metadata
license: other
license_name: lfm1.0
license_link: https://www.liquid.ai/license
base_model: LiquidAI/LFM2.5-350M-Base
tags:
- speech-to-text
- transcript-cleanup
- disfluency-correction
- sotto-asr
- lfm2
- liquid-ai
library_name: transformers
pipeline_tag: text-generation
language:
- en
SottoASR Transcript Cleanup — LFM2.5-350M Fine-Tuned
Overview
A fine-tuned LiquidAI/LFM2.5-350M-Base model for cleaning speech-to-text transcripts. Removes fillers, crutch words, self-corrections, false starts, and grammar errors while preserving the speaker's intent.
Performance
| Metric | This Model | Prompted Qwen3.5-2B |
|---|---|---|
| ROUGE-L | 0.868 | 0.891 |
| Self-correction | 0.814 | 0.742 |
| Inference speed | 0.12s | 1.0s |
| Model size | 350M | 2B |
| Preserve wording | 0.984 | 0.992 |
| Filler removal | 0.954 | 0.974 |
9x faster than the prompted 2B model while achieving competitive quality. Exceeds the 2B model on self-correction (0.814 vs 0.742).
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"juanquivilla/sotto-cleanup-lfm25-350m",
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
raw_transcript = "uh the server is uh running low on memory"
prompt = f"### Input:\n{raw_transcript}\n\n### Output:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
cleaned = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(cleaned) # "The server is running low on memory."
Training
- Base model: LiquidAI/LFM2.5-350M-Base (hybrid conv+attention, 32K context)
- Method: LoRA SFT (rank 64) via Unsloth
- Dataset: juanquivilla/sotto-transcript-cleanup — 124K pairs
- Epochs: 3 with early stopping
- Hardware: RTX 4090 (24GB)
- Training time: ~9 minutes
Categories Handled
- Filler removal (uh, um, uhm, er)
- Crutch word removal (basically, you know, I mean)
- Self-correction (speaker changes mind mid-sentence)
- False starts (abandoned sentence beginnings)
- Grammar correction (gonna → going to)
- Misheard word correction (post gress → Postgres)
- Dictation commands (period → ., comma → ,)
- List formatting (first/second/third → numbered list)
Limitations
- Grammar correction is the weakest category (0.788 ROUGE-L)
- Long dictation (200+ words) can occasionally truncate
- Domain-specific jargon correction depends on training data coverage
Part of SottoASR
SottoASR — local, privacy-first speech-to-text for macOS. All processing happens on-device.