--- license: other license_name: lfm1.0 license_link: https://www.liquid.ai/license base_model: LiquidAI/LFM2.5-350M-Base tags: - speech-to-text - transcript-cleanup - disfluency-correction - sotto-asr - lfm2 - liquid-ai library_name: transformers pipeline_tag: text-generation language: - en --- # SottoASR Transcript Cleanup — LFM2.5-350M Fine-Tuned ## Overview A fine-tuned [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) model for cleaning speech-to-text transcripts. Removes fillers, crutch words, self-corrections, false starts, and grammar errors while preserving the speaker's intent. ## Performance | Metric | This Model | Prompted Qwen3.5-2B | |--------|-----------|---------------------| | **ROUGE-L** | **0.868** | 0.891 | | **Self-correction** | **0.814** | 0.742 | | **Inference speed** | **0.12s** | 1.0s | | **Model size** | **350M** | 2B | | **Preserve wording** | **0.984** | 0.992 | | **Filler removal** | **0.954** | 0.974 | **9x faster** than the prompted 2B model while achieving competitive quality. Exceeds the 2B model on self-correction (0.814 vs 0.742). ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "juanquivilla/sotto-cleanup-lfm25-350m", dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m") raw_transcript = "uh the server is uh running low on memory" prompt = f"### Input:\n{raw_transcript}\n\n### Output:\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) cleaned = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print(cleaned) # "The server is running low on memory." ``` ## Training - **Base model:** LiquidAI/LFM2.5-350M-Base (hybrid conv+attention, 32K context) - **Method:** LoRA SFT (rank 64) via Unsloth - **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) — 124K pairs - **Epochs:** 3 with early stopping - **Hardware:** RTX 4090 (24GB) - **Training time:** ~9 minutes ## Categories Handled - Filler removal (uh, um, uhm, er) - Crutch word removal (basically, you know, I mean) - Self-correction (speaker changes mind mid-sentence) - False starts (abandoned sentence beginnings) - Grammar correction (gonna → going to) - Misheard word correction (post gress → Postgres) - Dictation commands (period → ., comma → ,) - List formatting (first/second/third → numbered list) ## Limitations - Grammar correction is the weakest category (0.788 ROUGE-L) - Long dictation (200+ words) can occasionally truncate - Domain-specific jargon correction depends on training data coverage ## Part of SottoASR [SottoASR](https://github.com/juanqui/sotto) — local, privacy-first speech-to-text for macOS. All processing happens on-device.