--- license: mit language: - en base_model: LiquidAI/LFM2.5-350M-Base tags: - speech-to-text - transcript-cleanup - text-correction - asr-post-processing - LFM - LiquidAI pipeline_tag: text-generation datasets: - juanquivilla/sotto-transcript-cleanup --- # SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, v23 + Paragraphs) [sottoasr.app](https://sottoasr.app) · [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) ## Overview **Full-precision bf16** fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** — for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) instead. This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and **— new in v23 — restructures long dictations into paragraph-formatted prose**, all locally with zero cloud dependency. ## What's new in v23 v23 (this model) adds **paragraph emission** for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with **4,012 new `paragraph_formatting` samples** generated via Bedrock Claude Haiku 4.5, teaching the model to insert `\n\n` paragraph breaks at natural topic / time-reference / discourse-marker boundaries. | Capability | v22 (previous prod) | **v23 (this model)** | |---|---|---| | Paragraph emission rate on long inputs | **0.0 %** | **91.5 %** | | ROUGE-L on paragraph-formatted inputs | 0.9521 | **0.9792** | | ROUGE-L on standard val set | 0.9539 | 0.9506 | | Filler-Free rate on standard val set | 90.3 % | 90.2 % | The 0.003 main val ROUGE-L regression sits within the natural seed-variance band documented across the 100+ prior fine-tuning experiments and is offset by the +0.027 ROUGE-L improvement on paragraph-formatted inputs and the new structural capability. ## Key Specs | Property | Value | |----------|-------| | **Size** | **676 MB** | | **ROUGE-L (val set, 1000 samples)** | **0.9506** | | **Exact Match** | **63.9 %** | | **Filler-Free** | **90.2 %** | | **Paragraph rate (long inputs)** | **91.5 %** | | **Latency** | **119 ms** average per transcript (RTX 4090) | | **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) | | **Precision** | bf16 | | **Training context** | 4,096 tokens (packed); model supports 32,768 tokens natively, 128K base | ## What It Does Takes raw, unpunctuated ASR output and produces clean, readable text: | Input (raw ASR) | Output (cleaned) | |-----------------|------------------| | so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. | | the deadline is friday no monday we have until monday | The deadline is Monday. | | what we what i wanted to say is the tests pass | What I wanted to say is the tests pass. | | okay so the thing is basically we're running out of disk space | We're running out of disk space. | | uh yes | Yes. | ### NEW in v23: Paragraph emission on long dictations Long, multi-topic input is now restructured into paragraph-formatted prose: **Input:** > okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist **Output:** > We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors. > > I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist. Notice the model: - Strips speech disfluencies ("okay so", "uh", "basically") - Capitalizes proper nouns (Redis, Elasticsearch, Svelte) - Adds correct punctuation - **Inserts a paragraph break at the topic shift** ("the elasticsearch cluster has been a pain") ## Benchmark Results ### Main val set (1000 samples, cleaned val.jsonl from training data) | Metric | v23 (this model) | v22 baseline | |---|---|---| | ROUGE-L | **0.9506** | 0.9539 | | Exact Match | **63.9 %** | 64.8 % | | Filler-Free | **90.2 %** | 90.3 % | | Paragraph rate | 0.0 % | 0.0 % | | Avg latency | 119 ms | 117 ms | ### Paragraph val set (200 paragraph_formatting samples) | Metric | v23 (this model) | v22 baseline | Δ | |---|---|---|---| | ROUGE-L | **0.9792** | 0.9521 | **+0.027** | | **Paragraph emission rate** | **91.5 %** | **0.0 %** | **+91.5 pts** | | Exact Match | 2.5 % | 0.0 % | +2.5 pts | | Avg latency | 1.46 s | 1.40 s | +60 ms | ### vs Prompted Qwen 2B Baseline (from earlier benchmarks) | Metric | This model (354M) | Prompted Qwen 2B | Improvement | |--------|-------------------|-------------------|-------------| | ROUGE-L | **0.9506** | 0.891 | **+0.060** | | Exact Match | **63.9 %** | 37 % | **+27 pts** | | Inference | **119 ms** | 1.0 s | **8.4× faster** | | Parameters | 354M | 2B | **5.6× smaller** | ## Usage ### Prompt Format ``` ### Input: {raw transcript} ### Output: {model generates cleaned text} ``` ### Python Example ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "juanquivilla/sotto-cleanup-lfm25-350m", dtype=torch.bfloat16, trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m") text = "so uh basically we need to fix the deployment pipeline" prompt = f"### Input:\n{text}\n\n### Output:\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate(**inputs, max_new_tokens=512, do_sample=False) output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) if "###" in output: output = output[:output.index("###")] print(output.strip()) # → "We need to fix the deployment pipeline." ``` For long dictation that may need paragraph formatting, use a higher `max_new_tokens` (1024-2048). ## Training Details ### Pipeline ``` LiquidAI/LFM2.5-350M-Base → SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting), LR 3e-5, β2=0.95, 3 epochs, batch 1×8, cosine schedule, 50 warmup steps, weight_decay 0.01, bf16+tf32, packed 4,096 context, seed 42 → eval_loss 1.016 (vs v22's 1.0306, -0.014) → GRPO: LoRA r=32, alpha=16, all linear layers, LR 3e-6 cosine, 5K samples × 4 generations, reward = ROUGE-L × 5.0 - filler_count × 0.5 (capped 2.0) × 3.0 + format_bonus → final main val ROUGE-L 0.9506 / paragraph rate 91.5 % ``` ### Dataset **157,556 train / 7,121 val rows** in [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup): - 153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation) - **3,995 new paragraph_formatting samples** (held out 200 for `paragraph_val.jsonl`) — generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100–500 word raw input + 2–5 paragraph clean output, split at natural discourse boundaries ### Hardware 1× RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total ## All Variants | Variant | Size | Use Case | |---------|------|----------| | **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | Training, GPU inference | | **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | 237 MB | **Recommended for Apple Silicon** | | [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | 195 MB | Smallest, slight quality trade-off | ## Limitations - Optimized for **English** conversational/meeting-style speech - Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning - Paragraph emission is conditional on input structure — short single-topic inputs (typical) will not be paragraph-broken - Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags) - Not designed for formal written text — trained on spoken language patterns ## License MIT ## Links - **Application:** [sottoasr.app](https://sottoasr.app) - **Source:** [github.com/juanqui/sottoasr](https://github.com/juanqui/sottoasr) - **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)