license: mit
language:
- en
base_model: LiquidAI/LFM2.5-350M-Base
tags:
- speech-to-text
- transcript-cleanup
- text-correction
- asr-post-processing
- LFM
- LiquidAI
pipeline_tag: text-generation
datasets:
- juanquivilla/sotto-transcript-cleanup
SottoASR Transcript Cleanup β LFM2.5-350M (Full Precision, v23 + Paragraphs)
sottoasr.app Β· MLX 5-bit (recommended) Β· MLX 4-bit (smaller) Β· Training Dataset
Overview
Full-precision bf16 fine-tune of LiquidAI/LFM2.5-350M-Base for on-device speech-to-text transcript cleanup. This is the training artifact β for on-device deployment on Apple Silicon, use the 5-bit MLX variant instead.
This model powers on-device transcript cleanup in SottoASR β a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and β new in v23 β restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.
What's new in v23
v23 (this model) adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples generated via Bedrock Claude Haiku 4.5, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.
| Capability | v22 (previous prod) | v23 (this model) |
|---|---|---|
| Paragraph emission rate on long inputs | 0.0 % | 91.5 % |
| ROUGE-L on paragraph-formatted inputs | 0.9521 | 0.9792 |
| ROUGE-L on standard val set | 0.9539 | 0.9506 |
| Filler-Free rate on standard val set | 90.3 % | 90.2 % |
The 0.003 main val ROUGE-L regression sits within the natural seed-variance band documented across the 100+ prior fine-tuning experiments and is offset by the +0.027 ROUGE-L improvement on paragraph-formatted inputs and the new structural capability.
Key Specs
| Property | Value |
|---|---|
| Size | 676 MB |
| ROUGE-L (val set, 1000 samples) | 0.9506 |
| Exact Match | 63.9 % |
| Filler-Free | 90.2 % |
| Paragraph rate (long inputs) | 91.5 % |
| Latency | 119 ms average per transcript (RTX 4090) |
| Architecture | Hybrid: 10 conv + 6 GQA attention (354M params) |
| Precision | bf16 |
| Training context | 4,096 tokens (packed); model supports 32,768 tokens natively, 128K base |
What It Does
Takes raw, unpunctuated ASR output and produces clean, readable text:
| Input (raw ASR) | Output (cleaned) |
|---|---|
| so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
| the deadline is friday no monday we have until monday | The deadline is Monday. |
| what we what i wanted to say is the tests pass | What I wanted to say is the tests pass. |
| okay so the thing is basically we're running out of disk space | We're running out of disk space. |
| uh yes | Yes. |
NEW in v23: Paragraph emission on long dictations
Long, multi-topic input is now restructured into paragraph-formatted prose:
Input:
okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist
Output:
We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.
I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.
Notice the model:
- Strips speech disfluencies ("okay so", "uh", "basically")
- Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
- Adds correct punctuation
- Inserts a paragraph break at the topic shift ("the elasticsearch cluster has been a pain")
Benchmark Results
Main val set (1000 samples, cleaned val.jsonl from training data)
| Metric | v23 (this model) | v22 baseline |
|---|---|---|
| ROUGE-L | 0.9506 | 0.9539 |
| Exact Match | 63.9 % | 64.8 % |
| Filler-Free | 90.2 % | 90.3 % |
| Paragraph rate | 0.0 % | 0.0 % |
| Avg latency | 119 ms | 117 ms |
Paragraph val set (200 paragraph_formatting samples)
| Metric | v23 (this model) | v22 baseline | Ξ |
|---|---|---|---|
| ROUGE-L | 0.9792 | 0.9521 | +0.027 |
| Paragraph emission rate | 91.5 % | 0.0 % | +91.5 pts |
| Exact Match | 2.5 % | 0.0 % | +2.5 pts |
| Avg latency | 1.46 s | 1.40 s | +60 ms |
vs Prompted Qwen 2B Baseline (from earlier benchmarks)
| Metric | This model (354M) | Prompted Qwen 2B | Improvement |
|---|---|---|---|
| ROUGE-L | 0.9506 | 0.891 | +0.060 |
| Exact Match | 63.9 % | 37 % | +27 pts |
| Inference | 119 ms | 1.0 s | 8.4Γ faster |
| Parameters | 354M | 2B | 5.6Γ smaller |
Usage
Prompt Format
### Input:
{raw transcript}
### Output:
{model generates cleaned text}
Python Example
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"juanquivilla/sotto-cleanup-lfm25-350m",
dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
output = output[:output.index("###")]
print(output.strip())
# β "We need to fix the deployment pipeline."
For long dictation that may need paragraph formatting, use a higher max_new_tokens (1024-2048).
Training Details
Pipeline
LiquidAI/LFM2.5-350M-Base
β SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
LR 3e-5, Ξ²2=0.95, 3 epochs, batch 1Γ8,
cosine schedule, 50 warmup steps, weight_decay 0.01,
bf16+tf32, packed 4,096 context, seed 42
β eval_loss 1.016 (vs v22's 1.0306, -0.014)
β GRPO: LoRA r=32, alpha=16, all linear layers,
LR 3e-6 cosine, 5K samples Γ 4 generations,
reward = ROUGE-L Γ 5.0 - filler_count Γ 0.5 (capped 2.0) Γ 3.0 + format_bonus
β final main val ROUGE-L 0.9506 / paragraph rate 91.5 %
Dataset
157,556 train / 7,121 val rows in juanquivilla/sotto-transcript-cleanup:
- 153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
- 3,995 new paragraph_formatting samples (held out 200 for
paragraph_val.jsonl) β generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100β500 word raw input + 2β5 paragraph clean output, split at natural discourse boundaries
Hardware
1Γ RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total
All Variants
| Variant | Size | Use Case |
|---|---|---|
| Full precision (this) | 676 MB | Training, GPU inference |
| MLX 5-bit | 237 MB | Recommended for Apple Silicon |
| MLX 4-bit | 195 MB | Smallest, slight quality trade-off |
Limitations
- Optimized for English conversational/meeting-style speech
- Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
- Paragraph emission is conditional on input structure β short single-topic inputs (typical) will not be paragraph-broken
- Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
- Not designed for formal written text β trained on spoken language patterns
License
MIT
Links
- Application: sottoasr.app
- Source: github.com/juanqui/sottoasr
- Dataset: juanquivilla/sotto-transcript-cleanup