---
license: mit
language:
- en
base_model: LiquidAI/LFM2.5-350M-Base
tags:
- speech-to-text
- transcript-cleanup
- text-correction
- asr-post-processing
- LFM
- LiquidAI
pipeline_tag: text-generation
datasets:
- juanquivilla/sotto-transcript-cleanup
---

# SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, v23 + Paragraphs)

[sottoasr.app](https://sottoasr.app) · [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)

## Overview

**Full-precision bf16** fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** — for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) instead.

This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and **— new in v23 — restructures long dictations into paragraph-formatted prose**, all locally with zero cloud dependency.

## What's new in v23

v23 (this model) adds **paragraph emission** for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with **4,012 new `paragraph_formatting` samples** generated via Bedrock Claude Haiku 4.5, teaching the model to insert `\n\n` paragraph breaks at natural topic / time-reference / discourse-marker boundaries.

| Capability | v22 (previous prod) | **v23 (this model)** |
|---|---|---|
| Paragraph emission rate on long inputs | **0.0 %** | **91.5 %** |
| ROUGE-L on paragraph-formatted inputs | 0.9521 | **0.9792** |
| ROUGE-L on standard val set | 0.9539 | 0.9506 |
| Filler-Free rate on standard val set | 90.3 % | 90.2 % |

The 0.003 main val ROUGE-L regression sits within the natural seed-variance band documented across the 100+ prior fine-tuning experiments and is offset by the +0.027 ROUGE-L improvement on paragraph-formatted inputs and the new structural capability.

## Key Specs

| Property | Value |
|----------|-------|
| **Size** | **676 MB** |
| **ROUGE-L (val set, 1000 samples)** | **0.9506** |
| **Exact Match** | **63.9 %** |
| **Filler-Free** | **90.2 %** |
| **Paragraph rate (long inputs)** | **91.5 %** |
| **Latency** | **119 ms** average per transcript (RTX 4090) |
| **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) |
| **Precision** | bf16 |
| **Training context** | 4,096 tokens (packed); model supports 32,768 tokens natively, 128K base |

## What It Does

Takes raw, unpunctuated ASR output and produces clean, readable text:

| Input (raw ASR) | Output (cleaned) |
|-----------------|------------------|
| so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
| the deadline is friday no monday we have until monday | The deadline is Monday. |
| what we what i wanted to say is the tests pass | What I wanted to say is the tests pass. |
| okay so the thing is basically we're running out of disk space | We're running out of disk space. |
| uh yes | Yes. |

### NEW in v23: Paragraph emission on long dictations

Long, multi-topic input is now restructured into paragraph-formatted prose:

**Input:**
> okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist

**Output:**
> We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.
>
> I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.

Notice the model:
- Strips speech disfluencies ("okay so", "uh", "basically")
- Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
- Adds correct punctuation
- **Inserts a paragraph break at the topic shift** ("the elasticsearch cluster has been a pain")

## Benchmark Results

### Main val set (1000 samples, cleaned val.jsonl from training data)

| Metric | v23 (this model) | v22 baseline |
|---|---|---|
| ROUGE-L | **0.9506** | 0.9539 |
| Exact Match | **63.9 %** | 64.8 % |
| Filler-Free | **90.2 %** | 90.3 % |
| Paragraph rate | 0.0 % | 0.0 % |
| Avg latency | 119 ms | 117 ms |

### Paragraph val set (200 paragraph_formatting samples)

| Metric | v23 (this model) | v22 baseline | Δ |
|---|---|---|---|
| ROUGE-L | **0.9792** | 0.9521 | **+0.027** |
| **Paragraph emission rate** | **91.5 %** | **0.0 %** | **+91.5 pts** |
| Exact Match | 2.5 % | 0.0 % | +2.5 pts |
| Avg latency | 1.46 s | 1.40 s | +60 ms |

### vs Prompted Qwen 2B Baseline (from earlier benchmarks)

| Metric | This model (354M) | Prompted Qwen 2B | Improvement |
|--------|-------------------|-------------------|-------------|
| ROUGE-L | **0.9506** | 0.891 | **+0.060** |
| Exact Match | **63.9 %** | 37 % | **+27 pts** |
| Inference | **119 ms** | 1.0 s | **8.4× faster** |
| Parameters | 354M | 2B | **5.6× smaller** |

## Usage

### Prompt Format

```
### Input:
{raw transcript}

### Output:
{model generates cleaned text}
```

### Python Example

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
    output = output[:output.index("###")]
print(output.strip())
# → "We need to fix the deployment pipeline."
```

For long dictation that may need paragraph formatting, use a higher `max_new_tokens` (1024-2048).

## Training Details

### Pipeline

```
LiquidAI/LFM2.5-350M-Base
  → SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
         LR 3e-5, β2=0.95, 3 epochs, batch 1×8,
         cosine schedule, 50 warmup steps, weight_decay 0.01,
         bf16+tf32, packed 4,096 context, seed 42
         → eval_loss 1.016 (vs v22's 1.0306, -0.014)
  → GRPO: LoRA r=32, alpha=16, all linear layers,
          LR 3e-6 cosine, 5K samples × 4 generations,
          reward = ROUGE-L × 5.0 - filler_count × 0.5 (capped 2.0) × 3.0 + format_bonus
          → final main val ROUGE-L 0.9506 / paragraph rate 91.5 %
```

### Dataset

**157,556 train / 7,121 val rows** in [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup):

- 153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
- **3,995 new paragraph_formatting samples** (held out 200 for `paragraph_val.jsonl`) — generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100–500 word raw input + 2–5 paragraph clean output, split at natural discourse boundaries

### Hardware

1× RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total

## All Variants

| Variant | Size | Use Case |
|---------|------|----------|
| **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | Training, GPU inference |
| **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | 237 MB | **Recommended for Apple Silicon** |
| [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | 195 MB | Smallest, slight quality trade-off |

## Limitations

- Optimized for **English** conversational/meeting-style speech
- Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
- Paragraph emission is conditional on input structure — short single-topic inputs (typical) will not be paragraph-broken
- Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
- Not designed for formal written text — trained on spoken language patterns

## License

MIT

## Links

- **Application:** [sottoasr.app](https://sottoasr.app)
- **Source:** [github.com/juanqui/sottoasr](https://github.com/juanqui/sottoasr)
- **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)