v23+paragraphs: ROUGE-L 0.9506, Filler-Free 90.2%, paragraph rate 91.5% (0% in v22)

8d24c18 verified 3 months ago

10.1 kB

license: mit
language:
  - en
base_model: LiquidAI/LFM2.5-350M-Base
tags:
  - speech-to-text
  - transcript-cleanup
  - text-correction
  - asr-post-processing
  - LFM
  - LiquidAI
pipeline_tag: text-generation
datasets:
  - juanquivilla/sotto-transcript-cleanup

SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, v23 + Paragraphs)

sottoasr.app · MLX 5-bit (recommended) · MLX 4-bit (smaller) · Training Dataset

Overview

Full-precision bf16 fine-tune of LiquidAI/LFM2.5-350M-Base for on-device speech-to-text transcript cleanup. This is the training artifact — for on-device deployment on Apple Silicon, use the 5-bit MLX variant instead.

This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and — new in v23 — restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.

What's new in v23

v23 (this model) adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples generated via Bedrock Claude Haiku 4.5, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.

Capability	v22 (previous prod)	v23 (this model)
Paragraph emission rate on long inputs	0.0 %	91.5 %
ROUGE-L on paragraph-formatted inputs	0.9521	0.9792
ROUGE-L on standard val set	0.9539	0.9506
Filler-Free rate on standard val set	90.3 %	90.2 %

The 0.003 main val ROUGE-L regression sits within the natural seed-variance band documented across the 100+ prior fine-tuning experiments and is offset by the +0.027 ROUGE-L improvement on paragraph-formatted inputs and the new structural capability.

Key Specs

Property	Value
Size	676 MB
ROUGE-L (val set, 1000 samples)	0.9506
Exact Match	63.9 %
Filler-Free	90.2 %
Paragraph rate (long inputs)	91.5 %
Latency	119 ms average per transcript (RTX 4090)
Architecture	Hybrid: 10 conv + 6 GQA attention (354M params)
Precision	bf16
Training context	4,096 tokens (packed); model supports 32,768 tokens natively, 128K base

What It Does

Takes raw, unpunctuated ASR output and produces clean, readable text:

Input (raw ASR)	Output (cleaned)
so uh basically we need to fix the deployment pipeline	We need to fix the deployment pipeline.
the deadline is friday no monday we have until monday	The deadline is Monday.
what we what i wanted to say is the tests pass	What I wanted to say is the tests pass.
okay so the thing is basically we're running out of disk space	We're running out of disk space.
uh yes	Yes.

NEW in v23: Paragraph emission on long dictations

Long, multi-topic input is now restructured into paragraph-formatted prose:

Input:

okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist

Output:

We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.

I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.

Notice the model:

Strips speech disfluencies ("okay so", "uh", "basically")
Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
Adds correct punctuation
Inserts a paragraph break at the topic shift ("the elasticsearch cluster has been a pain")

Benchmark Results

Main val set (1000 samples, cleaned val.jsonl from training data)

Metric	v23 (this model)	v22 baseline
ROUGE-L	0.9506	0.9539
Exact Match	63.9 %	64.8 %
Filler-Free	90.2 %	90.3 %
Paragraph rate	0.0 %	0.0 %
Avg latency	119 ms	117 ms

Paragraph val set (200 paragraph_formatting samples)

Metric	v23 (this model)	v22 baseline	Δ
ROUGE-L	0.9792	0.9521	+0.027
Paragraph emission rate	91.5 %	0.0 %	+91.5 pts
Exact Match	2.5 %	0.0 %	+2.5 pts
Avg latency	1.46 s	1.40 s	+60 ms

vs Prompted Qwen 2B Baseline (from earlier benchmarks)

Metric	This model (354M)	Prompted Qwen 2B	Improvement
ROUGE-L	0.9506	0.891	+0.060
Exact Match	63.9 %	37 %	+27 pts
Inference	119 ms	1.0 s	8.4× faster
Parameters	354M	2B	5.6× smaller

Usage

Prompt Format

### Input:
{raw transcript}

### Output:
{model generates cleaned text}

Python Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
    output = output[:output.index("###")]
print(output.strip())
# → "We need to fix the deployment pipeline."

For long dictation that may need paragraph formatting, use a higher max_new_tokens (1024-2048).

Training Details

Pipeline

LiquidAI/LFM2.5-350M-Base
  → SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
         LR 3e-5, β2=0.95, 3 epochs, batch 1×8,
         cosine schedule, 50 warmup steps, weight_decay 0.01,
         bf16+tf32, packed 4,096 context, seed 42
         → eval_loss 1.016 (vs v22's 1.0306, -0.014)
  → GRPO: LoRA r=32, alpha=16, all linear layers,
          LR 3e-6 cosine, 5K samples × 4 generations,
          reward = ROUGE-L × 5.0 - filler_count × 0.5 (capped 2.0) × 3.0 + format_bonus
          → final main val ROUGE-L 0.9506 / paragraph rate 91.5 %

Dataset

157,556 train / 7,121 val rows in juanquivilla/sotto-transcript-cleanup:

153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
3,995 new paragraph_formatting samples (held out 200 for paragraph_val.jsonl) — generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100–500 word raw input + 2–5 paragraph clean output, split at natural discourse boundaries

Hardware

1× RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total

All Variants

Variant	Size	Use Case
Full precision (this)	676 MB	Training, GPU inference
MLX 5-bit	237 MB	Recommended for Apple Silicon
MLX 4-bit	195 MB	Smallest, slight quality trade-off

Limitations

Optimized for English conversational/meeting-style speech
Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
Paragraph emission is conditional on input structure — short single-topic inputs (typical) will not be paragraph-broken
Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
Not designed for formal written text — trained on spoken language patterns

License

MIT

juanquivilla
/

sotto-cleanup-lfm25-350m