juanquivilla's picture
v23+paragraphs: ROUGE-L 0.9506, Filler-Free 90.2%, paragraph rate 91.5% (0% in v22)
8d24c18 verified
|
Raw
History Blame
10.1 kB
metadata
license: mit
language:
  - en
base_model: LiquidAI/LFM2.5-350M-Base
tags:
  - speech-to-text
  - transcript-cleanup
  - text-correction
  - asr-post-processing
  - LFM
  - LiquidAI
pipeline_tag: text-generation
datasets:
  - juanquivilla/sotto-transcript-cleanup

SottoASR Transcript Cleanup β€” LFM2.5-350M (Full Precision, v23 + Paragraphs)

sottoasr.app Β· MLX 5-bit (recommended) Β· MLX 4-bit (smaller) Β· Training Dataset

Overview

Full-precision bf16 fine-tune of LiquidAI/LFM2.5-350M-Base for on-device speech-to-text transcript cleanup. This is the training artifact β€” for on-device deployment on Apple Silicon, use the 5-bit MLX variant instead.

This model powers on-device transcript cleanup in SottoASR β€” a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and β€” new in v23 β€” restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.

What's new in v23

v23 (this model) adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples generated via Bedrock Claude Haiku 4.5, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.

Capability v22 (previous prod) v23 (this model)
Paragraph emission rate on long inputs 0.0 % 91.5 %
ROUGE-L on paragraph-formatted inputs 0.9521 0.9792
ROUGE-L on standard val set 0.9539 0.9506
Filler-Free rate on standard val set 90.3 % 90.2 %

The 0.003 main val ROUGE-L regression sits within the natural seed-variance band documented across the 100+ prior fine-tuning experiments and is offset by the +0.027 ROUGE-L improvement on paragraph-formatted inputs and the new structural capability.

Key Specs

Property Value
Size 676 MB
ROUGE-L (val set, 1000 samples) 0.9506
Exact Match 63.9 %
Filler-Free 90.2 %
Paragraph rate (long inputs) 91.5 %
Latency 119 ms average per transcript (RTX 4090)
Architecture Hybrid: 10 conv + 6 GQA attention (354M params)
Precision bf16
Training context 4,096 tokens (packed); model supports 32,768 tokens natively, 128K base

What It Does

Takes raw, unpunctuated ASR output and produces clean, readable text:

Input (raw ASR) Output (cleaned)
so uh basically we need to fix the deployment pipeline We need to fix the deployment pipeline.
the deadline is friday no monday we have until monday The deadline is Monday.
what we what i wanted to say is the tests pass What I wanted to say is the tests pass.
okay so the thing is basically we're running out of disk space We're running out of disk space.
uh yes Yes.

NEW in v23: Paragraph emission on long dictations

Long, multi-topic input is now restructured into paragraph-formatted prose:

Input:

okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist

Output:

We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.

I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.

Notice the model:

  • Strips speech disfluencies ("okay so", "uh", "basically")
  • Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
  • Adds correct punctuation
  • Inserts a paragraph break at the topic shift ("the elasticsearch cluster has been a pain")

Benchmark Results

Main val set (1000 samples, cleaned val.jsonl from training data)

Metric v23 (this model) v22 baseline
ROUGE-L 0.9506 0.9539
Exact Match 63.9 % 64.8 %
Filler-Free 90.2 % 90.3 %
Paragraph rate 0.0 % 0.0 %
Avg latency 119 ms 117 ms

Paragraph val set (200 paragraph_formatting samples)

Metric v23 (this model) v22 baseline Ξ”
ROUGE-L 0.9792 0.9521 +0.027
Paragraph emission rate 91.5 % 0.0 % +91.5 pts
Exact Match 2.5 % 0.0 % +2.5 pts
Avg latency 1.46 s 1.40 s +60 ms

vs Prompted Qwen 2B Baseline (from earlier benchmarks)

Metric This model (354M) Prompted Qwen 2B Improvement
ROUGE-L 0.9506 0.891 +0.060
Exact Match 63.9 % 37 % +27 pts
Inference 119 ms 1.0 s 8.4Γ— faster
Parameters 354M 2B 5.6Γ— smaller

Usage

Prompt Format

### Input:
{raw transcript}

### Output:
{model generates cleaned text}

Python Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
    output = output[:output.index("###")]
print(output.strip())
# β†’ "We need to fix the deployment pipeline."

For long dictation that may need paragraph formatting, use a higher max_new_tokens (1024-2048).

Training Details

Pipeline

LiquidAI/LFM2.5-350M-Base
  β†’ SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
         LR 3e-5, Ξ²2=0.95, 3 epochs, batch 1Γ—8,
         cosine schedule, 50 warmup steps, weight_decay 0.01,
         bf16+tf32, packed 4,096 context, seed 42
         β†’ eval_loss 1.016 (vs v22's 1.0306, -0.014)
  β†’ GRPO: LoRA r=32, alpha=16, all linear layers,
          LR 3e-6 cosine, 5K samples Γ— 4 generations,
          reward = ROUGE-L Γ— 5.0 - filler_count Γ— 0.5 (capped 2.0) Γ— 3.0 + format_bonus
          β†’ final main val ROUGE-L 0.9506 / paragraph rate 91.5 %

Dataset

157,556 train / 7,121 val rows in juanquivilla/sotto-transcript-cleanup:

  • 153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
  • 3,995 new paragraph_formatting samples (held out 200 for paragraph_val.jsonl) β€” generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100–500 word raw input + 2–5 paragraph clean output, split at natural discourse boundaries

Hardware

1Γ— RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total

All Variants

Variant Size Use Case
Full precision (this) 676 MB Training, GPU inference
MLX 5-bit 237 MB Recommended for Apple Silicon
MLX 4-bit 195 MB Smallest, slight quality trade-off

Limitations

  • Optimized for English conversational/meeting-style speech
  • Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
  • Paragraph emission is conditional on input structure β€” short single-topic inputs (typical) will not be paragraph-broken
  • Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
  • Not designed for formal written text β€” trained on spoken language patterns

License

MIT

Links