v23+paragraphs: ROUGE-L 0.9506, Filler-Free 90.2%, paragraph rate 91.5% (0% in v22)

Browse files

Files changed (2) hide show

README.md +106 -79
model.safetensors +1 -1

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ datasets:
 - juanquivilla/sotto-transcript-cleanup
 ---
-# SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision)
 [sottoasr.app](https://sottoasr.app) · [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
@@ -23,20 +23,34 @@ datasets:
 **Full-precision bf16** fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** — for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) instead.
-This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, and handles false starts and self-corrections — all locally, with zero cloud dependency.
 ## Key Specs
 | Property | Value |
 |----------|-------|
 | **Size** | **676 MB** |
-| **ROUGE-L** | **0.960** |
-| **Exact Match** | **69.6%** |
-| **Filler-Free** | **88.1%** |
-| **Latency** | **116 ms** average per transcript (RTX 4090) |
 | **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) |
 | **Precision** | bf16 |
-| **Context** | 32,768 tokens (trained with 4,096 packed) |
 ## What It Does
@@ -46,38 +60,57 @@ Takes raw, unpunctuated ASR output and produces clean, readable text:
 |-----------------|------------------|
 | so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
 | the deadline is friday no monday we have until monday | The deadline is Monday. |
-| what we what i wanted to say is that the tests pass | What I wanted to say is that the tests pass. |
 | okay so the thing is basically we're running out of disk space | We're running out of disk space. |
 | uh yes | Yes. |
-| i need to uh first fix the docker builds second stabilize the flaky tests and third add better monitoring | I need to first fix the Docker builds, second stabilize the flaky tests, and third add better monitoring. |
 ## Benchmark Results
-Evaluated on 135-sample benchmark covering 12 transcript cleanup categories:
-| Category | N | ROUGE-L | Exact Match |
-|----------|---|---------|-------------|
-| crutch_words | 10 | 0.916 | 60% |
-| dictation_commands | 10 | 0.989 | 80% |
-| false_start | 10 | 0.957 | 80% |
-| filler_removal | 15 | 0.951 | 73% |
-| grammar | 10 | 0.973 | 80% |
-| list_formatting | 10 | 0.990 | 80% |
-| long_dictation | 8 | 0.934 | 13% |
-| misheard_words | 10 | 0.938 | 70% |
-| mixed | 15 | 0.946 | 60% |
-| preserve_wording | 12 | 0.995 | 75% |
-| self_correction | 15 | 0.974 | 80% |
-| short | 10 | 0.947 | 70% |
-### vs Prompted Qwen 2B Baseline
-| Metric | This model (350M) | Prompted Qwen 2B | Improvement |
 |--------|-------------------|-------------------|-------------|
-| ROUGE-L | **0.960** | 0.891 | **+0.069** |
-| Exact Match | **69.6%** | 37% | **+32.6 pts** |
-| Inference | **116 ms** | 1.0s | **8x faster** |
-| Parameters | 354M | 2B | **5.6x smaller** |
 ## Usage
@@ -103,12 +136,12 @@ model = AutoModelForCausalLM.from_pretrained(
 )
 tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
-text = "so uh basically the thing is we need to uh fix the deployment pipeline"
 prompt = f"### Input:\n{text}\n\n### Output:\n"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 with torch.no_grad():
-    out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
 output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
 if "###" in output:
     output = output[:output.index("###")]
@@ -116,64 +149,58 @@ print(output.strip())
 # → "We need to fix the deployment pipeline."
 ```
 ## Training Details
-| Parameter | Value |
-|-----------|-------|
-| Method | Full fine-tune (all 354M params, no LoRA) |
-| Dataset | 143K samples ([sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)) |
-| Learning rate | 2.5e-5 (cosine schedule) |
-| Epochs | 3 |
-| Batch size | 1 × 8 gradient accumulation |
-| Optimizer | AdamW (full precision) |
-| Precision | bf16 + tf32 |
-| Hardware | 1× RTX 4090, ~25 min |
-### Training Data
-The 143K dataset covers diverse transcript cleanup scenarios:
-- **Filler removal** — uh, um, like, you know, basically
-- **Crutch phrase stripping** — "okay so the thing is basically..."
-- **Self-correction** — "X, no wait, Y" → Y
-- **False starts** — "What we— what I meant was..."
-- **Grammar & punctuation** — capitalization, periods, commas
-- **Dictation commands** — "new paragraph", "period"
-- **Short inputs** — heavy filler, minimal content (2-5 words)
-- **Long-form transcripts** — 500+ word dictation
-## Training Progression
-| Version | ROUGE-L | Key Innovation |
-|---------|---------|----------------|
-| v1: LoRA SFT 15K | 0.771 | Baseline |
-| v3: LoRA SFT 100K | 0.863 | Data scale breakthrough |
-| v4: + GRPO | 0.891 | Matched prompted 2B |
-| v5: Full FT | 0.907 | LoRA was the bottleneck |
-| v7: LR 2e-5 | 0.943 | Learning rate breakthrough |
-| v11: + Targeted data | 0.950 | Pattern-specific training |
-| **v15: LR 2.5e-5** | **0.960** | **Current production model** |
 ## All Variants
-| Variant | Size | ROUGE-L | Use Case |
-|---------|------|---------|----------|
-| **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | 0.960 | Training, GPU inference |
-| **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | 237 MB | ~0.955 | **Recommended for Apple Silicon** |
-| [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | 195 MB | ~0.945 | Smallest, slight quality trade-off |
 ## Limitations
 - Optimized for **English** conversational/meeting-style speech
 - Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
-- Long dictation (>500 words) has lowest exact match rate
 - Not designed for formal written text — trained on spoken language patterns
 ## Links
 - **Application:** [sottoasr.app](https://sottoasr.app)
 - **Source:** [github.com/juanqui/sottoasr](https://github.com/juanqui/sottoasr)
 - **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
-## License
-MIT

 - juanquivilla/sotto-transcript-cleanup
 ---
+# SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, v23 + Paragraphs)
 [sottoasr.app](https://sottoasr.app) · [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
 **Full-precision bf16** fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** — for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) instead.
+This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and **— new in v23 — restructures long dictations into paragraph-formatted prose**, all locally with zero cloud dependency.
+## What's new in v23
+v23 (this model) adds **paragraph emission** for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with **4,012 new `paragraph_formatting` samples** generated via Bedrock Claude Haiku 4.5, teaching the model to insert `\n\n` paragraph breaks at natural topic / time-reference / discourse-marker boundaries.
+| Capability | v22 (previous prod) | **v23 (this model)** |
+|---|---|---|
+| Paragraph emission rate on long inputs | **0.0 %** | **91.5 %** |
+| ROUGE-L on paragraph-formatted inputs | 0.9521 | **0.9792** |
+| ROUGE-L on standard val set | 0.9539 | 0.9506 |
+| Filler-Free rate on standard val set | 90.3 % | 90.2 % |
+The 0.003 main val ROUGE-L regression sits within the natural seed-variance band documented across the 100+ prior fine-tuning experiments and is offset by the +0.027 ROUGE-L improvement on paragraph-formatted inputs and the new structural capability.
 ## Key Specs
 | Property | Value |
 |----------|-------|
 | **Size** | **676 MB** |
+| **ROUGE-L (val set, 1000 samples)** | **0.9506** |
+| **Exact Match** | **63.9 %** |
+| **Filler-Free** | **90.2 %** |
+| **Paragraph rate (long inputs)** | **91.5 %** |
+| **Latency** | **119 ms** average per transcript (RTX 4090) |
 | **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) |
 | **Precision** | bf16 |
+| **Training context** | 4,096 tokens (packed); model supports 32,768 tokens natively, 128K base |
 ## What It Does
 |-----------------|------------------|
 | so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
 | the deadline is friday no monday we have until monday | The deadline is Monday. |
+| what we what i wanted to say is the tests pass | What I wanted to say is the tests pass. |
 | okay so the thing is basically we're running out of disk space | We're running out of disk space. |
 | uh yes | Yes. |
+### NEW in v23: Paragraph emission on long dictations
+Long, multi-topic input is now restructured into paragraph-formatted prose:
+**Input:**
+> okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist
+**Output:**
+> We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.
+>
+> I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.
+Notice the model:
+- Strips speech disfluencies ("okay so", "uh", "basically")
+- Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
+- Adds correct punctuation
+- **Inserts a paragraph break at the topic shift** ("the elasticsearch cluster has been a pain")
 ## Benchmark Results
+### Main val set (1000 samples, cleaned val.jsonl from training data)
+| Metric | v23 (this model) | v22 baseline |
+|---|---|---|
+| ROUGE-L | **0.9506** | 0.9539 |
+| Exact Match | **63.9 %** | 64.8 % |
+| Filler-Free | **90.2 %** | 90.3 % |
+| Paragraph rate | 0.0 % | 0.0 % |
+| Avg latency | 119 ms | 117 ms |
+### Paragraph val set (200 paragraph_formatting samples)
+| Metric | v23 (this model) | v22 baseline | Δ |
+|---|---|---|---|
+| ROUGE-L | **0.9792** | 0.9521 | **+0.027** |
+| **Paragraph emission rate** | **91.5 %** | **0.0 %** | **+91.5 pts** |
+| Exact Match | 2.5 % | 0.0 % | +2.5 pts |
+| Avg latency | 1.46 s | 1.40 s | +60 ms |
+### vs Prompted Qwen 2B Baseline (from earlier benchmarks)
+| Metric | This model (354M) | Prompted Qwen 2B | Improvement |
 |--------|-------------------|-------------------|-------------|
+| ROUGE-L | **0.9506** | 0.891 | **+0.060** |
+| Exact Match | **63.9 %** | 37 % | **+27 pts** |
+| Inference | **119 ms** | 1.0 s | **8.4× faster** |
+| Parameters | 354M | 2B | **5.6× smaller** |
 ## Usage
 )
 tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
+text = "so uh basically we need to fix the deployment pipeline"
 prompt = f"### Input:\n{text}\n\n### Output:\n"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 with torch.no_grad():
+    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
 output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
 if "###" in output:
     output = output[:output.index("###")]
 # → "We need to fix the deployment pipeline."
 ```
+For long dictation that may need paragraph formatting, use a higher `max_new_tokens` (1024-2048).
 ## Training Details
+### Pipeline
+```
+LiquidAI/LFM2.5-350M-Base
+  → SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
+         LR 3e-5, β2=0.95, 3 epochs, batch 1×8,
+         cosine schedule, 50 warmup steps, weight_decay 0.01,
+         bf16+tf32, packed 4,096 context, seed 42
+         → eval_loss 1.016 (vs v22's 1.0306, -0.014)
+  → GRPO: LoRA r=32, alpha=16, all linear layers,
+          LR 3e-6 cosine, 5K samples × 4 generations,
+          reward = ROUGE-L × 5.0 - filler_count × 0.5 (capped 2.0) × 3.0 + format_bonus
+          → final main val ROUGE-L 0.9506 / paragraph rate 91.5 %
+```
+### Dataset
+**157,556 train / 7,121 val rows** in [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup):
+- 153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
+- **3,995 new paragraph_formatting samples** (held out 200 for `paragraph_val.jsonl`) — generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100–500 word raw input + 2–5 paragraph clean output, split at natural discourse boundaries
+### Hardware
+1× RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total
 ## All Variants
+| Variant | Size | Use Case |
+|---------|------|----------|
+| **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | Training, GPU inference |
+| **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | 237 MB | **Recommended for Apple Silicon** |
+| [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | 195 MB | Smallest, slight quality trade-off |
 ## Limitations
 - Optimized for **English** conversational/meeting-style speech
 - Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
+- Paragraph emission is conditional on input structure — short single-topic inputs (typical) will not be paragraph-broken
+- Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
 - Not designed for formal written text — trained on spoken language patterns
+## License
+MIT
 ## Links
 - **Application:** [sottoasr.app](https://sottoasr.app)
 - **Source:** [github.com/juanqui/sottoasr](https://github.com/juanqui/sottoasr)
 - **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c37b567f001b8b532b9314a81ae321ac45465f273db36a29ef6ac02ef5415843
 size 708984464

 version https://git-lfs.github.com/spec/v1
+oid sha256:93579c0a12c842c568360fdea81c77d5f3b13a88ef9e5f0a0b3e2f889f6973ec
 size 708984464