v23+paragraphs: ROUGE-L 0.9506, Filler-Free 90.2%, paragraph rate 91.5% (0% in v22)
Browse files- README.md +106 -79
- model.safetensors +1 -1
README.md
CHANGED
|
@@ -15,7 +15,7 @@ datasets:
|
|
| 15 |
- juanquivilla/sotto-transcript-cleanup
|
| 16 |
---
|
| 17 |
|
| 18 |
-
# SottoASR Transcript Cleanup β LFM2.5-350M (Full Precision)
|
| 19 |
|
| 20 |
[sottoasr.app](https://sottoasr.app) Β· [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) Β· [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) Β· [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
|
| 21 |
|
|
@@ -23,20 +23,34 @@ datasets:
|
|
| 23 |
|
| 24 |
**Full-precision bf16** fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** β for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) instead.
|
| 25 |
|
| 26 |
-
This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) β a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
## Key Specs
|
| 29 |
|
| 30 |
| Property | Value |
|
| 31 |
|----------|-------|
|
| 32 |
| **Size** | **676 MB** |
|
| 33 |
-
| **ROUGE-L** | **0.
|
| 34 |
-
| **Exact Match** | **
|
| 35 |
-
| **Filler-Free** | **
|
| 36 |
-
| **
|
|
|
|
| 37 |
| **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) |
|
| 38 |
| **Precision** | bf16 |
|
| 39 |
-
| **
|
| 40 |
|
| 41 |
## What It Does
|
| 42 |
|
|
@@ -46,38 +60,57 @@ Takes raw, unpunctuated ASR output and produces clean, readable text:
|
|
| 46 |
|-----------------|------------------|
|
| 47 |
| so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
|
| 48 |
| the deadline is friday no monday we have until monday | The deadline is Monday. |
|
| 49 |
-
| what we what i wanted to say is
|
| 50 |
| okay so the thing is basically we're running out of disk space | We're running out of disk space. |
|
| 51 |
| uh yes | Yes. |
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
## Benchmark Results
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
|
| 59 |
-
|---
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
|
|
| 63 |
-
|
|
| 64 |
-
|
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
|
| 69 |
-
|
|
| 70 |
-
|
|
| 71 |
-
|
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
|
|
|
|
|
|
| 76 |
|--------|-------------------|-------------------|-------------|
|
| 77 |
-
| ROUGE-L | **0.
|
| 78 |
-
| Exact Match | **
|
| 79 |
-
| Inference | **
|
| 80 |
-
| Parameters | 354M | 2B | **5.
|
| 81 |
|
| 82 |
## Usage
|
| 83 |
|
|
@@ -103,12 +136,12 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 103 |
)
|
| 104 |
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
|
| 105 |
|
| 106 |
-
text = "so uh basically
|
| 107 |
prompt = f"### Input:\n{text}\n\n### Output:\n"
|
| 108 |
|
| 109 |
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 110 |
with torch.no_grad():
|
| 111 |
-
out = model.generate(**inputs, max_new_tokens=
|
| 112 |
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 113 |
if "###" in output:
|
| 114 |
output = output[:output.index("###")]
|
|
@@ -116,64 +149,58 @@ print(output.strip())
|
|
| 116 |
# β "We need to fix the deployment pipeline."
|
| 117 |
```
|
| 118 |
|
|
|
|
|
|
|
| 119 |
## Training Details
|
| 120 |
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
-
|
| 141 |
-
- **
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
| Version | ROUGE-L | Key Innovation |
|
| 147 |
-
|---------|---------|----------------|
|
| 148 |
-
| v1: LoRA SFT 15K | 0.771 | Baseline |
|
| 149 |
-
| v3: LoRA SFT 100K | 0.863 | Data scale breakthrough |
|
| 150 |
-
| v4: + GRPO | 0.891 | Matched prompted 2B |
|
| 151 |
-
| v5: Full FT | 0.907 | LoRA was the bottleneck |
|
| 152 |
-
| v7: LR 2e-5 | 0.943 | Learning rate breakthrough |
|
| 153 |
-
| v11: + Targeted data | 0.950 | Pattern-specific training |
|
| 154 |
-
| **v15: LR 2.5e-5** | **0.960** | **Current production model** |
|
| 155 |
|
| 156 |
## All Variants
|
| 157 |
|
| 158 |
-
| Variant | Size |
|
| 159 |
-
|---------|------|---------
|
| 160 |
-
| **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB |
|
| 161 |
-
| **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | 237 MB |
|
| 162 |
-
| [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | 195 MB |
|
| 163 |
|
| 164 |
## Limitations
|
| 165 |
|
| 166 |
- Optimized for **English** conversational/meeting-style speech
|
| 167 |
- Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
|
| 168 |
-
-
|
|
|
|
| 169 |
- Not designed for formal written text β trained on spoken language patterns
|
| 170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
## Links
|
| 172 |
|
| 173 |
- **Application:** [sottoasr.app](https://sottoasr.app)
|
| 174 |
- **Source:** [github.com/juanqui/sottoasr](https://github.com/juanqui/sottoasr)
|
| 175 |
- **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
|
| 176 |
-
|
| 177 |
-
## License
|
| 178 |
-
|
| 179 |
-
MIT
|
|
|
|
| 15 |
- juanquivilla/sotto-transcript-cleanup
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# SottoASR Transcript Cleanup β LFM2.5-350M (Full Precision, v23 + Paragraphs)
|
| 19 |
|
| 20 |
[sottoasr.app](https://sottoasr.app) Β· [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) Β· [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) Β· [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
|
| 21 |
|
|
|
|
| 23 |
|
| 24 |
**Full-precision bf16** fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** β for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) instead.
|
| 25 |
|
| 26 |
+
This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) β a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and **β new in v23 β restructures long dictations into paragraph-formatted prose**, all locally with zero cloud dependency.
|
| 27 |
+
|
| 28 |
+
## What's new in v23
|
| 29 |
+
|
| 30 |
+
v23 (this model) adds **paragraph emission** for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with **4,012 new `paragraph_formatting` samples** generated via Bedrock Claude Haiku 4.5, teaching the model to insert `\n\n` paragraph breaks at natural topic / time-reference / discourse-marker boundaries.
|
| 31 |
+
|
| 32 |
+
| Capability | v22 (previous prod) | **v23 (this model)** |
|
| 33 |
+
|---|---|---|
|
| 34 |
+
| Paragraph emission rate on long inputs | **0.0 %** | **91.5 %** |
|
| 35 |
+
| ROUGE-L on paragraph-formatted inputs | 0.9521 | **0.9792** |
|
| 36 |
+
| ROUGE-L on standard val set | 0.9539 | 0.9506 |
|
| 37 |
+
| Filler-Free rate on standard val set | 90.3 % | 90.2 % |
|
| 38 |
+
|
| 39 |
+
The 0.003 main val ROUGE-L regression sits within the natural seed-variance band documented across the 100+ prior fine-tuning experiments and is offset by the +0.027 ROUGE-L improvement on paragraph-formatted inputs and the new structural capability.
|
| 40 |
|
| 41 |
## Key Specs
|
| 42 |
|
| 43 |
| Property | Value |
|
| 44 |
|----------|-------|
|
| 45 |
| **Size** | **676 MB** |
|
| 46 |
+
| **ROUGE-L (val set, 1000 samples)** | **0.9506** |
|
| 47 |
+
| **Exact Match** | **63.9 %** |
|
| 48 |
+
| **Filler-Free** | **90.2 %** |
|
| 49 |
+
| **Paragraph rate (long inputs)** | **91.5 %** |
|
| 50 |
+
| **Latency** | **119 ms** average per transcript (RTX 4090) |
|
| 51 |
| **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) |
|
| 52 |
| **Precision** | bf16 |
|
| 53 |
+
| **Training context** | 4,096 tokens (packed); model supports 32,768 tokens natively, 128K base |
|
| 54 |
|
| 55 |
## What It Does
|
| 56 |
|
|
|
|
| 60 |
|-----------------|------------------|
|
| 61 |
| so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
|
| 62 |
| the deadline is friday no monday we have until monday | The deadline is Monday. |
|
| 63 |
+
| what we what i wanted to say is the tests pass | What I wanted to say is the tests pass. |
|
| 64 |
| okay so the thing is basically we're running out of disk space | We're running out of disk space. |
|
| 65 |
| uh yes | Yes. |
|
| 66 |
+
|
| 67 |
+
### NEW in v23: Paragraph emission on long dictations
|
| 68 |
+
|
| 69 |
+
Long, multi-topic input is now restructured into paragraph-formatted prose:
|
| 70 |
+
|
| 71 |
+
**Input:**
|
| 72 |
+
> okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist
|
| 73 |
+
|
| 74 |
+
**Output:**
|
| 75 |
+
> We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.
|
| 76 |
+
>
|
| 77 |
+
> I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.
|
| 78 |
+
|
| 79 |
+
Notice the model:
|
| 80 |
+
- Strips speech disfluencies ("okay so", "uh", "basically")
|
| 81 |
+
- Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
|
| 82 |
+
- Adds correct punctuation
|
| 83 |
+
- **Inserts a paragraph break at the topic shift** ("the elasticsearch cluster has been a pain")
|
| 84 |
|
| 85 |
## Benchmark Results
|
| 86 |
|
| 87 |
+
### Main val set (1000 samples, cleaned val.jsonl from training data)
|
| 88 |
+
|
| 89 |
+
| Metric | v23 (this model) | v22 baseline |
|
| 90 |
+
|---|---|---|
|
| 91 |
+
| ROUGE-L | **0.9506** | 0.9539 |
|
| 92 |
+
| Exact Match | **63.9 %** | 64.8 % |
|
| 93 |
+
| Filler-Free | **90.2 %** | 90.3 % |
|
| 94 |
+
| Paragraph rate | 0.0 % | 0.0 % |
|
| 95 |
+
| Avg latency | 119 ms | 117 ms |
|
| 96 |
+
|
| 97 |
+
### Paragraph val set (200 paragraph_formatting samples)
|
| 98 |
+
|
| 99 |
+
| Metric | v23 (this model) | v22 baseline | Ξ |
|
| 100 |
+
|---|---|---|---|
|
| 101 |
+
| ROUGE-L | **0.9792** | 0.9521 | **+0.027** |
|
| 102 |
+
| **Paragraph emission rate** | **91.5 %** | **0.0 %** | **+91.5 pts** |
|
| 103 |
+
| Exact Match | 2.5 % | 0.0 % | +2.5 pts |
|
| 104 |
+
| Avg latency | 1.46 s | 1.40 s | +60 ms |
|
| 105 |
+
|
| 106 |
+
### vs Prompted Qwen 2B Baseline (from earlier benchmarks)
|
| 107 |
+
|
| 108 |
+
| Metric | This model (354M) | Prompted Qwen 2B | Improvement |
|
| 109 |
|--------|-------------------|-------------------|-------------|
|
| 110 |
+
| ROUGE-L | **0.9506** | 0.891 | **+0.060** |
|
| 111 |
+
| Exact Match | **63.9 %** | 37 % | **+27 pts** |
|
| 112 |
+
| Inference | **119 ms** | 1.0 s | **8.4Γ faster** |
|
| 113 |
+
| Parameters | 354M | 2B | **5.6Γ smaller** |
|
| 114 |
|
| 115 |
## Usage
|
| 116 |
|
|
|
|
| 136 |
)
|
| 137 |
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
|
| 138 |
|
| 139 |
+
text = "so uh basically we need to fix the deployment pipeline"
|
| 140 |
prompt = f"### Input:\n{text}\n\n### Output:\n"
|
| 141 |
|
| 142 |
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 143 |
with torch.no_grad():
|
| 144 |
+
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
|
| 145 |
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 146 |
if "###" in output:
|
| 147 |
output = output[:output.index("###")]
|
|
|
|
| 149 |
# β "We need to fix the deployment pipeline."
|
| 150 |
```
|
| 151 |
|
| 152 |
+
For long dictation that may need paragraph formatting, use a higher `max_new_tokens` (1024-2048).
|
| 153 |
+
|
| 154 |
## Training Details
|
| 155 |
|
| 156 |
+
### Pipeline
|
| 157 |
+
|
| 158 |
+
```
|
| 159 |
+
LiquidAI/LFM2.5-350M-Base
|
| 160 |
+
β SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
|
| 161 |
+
LR 3e-5, Ξ²2=0.95, 3 epochs, batch 1Γ8,
|
| 162 |
+
cosine schedule, 50 warmup steps, weight_decay 0.01,
|
| 163 |
+
bf16+tf32, packed 4,096 context, seed 42
|
| 164 |
+
β eval_loss 1.016 (vs v22's 1.0306, -0.014)
|
| 165 |
+
β GRPO: LoRA r=32, alpha=16, all linear layers,
|
| 166 |
+
LR 3e-6 cosine, 5K samples Γ 4 generations,
|
| 167 |
+
reward = ROUGE-L Γ 5.0 - filler_count Γ 0.5 (capped 2.0) Γ 3.0 + format_bonus
|
| 168 |
+
β final main val ROUGE-L 0.9506 / paragraph rate 91.5 %
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
### Dataset
|
| 172 |
+
|
| 173 |
+
**157,556 train / 7,121 val rows** in [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup):
|
| 174 |
+
|
| 175 |
+
- 153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
|
| 176 |
+
- **3,995 new paragraph_formatting samples** (held out 200 for `paragraph_val.jsonl`) β generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100β500 word raw input + 2β5 paragraph clean output, split at natural discourse boundaries
|
| 177 |
+
|
| 178 |
+
### Hardware
|
| 179 |
+
|
| 180 |
+
1Γ RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
## All Variants
|
| 183 |
|
| 184 |
+
| Variant | Size | Use Case |
|
| 185 |
+
|---------|------|----------|
|
| 186 |
+
| **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | Training, GPU inference |
|
| 187 |
+
| **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | 237 MB | **Recommended for Apple Silicon** |
|
| 188 |
+
| [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | 195 MB | Smallest, slight quality trade-off |
|
| 189 |
|
| 190 |
## Limitations
|
| 191 |
|
| 192 |
- Optimized for **English** conversational/meeting-style speech
|
| 193 |
- Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
|
| 194 |
+
- Paragraph emission is conditional on input structure β short single-topic inputs (typical) will not be paragraph-broken
|
| 195 |
+
- Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
|
| 196 |
- Not designed for formal written text β trained on spoken language patterns
|
| 197 |
|
| 198 |
+
## License
|
| 199 |
+
|
| 200 |
+
MIT
|
| 201 |
+
|
| 202 |
## Links
|
| 203 |
|
| 204 |
- **Application:** [sottoasr.app](https://sottoasr.app)
|
| 205 |
- **Source:** [github.com/juanqui/sottoasr](https://github.com/juanqui/sottoasr)
|
| 206 |
- **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
|
|
|
|
|
|
|
|
|
|
|
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 708984464
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:93579c0a12c842c568360fdea81c77d5f3b13a88ef9e5f0a0b3e2f889f6973ec
|
| 3 |
size 708984464
|