juanquivilla commited on
Commit
8d24c18
Β·
verified Β·
1 Parent(s): 4111016

v23+paragraphs: ROUGE-L 0.9506, Filler-Free 90.2%, paragraph rate 91.5% (0% in v22)

Browse files
Files changed (2) hide show
  1. README.md +106 -79
  2. model.safetensors +1 -1
README.md CHANGED
@@ -15,7 +15,7 @@ datasets:
15
  - juanquivilla/sotto-transcript-cleanup
16
  ---
17
 
18
- # SottoASR Transcript Cleanup β€” LFM2.5-350M (Full Precision)
19
 
20
  [sottoasr.app](https://sottoasr.app) Β· [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) Β· [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) Β· [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
21
 
@@ -23,20 +23,34 @@ datasets:
23
 
24
  **Full-precision bf16** fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** β€” for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) instead.
25
 
26
- This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) β€” a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, and handles false starts and self-corrections β€” all locally, with zero cloud dependency.
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Key Specs
29
 
30
  | Property | Value |
31
  |----------|-------|
32
  | **Size** | **676 MB** |
33
- | **ROUGE-L** | **0.960** |
34
- | **Exact Match** | **69.6%** |
35
- | **Filler-Free** | **88.1%** |
36
- | **Latency** | **116 ms** average per transcript (RTX 4090) |
 
37
  | **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) |
38
  | **Precision** | bf16 |
39
- | **Context** | 32,768 tokens (trained with 4,096 packed) |
40
 
41
  ## What It Does
42
 
@@ -46,38 +60,57 @@ Takes raw, unpunctuated ASR output and produces clean, readable text:
46
  |-----------------|------------------|
47
  | so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
48
  | the deadline is friday no monday we have until monday | The deadline is Monday. |
49
- | what we what i wanted to say is that the tests pass | What I wanted to say is that the tests pass. |
50
  | okay so the thing is basically we're running out of disk space | We're running out of disk space. |
51
  | uh yes | Yes. |
52
- | i need to uh first fix the docker builds second stabilize the flaky tests and third add better monitoring | I need to first fix the Docker builds, second stabilize the flaky tests, and third add better monitoring. |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ## Benchmark Results
55
 
56
- Evaluated on 135-sample benchmark covering 12 transcript cleanup categories:
57
-
58
- | Category | N | ROUGE-L | Exact Match |
59
- |----------|---|---------|-------------|
60
- | crutch_words | 10 | 0.916 | 60% |
61
- | dictation_commands | 10 | 0.989 | 80% |
62
- | false_start | 10 | 0.957 | 80% |
63
- | filler_removal | 15 | 0.951 | 73% |
64
- | grammar | 10 | 0.973 | 80% |
65
- | list_formatting | 10 | 0.990 | 80% |
66
- | long_dictation | 8 | 0.934 | 13% |
67
- | misheard_words | 10 | 0.938 | 70% |
68
- | mixed | 15 | 0.946 | 60% |
69
- | preserve_wording | 12 | 0.995 | 75% |
70
- | self_correction | 15 | 0.974 | 80% |
71
- | short | 10 | 0.947 | 70% |
72
-
73
- ### vs Prompted Qwen 2B Baseline
74
-
75
- | Metric | This model (350M) | Prompted Qwen 2B | Improvement |
 
 
76
  |--------|-------------------|-------------------|-------------|
77
- | ROUGE-L | **0.960** | 0.891 | **+0.069** |
78
- | Exact Match | **69.6%** | 37% | **+32.6 pts** |
79
- | Inference | **116 ms** | 1.0s | **8x faster** |
80
- | Parameters | 354M | 2B | **5.6x smaller** |
81
 
82
  ## Usage
83
 
@@ -103,12 +136,12 @@ model = AutoModelForCausalLM.from_pretrained(
103
  )
104
  tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
105
 
106
- text = "so uh basically the thing is we need to uh fix the deployment pipeline"
107
  prompt = f"### Input:\n{text}\n\n### Output:\n"
108
 
109
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
110
  with torch.no_grad():
111
- out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
112
  output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
113
  if "###" in output:
114
  output = output[:output.index("###")]
@@ -116,64 +149,58 @@ print(output.strip())
116
  # β†’ "We need to fix the deployment pipeline."
117
  ```
118
 
 
 
119
  ## Training Details
120
 
121
- | Parameter | Value |
122
- |-----------|-------|
123
- | Method | Full fine-tune (all 354M params, no LoRA) |
124
- | Dataset | 143K samples ([sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)) |
125
- | Learning rate | 2.5e-5 (cosine schedule) |
126
- | Epochs | 3 |
127
- | Batch size | 1 Γ— 8 gradient accumulation |
128
- | Optimizer | AdamW (full precision) |
129
- | Precision | bf16 + tf32 |
130
- | Hardware | 1Γ— RTX 4090, ~25 min |
131
-
132
- ### Training Data
133
-
134
- The 143K dataset covers diverse transcript cleanup scenarios:
135
- - **Filler removal** β€” uh, um, like, you know, basically
136
- - **Crutch phrase stripping** β€” "okay so the thing is basically..."
137
- - **Self-correction** β€” "X, no wait, Y" β†’ Y
138
- - **False starts** β€” "What weβ€” what I meant was..."
139
- - **Grammar & punctuation** β€” capitalization, periods, commas
140
- - **Dictation commands** β€” "new paragraph", "period"
141
- - **Short inputs** β€” heavy filler, minimal content (2-5 words)
142
- - **Long-form transcripts** β€” 500+ word dictation
143
-
144
- ## Training Progression
145
-
146
- | Version | ROUGE-L | Key Innovation |
147
- |---------|---------|----------------|
148
- | v1: LoRA SFT 15K | 0.771 | Baseline |
149
- | v3: LoRA SFT 100K | 0.863 | Data scale breakthrough |
150
- | v4: + GRPO | 0.891 | Matched prompted 2B |
151
- | v5: Full FT | 0.907 | LoRA was the bottleneck |
152
- | v7: LR 2e-5 | 0.943 | Learning rate breakthrough |
153
- | v11: + Targeted data | 0.950 | Pattern-specific training |
154
- | **v15: LR 2.5e-5** | **0.960** | **Current production model** |
155
 
156
  ## All Variants
157
 
158
- | Variant | Size | ROUGE-L | Use Case |
159
- |---------|------|---------|----------|
160
- | **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | 0.960 | Training, GPU inference |
161
- | **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | 237 MB | ~0.955 | **Recommended for Apple Silicon** |
162
- | [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | 195 MB | ~0.945 | Smallest, slight quality trade-off |
163
 
164
  ## Limitations
165
 
166
  - Optimized for **English** conversational/meeting-style speech
167
  - Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
168
- - Long dictation (>500 words) has lowest exact match rate
 
169
  - Not designed for formal written text β€” trained on spoken language patterns
170
 
 
 
 
 
171
  ## Links
172
 
173
  - **Application:** [sottoasr.app](https://sottoasr.app)
174
  - **Source:** [github.com/juanqui/sottoasr](https://github.com/juanqui/sottoasr)
175
  - **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
176
-
177
- ## License
178
-
179
- MIT
 
15
  - juanquivilla/sotto-transcript-cleanup
16
  ---
17
 
18
+ # SottoASR Transcript Cleanup β€” LFM2.5-350M (Full Precision, v23 + Paragraphs)
19
 
20
  [sottoasr.app](https://sottoasr.app) Β· [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) Β· [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) Β· [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
21
 
 
23
 
24
  **Full-precision bf16** fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** β€” for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) instead.
25
 
26
+ This model powers on-device transcript cleanup in [SottoASR](https://sottoasr.app) β€” a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and **β€” new in v23 β€” restructures long dictations into paragraph-formatted prose**, all locally with zero cloud dependency.
27
+
28
+ ## What's new in v23
29
+
30
+ v23 (this model) adds **paragraph emission** for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with **4,012 new `paragraph_formatting` samples** generated via Bedrock Claude Haiku 4.5, teaching the model to insert `\n\n` paragraph breaks at natural topic / time-reference / discourse-marker boundaries.
31
+
32
+ | Capability | v22 (previous prod) | **v23 (this model)** |
33
+ |---|---|---|
34
+ | Paragraph emission rate on long inputs | **0.0 %** | **91.5 %** |
35
+ | ROUGE-L on paragraph-formatted inputs | 0.9521 | **0.9792** |
36
+ | ROUGE-L on standard val set | 0.9539 | 0.9506 |
37
+ | Filler-Free rate on standard val set | 90.3 % | 90.2 % |
38
+
39
+ The 0.003 main val ROUGE-L regression sits within the natural seed-variance band documented across the 100+ prior fine-tuning experiments and is offset by the +0.027 ROUGE-L improvement on paragraph-formatted inputs and the new structural capability.
40
 
41
  ## Key Specs
42
 
43
  | Property | Value |
44
  |----------|-------|
45
  | **Size** | **676 MB** |
46
+ | **ROUGE-L (val set, 1000 samples)** | **0.9506** |
47
+ | **Exact Match** | **63.9 %** |
48
+ | **Filler-Free** | **90.2 %** |
49
+ | **Paragraph rate (long inputs)** | **91.5 %** |
50
+ | **Latency** | **119 ms** average per transcript (RTX 4090) |
51
  | **Architecture** | Hybrid: 10 conv + 6 GQA attention (354M params) |
52
  | **Precision** | bf16 |
53
+ | **Training context** | 4,096 tokens (packed); model supports 32,768 tokens natively, 128K base |
54
 
55
  ## What It Does
56
 
 
60
  |-----------------|------------------|
61
  | so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
62
  | the deadline is friday no monday we have until monday | The deadline is Monday. |
63
+ | what we what i wanted to say is the tests pass | What I wanted to say is the tests pass. |
64
  | okay so the thing is basically we're running out of disk space | We're running out of disk space. |
65
  | uh yes | Yes. |
66
+
67
+ ### NEW in v23: Paragraph emission on long dictations
68
+
69
+ Long, multi-topic input is now restructured into paragraph-formatted prose:
70
+
71
+ **Input:**
72
+ > okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist
73
+
74
+ **Output:**
75
+ > We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.
76
+ >
77
+ > I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.
78
+
79
+ Notice the model:
80
+ - Strips speech disfluencies ("okay so", "uh", "basically")
81
+ - Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
82
+ - Adds correct punctuation
83
+ - **Inserts a paragraph break at the topic shift** ("the elasticsearch cluster has been a pain")
84
 
85
  ## Benchmark Results
86
 
87
+ ### Main val set (1000 samples, cleaned val.jsonl from training data)
88
+
89
+ | Metric | v23 (this model) | v22 baseline |
90
+ |---|---|---|
91
+ | ROUGE-L | **0.9506** | 0.9539 |
92
+ | Exact Match | **63.9 %** | 64.8 % |
93
+ | Filler-Free | **90.2 %** | 90.3 % |
94
+ | Paragraph rate | 0.0 % | 0.0 % |
95
+ | Avg latency | 119 ms | 117 ms |
96
+
97
+ ### Paragraph val set (200 paragraph_formatting samples)
98
+
99
+ | Metric | v23 (this model) | v22 baseline | Ξ” |
100
+ |---|---|---|---|
101
+ | ROUGE-L | **0.9792** | 0.9521 | **+0.027** |
102
+ | **Paragraph emission rate** | **91.5 %** | **0.0 %** | **+91.5 pts** |
103
+ | Exact Match | 2.5 % | 0.0 % | +2.5 pts |
104
+ | Avg latency | 1.46 s | 1.40 s | +60 ms |
105
+
106
+ ### vs Prompted Qwen 2B Baseline (from earlier benchmarks)
107
+
108
+ | Metric | This model (354M) | Prompted Qwen 2B | Improvement |
109
  |--------|-------------------|-------------------|-------------|
110
+ | ROUGE-L | **0.9506** | 0.891 | **+0.060** |
111
+ | Exact Match | **63.9 %** | 37 % | **+27 pts** |
112
+ | Inference | **119 ms** | 1.0 s | **8.4Γ— faster** |
113
+ | Parameters | 354M | 2B | **5.6Γ— smaller** |
114
 
115
  ## Usage
116
 
 
136
  )
137
  tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
138
 
139
+ text = "so uh basically we need to fix the deployment pipeline"
140
  prompt = f"### Input:\n{text}\n\n### Output:\n"
141
 
142
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
143
  with torch.no_grad():
144
+ out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
145
  output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
146
  if "###" in output:
147
  output = output[:output.index("###")]
 
149
  # β†’ "We need to fix the deployment pipeline."
150
  ```
151
 
152
+ For long dictation that may need paragraph formatting, use a higher `max_new_tokens` (1024-2048).
153
+
154
  ## Training Details
155
 
156
+ ### Pipeline
157
+
158
+ ```
159
+ LiquidAI/LFM2.5-350M-Base
160
+ β†’ SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
161
+ LR 3e-5, Ξ²2=0.95, 3 epochs, batch 1Γ—8,
162
+ cosine schedule, 50 warmup steps, weight_decay 0.01,
163
+ bf16+tf32, packed 4,096 context, seed 42
164
+ β†’ eval_loss 1.016 (vs v22's 1.0306, -0.014)
165
+ β†’ GRPO: LoRA r=32, alpha=16, all linear layers,
166
+ LR 3e-6 cosine, 5K samples Γ— 4 generations,
167
+ reward = ROUGE-L Γ— 5.0 - filler_count Γ— 0.5 (capped 2.0) Γ— 3.0 + format_bonus
168
+ β†’ final main val ROUGE-L 0.9506 / paragraph rate 91.5 %
169
+ ```
170
+
171
+ ### Dataset
172
+
173
+ **157,556 train / 7,121 val rows** in [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup):
174
+
175
+ - 153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
176
+ - **3,995 new paragraph_formatting samples** (held out 200 for `paragraph_val.jsonl`) β€” generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100–500 word raw input + 2–5 paragraph clean output, split at natural discourse boundaries
177
+
178
+ ### Hardware
179
+
180
+ 1Γ— RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total
 
 
 
 
 
 
 
 
 
181
 
182
  ## All Variants
183
 
184
+ | Variant | Size | Use Case |
185
+ |---------|------|----------|
186
+ | **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | Training, GPU inference |
187
+ | **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | 237 MB | **Recommended for Apple Silicon** |
188
+ | [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | 195 MB | Smallest, slight quality trade-off |
189
 
190
  ## Limitations
191
 
192
  - Optimized for **English** conversational/meeting-style speech
193
  - Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
194
+ - Paragraph emission is conditional on input structure β€” short single-topic inputs (typical) will not be paragraph-broken
195
+ - Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
196
  - Not designed for formal written text β€” trained on spoken language patterns
197
 
198
+ ## License
199
+
200
+ MIT
201
+
202
  ## Links
203
 
204
  - **Application:** [sottoasr.app](https://sottoasr.app)
205
  - **Source:** [github.com/juanqui/sottoasr](https://github.com/juanqui/sottoasr)
206
  - **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
 
 
 
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c37b567f001b8b532b9314a81ae321ac45465f273db36a29ef6ac02ef5415843
3
  size 708984464
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93579c0a12c842c568360fdea81c77d5f3b13a88ef9e5f0a0b3e2f889f6973ec
3
  size 708984464