juanquivilla commited on
Commit
1c9bfcc
·
verified ·
1 Parent(s): 9154aef

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +79 -44
README.md CHANGED
@@ -3,6 +3,8 @@ license: other
3
  license_name: lfm1.0
4
  license_link: https://www.liquid.ai/license
5
  base_model: LiquidAI/LFM2.5-350M-Base
 
 
6
  tags:
7
  - speech-to-text
8
  - transcript-cleanup
@@ -10,30 +12,80 @@ tags:
10
  - sotto-asr
11
  - lfm2
12
  - liquid-ai
 
13
  library_name: transformers
14
  pipeline_tag: text-generation
15
  language:
16
  - en
17
  ---
18
 
19
- # SottoASR Transcript Cleanup — LFM2.5-350M Fine-Tuned
 
 
 
 
 
 
 
20
 
21
  ## Overview
22
 
23
- A fine-tuned [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) model for cleaning speech-to-text transcripts. Removes fillers, crutch words, self-corrections, false starts, and grammar errors while preserving the speaker's intent.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Performance
26
 
27
- | Metric | This Model | Prompted Qwen3.5-2B |
28
- |--------|-----------|---------------------|
29
- | **ROUGE-L** | **0.868** | 0.891 |
30
- | **Self-correction** | **0.814** | 0.742 |
31
- | **Inference speed** | **0.12s** | 1.0s |
32
- | **Model size** | **350M** | 2B |
33
- | **Preserve wording** | **0.984** | 0.992 |
34
- | **Filler removal** | **0.954** | 0.974 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- **9x faster** than the prompted 2B model while achieving competitive quality. Exceeds the 2B model on self-correction (0.814 vs 0.742).
 
 
 
 
 
 
37
 
38
  ## Usage
39
 
@@ -43,48 +95,31 @@ import torch
43
 
44
  model = AutoModelForCausalLM.from_pretrained(
45
  "juanquivilla/sotto-cleanup-lfm25-350m",
46
- dtype=torch.bfloat16,
47
- device_map="auto",
48
- trust_remote_code=True,
49
  )
50
  tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
51
 
52
- raw_transcript = "uh the server is uh running low on memory"
53
- prompt = f"### Input:\n{raw_transcript}\n\n### Output:\n"
54
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
55
-
56
  with torch.no_grad():
57
- outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
58
- cleaned = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
59
- print(cleaned) # "The server is running low on memory."
60
  ```
61
 
62
- ## Training
63
-
64
- - **Base model:** LiquidAI/LFM2.5-350M-Base (hybrid conv+attention, 32K context)
65
- - **Method:** LoRA SFT (rank 64) via Unsloth
66
- - **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) — 124K pairs
67
- - **Epochs:** 3 with early stopping
68
- - **Hardware:** RTX 4090 (24GB)
69
- - **Training time:** ~9 minutes
70
 
71
- ## Categories Handled
 
 
 
 
72
 
73
- - Filler removal (uh, um, uhm, er)
74
- - Crutch word removal (basically, you know, I mean)
75
- - Self-correction (speaker changes mind mid-sentence)
76
- - False starts (abandoned sentence beginnings)
77
- - Grammar correction (gonna → going to)
78
- - Misheard word correction (post gress → Postgres)
79
- - Dictation commands (period → ., comma → ,)
80
- - List formatting (first/second/third → numbered list)
81
-
82
- ## Limitations
83
 
84
- - Grammar correction is the weakest category (0.788 ROUGE-L)
85
- - Long dictation (200+ words) can occasionally truncate
86
- - Domain-specific jargon correction depends on training data coverage
87
 
88
- ## Part of SottoASR
89
 
90
- [SottoASR](https://github.com/juanqui/sotto) local, privacy-first speech-to-text for macOS. All processing happens on-device.
 
3
  license_name: lfm1.0
4
  license_link: https://www.liquid.ai/license
5
  base_model: LiquidAI/LFM2.5-350M-Base
6
+ datasets:
7
+ - juanquivilla/sotto-transcript-cleanup
8
  tags:
9
  - speech-to-text
10
  - transcript-cleanup
 
12
  - sotto-asr
13
  - lfm2
14
  - liquid-ai
15
+ - text2text-generation
16
  library_name: transformers
17
  pipeline_tag: text-generation
18
  language:
19
  - en
20
  ---
21
 
22
+ # SottoASR Transcript Cleanup — LFM2.5-350M (bf16)
23
+
24
+ <p align="center">
25
+ <a href="https://sotto.app">sotto.app</a> ·
26
+ <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit">MLX 5-bit (recommended for deployment)</a> ·
27
+ <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit">MLX 4-bit</a> ·
28
+ <a href="https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup">Training Dataset</a>
29
+ </p>
30
 
31
  ## Overview
32
 
33
+ This is the **full-precision (bf16) fine-tuned** [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) model for cleaning speech-to-text transcripts. It is the fine-tuned SLM (Small Language Model) powering on-device transcript cleanup in [**SottoASR**](https://sotto.app) a local, privacy-first speech-to-text application for macOS.
34
+
35
+ **For on-device deployment, use the [MLX 5-bit quantized version](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) (233MB, <0.5% quality loss).**
36
+
37
+ ## What It Does
38
+
39
+ Takes raw, unpunctuated ASR output and produces clean, properly formatted text:
40
+
41
+ | Input (raw ASR) | Output (cleaned) |
42
+ |---|---|
43
+ | `uh the server is uh running low on memory` | The server is running low on memory. |
44
+ | `use redis wait no memcached is better` | Use Memcached. |
45
+ | `so basically the the api is um throttling our requests` | The API is throttling our requests. |
46
+ | `lets go ahead and really focus on the performance issue` | Let's go ahead and really focus on the performance issue. |
47
+ | `send the email to john period` | Send the email to John. |
48
+ | `me and the team is working on fixing it` | The team and I are working on fixing it. |
49
+
50
+ Handles: filler removal, crutch word removal, self-corrections, false starts, grammar fixes, misheard word correction, dictation commands (period→., comma→,, slash→/), list formatting, and wording preservation.
51
 
52
  ## Performance
53
 
54
+ | Metric | This Model (350M) | Prompted Qwen3.5-2B | Improvement |
55
+ |--------|-------------------|---------------------|-------------|
56
+ | **ROUGE-L** | **0.931** | 0.891 | **+4.5%** |
57
+ | **Exact Match** | **56%** | 37% | **+51% relative** |
58
+ | **Self-Correction** | **0.869** | 0.742 | **+17.1%** |
59
+ | **Zero-Filler Rate** | **90%** | 82% | **+9.8% relative** |
60
+ | **Inference** | **0.12s** | 1.0s | **8.3x faster** |
61
+ | **Model Size** | **354M params** | 2B params | **5.7x smaller** |
62
+
63
+ ### Per-Category Scores
64
+
65
+ | Category | ROUGE-L | Description |
66
+ |----------|---------|-------------|
67
+ | preserve_wording | 0.987 | Clean input passes through unchanged |
68
+ | list_formatting | 0.972 | Spoken lists → numbered format |
69
+ | filler_removal | 0.955 | uh, um, uhm, er, ah |
70
+ | short | 0.940 | Brief utterances (2-10 words) |
71
+ | false_start | 0.926 | Stutters and restarts |
72
+ | dictation_commands | 0.971 | period→., comma→,, slash→/ |
73
+ | mixed | 0.928 | Multiple overlapping disfluencies |
74
+ | long_dictation | 0.918 | 100+ word passages |
75
+ | misheard_words | 0.913 | ASR errors (post gress→Postgres) |
76
+ | grammar | 0.906 | gonna→going to, me and him→he and I |
77
+ | crutch_words | 0.892 | basically, you know, I mean |
78
+ | self_correction | 0.869 | Speaker changes mind mid-sentence |
79
+
80
+ ## Training
81
 
82
+ - **Base model:** [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) (hybrid convolution + attention, 32K context)
83
+ - **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) — 124K synthetic pairs
84
+ - **Method:** Two-stage full fine-tuning
85
+ 1. **Stage 1:** Full FT on 124K dataset (LR 1e-5, 3 epochs, ~22 min on RTX 4090)
86
+ 2. **Stage 2:** Concentrated hard-pattern FT on 14K examples (LR 2e-6, 1 epoch, 27 seconds)
87
+ - **Data sources:** Qwen3.5-35B (95K), Grok 4.20 (29K), hand-crafted (235)
88
+ - **Key finding:** Full fine-tune dramatically outperforms LoRA for small models (+7% ROUGE-L on same data)
89
 
90
  ## Usage
91
 
 
95
 
96
  model = AutoModelForCausalLM.from_pretrained(
97
  "juanquivilla/sotto-cleanup-lfm25-350m",
98
+ dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
 
 
99
  )
100
  tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
101
 
102
+ raw = "uh the server is uh running low on memory"
103
+ prompt = f"### Input:\n{raw}\n\n### Output:\n"
104
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 
105
  with torch.no_grad():
106
+ out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
107
+ print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
108
+ # "The server is running low on memory."
109
  ```
110
 
111
+ ## Quantized Variants
 
 
 
 
 
 
 
112
 
113
+ | Variant | Size | ROUGE-L | Filler-Free | Link |
114
+ |---------|------|---------|-------------|------|
115
+ | **bf16 (this model)** | 676MB | 0.931 | 90% | — |
116
+ | **MLX 5-bit (recommended)** | 233MB | 0.926 | 99% | [sotto-cleanup-lfm25-350m-mlx-5bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) |
117
+ | MLX 4-bit | 190MB | 0.897 | 99% | [sotto-cleanup-lfm25-350m-mlx-4bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) |
118
 
119
+ ## Part of SottoASR
 
 
 
 
 
 
 
 
 
120
 
121
+ [**SottoASR**](https://sotto.app) is a local, privacy-first speech-to-text application for macOS. Press a hotkey, speak, and clean text appears at your cursor. All processing happens on-device — no audio or text is ever sent to a cloud service. This model powers the transcript cleanup step.
 
 
122
 
123
+ ## License
124
 
125
+ This model inherits the [LFM 1.0 license](https://www.liquid.ai/license) from the base model.