juanquivilla commited on
Commit
eda4f6c
·
verified ·
1 Parent(s): 83e83fb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: lfm1.0
4
+ license_link: https://www.liquid.ai/license
5
+ base_model: LiquidAI/LFM2.5-350M-Base
6
+ tags:
7
+ - speech-to-text
8
+ - transcript-cleanup
9
+ - disfluency-correction
10
+ - sotto-asr
11
+ - lfm2
12
+ - liquid-ai
13
+ library_name: transformers
14
+ pipeline_tag: text-generation
15
+ language:
16
+ - en
17
+ ---
18
+
19
+ # SottoASR Transcript Cleanup — LFM2.5-350M Fine-Tuned
20
+
21
+ ## Overview
22
+
23
+ A fine-tuned [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) model for cleaning speech-to-text transcripts. Removes fillers, crutch words, self-corrections, false starts, and grammar errors while preserving the speaker's intent.
24
+
25
+ ## Performance
26
+
27
+ | Metric | This Model | Prompted Qwen3.5-2B |
28
+ |--------|-----------|---------------------|
29
+ | **ROUGE-L** | **0.868** | 0.891 |
30
+ | **Self-correction** | **0.814** | 0.742 |
31
+ | **Inference speed** | **0.12s** | 1.0s |
32
+ | **Model size** | **350M** | 2B |
33
+ | **Preserve wording** | **0.984** | 0.992 |
34
+ | **Filler removal** | **0.954** | 0.974 |
35
+
36
+ **9x faster** than the prompted 2B model while achieving competitive quality. Exceeds the 2B model on self-correction (0.814 vs 0.742).
37
+
38
+ ## Usage
39
+
40
+ ```python
41
+ from transformers import AutoModelForCausalLM, AutoTokenizer
42
+ import torch
43
+
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ "juanquivilla/sotto-cleanup-lfm25-350m",
46
+ dtype=torch.bfloat16,
47
+ device_map="auto",
48
+ trust_remote_code=True,
49
+ )
50
+ tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
51
+
52
+ raw_transcript = "uh the server is uh running low on memory"
53
+ prompt = f"### Input:\n{raw_transcript}\n\n### Output:\n"
54
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
55
+
56
+ with torch.no_grad():
57
+ outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
58
+ cleaned = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
59
+ print(cleaned) # "The server is running low on memory."
60
+ ```
61
+
62
+ ## Training
63
+
64
+ - **Base model:** LiquidAI/LFM2.5-350M-Base (hybrid conv+attention, 32K context)
65
+ - **Method:** LoRA SFT (rank 64) via Unsloth
66
+ - **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) — 124K pairs
67
+ - **Epochs:** 3 with early stopping
68
+ - **Hardware:** RTX 4090 (24GB)
69
+ - **Training time:** ~9 minutes
70
+
71
+ ## Categories Handled
72
+
73
+ - Filler removal (uh, um, uhm, er)
74
+ - Crutch word removal (basically, you know, I mean)
75
+ - Self-correction (speaker changes mind mid-sentence)
76
+ - False starts (abandoned sentence beginnings)
77
+ - Grammar correction (gonna → going to)
78
+ - Misheard word correction (post gress → Postgres)
79
+ - Dictation commands (period → ., comma → ,)
80
+ - List formatting (first/second/third → numbered list)
81
+
82
+ ## Limitations
83
+
84
+ - Grammar correction is the weakest category (0.788 ROUGE-L)
85
+ - Long dictation (200+ words) can occasionally truncate
86
+ - Domain-specific jargon correction depends on training data coverage
87
+
88
+ ## Part of SottoASR
89
+
90
+ [SottoASR](https://github.com/juanqui/sotto) — local, privacy-first speech-to-text for macOS. All processing happens on-device.