Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +79 -44

README.md CHANGED Viewed

@@ -3,6 +3,8 @@ license: other
 license_name: lfm1.0
 license_link: https://www.liquid.ai/license
 base_model: LiquidAI/LFM2.5-350M-Base
 tags:
   - speech-to-text
   - transcript-cleanup
@@ -10,30 +12,80 @@ tags:
   - sotto-asr
   - lfm2
   - liquid-ai
 library_name: transformers
 pipeline_tag: text-generation
 language:
   - en
 ---
-# SottoASR Transcript Cleanup — LFM2.5-350M Fine-Tuned
 ## Overview
-A fine-tuned [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) model for cleaning speech-to-text transcripts. Removes fillers, crutch words, self-corrections, false starts, and grammar errors while preserving the speaker's intent.
 ## Performance
-| Metric | This Model | Prompted Qwen3.5-2B |
-|--------|-----------|---------------------|
-| **ROUGE-L** | **0.868** | 0.891 |
-| **Self-correction** | **0.814** | 0.742 |
-| **Inference speed** | **0.12s** | 1.0s |
-| **Model size** | **350M** | 2B |
-| **Preserve wording** | **0.984** | 0.992 |
-| **Filler removal** | **0.954** | 0.974 |
-**9x faster** than the prompted 2B model while achieving competitive quality. Exceeds the 2B model on self-correction (0.814 vs 0.742).
 ## Usage
@@ -43,48 +95,31 @@ import torch
 model = AutoModelForCausalLM.from_pretrained(
     "juanquivilla/sotto-cleanup-lfm25-350m",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True,
 )
 tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
-raw_transcript = "uh the server is uh running low on memory"
-prompt = f"### Input:\n{raw_transcript}\n\n### Output:\n"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 with torch.no_grad():
-    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
-cleaned = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
-print(cleaned)  # "The server is running low on memory."
 ```
-## Training
-- **Base model:** LiquidAI/LFM2.5-350M-Base (hybrid conv+attention, 32K context)
-- **Method:** LoRA SFT (rank 64) via Unsloth
-- **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) — 124K pairs
-- **Epochs:** 3 with early stopping
-- **Hardware:** RTX 4090 (24GB)
-- **Training time:** ~9 minutes
-## Categories Handled
-- Filler removal (uh, um, uhm, er)
-- Crutch word removal (basically, you know, I mean)
-- Self-correction (speaker changes mind mid-sentence)
-- False starts (abandoned sentence beginnings)
-- Grammar correction (gonna → going to)
-- Misheard word correction (post gress → Postgres)
-- Dictation commands (period → ., comma → ,)
-- List formatting (first/second/third → numbered list)
-## Limitations
-- Grammar correction is the weakest category (0.788 ROUGE-L)
-- Long dictation (200+ words) can occasionally truncate
-- Domain-specific jargon correction depends on training data coverage
-## Part of SottoASR
-[SottoASR](https://github.com/juanqui/sotto) — local, privacy-first speech-to-text for macOS. All processing happens on-device.

 license_name: lfm1.0
 license_link: https://www.liquid.ai/license
 base_model: LiquidAI/LFM2.5-350M-Base
+datasets:
+  - juanquivilla/sotto-transcript-cleanup
 tags:
   - speech-to-text
   - transcript-cleanup
   - sotto-asr
   - lfm2
   - liquid-ai
+  - text2text-generation
 library_name: transformers
 pipeline_tag: text-generation
 language:
   - en
 ---
+# SottoASR Transcript Cleanup — LFM2.5-350M (bf16)
+<p align="center">
+  <a href="https://sotto.app">sotto.app</a> ·
+  <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit">MLX 5-bit (recommended for deployment)</a> ·
+  <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit">MLX 4-bit</a> ·
+  <a href="https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup">Training Dataset</a>
+</p>
 ## Overview
+This is the **full-precision (bf16) fine-tuned** [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) model for cleaning speech-to-text transcripts. It is the fine-tuned SLM (Small Language Model) powering on-device transcript cleanup in [**SottoASR**](https://sotto.app) — a local, privacy-first speech-to-text application for macOS.
+**For on-device deployment, use the [MLX 5-bit quantized version](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) (233MB, <0.5% quality loss).**
+## What It Does
+Takes raw, unpunctuated ASR output and produces clean, properly formatted text:
+| Input (raw ASR) | Output (cleaned) |
+|---|---|
+| `uh the server is uh running low on memory` | The server is running low on memory. |
+| `use redis wait no memcached is better` | Use Memcached. |
+| `so basically the the api is um throttling our requests` | The API is throttling our requests. |
+| `lets go ahead and really focus on the performance issue` | Let's go ahead and really focus on the performance issue. |
+| `send the email to john period` | Send the email to John. |
+| `me and the team is working on fixing it` | The team and I are working on fixing it. |
+Handles: filler removal, crutch word removal, self-corrections, false starts, grammar fixes, misheard word correction, dictation commands (period→., comma→,, slash→/), list formatting, and wording preservation.
 ## Performance
+| Metric | This Model (350M) | Prompted Qwen3.5-2B | Improvement |
+|--------|-------------------|---------------------|-------------|
+| **ROUGE-L** | **0.931** | 0.891 | **+4.5%** |
+| **Exact Match** | **56%** | 37% | **+51% relative** |
+| **Self-Correction** | **0.869** | 0.742 | **+17.1%** |
+| **Zero-Filler Rate** | **90%** | 82% | **+9.8% relative** |
+| **Inference** | **0.12s** | 1.0s | **8.3x faster** |
+| **Model Size** | **354M params** | 2B params | **5.7x smaller** |
+### Per-Category Scores
+| Category | ROUGE-L | Description |
+|----------|---------|-------------|
+| preserve_wording | 0.987 | Clean input passes through unchanged |
+| list_formatting | 0.972 | Spoken lists → numbered format |
+| filler_removal | 0.955 | uh, um, uhm, er, ah |
+| short | 0.940 | Brief utterances (2-10 words) |
+| false_start | 0.926 | Stutters and restarts |
+| dictation_commands | 0.971 | period→., comma→,, slash→/ |
+| mixed | 0.928 | Multiple overlapping disfluencies |
+| long_dictation | 0.918 | 100+ word passages |
+| misheard_words | 0.913 | ASR errors (post gress→Postgres) |
+| grammar | 0.906 | gonna→going to, me and him→he and I |
+| crutch_words | 0.892 | basically, you know, I mean |
+| self_correction | 0.869 | Speaker changes mind mid-sentence |
+## Training
+- **Base model:** [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) (hybrid convolution + attention, 32K context)
+- **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) — 124K synthetic pairs
+- **Method:** Two-stage full fine-tuning
+  1. **Stage 1:** Full FT on 124K dataset (LR 1e-5, 3 epochs, ~22 min on RTX 4090)
+  2. **Stage 2:** Concentrated hard-pattern FT on 14K examples (LR 2e-6, 1 epoch, 27 seconds)
+- **Data sources:** Qwen3.5-35B (95K), Grok 4.20 (29K), hand-crafted (235)
+- **Key finding:** Full fine-tune dramatically outperforms LoRA for small models (+7% ROUGE-L on same data)
 ## Usage
 model = AutoModelForCausalLM.from_pretrained(
     "juanquivilla/sotto-cleanup-lfm25-350m",
+    dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
 )
 tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
+raw = "uh the server is uh running low on memory"
+prompt = f"### Input:\n{raw}\n\n### Output:\n"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 with torch.no_grad():
+    out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
+print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
+# → "The server is running low on memory."
 ```
+## Quantized Variants
+| Variant | Size | ROUGE-L | Filler-Free | Link |
+|---------|------|---------|-------------|------|
+| **bf16 (this model)** | 676MB | 0.931 | 90% | — |
+| **MLX 5-bit (recommended)** | 233MB | 0.926 | 99% | [sotto-cleanup-lfm25-350m-mlx-5bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) |
+| MLX 4-bit | 190MB | 0.897 | 99% | [sotto-cleanup-lfm25-350m-mlx-4bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) |
+## Part of SottoASR
+[**SottoASR**](https://sotto.app) is a local, privacy-first speech-to-text application for macOS. Press a hotkey, speak, and clean text appears at your cursor. All processing happens on-device — no audio or text is ever sent to a cloud service. This model powers the transcript cleanup step.
+## License
+This model inherits the [LFM 1.0 license](https://www.liquid.ai/license) from the base model.