Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +90 -0

README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+---
+license: other
+license_name: lfm1.0
+license_link: https://www.liquid.ai/license
+base_model: LiquidAI/LFM2.5-350M-Base
+tags:
+  - speech-to-text
+  - transcript-cleanup
+  - disfluency-correction
+  - sotto-asr
+  - lfm2
+  - liquid-ai
+library_name: transformers
+pipeline_tag: text-generation
+language:
+  - en
+---
+# SottoASR Transcript Cleanup — LFM2.5-350M Fine-Tuned
+## Overview
+A fine-tuned [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) model for cleaning speech-to-text transcripts. Removes fillers, crutch words, self-corrections, false starts, and grammar errors while preserving the speaker's intent.
+## Performance
+| Metric | This Model | Prompted Qwen3.5-2B |
+|--------|-----------|---------------------|
+| **ROUGE-L** | **0.868** | 0.891 |
+| **Self-correction** | **0.814** | 0.742 |
+| **Inference speed** | **0.12s** | 1.0s |
+| **Model size** | **350M** | 2B |
+| **Preserve wording** | **0.984** | 0.992 |
+| **Filler removal** | **0.954** | 0.974 |
+**9x faster** than the prompted 2B model while achieving competitive quality. Exceeds the 2B model on self-correction (0.814 vs 0.742).
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model = AutoModelForCausalLM.from_pretrained(
+    "juanquivilla/sotto-cleanup-lfm25-350m",
+    dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
+raw_transcript = "uh the server is uh running low on memory"
+prompt = f"### Input:\n{raw_transcript}\n\n### Output:\n"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
+cleaned = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+print(cleaned)  # "The server is running low on memory."
+```
+## Training
+- **Base model:** LiquidAI/LFM2.5-350M-Base (hybrid conv+attention, 32K context)
+- **Method:** LoRA SFT (rank 64) via Unsloth
+- **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) — 124K pairs
+- **Epochs:** 3 with early stopping
+- **Hardware:** RTX 4090 (24GB)
+- **Training time:** ~9 minutes
+## Categories Handled
+- Filler removal (uh, um, uhm, er)
+- Crutch word removal (basically, you know, I mean)
+- Self-correction (speaker changes mind mid-sentence)
+- False starts (abandoned sentence beginnings)
+- Grammar correction (gonna → going to)
+- Misheard word correction (post gress → Postgres)
+- Dictation commands (period → ., comma → ,)
+- List formatting (first/second/third → numbered list)
+## Limitations
+- Grammar correction is the weakest category (0.788 ROUGE-L)
+- Long dictation (200+ words) can occasionally truncate
+- Domain-specific jargon correction depends on training data coverage
+## Part of SottoASR
+[SottoASR](https://github.com/juanqui/sotto) — local, privacy-first speech-to-text for macOS. All processing happens on-device.