---
license: other
license_name: lfm1.0
license_link: https://www.liquid.ai/license
base_model: LiquidAI/LFM2.5-350M-Base
tags:
  - speech-to-text
  - transcript-cleanup
  - disfluency-correction
  - sotto-asr
  - lfm2
  - liquid-ai
library_name: transformers
pipeline_tag: text-generation
language:
  - en
---

# SottoASR Transcript Cleanup — LFM2.5-350M Fine-Tuned

## Overview

A fine-tuned [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) model for cleaning speech-to-text transcripts. Removes fillers, crutch words, self-corrections, false starts, and grammar errors while preserving the speaker's intent.

## Performance

| Metric | This Model | Prompted Qwen3.5-2B |
|--------|-----------|---------------------|
| **ROUGE-L** | **0.868** | 0.891 |
| **Self-correction** | **0.814** | 0.742 |
| **Inference speed** | **0.12s** | 1.0s |
| **Model size** | **350M** | 2B |
| **Preserve wording** | **0.984** | 0.992 |
| **Filler removal** | **0.954** | 0.974 |

**9x faster** than the prompted 2B model while achieving competitive quality. Exceeds the 2B model on self-correction (0.814 vs 0.742).

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

raw_transcript = "uh the server is uh running low on memory"
prompt = f"### Input:\n{raw_transcript}\n\n### Output:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
cleaned = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(cleaned)  # "The server is running low on memory."
```

## Training

- **Base model:** LiquidAI/LFM2.5-350M-Base (hybrid conv+attention, 32K context)
- **Method:** LoRA SFT (rank 64) via Unsloth
- **Dataset:** [juanquivilla/sotto-transcript-cleanup](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) — 124K pairs
- **Epochs:** 3 with early stopping
- **Hardware:** RTX 4090 (24GB)
- **Training time:** ~9 minutes

## Categories Handled

- Filler removal (uh, um, uhm, er)
- Crutch word removal (basically, you know, I mean)
- Self-correction (speaker changes mind mid-sentence)
- False starts (abandoned sentence beginnings)
- Grammar correction (gonna → going to)
- Misheard word correction (post gress → Postgres)
- Dictation commands (period → ., comma → ,)
- List formatting (first/second/third → numbered list)

## Limitations

- Grammar correction is the weakest category (0.788 ROUGE-L)
- Long dictation (200+ words) can occasionally truncate
- Domain-specific jargon correction depends on training data coverage

## Part of SottoASR

[SottoASR](https://github.com/juanqui/sotto) — local, privacy-first speech-to-text for macOS. All processing happens on-device.