juanquivilla's picture
Upload README.md with huggingface_hub
eda4f6c verified
|
Raw
History Blame
3.07 kB
metadata
license: other
license_name: lfm1.0
license_link: https://www.liquid.ai/license
base_model: LiquidAI/LFM2.5-350M-Base
tags:
  - speech-to-text
  - transcript-cleanup
  - disfluency-correction
  - sotto-asr
  - lfm2
  - liquid-ai
library_name: transformers
pipeline_tag: text-generation
language:
  - en

SottoASR Transcript Cleanup — LFM2.5-350M Fine-Tuned

Overview

A fine-tuned LiquidAI/LFM2.5-350M-Base model for cleaning speech-to-text transcripts. Removes fillers, crutch words, self-corrections, false starts, and grammar errors while preserving the speaker's intent.

Performance

Metric This Model Prompted Qwen3.5-2B
ROUGE-L 0.868 0.891
Self-correction 0.814 0.742
Inference speed 0.12s 1.0s
Model size 350M 2B
Preserve wording 0.984 0.992
Filler removal 0.954 0.974

9x faster than the prompted 2B model while achieving competitive quality. Exceeds the 2B model on self-correction (0.814 vs 0.742).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

raw_transcript = "uh the server is uh running low on memory"
prompt = f"### Input:\n{raw_transcript}\n\n### Output:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
cleaned = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(cleaned)  # "The server is running low on memory."

Training

  • Base model: LiquidAI/LFM2.5-350M-Base (hybrid conv+attention, 32K context)
  • Method: LoRA SFT (rank 64) via Unsloth
  • Dataset: juanquivilla/sotto-transcript-cleanup — 124K pairs
  • Epochs: 3 with early stopping
  • Hardware: RTX 4090 (24GB)
  • Training time: ~9 minutes

Categories Handled

  • Filler removal (uh, um, uhm, er)
  • Crutch word removal (basically, you know, I mean)
  • Self-correction (speaker changes mind mid-sentence)
  • False starts (abandoned sentence beginnings)
  • Grammar correction (gonna → going to)
  • Misheard word correction (post gress → Postgres)
  • Dictation commands (period → ., comma → ,)
  • List formatting (first/second/third → numbered list)

Limitations

  • Grammar correction is the weakest category (0.788 ROUGE-L)
  • Long dictation (200+ words) can occasionally truncate
  • Domain-specific jargon correction depends on training data coverage

Part of SottoASR

SottoASR — local, privacy-first speech-to-text for macOS. All processing happens on-device.