Upload README.md with huggingface_hub

eda4f6c verified 3 months ago

3.07 kB

license: other
license_name: lfm1.0
license_link: https://www.liquid.ai/license
base_model: LiquidAI/LFM2.5-350M-Base
tags:
  - speech-to-text
  - transcript-cleanup
  - disfluency-correction
  - sotto-asr
  - lfm2
  - liquid-ai
library_name: transformers
pipeline_tag: text-generation
language:
  - en

SottoASR Transcript Cleanup — LFM2.5-350M Fine-Tuned

Overview

A fine-tuned LiquidAI/LFM2.5-350M-Base model for cleaning speech-to-text transcripts. Removes fillers, crutch words, self-corrections, false starts, and grammar errors while preserving the speaker's intent.

Performance

Metric	This Model	Prompted Qwen3.5-2B
ROUGE-L	0.868	0.891
Self-correction	0.814	0.742
Inference speed	0.12s	1.0s
Model size	350M	2B
Preserve wording	0.984	0.992
Filler removal	0.954	0.974

9x faster than the prompted 2B model while achieving competitive quality. Exceeds the 2B model on self-correction (0.814 vs 0.742).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

raw_transcript = "uh the server is uh running low on memory"
prompt = f"### Input:\n{raw_transcript}\n\n### Output:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
cleaned = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(cleaned)  # "The server is running low on memory."

Training

Base model: LiquidAI/LFM2.5-350M-Base (hybrid conv+attention, 32K context)
Method: LoRA SFT (rank 64) via Unsloth
Dataset: juanquivilla/sotto-transcript-cleanup — 124K pairs
Epochs: 3 with early stopping
Hardware: RTX 4090 (24GB)
Training time: ~9 minutes

Limitations

Grammar correction is the weakest category (0.788 ROUGE-L)
Long dictation (200+ words) can occasionally truncate
Domain-specific jargon correction depends on training data coverage

Part of SottoASR

SottoASR — local, privacy-first speech-to-text for macOS. All processing happens on-device.

juanquivilla
/

sotto-cleanup-lfm25-350m