--- license: mit language: - en base_model: LiquidAI/LFM2.5-350M-Base tags: - speech-to-text - transcript-cleanup - text-correction - asr-post-processing - LFM - LiquidAI - grpo - full-fine-tune - inverse-text-normalization pipeline_tag: text-generation datasets: - juanquivilla/sotto-transcript-cleanup --- # SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, soup_30) [sottoasr.app](https://sottoasr.app) · [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup) ## Overview Full-precision bf16 fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** — for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit). ## What's new (model soup release) This model is a **weight-space average** of two strong checkpoints from the same fine-tuning lineage: - **0.3 × v55** (latest: 2-epoch refinement at lr 2e-6) — strongest on number-accuracy and filler-stripping - **0.7 × v51** (the prior production model) — strongest on adversarial sampling benchmark Linear interpolation in weight space (`θ = α·θ_v55 + (1-α)·θ_v51`) is sometimes called "model souping". It works here because v55 was chained from v51 (same architecture, related minima), and the soup recovers v51's bench-sample strengths without losing v55's number/filler gains. The full recipe sweep is in the [research journal](https://github.com/anthropics) (2026-05-06 loop). ## Headline numbers (production-mode eval: `max_new_tokens=900`, `repetition_penalty=1.05`) | Capability | v36 | v45 | v51 | v55 | **soup (this)** | |---|---:|---:|---:|---:|---:| | Number accuracy (171-sample stratified val) | 12.9% | 95.9% | 95.3% | 96.5% | **96.5%** | | 66-case adversarial benchmark (greedy) | n/a | 76% | 84.8% | 84.8% | **86.4%** | | 66-case adversarial benchmark (temp 0.7 × 4) | n/a | 77% | 84.5% | 82.6% | **86.0%** | | Loops on 264 sampling-mode probes | n/a | 0 | 1 | 2 | **0** | | Filler-free on 241 long inputs | 67.2% | 68.0% | 72.2% | 72.6% | 71.8% | | Sub-deletion >15% on 241 long inputs | 13.3% | 13.7% | 4.6% | 5.0% | **5.0%** | Composite score (0.35×num + 0.30×bench_greedy + 0.15×bench_sample + 0.10×filler_long + 0.05×(1-sub15) + 0.05×(1-loops/N)): **89.51** at full production settings. ## Training pipeline ``` LiquidAI/LFM2.5-350M-Base → SFT v23 → GRPO v23 (paragraph emission) → GRPO v36: full FT with substantive-deletion-aware reward → SFT v39: + 12.7K augmented number examples (ITN) → GRPO v40–v45: chained refinement, fixed reward + amplified filler penalty → GRPO v50 + v51: anti-loop n-gram penalty → GRPO v55: 2-epoch refinement at lr 2e-6 (best chained checkpoint) → soup: 0.3·θ_v55 + 0.7·θ_v51 (weight-space average — this model) ``` ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "juanquivilla/sotto-cleanup-lfm25-350m", dtype=torch.bfloat16, trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m") text = "talk about server three sixty" prompt = f"### Input:\n{text}\n\n### Output:\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate( **inputs, max_new_tokens=max(900, int(len(text.split()) * 1.5)), # ≥1.5× input word count do_sample=False, repetition_penalty=1.05, # LFM2.5 official default ) output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) if "###" in output: output = output[:output.index("###")] print(output.strip()) ``` ### Inference recommendations The headline numbers above use these settings — they match the LFM2.5 model card's defaults and are the production deployment for [sottoasr.app](https://sottoasr.app): - **`repetition_penalty=1.05`** — LFM2.5's official default. Critical for long inputs: prevents the rare voicemail-style 5-gram loops that can occur with `repetition_penalty=1.0`. - **`max_new_tokens >= 1.5 × input_word_count`** (or 900 minimum) — long inputs (>200 words) need headroom; truncating mid-output looks like content deletion. - **`do_sample=False`** (greedy) for deterministic output. If sampling is needed, use `temperature=0.1, top_k=50`. ## All Variants | Variant | Size | Use Case | |---------|------|----------| | **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | Training, GPU inference | | **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | ~237 MB | **Recommended for Apple Silicon** | | [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | ~195 MB | Smallest | ## License MIT