File size: 5,199 Bytes
eda4f6c 27c16fc eda4f6c 27c16fc 7278227 81578ee eda4f6c 27c16fc eda4f6c 9f2abdb 1c9bfcc b64dd3c eda4f6c b64dd3c e8ff65e b64dd3c 6df6f01 8d24c18 6df6f01 8d24c18 6df6f01 8d24c18 6df6f01 8d24c18 6df6f01 7278227 6df6f01 eda4f6c e8ff65e 1c9bfcc 27c16fc e8ff65e 6df6f01 27c16fc eda4f6c e8ff65e eda4f6c b64dd3c eda4f6c 81578ee 27c16fc eda4f6c 6df6f01 27c16fc b64dd3c eda4f6c 6df6f01 b64dd3c 8d24c18 7278227 e8ff65e eda4f6c 8d24c18 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | ---
license: mit
language:
- en
base_model: LiquidAI/LFM2.5-350M-Base
tags:
- speech-to-text
- transcript-cleanup
- text-correction
- asr-post-processing
- LFM
- LiquidAI
- grpo
- full-fine-tune
- inverse-text-normalization
pipeline_tag: text-generation
datasets:
- juanquivilla/sotto-transcript-cleanup
---
# SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, soup_30)
[sottoasr.app](https://sottoasr.app) · [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)
## Overview
Full-precision bf16 fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** — for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit).
## What's new (model soup release)
This model is a **weight-space average** of two strong checkpoints from the same fine-tuning lineage:
- **0.3 × v55** (latest: 2-epoch refinement at lr 2e-6) — strongest on number-accuracy and filler-stripping
- **0.7 × v51** (the prior production model) — strongest on adversarial sampling benchmark
Linear interpolation in weight space (`θ = α·θ_v55 + (1-α)·θ_v51`) is sometimes called "model souping". It works here because v55 was chained from v51 (same architecture, related minima), and the soup recovers v51's bench-sample strengths without losing v55's number/filler gains. The full recipe sweep is in the [research journal](https://github.com/anthropics) (2026-05-06 loop).
## Headline numbers (production-mode eval: `max_new_tokens=900`, `repetition_penalty=1.05`)
| Capability | v36 | v45 | v51 | v55 | **soup (this)** |
|---|---:|---:|---:|---:|---:|
| Number accuracy (171-sample stratified val) | 12.9% | 95.9% | 95.3% | 96.5% | **96.5%** |
| 66-case adversarial benchmark (greedy) | n/a | 76% | 84.8% | 84.8% | **86.4%** |
| 66-case adversarial benchmark (temp 0.7 × 4) | n/a | 77% | 84.5% | 82.6% | **86.0%** |
| Loops on 264 sampling-mode probes | n/a | 0 | 1 | 2 | **0** |
| Filler-free on 241 long inputs | 67.2% | 68.0% | 72.2% | 72.6% | 71.8% |
| Sub-deletion >15% on 241 long inputs | 13.3% | 13.7% | 4.6% | 5.0% | **5.0%** |
Composite score (0.35×num + 0.30×bench_greedy + 0.15×bench_sample + 0.10×filler_long + 0.05×(1-sub15) + 0.05×(1-loops/N)): **89.51** at full production settings.
## Training pipeline
```
LiquidAI/LFM2.5-350M-Base
→ SFT v23 → GRPO v23 (paragraph emission)
→ GRPO v36: full FT with substantive-deletion-aware reward
→ SFT v39: + 12.7K augmented number examples (ITN)
→ GRPO v40–v45: chained refinement, fixed reward + amplified filler penalty
→ GRPO v50 + v51: anti-loop n-gram penalty
→ GRPO v55: 2-epoch refinement at lr 2e-6 (best chained checkpoint)
→ soup: 0.3·θ_v55 + 0.7·θ_v51 (weight-space average — this model)
```
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"juanquivilla/sotto-cleanup-lfm25-350m",
dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
text = "talk about server three sixty"
prompt = f"### Input:\n{text}\n\n### Output:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=max(900, int(len(text.split()) * 1.5)), # ≥1.5× input word count
do_sample=False,
repetition_penalty=1.05, # LFM2.5 official default
)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
output = output[:output.index("###")]
print(output.strip())
```
### Inference recommendations
The headline numbers above use these settings — they match the LFM2.5 model card's defaults and are the production deployment for [sottoasr.app](https://sottoasr.app):
- **`repetition_penalty=1.05`** — LFM2.5's official default. Critical for long inputs: prevents the rare voicemail-style 5-gram loops that can occur with `repetition_penalty=1.0`.
- **`max_new_tokens >= 1.5 × input_word_count`** (or 900 minimum) — long inputs (>200 words) need headroom; truncating mid-output looks like content deletion.
- **`do_sample=False`** (greedy) for deterministic output. If sampling is needed, use `temperature=0.1, top_k=50`.
## All Variants
| Variant | Size | Use Case |
|---------|------|----------|
| **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | Training, GPU inference |
| **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | ~237 MB | **Recommended for Apple Silicon** |
| [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | ~195 MB | Smallest |
## License
MIT
|