---
license: mit
language:
- en
base_model: LiquidAI/LFM2.5-350M-Base
tags:
- speech-to-text
- transcript-cleanup
- text-correction
- asr-post-processing
- LFM
- LiquidAI
- grpo
- full-fine-tune
- inverse-text-normalization
pipeline_tag: text-generation
datasets:
- juanquivilla/sotto-transcript-cleanup
---

# SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, soup_30)

[sottoasr.app](https://sottoasr.app) · [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)

## Overview

Full-precision bf16 fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** — for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit).

## What's new (model soup release)

This model is a **weight-space average** of two strong checkpoints from the same fine-tuning lineage:
- **0.3 × v55** (latest: 2-epoch refinement at lr 2e-6) — strongest on number-accuracy and filler-stripping
- **0.7 × v51** (the prior production model) — strongest on adversarial sampling benchmark

Linear interpolation in weight space (`θ = α·θ_v55 + (1-α)·θ_v51`) is sometimes called "model souping". It works here because v55 was chained from v51 (same architecture, related minima), and the soup recovers v51's bench-sample strengths without losing v55's number/filler gains. The full recipe sweep is in the [research journal](https://github.com/anthropics) (2026-05-06 loop).

## Headline numbers (production-mode eval: `max_new_tokens=900`, `repetition_penalty=1.05`)

| Capability | v36 | v45 | v51 | v55 | **soup (this)** |
|---|---:|---:|---:|---:|---:|
| Number accuracy (171-sample stratified val) | 12.9% | 95.9% | 95.3% | 96.5% | **96.5%** |
| 66-case adversarial benchmark (greedy) | n/a | 76% | 84.8% | 84.8% | **86.4%** |
| 66-case adversarial benchmark (temp 0.7 × 4) | n/a | 77% | 84.5% | 82.6% | **86.0%** |
| Loops on 264 sampling-mode probes | n/a | 0 | 1 | 2 | **0** |
| Filler-free on 241 long inputs | 67.2% | 68.0% | 72.2% | 72.6% | 71.8% |
| Sub-deletion >15% on 241 long inputs | 13.3% | 13.7% | 4.6% | 5.0% | **5.0%** |

Composite score (0.35×num + 0.30×bench_greedy + 0.15×bench_sample + 0.10×filler_long + 0.05×(1-sub15) + 0.05×(1-loops/N)): **89.51** at full production settings.

## Training pipeline

```
LiquidAI/LFM2.5-350M-Base
  → SFT v23 → GRPO v23 (paragraph emission)
  → GRPO v36: full FT with substantive-deletion-aware reward
  → SFT v39: + 12.7K augmented number examples (ITN)
  → GRPO v40–v45: chained refinement, fixed reward + amplified filler penalty
  → GRPO v50 + v51: anti-loop n-gram penalty
  → GRPO v55: 2-epoch refinement at lr 2e-6 (best chained checkpoint)
  → soup: 0.3·θ_v55 + 0.7·θ_v51 (weight-space average — this model)
```

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

text = "talk about server three sixty"
prompt = f"### Input:\n{text}\n\n### Output:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=max(900, int(len(text.split()) * 1.5)),  # ≥1.5× input word count
        do_sample=False,
        repetition_penalty=1.05,                                # LFM2.5 official default
    )
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
    output = output[:output.index("###")]
print(output.strip())
```

### Inference recommendations

The headline numbers above use these settings — they match the LFM2.5 model card's defaults and are the production deployment for [sottoasr.app](https://sottoasr.app):

- **`repetition_penalty=1.05`** — LFM2.5's official default. Critical for long inputs: prevents the rare voicemail-style 5-gram loops that can occur with `repetition_penalty=1.0`.
- **`max_new_tokens >= 1.5 × input_word_count`** (or 900 minimum) — long inputs (>200 words) need headroom; truncating mid-output looks like content deletion.
- **`do_sample=False`** (greedy) for deterministic output. If sampling is needed, use `temperature=0.1, top_k=50`.

## All Variants

| Variant | Size | Use Case |
|---------|------|----------|
| **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | Training, GPU inference |
| **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | ~237 MB | **Recommended for Apple Silicon** |
| [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | ~195 MB | Smallest |

## License

MIT