File size: 5,199 Bytes
eda4f6c
27c16fc
 
 
eda4f6c
 
27c16fc
 
 
 
 
 
7278227
 
81578ee
eda4f6c
27c16fc
 
eda4f6c
 
9f2abdb
1c9bfcc
b64dd3c
eda4f6c
b64dd3c
 
e8ff65e
b64dd3c
6df6f01
8d24c18
6df6f01
 
 
8d24c18
6df6f01
8d24c18
6df6f01
8d24c18
6df6f01
 
 
 
 
 
 
 
7278227
6df6f01
eda4f6c
e8ff65e
1c9bfcc
27c16fc
e8ff65e
 
 
 
6df6f01
 
 
 
27c16fc
eda4f6c
e8ff65e
eda4f6c
 
 
 
 
 
 
b64dd3c
eda4f6c
 
 
81578ee
27c16fc
 
eda4f6c
 
6df6f01
 
 
 
 
 
27c16fc
b64dd3c
 
 
eda4f6c
 
6df6f01
 
 
 
 
 
 
 
b64dd3c
 
8d24c18
 
 
7278227
e8ff65e
eda4f6c
8d24c18
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: mit
language:
- en
base_model: LiquidAI/LFM2.5-350M-Base
tags:
- speech-to-text
- transcript-cleanup
- text-correction
- asr-post-processing
- LFM
- LiquidAI
- grpo
- full-fine-tune
- inverse-text-normalization
pipeline_tag: text-generation
datasets:
- juanquivilla/sotto-transcript-cleanup
---

# SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, soup_30)

[sottoasr.app](https://sottoasr.app) · [MLX 5-bit (recommended)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit) · [MLX 4-bit (smaller)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) · [Training Dataset](https://huggingface.co/datasets/juanquivilla/sotto-transcript-cleanup)

## Overview

Full-precision bf16 fine-tune of [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) for on-device speech-to-text transcript cleanup. This is the **training artifact** — for on-device deployment on Apple Silicon, use the [5-bit MLX variant](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit).

## What's new (model soup release)

This model is a **weight-space average** of two strong checkpoints from the same fine-tuning lineage:
- **0.3 × v55** (latest: 2-epoch refinement at lr 2e-6) — strongest on number-accuracy and filler-stripping
- **0.7 × v51** (the prior production model) — strongest on adversarial sampling benchmark

Linear interpolation in weight space (`θ = α·θ_v55 + (1-α)·θ_v51`) is sometimes called "model souping". It works here because v55 was chained from v51 (same architecture, related minima), and the soup recovers v51's bench-sample strengths without losing v55's number/filler gains. The full recipe sweep is in the [research journal](https://github.com/anthropics) (2026-05-06 loop).

## Headline numbers (production-mode eval: `max_new_tokens=900`, `repetition_penalty=1.05`)

| Capability | v36 | v45 | v51 | v55 | **soup (this)** |
|---|---:|---:|---:|---:|---:|
| Number accuracy (171-sample stratified val) | 12.9% | 95.9% | 95.3% | 96.5% | **96.5%** |
| 66-case adversarial benchmark (greedy) | n/a | 76% | 84.8% | 84.8% | **86.4%** |
| 66-case adversarial benchmark (temp 0.7 × 4) | n/a | 77% | 84.5% | 82.6% | **86.0%** |
| Loops on 264 sampling-mode probes | n/a | 0 | 1 | 2 | **0** |
| Filler-free on 241 long inputs | 67.2% | 68.0% | 72.2% | 72.6% | 71.8% |
| Sub-deletion >15% on 241 long inputs | 13.3% | 13.7% | 4.6% | 5.0% | **5.0%** |

Composite score (0.35×num + 0.30×bench_greedy + 0.15×bench_sample + 0.10×filler_long + 0.05×(1-sub15) + 0.05×(1-loops/N)): **89.51** at full production settings.

## Training pipeline

```
LiquidAI/LFM2.5-350M-Base
  → SFT v23 → GRPO v23 (paragraph emission)
  → GRPO v36: full FT with substantive-deletion-aware reward
  → SFT v39: + 12.7K augmented number examples (ITN)
  → GRPO v40–v45: chained refinement, fixed reward + amplified filler penalty
  → GRPO v50 + v51: anti-loop n-gram penalty
  → GRPO v55: 2-epoch refinement at lr 2e-6 (best chained checkpoint)
  → soup: 0.3·θ_v55 + 0.7·θ_v51 (weight-space average — this model)
```

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

text = "talk about server three sixty"
prompt = f"### Input:\n{text}\n\n### Output:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=max(900, int(len(text.split()) * 1.5)),  # ≥1.5× input word count
        do_sample=False,
        repetition_penalty=1.05,                                # LFM2.5 official default
    )
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
    output = output[:output.index("###")]
print(output.strip())
```

### Inference recommendations

The headline numbers above use these settings — they match the LFM2.5 model card's defaults and are the production deployment for [sottoasr.app](https://sottoasr.app):

- **`repetition_penalty=1.05`** — LFM2.5's official default. Critical for long inputs: prevents the rare voicemail-style 5-gram loops that can occur with `repetition_penalty=1.0`.
- **`max_new_tokens >= 1.5 × input_word_count`** (or 900 minimum) — long inputs (>200 words) need headroom; truncating mid-output looks like content deletion.
- **`do_sample=False`** (greedy) for deterministic output. If sampling is needed, use `temperature=0.1, top_k=50`.

## All Variants

| Variant | Size | Use Case |
|---------|------|----------|
| **[Full precision (this)](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m)** | 676 MB | Training, GPU inference |
| **[MLX 5-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit)** | ~237 MB | **Recommended for Apple Silicon** |
| [MLX 4-bit](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit) | ~195 MB | Smallest |

## License

MIT