- Huihui-LFM2.5-8B-A1B-abliterated-NVFP4
Huihui-LFM2.5-8B-A1B-abliterated-NVFP4
NVFP4 (W4A4) quantization of huihui-ai/Huihui-LFM2.5-8B-A1B-abliterated — the abliterated (refusal-reduced) build of Liquid AI's 8.3B-total / 1.5B-active mixture-of-experts reasoner (131 K context). Runs on one 16 GB Blackwell GPU with room for a ~1.1 M-token KV cache (fp8), and scales cleanly to 2–4 GPUs.
Quantized by Lna-Lab with NVIDIA TensorRT Model-Optimizer (modelopt). To our knowledge this is the first NVFP4 build of an abliterated lfm2_moe.
⚠️ Uncensored model. The base is abliterated by huihui.ai — its safety filtering is significantly reduced. See Usage warnings below. (Per the base card, the abliteration touches the dense path; the MoE experts were not ablated.)
Why it's nice: 8.3B total / 1.5B active MoE + a hybrid backbone (only 6 of 24 layers are attention; the rest are short-convolution) means the KV cache is tiny. Shrink the weights to 4-bit and the freed VRAM turns straight into concurrency — one card happily serves a stack of parallel sessions, and TP fans that out further.
📊 Measured throughput — 1 / 2 / 4× RTX PRO 2000 Blackwell (16 GB, SM120)
docker vllm/vllm-openai:v0.22.0, --quantization modelopt, fp8 KV, max-model-len 32768, 256-token decode via /v1/completions (ignore_eos+min_tokens). Aggregate output tokens/s at concurrency C:
| C | TP=1 (1 GPU) | TP=2 (2 GPU) | TP=4 (4 GPU) |
|---|---|---|---|
| 1 | 130 | 210 | 305 |
| 2 | 206 | 323 | 452 |
| 4 | 363 | 595 | 808 |
| 8 | 687 | 1147 | 1557 |
| 16 | 1166 | 1831 | 2463 |
| 32 | 2111 | 3273 | 4545 |
| TP=1 | TP=2 | TP=4 | |
|---|---|---|---|
| weights / GPU | ~5.3 GB | ~2.7 GB | ~1.4 GB |
| KV pool (fp8) | 1.16 M tok | 3.17 M tok | 7.82 M tok |
| max concurrency @ 32 K ctx | 35× | 97× | 239× |
- Single-stream scales with TP: 130 → 210 → 305 tok/s (1→2→4 GPU).
- Aggregate scales near-linearly with concurrency; TP=4 reaches ~4.5 K tok/s at C=32 and still has KV headroom for hundreds of sessions.
- Numbers are single-run and indicative (±~10–15 % at high concurrency). bf16 KV trades ~half the KV pool for marginally different speed.
Coherence spot-check (greedy / EN-JA-code): MoE explanation, Japanese, and a linked-list reversal all correct; reasoning, multilingual and code generation preserved.
🔗 Multi-GPU (TP) on no-NVLink Blackwell — important
These workstation cards have no NVLink (PCIe-only). With plain --tensor-parallel-size 2/4, NCCL deadlocks at communicator init (GPUs spin at 100 % util, no progress). Launch TP with P2P disabled:
docker run -d --gpus '"device=0,1,3,4"' --ipc=host \
-e NCCL_P2P_DISABLE=1 -e NCCL_CUMEM_ENABLE=0 \
-v $PWD:/model:ro -p 8000:8000 --shm-size 16g \
--entrypoint vllm vllm/vllm-openai:v0.22.0 \
serve /model --served-model-name lfm25-abl \
--quantization modelopt --tensor-parallel-size 4 \
--disable-custom-all-reduce \
--kv-cache-dtype fp8 --max-model-len 32768 \
--max-num-seqs 256 --gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8000
P2P-off staging is fine for this small MoE (per-token all-reduce volume is tiny). On NVLink hardware you can drop NCCL_P2P_DISABLE / --disable-custom-all-reduce.
🧠 It's a reasoning model — chat template
LFM2.5 uses a ChatML-like template and emits an explicit <think> … </think> chain of
thought before the final answer. tokenizer.apply_chat_template(...) renders, e.g.:
<|startoftext|><|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
In vLLM, pass --reasoning-parser deepseek_r1 — it matches the </think> delimiter so
the OpenAI API returns the CoT in reasoning_content and the answer in content. Omit it
for raw text (think tags included).
🛠 Tool use (agentic)
Pass tools via apply_chat_template(..., tools=[...]). By default the model emits
Pythonic calls between <|tool_call_start|> and <|tool_call_end|>:
<|tool_call_start|>[get_weather(city="Tokyo")]<|tool_call_end|>
For OpenAI-API auto-parsing add --enable-auto-tool-choice --tool-call-parser pythonic
(if your vLLM build's Pythonic parser handles the wrapper); otherwise parse the
special-token block yourself. The bundled chat_template.jinja renders tools / calls /
tool-role results.
🚀 Serve with vLLM
Single GPU:
CUDA_VISIBLE_DEVICES=0 vllm serve sakamakismile/Huihui-LFM2.5-8B-A1B-abliterated-NVFP4 \
--served-model-name lfm25-abl \
--quantization modelopt \
--kv-cache-dtype fp8 \
--max-model-len 128000 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.90 \
--reasoning-parser deepseek_r1 \
--port 8000
(For TP=2/4 see the no-NVLink launch above.)
| flag | what it does for this model |
|---|---|
--quantization modelopt |
required — reads hf_quant_config.json (NVFP4). Omit it and weights load as garbage. |
--kv-cache-dtype fp8 |
~2× KV capacity (used for the table above). |
--max-num-seqs |
concurrency; KV is cheap here, so 16–256 is comfortable. |
--max-model-len |
up to 131072 (native). |
--reasoning-parser deepseek_r1 |
separates <think> CoT from the answer. |
--tensor-parallel-size |
1 fits one card; 2/4 for more single-stream speed + KV (mind the NCCL note). |
Offline
from vllm import LLM, SamplingParams
llm = LLM("sakamakismile/Huihui-LFM2.5-8B-A1B-abliterated-NVFP4", quantization="modelopt",
max_model_len=32768, gpu_memory_utilization=0.90, max_num_seqs=8)
tok = llm.get_tokenizer()
chat = tok.apply_chat_template(
[{"role": "user", "content": "日本語で自己紹介して。"}],
tokenize=False, add_generation_prompt=True)
print(llm.generate([chat], SamplingParams(
temperature=0.2, top_k=80, repetition_penalty=1.05, max_tokens=512))[0].outputs[0].text)
Sampling (Liquid's recommendation): temperature=0.2, top_k=80,
repetition_penalty=1.05. It thinks first, so give it max_tokens ≥ 512.
⚠️ Usage notes & caveats
- Needs Blackwell (SM120) + a recent vLLM (≥0.21 with NVFP4/modelopt) and
flashinfer— the FP4 GEMM and MoE run on FlashInfer-CUTLASS kernels. ModuleNotFoundError: No module named 'trinity_turbo'in the logs is harmless.- If a MoE backend objects to the FP4 scales, force Marlin:
VLLM_USE_FLASHINFER_MOE_FP4=0. - Use
--quantization modeloptonly — notfp8/awq/gptq. - A handful of rarely-routed ("cold") experts are calibrated from limited activation coverage and exported with a weight-derived scale fallback; for the overwhelming majority of tokens, output tracks the source closely.
- Uncensored / abliterated (refusal-reduced) — inherited from the base. Safety filtering is significantly reduced; outputs may be sensitive or inappropriate. Use for research / controlled settings, monitor and review outputs, and comply with local law. See the base model card for the full warnings. Quantization changes precision only — no behavioral change vs the abliterated base.
🔬 What's quantized
NVFP4 = e2m1 weights, 16-wide blocks, FP8-e4m3 block scales + FP32 global scale, static
per-tensor input_scale.
- → NVFP4: all 32 MoE experts (per layer) + the 2 dense MLP layers.
- kept BF16: attention (q/k/v/out), short-conv projections, the MoE router
(
feed_forward.gate), token embeddings, andlm_head.
Full recipe + scripts: Lna-Lab lnarizer/recipes/lfm2_moe/ (the one modelopt calibration
patch for lfm2_moe, plus the expert-key remap for vLLM). The exact same recipe produced
the non-abliterated LFM2.5-8B-A1B-NVFP4
unchanged.
License
Inherits the base license (LFM Open License v1.0, license_name: lfm1.0) — see the bundled
LICENSE. Lineage: LiquidAI/LFM2.5-8B-A1B → huihui-ai/Huihui-LFM2.5-8B-A1B-abliterated
→ this NVFP4 build.
Credits
- Base model: Liquid AI — LFM2.5-8B-A1B.
- Abliteration: huihui.ai — Huihui-LFM2.5-8B-A1B-abliterated.
- NVFP4 quantization & packaging: Lna-Lab.
- Tooling: NVIDIA TensorRT Model-Optimizer, vLLM, FlashInfer.
💖 Support the Base Model Author
This NVFP4 build stands on huihui.ai's abliteration work. If it's useful to you, please consider supporting them — a cup of coffee goes a long way:
- Ko-fi: https://ko-fi.com/huihuiai
- Bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
🔬 Lna-Lab · NVFP4 for Blackwell · LLMs without colored glasses, in 4-bit, for the edge
Quantized & benchmarked on 7× RTX PRO 2000 Blackwell · 2026
- Downloads last month
- 110