Llama-3.1-70B-LatamGPT-SFT-1.0-NVFP4

NVFP4 (W4A4) quantization of latam-gpt/Llama-3.1-70B-LatamGPT-SFT-1.0, in the compressed-tensors nvfp4-pack-quantized format, for inference on NVIDIA Blackwell GPUs with vLLM. This card covers only the quantization. For everything about the model itself (training, data, capabilities, intended use, evaluations), see the base model card.

How it was made

Scheme: NVFP4: 4-bit weights (FP4 E2M1, per-group-of-16 FP8-E4M3 scales + per-tensor FP32 global scale) and 4-bit activations (W4A4, calibrated activation global scales). lm_head and embeddings stay in BF16.
Recipe: llmcompressor QuantizationModifier(scheme="NVFP4", ignore=["lm_head"]) (compressed-tensors 0.13.0), run through a memory-bounded sequential pipeline that calibrates the full 70B on a single 30 GB-RAM / RTX 5090 workstation.
Calibration: 128 Spanish samples from CohereLabs/aya_dataset, max sequence length 1024, formatted with the model's chat template.

To reproduce with stock tooling, use llmcompressor >= 0.10; its built-in disk offloading handles larger-than-RAM models and produces the same format:

from transformers import AutoModelForCausalLM
from compressed_tensors.offload import load_offloaded_model
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

with load_offloaded_model():
    model = AutoModelForCausalLM.from_pretrained(
        "latam-gpt/Llama-3.1-70B-LatamGPT-SFT-1.0",
        torch_dtype="auto",
        device_map="auto_offload",
        offload_folder="/path/on/a/real/disk",  # pass a single explicit folder
    )

oneshot(
    model=model,
    dataset=...,  # your calibration set; here: 128 Spanish Aya chats @ 1024 tokens
    recipe=QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"]),
    max_seq_length=1024,
    num_calibration_samples=128,
)
model.save_pretrained("out", save_compressed=True)

What to expect

Size: ~40 GB on disk, 3.5x smaller than the ~140 GB BF16 original.
Serving: load with vLLM (native NVFP4 kernels); it auto-detects the quantization from config.json:

from vllm import LLM, SamplingParams

llm = LLM(model="pebeto/Llama-3.1-70B-LatamGPT-SFT-1.0-NVFP4")
out = llm.generate(
    ["Explica la fotosíntesis en términos simples."],
    SamplingParams(max_tokens=256, temperature=0.7),
)
print(out[0].outputs[0].text)

Same chat template and tokenizer as the base model.

Limitations

Not yet accuracy-benchmarked. The tensor layout matches the stock llmcompressor NVFP4 recipe byte for byte, but I have not run perplexity or downstream evaluations. Expect some quality loss vs BF16, as with any 4-bit quantization. Community evaluations and feedback are welcome.
vLLM only. The Hugging Face transformers decompress-on-generate path is broken for NVFP4 (a compressed-tensors / PyTorch bug that also hits stock outputs). Do not use AutoModelForCausalLM.generate; load with vLLM.
Hardware: NVFP4 kernels require a Blackwell GPU (compute capability 12.0: RTX 50-series, B100/B200, GB200). The ~40 GB of weights exceed a single 32 GB card, so you need a >=48 GB card, multiple GPUs, or offloading to serve it.
Inherits the limitations, biases, and intended-use constraints of the base model and of Llama 3.1.

License & attribution

This model derives from Meta's Llama 3.1 and from latam-gpt/Llama-3.1-70B-LatamGPT-SFT-1.0, distributed under the Llama 3.1 Community License and subject to the Acceptable Use Policy and the base model's terms. Credit for the model itself goes to the LatamGPT project and to Meta. I quantized it with llmcompressor / compressed-tensors, using Spanish calibration data from Aya.