Llama-3.1-70B-LatamGPT-SFT-1.0-NVFP4
NVFP4 (W4A4) quantization of latam-gpt/Llama-3.1-70B-LatamGPT-SFT-1.0, in the compressed-tensors nvfp4-pack-quantized format, for inference on NVIDIA Blackwell GPUs with vLLM. This card covers only the quantization. For everything about the model itself (training, data, capabilities, intended use, evaluations), see the base model card.
How it was made
- Scheme: NVFP4: 4-bit weights (FP4 E2M1, per-group-of-16 FP8-E4M3 scales + per-tensor FP32 global scale) and 4-bit activations (W4A4, calibrated activation global scales).
lm_headand embeddings stay in BF16. - Recipe:
llmcompressorQuantizationModifier(scheme="NVFP4", ignore=["lm_head"])(compressed-tensors0.13.0), run through a memory-bounded sequential pipeline that calibrates the full 70B on a single 30 GB-RAM / RTX 5090 workstation. - Calibration: 128 Spanish samples from
CohereLabs/aya_dataset, max sequence length 1024, formatted with the model's chat template.
To reproduce with stock tooling, use llmcompressor >= 0.10; its built-in disk offloading handles larger-than-RAM models and produces the same format:
from transformers import AutoModelForCausalLM
from compressed_tensors.offload import load_offloaded_model
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
with load_offloaded_model():
model = AutoModelForCausalLM.from_pretrained(
"latam-gpt/Llama-3.1-70B-LatamGPT-SFT-1.0",
torch_dtype="auto",
device_map="auto_offload",
offload_folder="/path/on/a/real/disk", # pass a single explicit folder
)
oneshot(
model=model,
dataset=..., # your calibration set; here: 128 Spanish Aya chats @ 1024 tokens
recipe=QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"]),
max_seq_length=1024,
num_calibration_samples=128,
)
model.save_pretrained("out", save_compressed=True)
What to expect
- Size: ~40 GB on disk, 3.5x smaller than the ~140 GB BF16 original.
- Serving: load with vLLM (native NVFP4 kernels); it auto-detects the quantization from
config.json:
from vllm import LLM, SamplingParams
llm = LLM(model="pebeto/Llama-3.1-70B-LatamGPT-SFT-1.0-NVFP4")
out = llm.generate(
["Explica la fotosíntesis en términos simples."],
SamplingParams(max_tokens=256, temperature=0.7),
)
print(out[0].outputs[0].text)
- Same chat template and tokenizer as the base model.
Limitations
- Not yet accuracy-benchmarked. The tensor layout matches the stock
llmcompressorNVFP4 recipe byte for byte, but I have not run perplexity or downstream evaluations. Expect some quality loss vs BF16, as with any 4-bit quantization. Community evaluations and feedback are welcome. - vLLM only. The Hugging Face
transformersdecompress-on-generate path is broken for NVFP4 (acompressed-tensors/ PyTorch bug that also hits stock outputs). Do not useAutoModelForCausalLM.generate; load with vLLM. - Hardware: NVFP4 kernels require a Blackwell GPU (compute capability 12.0: RTX 50-series, B100/B200, GB200). The ~40 GB of weights exceed a single 32 GB card, so you need a >=48 GB card, multiple GPUs, or offloading to serve it.
- Inherits the limitations, biases, and intended-use constraints of the base model and of Llama 3.1.
License & attribution
This model derives from Meta's Llama 3.1 and from latam-gpt/Llama-3.1-70B-LatamGPT-SFT-1.0, distributed under the Llama 3.1 Community License and subject to the Acceptable Use Policy and the base model's terms. Credit for the model itself goes to the LatamGPT project and to Meta. I quantized it with llmcompressor / compressed-tensors, using Spanish calibration data from Aya.
- Downloads last month
- 6
Model tree for pebeto/Llama-3.1-70B-LatamGPT-SFT-1.0-NVFP4
Base model
meta-llama/Llama-3.1-70B