---
license: other
license_name: other
license_link: https://huggingface.co/TheDrummer/Magidonia-24B-v4.3
base_model: TheDrummer/Magidonia-24B-v4.3
tags:
- tensorrt-llm
- quantized
- int4_awq
- 4-bit
language:
- en
pipeline_tag: text-generation
---

# TheDrummer/Magidonia-24B-v4.3 — INT4-AWQ (TensorRT-LLM)

This is a INT4-AWQ quantized version of https://huggingface.co/TheDrummer/Magidonia-24B-v4.3, optimized for TensorRT-LLM inference.

---

## Model Overview

**Key Features:**
- **High-Performance Inference**: Optimized for NVIDIA GPUs with TensorRT-LLM
- **Memory Efficient**: 4-bit weights reduce VRAM usage vs FP16
- **Production Ready**: Built for low-latency, high-throughput chat serving
- **Portable Checkpoints**: Checkpoints work across systems; engines are hardware-specific

---

## Technical Specifications

| Specification | Details |
|---------------|---------|
| **Source Model** | https://huggingface.co/TheDrummer/Magidonia-24B-v4.3 |
| **Quantization Method** | INT4-AWQ |
| **Precision** | 4-bit weights |
| **KV Cache** | int8 |
| **KV Cache Type** | paged |
| **KV Reuse** | enabled |
| **Block/Group Size** | 128 |
| **TensorRT-LLM Version** | `1.2.0rc5` (used for quantization) |
| **Max Batch Size** | 64 |
| **Max Input Length** | 5525 |
| **Max Output Length** | 150 |
| **SM Architecture** | sm90 |
| **GPU** | NVIDIA H100 NVL |
| **CUDA Toolkit** | 13.0 |
| **Generated** | 2026-01-03 18:54:59 UTC |

---

## Artifact Layout

```
trt-llm/
  checkpoints/
    *.safetensors
    config.json
  engines/sm90_trt-llm-1.2.0rc5_cuda13.0/ 
    rank*.engine
    config.json
```

---

## Quantization Details

| Parameter | Value |
|-----------|-------|
| **Method** | INT4-AWQ |
| **Calibration Size** | 64 samples |
| **Calibration Seq Length** | 5675 |
| **AWQ Block Size** | 128 |
| **Calibration Batch Size** | 16 |

---

## Compatibility

### Requirements
- **GPU**: NVIDIA with Compute Capability ≥ 9.0 (Hopper / H100)
- **CUDA**: 13.0+
- **TensorRT-LLM**: `1.2.0rc5`
- **Python**: 3.10+

### Portability Notes
- **Checkpoints**: Portable across systems with compatible TensorRT-LLM versions; rebuild engines on the target GPU
- **Engines**: Hardware-specific (rebuild for different GPU/CUDA versions/SMs, e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX)
- **INT4-AWQ checkpoints are portable across sm89/sm90+ GPUs; rebuild engines for the target GPU (e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX) or reuse one of the prebuild engines in case they match your GPU.**

---

## Troubleshooting

<details>
<summary><b>Engine fails to load on different GPU</b></summary>
Engines are compiled for specific SM architecture and CUDA version. Either:

1. Use the checkpoints and rebuild the engine on your target system
2. Download an engine matching your GPU from the `engines/` subdirectories
</details>

<details>
<summary><b>Out of Memory</b></summary>
Reduce `max_batch_size` or `max_seq_len` when building the engine.
Adjust `kv_cache_config.free_gpu_memory_fraction` at runtime.
</details>

---

## Resources

- [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/)
- [Source Model](https://huggingface.co/TheDrummer/Magidonia-24B-v4.3)

---

## License

This quantized model inherits the license from the original base model: **other**

See the [original model's license](https://huggingface.co/TheDrummer/Magidonia-24B-v4.3) for full terms.