---
license: other
license_name: other
license_link: https://huggingface.co/TheDrummer/Magidonia-24B-v4.3
base_model: TheDrummer/Magidonia-24B-v4.3
tags:
- tensorrt-llm
- quantized
- int4_awq
- 4-bit
language:
- en
pipeline_tag: text-generation
---
# TheDrummer/Magidonia-24B-v4.3 — INT4-AWQ (TensorRT-LLM)
This is a INT4-AWQ quantized version of https://huggingface.co/TheDrummer/Magidonia-24B-v4.3, optimized for TensorRT-LLM inference.
---
## Model Overview
**Key Features:**
- **High-Performance Inference**: Optimized for NVIDIA GPUs with TensorRT-LLM
- **Memory Efficient**: 4-bit weights reduce VRAM usage vs FP16
- **Production Ready**: Built for low-latency, high-throughput chat serving
- **Portable Checkpoints**: Checkpoints work across systems; engines are hardware-specific
---
## Technical Specifications
| Specification | Details |
|---------------|---------|
| **Source Model** | https://huggingface.co/TheDrummer/Magidonia-24B-v4.3 |
| **Quantization Method** | INT4-AWQ |
| **Precision** | 4-bit weights |
| **KV Cache** | int8 |
| **KV Cache Type** | paged |
| **KV Reuse** | enabled |
| **Block/Group Size** | 128 |
| **TensorRT-LLM Version** | `1.2.0rc5` (used for quantization) |
| **Max Batch Size** | 64 |
| **Max Input Length** | 5525 |
| **Max Output Length** | 150 |
| **SM Architecture** | sm90 |
| **GPU** | NVIDIA H100 NVL |
| **CUDA Toolkit** | 13.0 |
| **Generated** | 2026-01-03 18:54:59 UTC |
---
## Artifact Layout
```
trt-llm/
checkpoints/
*.safetensors
config.json
engines/sm90_trt-llm-1.2.0rc5_cuda13.0/
rank*.engine
config.json
```
---
## Quantization Details
| Parameter | Value |
|-----------|-------|
| **Method** | INT4-AWQ |
| **Calibration Size** | 64 samples |
| **Calibration Seq Length** | 5675 |
| **AWQ Block Size** | 128 |
| **Calibration Batch Size** | 16 |
---
## Compatibility
### Requirements
- **GPU**: NVIDIA with Compute Capability ≥ 9.0 (Hopper / H100)
- **CUDA**: 13.0+
- **TensorRT-LLM**: `1.2.0rc5`
- **Python**: 3.10+
### Portability Notes
- **Checkpoints**: Portable across systems with compatible TensorRT-LLM versions; rebuild engines on the target GPU
- **Engines**: Hardware-specific (rebuild for different GPU/CUDA versions/SMs, e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX)
- **INT4-AWQ checkpoints are portable across sm89/sm90+ GPUs; rebuild engines for the target GPU (e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX) or reuse one of the prebuild engines in case they match your GPU.**
---
## Troubleshooting
Engine fails to load on different GPU
Engines are compiled for specific SM architecture and CUDA version. Either:
1. Use the checkpoints and rebuild the engine on your target system
2. Download an engine matching your GPU from the `engines/` subdirectories
Out of Memory
Reduce `max_batch_size` or `max_seq_len` when building the engine.
Adjust `kv_cache_config.free_gpu_memory_fraction` at runtime.
---
## Resources
- [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/)
- [Source Model](https://huggingface.co/TheDrummer/Magidonia-24B-v4.3)
---
## License
This quantized model inherits the license from the original base model: **other**
See the [original model's license](https://huggingface.co/TheDrummer/Magidonia-24B-v4.3) for full terms.