--- license: other license_name: other license_link: https://huggingface.co/TheDrummer/Magidonia-24B-v4.3 base_model: TheDrummer/Magidonia-24B-v4.3 tags: - tensorrt-llm - quantized - int4_awq - 4-bit language: - en pipeline_tag: text-generation --- # TheDrummer/Magidonia-24B-v4.3 — INT4-AWQ (TensorRT-LLM) This is a INT4-AWQ quantized version of https://huggingface.co/TheDrummer/Magidonia-24B-v4.3, optimized for TensorRT-LLM inference. --- ## Model Overview **Key Features:** - **High-Performance Inference**: Optimized for NVIDIA GPUs with TensorRT-LLM - **Memory Efficient**: 4-bit weights reduce VRAM usage vs FP16 - **Production Ready**: Built for low-latency, high-throughput chat serving - **Portable Checkpoints**: Checkpoints work across systems; engines are hardware-specific --- ## Technical Specifications | Specification | Details | |---------------|---------| | **Source Model** | https://huggingface.co/TheDrummer/Magidonia-24B-v4.3 | | **Quantization Method** | INT4-AWQ | | **Precision** | 4-bit weights | | **KV Cache** | int8 | | **KV Cache Type** | paged | | **KV Reuse** | enabled | | **Block/Group Size** | 128 | | **TensorRT-LLM Version** | `1.2.0rc5` (used for quantization) | | **Max Batch Size** | 64 | | **Max Input Length** | 5525 | | **Max Output Length** | 150 | | **SM Architecture** | sm90 | | **GPU** | NVIDIA H100 NVL | | **CUDA Toolkit** | 13.0 | | **Generated** | 2026-01-03 18:54:59 UTC | --- ## Artifact Layout ``` trt-llm/ checkpoints/ *.safetensors config.json engines/sm90_trt-llm-1.2.0rc5_cuda13.0/ rank*.engine config.json ``` --- ## Quantization Details | Parameter | Value | |-----------|-------| | **Method** | INT4-AWQ | | **Calibration Size** | 64 samples | | **Calibration Seq Length** | 5675 | | **AWQ Block Size** | 128 | | **Calibration Batch Size** | 16 | --- ## Compatibility ### Requirements - **GPU**: NVIDIA with Compute Capability ≥ 9.0 (Hopper / H100) - **CUDA**: 13.0+ - **TensorRT-LLM**: `1.2.0rc5` - **Python**: 3.10+ ### Portability Notes - **Checkpoints**: Portable across systems with compatible TensorRT-LLM versions; rebuild engines on the target GPU - **Engines**: Hardware-specific (rebuild for different GPU/CUDA versions/SMs, e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX) - **INT4-AWQ checkpoints are portable across sm89/sm90+ GPUs; rebuild engines for the target GPU (e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX) or reuse one of the prebuild engines in case they match your GPU.** --- ## Troubleshooting
Engine fails to load on different GPU Engines are compiled for specific SM architecture and CUDA version. Either: 1. Use the checkpoints and rebuild the engine on your target system 2. Download an engine matching your GPU from the `engines/` subdirectories
Out of Memory Reduce `max_batch_size` or `max_seq_len` when building the engine. Adjust `kv_cache_config.free_gpu_memory_fraction` at runtime.
--- ## Resources - [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/) - [Source Model](https://huggingface.co/TheDrummer/Magidonia-24B-v4.3) --- ## License This quantized model inherits the license from the original base model: **other** See the [original model's license](https://huggingface.co/TheDrummer/Magidonia-24B-v4.3) for full terms.