Falcon3-7B-Base-1.58bit-ATLAS (v2.10.0)
This repository contains a highly optimized TQ1 quantized version of the official tiiuae/Falcon3-7B-Base model for the ATLAS Engine ecosystem, designed for native, ultra-low-latency CPU inference without any GPU requirement.
Packed using the unified
pack_to_atlas.pytoolchain (v2.10.0) with BF16 weight scale correction.
Engine Specifications
| Property | Value |
|---|---|
| Format | ATLAS Binary (.atlas), format_version=2 |
| Quantization | TQ1.0 โ Ternary Weight Packing (Base-3, ~1.58 bits/weight) |
| Target | Native CPU โ Intel AVX2 (Haswell 2013+), no GPU needed |
| File Size | 2.96 GB |
| Inference Speed | 3.2 tok/s (int4 FFN) |
| Description | 28 layers, 3072 hidden, 23040 intermediate โ TII Base variant |
Architecture
| Component | Detail |
|---|---|
| Base Model | tiiuae/Falcon3-7B-Base |
| Architecture | falcon3 |
| Layers | 28 |
| Hidden Size | 3072 |
| Intermediate Size | 23040 |
| Attention Heads | 12 (GQA, 4 KV heads) |
| Head Dim | 256 |
| RoPE Theta | 1000042.0 |
| Vocabulary | 131080 |
| Context Window | 4096 (NTK-scalable up to 8192) |
Verification
During pre-release evaluation (v2.10.0), this quantized derivative demonstrated correct convergence:
- T=0 (argmax):
"The capital of France is Paris."โ correct deterministic output - T=0.7 (sampling): Coherent structured generation with sensible continuation
Note on scale mathematics: the legacy dequantization path divides by the scale factor rather than multiplying. Since this is a constant across all logits for any given output row, the relative probability distribution remains identical under softmax normalization โ no effect on output quality.
Prompt Format
This is a Base model โ it generates raw text continuation without instruction-following. Simply provide your prompt:
{prompt}
Usage
Python
git clone https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.git
from atlas_infer import AtlasModel
model = AtlasModel("Falcon3-7B-Base-tq1.atlas")
output = model.generate_c(
"What is the capital of France?",
max_new_tokens=100,
temperature=0.7,
top_k=40,
)
print(output)
C++ CLI (standalone, no Python required)
atlas --model Falcon3-7B-Base-tq1.atlas --prompt "What is the capital of France?" --max-tokens 100
SSE Web Server
python atlas_server.py --model Falcon3-7B-Base-tq1.atlas --port 8080
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "What is the capital of France?", "max_tokens": 100}'
What is ATLAS?
ATLAS is a CPU inference engine for BitNet b1.58 ternary-quantized models. It repacks HuggingFace safetensors into the TQ1.0 format (5 ternary trits per byte, Base-3 encoding, ~1.58 bits/weight) and runs fast inference via a C++ DLL + Python wrapper.
| Feature | Description |
|---|---|
| No GPU required | Runs on any x86-64 CPU with AVX2 (Intel Haswell 2013+, AMD Excavator 2015+) |
| Hybrid matmul | FFN tensors in int8, QKV/O in TQ1-packed, per-tensor dispatch |
| int4 FFN mode | Halves FFN memory bandwidth for 18-26% speedup (7B/10B) |
| f32 bypass | Auto-enabled for small models (โค1B) and SubLN architectures |
| Ring buffer KV cache | Extended context via NTK-aware RoPE scaling |
| Standalone C++ CLI | No Python or PyTorch required at runtime |
| SSE web server | FastAPI-based /v1/chat/completions with prompt caching |
Links
- Engine source code: github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0
- Original model:
tiiuae/Falcon3-7B-Base
License & Usage Restrictions
This is a quantized derivative work based on the Falcon3 series (original model by the Technology Innovation Institute (TII)), originally released under the Falcon-LLM License.
By downloading or utilizing this file, you agree to be bound by the Falcon-LLM License:
- Attribution: Any usage or secondary deployment must credit the Technology Innovation Institute (TII).
- Non-Commercial & Small Commercial Use: Free for academic research, personal projects, and commercial entities with annual revenue under $1,000,000 USD.
- Commercial Hosting: Entities intending to provide shared, managed hosting of the model or its derivatives as a service must enter into a separate license arrangement with TII.
Disclaimer: This quantized file is provided "as-is". The ATLAS engine itself is Apache 2.0 licensed.
- Downloads last month
- 22
Model tree for xxxn3m3s1sxxx/Falcon3-7B-Base-1.58bit-ATLAS
Base model
tiiuae/Falcon3-7B-Base