Falcon3-3B-Base-1.58bit-ATLAS (v2.10.0)

This repository contains a highly optimized TQ1 quantized version of the official tiiuae/Falcon3-3B-Base model for the ATLAS Engine ecosystem, designed for native, ultra-low-latency CPU inference without any GPU requirement.

Packed using the unified pack_to_atlas.py toolchain (v2.10.0) with BF16 weight scale correction.


Engine Specifications

Property Value
Format ATLAS Binary (.atlas), format_version=2
Quantization TQ1.0 โ€” Ternary Weight Packing (Base-3, ~1.58 bits/weight)
Target Native CPU โ€” Intel AVX2 (Haswell 2013+), no GPU needed
File Size 2.11 GB
Inference Speed 7.1 tok/s (hybrid)
Description 22 layers, 3072 hidden, 9216 intermediate โ€” TII Base variant

Architecture

Component Detail
Base Model tiiuae/Falcon3-3B-Base
Architecture falcon3
Layers 22
Hidden Size 3072
Intermediate Size 9216
Attention Heads 12 (GQA, 4 KV heads)
Head Dim 256
RoPE Theta 1000042.0
Vocabulary 131072
Context Window 4096 (NTK-scalable up to 8192)

Verification

During pre-release evaluation (v2.10.0), this quantized derivative demonstrated correct convergence:

  • T=0 (argmax): "The capital of France is Paris." โ€” correct deterministic output
  • T=0.7 (sampling): Coherent structured generation with sensible continuation

Note on scale mathematics: the legacy dequantization path divides by the scale factor rather than multiplying. Since this is a constant across all logits for any given output row, the relative probability distribution remains identical under softmax normalization โ€” no effect on output quality.


Prompt Format

This is a Base model โ€” it generates raw text continuation without instruction-following. Simply provide your prompt:

{prompt}

Usage

Python

git clone https://github.com/xxxn3m3s1sxxx/ATLAS-TQ1_0.git
from atlas_infer import AtlasModel

model = AtlasModel("Falcon3-3B-Base-tq1.atlas")
output = model.generate_c(
    "What is the capital of France?",
    max_new_tokens=100,
    temperature=0.7,
    top_k=40,
)
print(output)

C++ CLI (standalone, no Python required)

atlas --model Falcon3-3B-Base-tq1.atlas --prompt "What is the capital of France?" --max-tokens 100

SSE Web Server

python atlas_server.py --model Falcon3-3B-Base-tq1.atlas --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of France?", "max_tokens": 100}'

What is ATLAS?

ATLAS is a CPU inference engine for BitNet b1.58 ternary-quantized models. It repacks HuggingFace safetensors into the TQ1.0 format (5 ternary trits per byte, Base-3 encoding, ~1.58 bits/weight) and runs fast inference via a C++ DLL + Python wrapper.

Feature Description
No GPU required Runs on any x86-64 CPU with AVX2 (Intel Haswell 2013+, AMD Excavator 2015+)
Hybrid matmul FFN tensors in int8, QKV/O in TQ1-packed, per-tensor dispatch
int4 FFN mode Halves FFN memory bandwidth for 18-26% speedup (7B/10B)
f32 bypass Auto-enabled for small models (โ‰ค1B) and SubLN architectures
Ring buffer KV cache Extended context via NTK-aware RoPE scaling
Standalone C++ CLI No Python or PyTorch required at runtime
SSE web server FastAPI-based /v1/chat/completions with prompt caching

Links


License & Usage Restrictions

This is a quantized derivative work based on the Falcon3 series (original model by the Technology Innovation Institute (TII)), originally released under the Falcon-LLM License.

By downloading or utilizing this file, you agree to be bound by the Falcon-LLM License:

  1. Attribution: Any usage or secondary deployment must credit the Technology Innovation Institute (TII).
  2. Non-Commercial & Small Commercial Use: Free for academic research, personal projects, and commercial entities with annual revenue under $1,000,000 USD.
  3. Commercial Hosting: Entities intending to provide shared, managed hosting of the model or its derivatives as a service must enter into a separate license arrangement with TII.

Disclaimer: This quantized file is provided "as-is". The ATLAS engine itself is Apache 2.0 licensed.

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xxxn3m3s1sxxx/Falcon3-3B-Base-1.58bit-ATLAS

Finetuned
(1)
this model

Collection including xxxn3m3s1sxxx/Falcon3-3B-Base-1.58bit-ATLAS