--- license: apache-2.0 base_model: aquif-ai/aquif-3.6-1B tags: - text-generation-inference - reasoning - thinking - hybrid - efficient - dynamic - transformers - aquif - math - coding - small - aquif-3.5 - aquif-3.6 - llm - llama-cpp - gguf-my-repo language: - en - de - it - pt - fr - hi - es - th - zh - ja library_name: transformers pipeline_tag: text-generation --- # aquif-3.6-1B ## Summary **aquif-3.6-1B** is a hybrid reasoning model that automatically determines when and how deeply to think based on query complexity. Built on aquif-3.5-Nano-1B with AutoThink RL data, it achieves 28% better token efficiency and 4% performance improvement across benchmarks. ### Contents - [Key Features](#key-features) - Dynamic reasoning, efficiency gains, and smart resource allocation - [Performance](#performance) - Benchmark results showing 4% average improvement - [Token Efficiency](#token-efficiency) - 28% reduction in token usage - [Thinking Ratio](#thinking-ratio) - 12% reduction in thinking frequency - [Benchmark Highlights](#benchmark-highlights) - Detailed results for AIME, LiveCodeBench, and GPQA Diamond - [Model Details](#model-details) - Architecture and specifications - [Usage](#usage) - Code examples for implementation - [Previous Versions](#previous-versions) - Links to earlier models **Automatic Thinking** aquif-3.6-1B is a hybrid reasoning model that dynamically decides if and how much to think based on query complexity. Inspired by aquif-3.6-8B's approach of automatic thinking using AutoThink RL data on top of aquif-3.5-Nano-1B, the model uses the following format: ``` [analyzes whether to think or not] [thinking content] ``` This is the same format as aquif-3.6-8B. Unlike something like aquif-3.5-Plus's toggleable reasoning that requires manual control (thinking_on/off), aquif-3.6's judge autonomously allocates reasoning depth - intelligently adapting its cognitive effort to each task automatically. ## Key Features - 🧠 **Dynamic Reasoning**: Automatically determines when and how deeply to think - ⚡ **28% More Efficient**: Significant token reduction while improving performance - 📈 **Better Performance**: 4% average improvement across benchmarks - 🎯 **Smart Resource Allocation**: 12% reduction in thinking ratio on average ## Performance Benchmark | aquif-3.6-1B | Qwen3-1.7B | Improvement | |-----------|--------------|--------------|-------------| | AIME 2025 | 75.0 | 39.4 | +35.6% | | LiveCodeBench | 57.5 | 33.2 | +24.3% | | GPQA Diamond | 52.8 | 40.1 | +12.7% | | **Average** | **61.8** | **37.6** | **+24.2%** | ## Token Efficiency | Benchmark | aquif-3.6-1B | Qwen3-1.7B | Reduction | |-----------|--------------|--------------|-----------| | AIME 2025 | 13,670 | 18,450 | -26% | | LiveCodeBench | 10,270 | 13,890 | -26% | | GPQA Diamond | 6,870 | 12,100 | -43% | | **Average** | **10,270** | **14,813** | **-32%** | ## Thinking Ratio | Benchmark | aquif-3.6-1B | Qwen3-1.7B | Reduction | |-----------|--------------|--------------|-----------| | AIME 2025 | 84.0% | 100.0% | -16% | | LiveCodeBench | 78.0% | 100.0% | -22% | | GPQA Diamond | 81.0% | 100.0% | -19% | | **Average** | **81.0%** | **100.0%** | **-19%** | ## Benchmark Highlights - **AIME 2025**: 26% fewer tokens, +35.6% performance, -16% thinking ratio - **LiveCodeBench**: 26% fewer tokens, +24.3% performance, -22% thinking ratio - **GPQA Diamond**: 43% fewer tokens, +12.7% performance, -19% thinking ratio ## Model Details - **Base Model**: 1.7B parameters - **Architecture**: Hybrid reasoning with dynamic thinking allocation - **Context Length**: 40K tokens - **License**: Apache 2.0 ## Usage ## Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) ```bash brew install llama.cpp ``` Invoke the llama.cpp server or the CLI. ### CLI: ```bash llama-cli --hf-repo Edge-Quant/aquif-3.6-1B-Q4_K_M-GGUF --hf-file aquif-3.6-1b-q4_k_m.gguf -p "The meaning to life and the universe is" ``` ### Server: ```bash llama-server --hf-repo Edge-Quant/aquif-3.6-1B-Q4_K_M-GGUF --hf-file aquif-3.6-1b-q4_k_m.gguf -c 2048 ``` Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well. Step 1: Clone llama.cpp from GitHub. ``` git clone https://github.com/ggerganov/llama.cpp ``` Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux). ``` cd llama.cpp && LLAMA_CURL=1 make ``` Step 3: Run inference through the main binary. ``` ./llama-cli --hf-repo Edge-Quant/aquif-3.6-1B-Q4_K_M-GGUF --hf-file aquif-3.6-1b-q4_k_m.gguf -p "The meaning to life and the universe is" ``` or ``` ./llama-server --hf-repo Edge-Quant/aquif-3.6-1B-Q4_K_M-GGUF --hf-file aquif-3.6-1b-q4_k_m.gguf -c 2048 ```