Image Feature Extraction
Transformers
Safetensors
English
arcee_kda
kda
kimi-delta-attention
linear-attention
distillation
research
custom_code
Instructions to use arcee-ai/AFM-4.5B-Base-KDA-Only with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use arcee-ai/AFM-4.5B-Base-KDA-Only with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="arcee-ai/AFM-4.5B-Base-KDA-Only", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("arcee-ai/AFM-4.5B-Base-KDA-Only", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| base_model: | |
| - arcee-ai/AFM-4.5B-Base | |
| tags: | |
| - kda | |
| - kimi-delta-attention | |
| - linear-attention | |
| - distillation | |
| - research | |
| # AFM-4.5B-Base-KDA-Only | |
| A research variant of [AFM-4.5B-Base](https://huggingface.co/arcee-ai/AFM-4.5B-Base) where all attention layers have been replaced with Kimi Delta Attention (KDA) through knowledge distillation. This model contains **no full-attention layers**. | |
| > ⚠️ **Research Model**: This is an experimental model released for research purposes. For production use, see [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B). | |
| More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it | |
| ## Overview | |
| This model explores whether full attention can be completely replaced with linear attention mechanisms. Using [DistillKit](https://github.com/arcee-ai/DistillKit), we distilled the original AFM-4.5B-Base (teacher) into a pure-KDA architecture (student). | |
| **Key characteristics:** | |
| - All 24 layers use KDA instead of full attention | |
| - Trained up to 32k sequence length | |
| - Linear memory scaling with sequence length | |
| - Smoother long-context degradation compared to hybrid architectures | |
| ## Architecture | |
| | Component | Details | | |
| |-----------|---------| | |
| | Parameters | 4.5B | | |
| | Attention Type | Kimi Delta Attention (All layers) | | |
| | Positional Encoding | None (inherent to KDA) | | |
| | Max Training Length | 32k tokens | | |
| | Base Model | AFM-4.5B-Base | | |
| ## Benchmark Results | |
| Performance compared to the teacher model and hybrid configurations: | |
| | Benchmark | Teacher (Full Attn) | KDA-Only | | |
| |-----------|:-------------------:|:--------:| | |
| | MMLU (Avg) | 63.1% | 55.8% | | |
| | ARC-Challenge | 55.6% | 49.9% | | |
| | HellaSwag (Norm) | 78.0% | 74.3% | | |
| | GSM8K (Math) | 52.1% | 26.8% | | |
| ### Key Findings | |
| - **Knowledge benchmarks**: KDA-Only performs within statistical range of hybrid approaches on MMLU, ARC, and HellaSwag | |
| - **Math performance**: Larger drop on GSM8K compared to hybrid, though this may recover with longer training | |
| - **Long-context behavior**: Degrades more smoothly than hybrid models beyond training length—no cliff at 32k, just gradual falloff | |
| ## Long-Context Performance (NIAH) | |
| The pure-KDA model shows interesting long-context characteristics: | |
| - 100% single-needle retrieval up to 65k (beyond training length!) | |
| - Multikey retrieval degrades starting at 4k but smoothly | |
| - No sharp "cliff" like hybrid models exhibit past 32k | |
| This behavior aligns with expectations for state-space-like architectures: fixed hidden state size creates inherent tension with growing context, but degradation is graceful. | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| model_id = "arcee-ai/AFM-4.5B-Base-KDA-Only" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto" | |
| ) | |
| prompt = "The theory of relativity states that" | |
| input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device) | |
| outputs = model.generate( | |
| input_ids, | |
| max_new_tokens=100, | |
| do_sample=True, | |
| temperature=0.7, | |
| top_p=0.95 | |
| ) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ## Training Details | |
| - **Method**: Knowledge distillation from AFM-4.5B-Base using [DistillKit](https://github.com/arcee-ai/DistillKit) | |
| - **Teacher**: AFM-4.5B-Base (full attention) | |
| - **Student Architecture**: All layers converted to KDA | |
| - **Training Length**: 32k sequence length | |
| ## Intended Use | |
| This model is intended for: | |
| - Research into linear attention mechanisms | |
| - Studying attention distillation techniques | |
| - Exploring pure state-space-like architectures for language modeling | |
| - Benchmarking KDA vs full attention tradeoffs | |
| ## Limitations | |
| - Lower math/reasoning performance compared to full attention | |
| - Not instruction-tuned | |
| - Research checkpoint—not optimized for production | |
| ## License | |
| AFM-4.5B is released under the Apache-2.0 license. |