bartowski's picture
Update README.md
01ad2e0 verified
|
Raw
History Blame Contribute Delete
4.09 kB
---
license: apache-2.0
language:
- en
library_name: transformers
base_model:
- arcee-ai/AFM-4.5B-Base
tags:
- kda
- kimi-delta-attention
- linear-attention
- distillation
- research
---
# AFM-4.5B-Base-KDA-Only
A research variant of [AFM-4.5B-Base](https://huggingface.co/arcee-ai/AFM-4.5B-Base) where all attention layers have been replaced with Kimi Delta Attention (KDA) through knowledge distillation. This model contains **no full-attention layers**.
> ⚠️ **Research Model**: This is an experimental model released for research purposes. For production use, see [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B).
More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it
## Overview
This model explores whether full attention can be completely replaced with linear attention mechanisms. Using [DistillKit](https://github.com/arcee-ai/DistillKit), we distilled the original AFM-4.5B-Base (teacher) into a pure-KDA architecture (student).
**Key characteristics:**
- All 24 layers use KDA instead of full attention
- Trained up to 32k sequence length
- Linear memory scaling with sequence length
- Smoother long-context degradation compared to hybrid architectures
## Architecture
| Component | Details |
|-----------|---------|
| Parameters | 4.5B |
| Attention Type | Kimi Delta Attention (All layers) |
| Positional Encoding | None (inherent to KDA) |
| Max Training Length | 32k tokens |
| Base Model | AFM-4.5B-Base |
## Benchmark Results
Performance compared to the teacher model and hybrid configurations:
| Benchmark | Teacher (Full Attn) | KDA-Only |
|-----------|:-------------------:|:--------:|
| MMLU (Avg) | 63.1% | 55.8% |
| ARC-Challenge | 55.6% | 49.9% |
| HellaSwag (Norm) | 78.0% | 74.3% |
| GSM8K (Math) | 52.1% | 26.8% |
### Key Findings
- **Knowledge benchmarks**: KDA-Only performs within statistical range of hybrid approaches on MMLU, ARC, and HellaSwag
- **Math performance**: Larger drop on GSM8K compared to hybrid, though this may recover with longer training
- **Long-context behavior**: Degrades more smoothly than hybrid models beyond training length—no cliff at 32k, just gradual falloff
## Long-Context Performance (NIAH)
The pure-KDA model shows interesting long-context characteristics:
- 100% single-needle retrieval up to 65k (beyond training length!)
- Multikey retrieval degrades starting at 4k but smoothly
- No sharp "cliff" like hybrid models exhibit past 32k
This behavior aligns with expectations for state-space-like architectures: fixed hidden state size creates inherent tension with growing context, but degradation is graceful.
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "arcee-ai/AFM-4.5B-Base-KDA-Only"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "The theory of relativity states that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
- **Method**: Knowledge distillation from AFM-4.5B-Base using [DistillKit](https://github.com/arcee-ai/DistillKit)
- **Teacher**: AFM-4.5B-Base (full attention)
- **Student Architecture**: All layers converted to KDA
- **Training Length**: 32k sequence length
## Intended Use
This model is intended for:
- Research into linear attention mechanisms
- Studying attention distillation techniques
- Exploring pure state-space-like architectures for language modeling
- Benchmarking KDA vs full attention tradeoffs
## Limitations
- Lower math/reasoning performance compared to full attention
- Not instruction-tuned
- Research checkpoint—not optimized for production
## License
AFM-4.5B is released under the Apache-2.0 license.