Update README.md

01ad2e0 verified 6 months ago

4.09 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	base_model:
	- arcee-ai/AFM-4.5B-Base
	tags:
	- kda
	- kimi-delta-attention
	- linear-attention
	- distillation
	- research
	---

	# AFM-4.5B-Base-KDA-Only

	A research variant of [AFM-4.5B-Base](https://huggingface.co/arcee-ai/AFM-4.5B-Base) where all attention layers have been replaced with Kimi Delta Attention (KDA) through knowledge distillation. This model contains no full-attention layers.

	> ⚠️ Research Model: This is an experimental model released for research purposes. For production use, see [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B).

	More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it

	## Overview

	This model explores whether full attention can be completely replaced with linear attention mechanisms. Using [DistillKit](https://github.com/arcee-ai/DistillKit), we distilled the original AFM-4.5B-Base (teacher) into a pure-KDA architecture (student).

	Key characteristics:
	- All 24 layers use KDA instead of full attention
	- Trained up to 32k sequence length
	- Linear memory scaling with sequence length
	- Smoother long-context degradation compared to hybrid architectures

	## Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Parameters \| 4.5B \|
	\| Attention Type \| Kimi Delta Attention (All layers) \|
	\| Positional Encoding \| None (inherent to KDA) \|
	\| Max Training Length \| 32k tokens \|
	\| Base Model \| AFM-4.5B-Base \|

	## Benchmark Results

	Performance compared to the teacher model and hybrid configurations:

	\| Benchmark \| Teacher (Full Attn) \| KDA-Only \|
	\|-----------\|:-------------------:\|:--------:\|
	\| MMLU (Avg) \| 63.1% \| 55.8% \|
	\| ARC-Challenge \| 55.6% \| 49.9% \|
	\| HellaSwag (Norm) \| 78.0% \| 74.3% \|
	\| GSM8K (Math) \| 52.1% \| 26.8% \|

	### Key Findings

	- Knowledge benchmarks: KDA-Only performs within statistical range of hybrid approaches on MMLU, ARC, and HellaSwag
	- Math performance: Larger drop on GSM8K compared to hybrid, though this may recover with longer training
	- Long-context behavior: Degrades more smoothly than hybrid models beyond training length—no cliff at 32k, just gradual falloff

	## Long-Context Performance (NIAH)

	The pure-KDA model shows interesting long-context characteristics:

	- 100% single-needle retrieval up to 65k (beyond training length!)
	- Multikey retrieval degrades starting at 4k but smoothly
	- No sharp "cliff" like hybrid models exhibit past 32k

	This behavior aligns with expectations for state-space-like architectures: fixed hidden state size creates inherent tension with growing context, but degradation is graceful.

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_id = "arcee-ai/AFM-4.5B-Base-KDA-Only"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	prompt = "The theory of relativity states that"
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

	outputs = model.generate(
	input_ids,
	max_new_tokens=100,
	do_sample=True,
	temperature=0.7,
	top_p=0.95
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	- Method: Knowledge distillation from AFM-4.5B-Base using [DistillKit](https://github.com/arcee-ai/DistillKit)
	- Teacher: AFM-4.5B-Base (full attention)
	- Student Architecture: All layers converted to KDA
	- Training Length: 32k sequence length

	## Intended Use

	This model is intended for:
	- Research into linear attention mechanisms
	- Studying attention distillation techniques
	- Exploring pure state-space-like architectures for language modeling
	- Benchmarking KDA vs full attention tradeoffs

	## Limitations

	- Lower math/reasoning performance compared to full attention
	- Not instruction-tuned
	- Research checkpoint—not optimized for production

	## License

	AFM-4.5B is released under the Apache-2.0 license.