rp440
/

Qwen3-8b-DSA-index

Text Generation

sparse-attention

Model card Files Files and versions

rp440 commited on Mar 27

Commit

bb8a08c

·

verified ·

1 Parent(s): 2d70b4c

Upload folder using huggingface_hub

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -20,7 +20,7 @@ language:
 # Qwen3-8B All-Sparse Indexer
-> **Experimental research artifact** — a post-training Dynamic Sparse Attention (DSA) indexer trained at 2K context length. This repository is intended as an exploratory learned sparse-attention index, not a finished production method. The inference code is written in MLX and has not been optimized for speed.
 A lightweight **sparse-attention indexer** trained to approximate dense attention behavior in [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B). Conceptually, this is a **DeepSeek-style learned index** in the sense that a small auxiliary network predicts which key-value positions are worth keeping for attention. This is an independent research artifact and is not affiliated with DeepSeek. Early results suggest the approach can work in some settings, but more research is needed.
@@ -82,6 +82,7 @@ diverse natural-language prose. The model was then asked to retrieve it. These a
 | --------------------------- | ------ | ------------------- |
 | GSM8K accuracy (4-shot)     | 95%    | 92%                 |
 | PPL on C4 (seq_len=2048)    | 13.526 | 13.533 (+0.058%)    |
 ## Training Details

 # Qwen3-8B All-Sparse Indexer
+> **Experimental research artifact** — a trained Dynamic Sparse Attention (DSA) indexer trained at 2K context length. This repository is intended as an exploratory learned sparse-attention index, not a finished production method. The inference code is written in MLX.
 A lightweight **sparse-attention indexer** trained to approximate dense attention behavior in [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B). Conceptually, this is a **DeepSeek-style learned index** in the sense that a small auxiliary network predicts which key-value positions are worth keeping for attention. This is an independent research artifact and is not affiliated with DeepSeek. Early results suggest the approach can work in some settings, but more research is needed.
 | --------------------------- | ------ | ------------------- |
 | GSM8K accuracy (4-shot)     | 95%    | 92%                 |
 | PPL on C4 (seq_len=2048)    | 13.526 | 13.533 (+0.058%)    |
+| PPL on C4 (seq_len=8192)    | 15.628 | 15.653 (+0.16%)     |
 ## Training Details