rp440 commited on
Commit
bb8a08c
·
verified ·
1 Parent(s): 2d70b4c

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -20,7 +20,7 @@ language:
20
 
21
  # Qwen3-8B All-Sparse Indexer
22
 
23
- > **Experimental research artifact** — a post-training Dynamic Sparse Attention (DSA) indexer trained at 2K context length. This repository is intended as an exploratory learned sparse-attention index, not a finished production method. The inference code is written in MLX and has not been optimized for speed.
24
 
25
  A lightweight **sparse-attention indexer** trained to approximate dense attention behavior in [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B). Conceptually, this is a **DeepSeek-style learned index** in the sense that a small auxiliary network predicts which key-value positions are worth keeping for attention. This is an independent research artifact and is not affiliated with DeepSeek. Early results suggest the approach can work in some settings, but more research is needed.
26
 
@@ -82,6 +82,7 @@ diverse natural-language prose. The model was then asked to retrieve it. These a
82
  | --------------------------- | ------ | ------------------- |
83
  | GSM8K accuracy (4-shot) | 95% | 92% |
84
  | PPL on C4 (seq_len=2048) | 13.526 | 13.533 (+0.058%) |
 
85
 
86
 
87
  ## Training Details
 
20
 
21
  # Qwen3-8B All-Sparse Indexer
22
 
23
+ > **Experimental research artifact** — a trained Dynamic Sparse Attention (DSA) indexer trained at 2K context length. This repository is intended as an exploratory learned sparse-attention index, not a finished production method. The inference code is written in MLX.
24
 
25
  A lightweight **sparse-attention indexer** trained to approximate dense attention behavior in [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B). Conceptually, this is a **DeepSeek-style learned index** in the sense that a small auxiliary network predicts which key-value positions are worth keeping for attention. This is an independent research artifact and is not affiliated with DeepSeek. Early results suggest the approach can work in some settings, but more research is needed.
26
 
 
82
  | --------------------------- | ------ | ------------------- |
83
  | GSM8K accuracy (4-shot) | 95% | 92% |
84
  | PPL on C4 (seq_len=2048) | 13.526 | 13.533 (+0.058%) |
85
+ | PPL on C4 (seq_len=8192) | 15.628 | 15.653 (+0.16%) |
86
 
87
 
88
  ## Training Details