8bit-threshold-computer / llm_integration /smollm2 /SMOLLM2_ARCHITECTURE.md
CharlesCNorton
Move SmolLM2 analysis files into smollm2 subfolder
8a1465b
# SmolLM2-360M-Instruct Architecture Analysis
Technical reference document for the 8bit-threshold-computer LLM integration project.
**Model**: `HuggingFaceTB/SmolLM2-360M-Instruct`
**Architecture**: LlamaForCausalLM (Llama 2 variant)
**Tokenizer**: GPT2TokenizerFast
**Analysis Date**: 2026-01-21
---
## 1. Executive Summary
SmolLM2-360M-Instruct is a 362M parameter causal language model using the Llama architecture. Key characteristics relevant to our bit extraction task:
- **Hidden dimension: 960** (matches our extractor input requirement)
- **32 transformer layers** providing multiple extraction points
- **Digit-level tokenization** for numbers (each digit is a separate token)
- **Grouped Query Attention (GQA)** with 15 query heads and 5 KV heads
---
## 2. Architecture Census
### 2.1 Core Parameters
| Parameter | Value |
|-----------|-------|
| Total Parameters | 361,821,120 (361.82M) |
| Vocabulary Size | 49,152 |
| Hidden Dimension | 960 |
| Intermediate Dimension (MLP) | 2,560 |
| Number of Layers | 32 |
| Number of Attention Heads | 15 |
| Number of KV Heads | 5 (Grouped Query Attention) |
| Head Dimension | 64 |
| Max Sequence Length | 8,192 |
| Activation Function | SiLU |
| Normalization | RMSNorm (eps=1e-05) |
| Position Encoding | RoPE (theta=100,000) |
| Word Embedding Tying | Yes (embed_tokens = lm_head) |
### 2.2 Architecture Diagram
```
Input Token IDs
|
v
+------------------+
| Embedding Layer | (49152, 960)
+------------------+
|
v
+------------------+
| LlamaDecoderLayer| x 32
| +-------------+ |
| | RMSNorm | |
| +-------------+ |
| | Self-Attn | | Q: (960, 960), K: (960, 320), V: (960, 320), O: (960, 960)
| +-------------+ |
| | Residual | |
| +-------------+ |
| | RMSNorm | |
| +-------------+ |
| | MLP (SwiGLU)| | gate: (960, 2560), up: (960, 2560), down: (2560, 960)
| +-------------+ |
| | Residual | |
+------------------+
|
v
+------------------+
| Final RMSNorm | (960,)
+------------------+
|
v
+------------------+
| LM Head | (960, 49152) - tied with embeddings
+------------------+
|
v
Logits (batch, seq, 49152)
```
### 2.3 Parameter Distribution
| Component | Parameters | Percentage |
|-----------|-----------|------------|
| Embedding | 47,185,920 | 13.04% |
| All Attention Layers | 78,643,200 | 21.74% |
| All MLP Layers | 235,929,600 | 65.19% |
| All Layer Norms | 61,440 | 0.02% |
| Final Norm | 960 | 0.00% |
Per-layer breakdown (each of 32 layers):
- Attention: 2,457,600 params (0.68%)
- MLP: 7,372,800 params (2.04%)
- Norms: 1,920 params (0.00%)
---
## 3. Weight Inventory
### 3.1 Embedding and Output Layers
| Parameter Name | Shape | Elements | Notes |
|---------------|-------|----------|-------|
| `model.embed_tokens.weight` | (49152, 960) | 47,185,920 | Token embeddings |
| `model.norm.weight` | (960,) | 960 | Final layer norm |
| `lm_head.weight` | (49152, 960) | (tied) | Tied to embed_tokens |
### 3.2 Single Transformer Layer Structure
Each of the 32 layers (`model.layers.{0-31}`) contains:
**Attention Block:**
| Parameter | Shape | Elements |
|-----------|-------|----------|
| `self_attn.q_proj.weight` | (960, 960) | 921,600 |
| `self_attn.k_proj.weight` | (320, 960) | 307,200 |
| `self_attn.v_proj.weight` | (320, 960) | 307,200 |
| `self_attn.o_proj.weight` | (960, 960) | 921,600 |
| **Attention Total** | | **2,457,600** |
**MLP Block (SwiGLU):**
| Parameter | Shape | Elements |
|-----------|-------|----------|
| `mlp.gate_proj.weight` | (2560, 960) | 2,457,600 |
| `mlp.up_proj.weight` | (2560, 960) | 2,457,600 |
| `mlp.down_proj.weight` | (960, 2560) | 2,457,600 |
| **MLP Total** | | **7,372,800** |
**Normalization:**
| Parameter | Shape | Elements |
|-----------|-------|----------|
| `input_layernorm.weight` | (960,) | 960 |
| `post_attention_layernorm.weight` | (960,) | 960 |
| **Norms Total** | | **1,920** |
**Layer Total: 9,832,320 parameters**
### 3.3 Grouped Query Attention (GQA) Details
SmolLM2 uses GQA with a 3:1 ratio:
- 15 query heads (Q dimension: 960 = 15 x 64)
- 5 key-value heads (KV dimension: 320 = 5 x 64)
- Each KV head is shared by 3 query heads
- This reduces KV cache memory by ~67% vs standard MHA
---
## 4. Tokenization Analysis
### 4.1 Arithmetic Expression Tokenization
Test input: `"47 + 86"`
| Position | Token ID | Token | Description |
|----------|----------|-------|-------------|
| 0 | 36 | `'4'` | First digit of operand A |
| 1 | 39 | `'7'` | Second digit of operand A |
| 2 | 1232 | `' +'` | Space + plus sign |
| 3 | 216 | `' '` | Trailing space |
| 4 | 40 | `'8'` | First digit of operand B |
| 5 | 38 | `'6'` | Second digit of operand B |
### 4.2 Digit Token Mappings
| Digit | Token ID |
|-------|----------|
| 0 | 32 |
| 1 | 33 |
| 2 | 34 |
| 3 | 35 |
| 4 | 36 |
| 5 | 37 |
| 6 | 38 |
| 7 | 39 |
| 8 | 40 |
| 9 | 41 |
Key observations:
- **Digits are tokenized individually** (no multi-digit tokens like "47")
- Digit tokens are sequential: ID = 32 + digit_value
- Space-prefixed operators exist (e.g., `' +'` = 1232)
- `'='` has token ID 45
### 4.3 Implications for Bit Extraction
The digit-by-digit tokenization means:
1. For `"47 + 86"`, operand A spans positions [0,1] and operand B spans positions [4,5]
2. The model must learn to:
- Recognize digit boundaries
- Compose multi-digit numbers from sequential tokens
- Perform arithmetic across token positions
3. Hidden states at digit positions contain the numerical representation
---
## 5. Hidden State Analysis
### 5.1 Hidden State Output Structure
When running with `output_hidden_states=True`:
- Returns **33 hidden states** (embedding + 32 layer outputs)
- Each has shape: `(batch_size, seq_len, hidden_dim)`
- For `"47 + 86"`: `(1, 6, 960)`
### 5.2 Hidden State Statistics by Layer
| Layer | Mean | Std Dev | Min | Max |
|-------|------|---------|-----|-----|
| Embedding | -0.001 | 0.105 | -0.44 | 1.77 |
| Layer 0 | -0.127 | 2.55 | -80.8 | 19.0 |
| Layer 1 | -0.171 | 3.70 | -161 | 39.7 |
| Layer 2 | -0.151 | 3.67 | -102 | 61.4 |
| Layer 3 | -1.13 | 327 | -21,722 | 11,856 |
| Layer 4-12 | ~-1.3 | ~327 | ~-21,700 | ~11,800 |
| Layer 13-26 | ~-1.5 | ~337 | ~-22,400 | ~12,100 |
| Layer 27-30 | ~-1.8 | ~310 | ~-20,000 | ~11,800 |
| Layer 31 | 0.017 | 1.34 | -18.9 | 34.3 |
Key observations:
1. **Dramatic variance explosion at Layer 3**: Std dev jumps from ~4 to ~327
2. **Stable middle layers (4-26)**: Consistent statistics, suggesting numerical computation
3. **Compression at final layer**: Std dev drops to 1.34 at Layer 31 (pre-softmax normalization)
4. **Layer 31 is well-scaled** for downstream processing
### 5.3 Extraction Point Candidates
| Layer Range | Characteristics | Suitability |
|-------------|-----------------|-------------|
| 0-2 (Early) | Low variance, close to embeddings | Poor - minimal computation |
| 3-12 (Early-Mid) | High variance, initial processing | Moderate - may contain raw numerical features |
| 13-26 (Middle) | Stable high variance | Good - computation in progress |
| 27-30 (Late) | Variance compression begins | Good - refined representations |
| 31 (Final) | Well-normalized output | Best - final representation before LM head |
---
## 6. Relevance to 8bit-Threshold-Computer Project
### 6.1 Hidden Dimension Match
**The hidden dimension of 960 exactly matches our extractor input requirement.** This is fortuitous as it means:
- No projection layer needed to interface with our bit extractor
- Direct extraction from any layer's hidden states
- Full utilization of the model's representational capacity
### 6.2 Recommended Extraction Strategy
```python
def extract_hidden_state(model, tokenizer, expression, layer=-1):
"""
Extract hidden state for bit extraction.
Args:
layer: Which layer to extract from (default: final layer)
-1 = Layer 31 (final, pre-LM-head)
Returns:
Tensor of shape (960,) for the last token position
"""
inputs = tokenizer(expression, return_tensors="pt")
outputs = model(**inputs, output_hidden_states=True)
# hidden_states[0] = embedding, hidden_states[1] = layer 0, ...
# hidden_states[32] = layer 31 (final)
hidden = outputs.hidden_states[layer] # (1, seq_len, 960)
# Extract last token position for autoregressive prediction
return hidden[0, -1, :] # (960,)
```
### 6.3 Token Position Analysis
For arithmetic expressions like `"A + B"`:
```
Tokens: [d1] [d2] [ +] [ ] [d3] [d4]
Positions: 0 1 2 3 4 5
Operand A: positions 0 to (plus_pos - 1)
Operator: position where ' +' token appears
Operand B: positions (plus_pos + 2) to end
```
Strategy for operand extraction:
1. Find the `' +'` token (ID 1232) position
2. Collect hidden states at digit positions before it (operand A)
3. Collect hidden states at digit positions after it (operand B)
4. Consider pooling (mean, max) or concatenating digit hidden states
### 6.4 Attention Pattern Utilization
With GQA (15 query heads, 5 KV heads), we can analyze attention patterns to:
1. Identify which positions attend to operand digits
2. Determine if the model explicitly links corresponding digit positions
3. Find heads that specialize in numerical reasoning
```python
def get_attention_weights(model, tokenizer, expression):
inputs = tokenizer(expression, return_tensors="pt")
outputs = model(**inputs, output_attentions=True)
# attentions: tuple of (batch, num_heads, seq_len, seq_len) per layer
return outputs.attentions
```
### 6.5 Extraction Interface Specification
For integration with the threshold computer:
```python
class SmolLM2Extractor:
"""Interface between SmolLM2 and threshold-based bit extraction."""
def __init__(self, model, tokenizer, extraction_layer=31):
self.model = model
self.tokenizer = tokenizer
self.layer = extraction_layer + 1 # +1 because index 0 is embedding
def get_hidden_state(self, text: str) -> torch.Tensor:
"""
Returns: Tensor of shape (960,) ready for bit extractor
"""
tokens = self.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**tokens, output_hidden_states=True)
return outputs.hidden_states[self.layer][0, -1, :]
def get_all_position_states(self, text: str) -> torch.Tensor:
"""
Returns: Tensor of shape (seq_len, 960) for all positions
"""
tokens = self.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**tokens, output_hidden_states=True)
return outputs.hidden_states[self.layer][0]
```
---
## 7. Complete Weight Inventory Table
### 7.1 All Named Parameters
```
EMBEDDING (47,185,920 params - 13.04%)
model.embed_tokens.weight (49152, 960) 47,185,920
LAYER 0 (9,832,320 params - 2.72%)
Attention (2,457,600):
model.layers.0.self_attn.q_proj.weight (960, 960) 921,600
model.layers.0.self_attn.k_proj.weight (320, 960) 307,200
model.layers.0.self_attn.v_proj.weight (320, 960) 307,200
model.layers.0.self_attn.o_proj.weight (960, 960) 921,600
MLP (7,372,800):
model.layers.0.mlp.gate_proj.weight (2560, 960) 2,457,600
model.layers.0.mlp.up_proj.weight (2560, 960) 2,457,600
model.layers.0.mlp.down_proj.weight (960, 2560) 2,457,600
Norms (1,920):
model.layers.0.input_layernorm.weight (960,) 960
model.layers.0.post_attention_layernorm.weight (960,) 960
[Layers 1-31 follow identical structure, each with 9,832,320 params]
FINAL NORM (960 params - 0.00%)
model.norm.weight (960,) 960
LM HEAD (tied with embed_tokens)
lm_head.weight (49152, 960) [shared]
```
### 7.2 Summary by Component Type
| Component Type | Count | Params Each | Total Params |
|----------------|-------|-------------|--------------|
| Embedding | 1 | 47,185,920 | 47,185,920 |
| Q Projection | 32 | 921,600 | 29,491,200 |
| K Projection | 32 | 307,200 | 9,830,400 |
| V Projection | 32 | 307,200 | 9,830,400 |
| O Projection | 32 | 921,600 | 29,491,200 |
| Gate Projection | 32 | 2,457,600 | 78,643,200 |
| Up Projection | 32 | 2,457,600 | 78,643,200 |
| Down Projection | 32 | 2,457,600 | 78,643,200 |
| Input LayerNorm | 32 | 960 | 30,720 |
| Post-Attn LayerNorm | 32 | 960 | 30,720 |
| Final LayerNorm | 1 | 960 | 960 |
| **Total** | | | **361,821,120** |
---
## 8. Configuration Reference
Complete model configuration from HuggingFace:
```python
{
"architectures": ["LlamaForCausalLM"],
"attention_bias": False,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 2,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 960,
"initializer_range": 0.02,
"intermediate_size": 2560,
"max_position_embeddings": 8192,
"mlp_bias": False,
"model_type": "llama",
"num_attention_heads": 15,
"num_hidden_layers": 32,
"num_key_value_heads": 5,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_interleaved": False,
"rope_theta": 100000,
"tie_word_embeddings": True,
"use_cache": True,
"vocab_size": 49152
}
```
---
## 9. Appendix: PyTorch Model Structure
```
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(49152, 960, padding_idx=2)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=960, out_features=960, bias=False)
(k_proj): Linear(in_features=960, out_features=320, bias=False)
(v_proj): Linear(in_features=960, out_features=320, bias=False)
(o_proj): Linear(in_features=960, out_features=960, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=960, out_features=2560, bias=False)
(up_proj): Linear(in_features=960, out_features=2560, bias=False)
(down_proj): Linear(in_features=2560, out_features=960, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm((960,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((960,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((960,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=960, out_features=49152, bias=False)
)
```