| # SmolLM2-360M-Instruct Architecture Analysis |
|
|
| Technical reference document for the 8bit-threshold-computer LLM integration project. |
|
|
| **Model**: `HuggingFaceTB/SmolLM2-360M-Instruct` |
| **Architecture**: LlamaForCausalLM (Llama 2 variant) |
| **Tokenizer**: GPT2TokenizerFast |
| **Analysis Date**: 2026-01-21 |
|
|
| --- |
|
|
| ## 1. Executive Summary |
|
|
| SmolLM2-360M-Instruct is a 362M parameter causal language model using the Llama architecture. Key characteristics relevant to our bit extraction task: |
|
|
| - **Hidden dimension: 960** (matches our extractor input requirement) |
| - **32 transformer layers** providing multiple extraction points |
| - **Digit-level tokenization** for numbers (each digit is a separate token) |
| - **Grouped Query Attention (GQA)** with 15 query heads and 5 KV heads |
|
|
| --- |
|
|
| ## 2. Architecture Census |
|
|
| ### 2.1 Core Parameters |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Total Parameters | 361,821,120 (361.82M) | |
| | Vocabulary Size | 49,152 | |
| | Hidden Dimension | 960 | |
| | Intermediate Dimension (MLP) | 2,560 | |
| | Number of Layers | 32 | |
| | Number of Attention Heads | 15 | |
| | Number of KV Heads | 5 (Grouped Query Attention) | |
| | Head Dimension | 64 | |
| | Max Sequence Length | 8,192 | |
| | Activation Function | SiLU | |
| | Normalization | RMSNorm (eps=1e-05) | |
| | Position Encoding | RoPE (theta=100,000) | |
| | Word Embedding Tying | Yes (embed_tokens = lm_head) | |
|
|
| ### 2.2 Architecture Diagram |
|
|
| ``` |
| Input Token IDs |
| | |
| v |
| +------------------+ |
| | Embedding Layer | (49152, 960) |
| +------------------+ |
| | |
| v |
| +------------------+ |
| | LlamaDecoderLayer| x 32 |
| | +-------------+ | |
| | | RMSNorm | | |
| | +-------------+ | |
| | | Self-Attn | | Q: (960, 960), K: (960, 320), V: (960, 320), O: (960, 960) |
| | +-------------+ | |
| | | Residual | | |
| | +-------------+ | |
| | | RMSNorm | | |
| | +-------------+ | |
| | | MLP (SwiGLU)| | gate: (960, 2560), up: (960, 2560), down: (2560, 960) |
| | +-------------+ | |
| | | Residual | | |
| +------------------+ |
| | |
| v |
| +------------------+ |
| | Final RMSNorm | (960,) |
| +------------------+ |
| | |
| v |
| +------------------+ |
| | LM Head | (960, 49152) - tied with embeddings |
| +------------------+ |
| | |
| v |
| Logits (batch, seq, 49152) |
| ``` |
|
|
| ### 2.3 Parameter Distribution |
|
|
| | Component | Parameters | Percentage | |
| |-----------|-----------|------------| |
| | Embedding | 47,185,920 | 13.04% | |
| | All Attention Layers | 78,643,200 | 21.74% | |
| | All MLP Layers | 235,929,600 | 65.19% | |
| | All Layer Norms | 61,440 | 0.02% | |
| | Final Norm | 960 | 0.00% | |
|
|
| Per-layer breakdown (each of 32 layers): |
| - Attention: 2,457,600 params (0.68%) |
| - MLP: 7,372,800 params (2.04%) |
| - Norms: 1,920 params (0.00%) |
|
|
| --- |
|
|
| ## 3. Weight Inventory |
|
|
| ### 3.1 Embedding and Output Layers |
|
|
| | Parameter Name | Shape | Elements | Notes | |
| |---------------|-------|----------|-------| |
| | `model.embed_tokens.weight` | (49152, 960) | 47,185,920 | Token embeddings | |
| | `model.norm.weight` | (960,) | 960 | Final layer norm | |
| | `lm_head.weight` | (49152, 960) | (tied) | Tied to embed_tokens | |
| |
| ### 3.2 Single Transformer Layer Structure |
| |
| Each of the 32 layers (`model.layers.{0-31}`) contains: |
| |
| **Attention Block:** |
| | Parameter | Shape | Elements | |
| |-----------|-------|----------| |
| | `self_attn.q_proj.weight` | (960, 960) | 921,600 | |
| | `self_attn.k_proj.weight` | (320, 960) | 307,200 | |
| | `self_attn.v_proj.weight` | (320, 960) | 307,200 | |
| | `self_attn.o_proj.weight` | (960, 960) | 921,600 | |
| | **Attention Total** | | **2,457,600** | |
| |
| **MLP Block (SwiGLU):** |
| | Parameter | Shape | Elements | |
| |-----------|-------|----------| |
| | `mlp.gate_proj.weight` | (2560, 960) | 2,457,600 | |
| | `mlp.up_proj.weight` | (2560, 960) | 2,457,600 | |
| | `mlp.down_proj.weight` | (960, 2560) | 2,457,600 | |
| | **MLP Total** | | **7,372,800** | |
|
|
| **Normalization:** |
| | Parameter | Shape | Elements | |
| |-----------|-------|----------| |
| | `input_layernorm.weight` | (960,) | 960 | |
| | `post_attention_layernorm.weight` | (960,) | 960 | |
| | **Norms Total** | | **1,920** | |
|
|
| **Layer Total: 9,832,320 parameters** |
|
|
| ### 3.3 Grouped Query Attention (GQA) Details |
|
|
| SmolLM2 uses GQA with a 3:1 ratio: |
| - 15 query heads (Q dimension: 960 = 15 x 64) |
| - 5 key-value heads (KV dimension: 320 = 5 x 64) |
| - Each KV head is shared by 3 query heads |
| - This reduces KV cache memory by ~67% vs standard MHA |
|
|
| --- |
|
|
| ## 4. Tokenization Analysis |
|
|
| ### 4.1 Arithmetic Expression Tokenization |
|
|
| Test input: `"47 + 86"` |
|
|
| | Position | Token ID | Token | Description | |
| |----------|----------|-------|-------------| |
| | 0 | 36 | `'4'` | First digit of operand A | |
| | 1 | 39 | `'7'` | Second digit of operand A | |
| | 2 | 1232 | `' +'` | Space + plus sign | |
| | 3 | 216 | `' '` | Trailing space | |
| | 4 | 40 | `'8'` | First digit of operand B | |
| | 5 | 38 | `'6'` | Second digit of operand B | |
|
|
| ### 4.2 Digit Token Mappings |
|
|
| | Digit | Token ID | |
| |-------|----------| |
| | 0 | 32 | |
| | 1 | 33 | |
| | 2 | 34 | |
| | 3 | 35 | |
| | 4 | 36 | |
| | 5 | 37 | |
| | 6 | 38 | |
| | 7 | 39 | |
| | 8 | 40 | |
| | 9 | 41 | |
|
|
| Key observations: |
| - **Digits are tokenized individually** (no multi-digit tokens like "47") |
| - Digit tokens are sequential: ID = 32 + digit_value |
| - Space-prefixed operators exist (e.g., `' +'` = 1232) |
| - `'='` has token ID 45 |
| |
| ### 4.3 Implications for Bit Extraction |
| |
| The digit-by-digit tokenization means: |
| 1. For `"47 + 86"`, operand A spans positions [0,1] and operand B spans positions [4,5] |
| 2. The model must learn to: |
| - Recognize digit boundaries |
| - Compose multi-digit numbers from sequential tokens |
| - Perform arithmetic across token positions |
| 3. Hidden states at digit positions contain the numerical representation |
| |
| --- |
| |
| ## 5. Hidden State Analysis |
| |
| ### 5.1 Hidden State Output Structure |
| |
| When running with `output_hidden_states=True`: |
| - Returns **33 hidden states** (embedding + 32 layer outputs) |
| - Each has shape: `(batch_size, seq_len, hidden_dim)` |
| - For `"47 + 86"`: `(1, 6, 960)` |
|
|
| ### 5.2 Hidden State Statistics by Layer |
|
|
| | Layer | Mean | Std Dev | Min | Max | |
| |-------|------|---------|-----|-----| |
| | Embedding | -0.001 | 0.105 | -0.44 | 1.77 | |
| | Layer 0 | -0.127 | 2.55 | -80.8 | 19.0 | |
| | Layer 1 | -0.171 | 3.70 | -161 | 39.7 | |
| | Layer 2 | -0.151 | 3.67 | -102 | 61.4 | |
| | Layer 3 | -1.13 | 327 | -21,722 | 11,856 | |
| | Layer 4-12 | ~-1.3 | ~327 | ~-21,700 | ~11,800 | |
| | Layer 13-26 | ~-1.5 | ~337 | ~-22,400 | ~12,100 | |
| | Layer 27-30 | ~-1.8 | ~310 | ~-20,000 | ~11,800 | |
| | Layer 31 | 0.017 | 1.34 | -18.9 | 34.3 | |
|
|
| Key observations: |
| 1. **Dramatic variance explosion at Layer 3**: Std dev jumps from ~4 to ~327 |
| 2. **Stable middle layers (4-26)**: Consistent statistics, suggesting numerical computation |
| 3. **Compression at final layer**: Std dev drops to 1.34 at Layer 31 (pre-softmax normalization) |
| 4. **Layer 31 is well-scaled** for downstream processing |
|
|
| ### 5.3 Extraction Point Candidates |
|
|
| | Layer Range | Characteristics | Suitability | |
| |-------------|-----------------|-------------| |
| | 0-2 (Early) | Low variance, close to embeddings | Poor - minimal computation | |
| | 3-12 (Early-Mid) | High variance, initial processing | Moderate - may contain raw numerical features | |
| | 13-26 (Middle) | Stable high variance | Good - computation in progress | |
| | 27-30 (Late) | Variance compression begins | Good - refined representations | |
| | 31 (Final) | Well-normalized output | Best - final representation before LM head | |
|
|
| --- |
|
|
| ## 6. Relevance to 8bit-Threshold-Computer Project |
|
|
| ### 6.1 Hidden Dimension Match |
|
|
| **The hidden dimension of 960 exactly matches our extractor input requirement.** This is fortuitous as it means: |
| - No projection layer needed to interface with our bit extractor |
| - Direct extraction from any layer's hidden states |
| - Full utilization of the model's representational capacity |
|
|
| ### 6.2 Recommended Extraction Strategy |
|
|
| ```python |
| def extract_hidden_state(model, tokenizer, expression, layer=-1): |
| """ |
| Extract hidden state for bit extraction. |
| |
| Args: |
| layer: Which layer to extract from (default: final layer) |
| -1 = Layer 31 (final, pre-LM-head) |
| |
| Returns: |
| Tensor of shape (960,) for the last token position |
| """ |
| inputs = tokenizer(expression, return_tensors="pt") |
| outputs = model(**inputs, output_hidden_states=True) |
| |
| # hidden_states[0] = embedding, hidden_states[1] = layer 0, ... |
| # hidden_states[32] = layer 31 (final) |
| hidden = outputs.hidden_states[layer] # (1, seq_len, 960) |
| |
| # Extract last token position for autoregressive prediction |
| return hidden[0, -1, :] # (960,) |
| ``` |
|
|
| ### 6.3 Token Position Analysis |
|
|
| For arithmetic expressions like `"A + B"`: |
|
|
| ``` |
| Tokens: [d1] [d2] [ +] [ ] [d3] [d4] |
| Positions: 0 1 2 3 4 5 |
| |
| Operand A: positions 0 to (plus_pos - 1) |
| Operator: position where ' +' token appears |
| Operand B: positions (plus_pos + 2) to end |
| ``` |
|
|
| Strategy for operand extraction: |
| 1. Find the `' +'` token (ID 1232) position |
| 2. Collect hidden states at digit positions before it (operand A) |
| 3. Collect hidden states at digit positions after it (operand B) |
| 4. Consider pooling (mean, max) or concatenating digit hidden states |
|
|
| ### 6.4 Attention Pattern Utilization |
|
|
| With GQA (15 query heads, 5 KV heads), we can analyze attention patterns to: |
| 1. Identify which positions attend to operand digits |
| 2. Determine if the model explicitly links corresponding digit positions |
| 3. Find heads that specialize in numerical reasoning |
|
|
| ```python |
| def get_attention_weights(model, tokenizer, expression): |
| inputs = tokenizer(expression, return_tensors="pt") |
| outputs = model(**inputs, output_attentions=True) |
| # attentions: tuple of (batch, num_heads, seq_len, seq_len) per layer |
| return outputs.attentions |
| ``` |
|
|
| ### 6.5 Extraction Interface Specification |
|
|
| For integration with the threshold computer: |
|
|
| ```python |
| class SmolLM2Extractor: |
| """Interface between SmolLM2 and threshold-based bit extraction.""" |
| |
| def __init__(self, model, tokenizer, extraction_layer=31): |
| self.model = model |
| self.tokenizer = tokenizer |
| self.layer = extraction_layer + 1 # +1 because index 0 is embedding |
| |
| def get_hidden_state(self, text: str) -> torch.Tensor: |
| """ |
| Returns: Tensor of shape (960,) ready for bit extractor |
| """ |
| tokens = self.tokenizer(text, return_tensors="pt") |
| with torch.no_grad(): |
| outputs = self.model(**tokens, output_hidden_states=True) |
| return outputs.hidden_states[self.layer][0, -1, :] |
| |
| def get_all_position_states(self, text: str) -> torch.Tensor: |
| """ |
| Returns: Tensor of shape (seq_len, 960) for all positions |
| """ |
| tokens = self.tokenizer(text, return_tensors="pt") |
| with torch.no_grad(): |
| outputs = self.model(**tokens, output_hidden_states=True) |
| return outputs.hidden_states[self.layer][0] |
| ``` |
|
|
| --- |
|
|
| ## 7. Complete Weight Inventory Table |
|
|
| ### 7.1 All Named Parameters |
|
|
| ``` |
| EMBEDDING (47,185,920 params - 13.04%) |
| model.embed_tokens.weight (49152, 960) 47,185,920 |
| |
| LAYER 0 (9,832,320 params - 2.72%) |
| Attention (2,457,600): |
| model.layers.0.self_attn.q_proj.weight (960, 960) 921,600 |
| model.layers.0.self_attn.k_proj.weight (320, 960) 307,200 |
| model.layers.0.self_attn.v_proj.weight (320, 960) 307,200 |
| model.layers.0.self_attn.o_proj.weight (960, 960) 921,600 |
| MLP (7,372,800): |
| model.layers.0.mlp.gate_proj.weight (2560, 960) 2,457,600 |
| model.layers.0.mlp.up_proj.weight (2560, 960) 2,457,600 |
| model.layers.0.mlp.down_proj.weight (960, 2560) 2,457,600 |
| Norms (1,920): |
| model.layers.0.input_layernorm.weight (960,) 960 |
| model.layers.0.post_attention_layernorm.weight (960,) 960 |
| |
| [Layers 1-31 follow identical structure, each with 9,832,320 params] |
| |
| FINAL NORM (960 params - 0.00%) |
| model.norm.weight (960,) 960 |
| |
| LM HEAD (tied with embed_tokens) |
| lm_head.weight (49152, 960) [shared] |
| ``` |
|
|
| ### 7.2 Summary by Component Type |
|
|
| | Component Type | Count | Params Each | Total Params | |
| |----------------|-------|-------------|--------------| |
| | Embedding | 1 | 47,185,920 | 47,185,920 | |
| | Q Projection | 32 | 921,600 | 29,491,200 | |
| | K Projection | 32 | 307,200 | 9,830,400 | |
| | V Projection | 32 | 307,200 | 9,830,400 | |
| | O Projection | 32 | 921,600 | 29,491,200 | |
| | Gate Projection | 32 | 2,457,600 | 78,643,200 | |
| | Up Projection | 32 | 2,457,600 | 78,643,200 | |
| | Down Projection | 32 | 2,457,600 | 78,643,200 | |
| | Input LayerNorm | 32 | 960 | 30,720 | |
| | Post-Attn LayerNorm | 32 | 960 | 30,720 | |
| | Final LayerNorm | 1 | 960 | 960 | |
| | **Total** | | | **361,821,120** | |
|
|
| --- |
|
|
| ## 8. Configuration Reference |
|
|
| Complete model configuration from HuggingFace: |
|
|
| ```python |
| { |
| "architectures": ["LlamaForCausalLM"], |
| "attention_bias": False, |
| "attention_dropout": 0.0, |
| "bos_token_id": 1, |
| "eos_token_id": 2, |
| "pad_token_id": 2, |
| "head_dim": 64, |
| "hidden_act": "silu", |
| "hidden_size": 960, |
| "initializer_range": 0.02, |
| "intermediate_size": 2560, |
| "max_position_embeddings": 8192, |
| "mlp_bias": False, |
| "model_type": "llama", |
| "num_attention_heads": 15, |
| "num_hidden_layers": 32, |
| "num_key_value_heads": 5, |
| "pretraining_tp": 1, |
| "rms_norm_eps": 1e-05, |
| "rope_interleaved": False, |
| "rope_theta": 100000, |
| "tie_word_embeddings": True, |
| "use_cache": True, |
| "vocab_size": 49152 |
| } |
| ``` |
|
|
| --- |
|
|
| ## 9. Appendix: PyTorch Model Structure |
|
|
| ``` |
| LlamaForCausalLM( |
| (model): LlamaModel( |
| (embed_tokens): Embedding(49152, 960, padding_idx=2) |
| (layers): ModuleList( |
| (0-31): 32 x LlamaDecoderLayer( |
| (self_attn): LlamaAttention( |
| (q_proj): Linear(in_features=960, out_features=960, bias=False) |
| (k_proj): Linear(in_features=960, out_features=320, bias=False) |
| (v_proj): Linear(in_features=960, out_features=320, bias=False) |
| (o_proj): Linear(in_features=960, out_features=960, bias=False) |
| ) |
| (mlp): LlamaMLP( |
| (gate_proj): Linear(in_features=960, out_features=2560, bias=False) |
| (up_proj): Linear(in_features=960, out_features=2560, bias=False) |
| (down_proj): Linear(in_features=2560, out_features=960, bias=False) |
| (act_fn): SiLUActivation() |
| ) |
| (input_layernorm): LlamaRMSNorm((960,), eps=1e-05) |
| (post_attention_layernorm): LlamaRMSNorm((960,), eps=1e-05) |
| ) |
| ) |
| (norm): LlamaRMSNorm((960,), eps=1e-05) |
| (rotary_emb): LlamaRotaryEmbedding() |
| ) |
| (lm_head): Linear(in_features=960, out_features=49152, bias=False) |
| ) |
| ``` |
|
|