SmolLM2-360M-Instruct Architecture Analysis
Technical reference document for the 8bit-threshold-computer LLM integration project.
Model: HuggingFaceTB/SmolLM2-360M-Instruct
Architecture: LlamaForCausalLM (Llama 2 variant)
Tokenizer: GPT2TokenizerFast
Analysis Date: 2026-01-21
1. Executive Summary
SmolLM2-360M-Instruct is a 362M parameter causal language model using the Llama architecture. Key characteristics relevant to our bit extraction task:
- Hidden dimension: 960 (matches our extractor input requirement)
- 32 transformer layers providing multiple extraction points
- Digit-level tokenization for numbers (each digit is a separate token)
- Grouped Query Attention (GQA) with 15 query heads and 5 KV heads
2. Architecture Census
2.1 Core Parameters
| Parameter | Value |
|---|---|
| Total Parameters | 361,821,120 (361.82M) |
| Vocabulary Size | 49,152 |
| Hidden Dimension | 960 |
| Intermediate Dimension (MLP) | 2,560 |
| Number of Layers | 32 |
| Number of Attention Heads | 15 |
| Number of KV Heads | 5 (Grouped Query Attention) |
| Head Dimension | 64 |
| Max Sequence Length | 8,192 |
| Activation Function | SiLU |
| Normalization | RMSNorm (eps=1e-05) |
| Position Encoding | RoPE (theta=100,000) |
| Word Embedding Tying | Yes (embed_tokens = lm_head) |
2.2 Architecture Diagram
Input Token IDs
|
v
+------------------+
| Embedding Layer | (49152, 960)
+------------------+
|
v
+------------------+
| LlamaDecoderLayer| x 32
| +-------------+ |
| | RMSNorm | |
| +-------------+ |
| | Self-Attn | | Q: (960, 960), K: (960, 320), V: (960, 320), O: (960, 960)
| +-------------+ |
| | Residual | |
| +-------------+ |
| | RMSNorm | |
| +-------------+ |
| | MLP (SwiGLU)| | gate: (960, 2560), up: (960, 2560), down: (2560, 960)
| +-------------+ |
| | Residual | |
+------------------+
|
v
+------------------+
| Final RMSNorm | (960,)
+------------------+
|
v
+------------------+
| LM Head | (960, 49152) - tied with embeddings
+------------------+
|
v
Logits (batch, seq, 49152)
2.3 Parameter Distribution
| Component | Parameters | Percentage |
|---|---|---|
| Embedding | 47,185,920 | 13.04% |
| All Attention Layers | 78,643,200 | 21.74% |
| All MLP Layers | 235,929,600 | 65.19% |
| All Layer Norms | 61,440 | 0.02% |
| Final Norm | 960 | 0.00% |
Per-layer breakdown (each of 32 layers):
- Attention: 2,457,600 params (0.68%)
- MLP: 7,372,800 params (2.04%)
- Norms: 1,920 params (0.00%)
3. Weight Inventory
3.1 Embedding and Output Layers
| Parameter Name | Shape | Elements | Notes |
|---|---|---|---|
model.embed_tokens.weight |
(49152, 960) | 47,185,920 | Token embeddings |
model.norm.weight |
(960,) | 960 | Final layer norm |
lm_head.weight |
(49152, 960) | (tied) | Tied to embed_tokens |
3.2 Single Transformer Layer Structure
Each of the 32 layers (model.layers.{0-31}) contains:
Attention Block:
| Parameter | Shape | Elements |
|---|---|---|
self_attn.q_proj.weight |
(960, 960) | 921,600 |
self_attn.k_proj.weight |
(320, 960) | 307,200 |
self_attn.v_proj.weight |
(320, 960) | 307,200 |
self_attn.o_proj.weight |
(960, 960) | 921,600 |
| Attention Total | 2,457,600 |
MLP Block (SwiGLU):
| Parameter | Shape | Elements |
|---|---|---|
mlp.gate_proj.weight |
(2560, 960) | 2,457,600 |
mlp.up_proj.weight |
(2560, 960) | 2,457,600 |
mlp.down_proj.weight |
(960, 2560) | 2,457,600 |
| MLP Total | 7,372,800 |
Normalization:
| Parameter | Shape | Elements |
|---|---|---|
input_layernorm.weight |
(960,) | 960 |
post_attention_layernorm.weight |
(960,) | 960 |
| Norms Total | 1,920 |
Layer Total: 9,832,320 parameters
3.3 Grouped Query Attention (GQA) Details
SmolLM2 uses GQA with a 3:1 ratio:
- 15 query heads (Q dimension: 960 = 15 x 64)
- 5 key-value heads (KV dimension: 320 = 5 x 64)
- Each KV head is shared by 3 query heads
- This reduces KV cache memory by ~67% vs standard MHA
4. Tokenization Analysis
4.1 Arithmetic Expression Tokenization
Test input: "47 + 86"
| Position | Token ID | Token | Description |
|---|---|---|---|
| 0 | 36 | '4' |
First digit of operand A |
| 1 | 39 | '7' |
Second digit of operand A |
| 2 | 1232 | ' +' |
Space + plus sign |
| 3 | 216 | ' ' |
Trailing space |
| 4 | 40 | '8' |
First digit of operand B |
| 5 | 38 | '6' |
Second digit of operand B |
4.2 Digit Token Mappings
| Digit | Token ID |
|---|---|
| 0 | 32 |
| 1 | 33 |
| 2 | 34 |
| 3 | 35 |
| 4 | 36 |
| 5 | 37 |
| 6 | 38 |
| 7 | 39 |
| 8 | 40 |
| 9 | 41 |
Key observations:
- Digits are tokenized individually (no multi-digit tokens like "47")
- Digit tokens are sequential: ID = 32 + digit_value
- Space-prefixed operators exist (e.g.,
' +'= 1232) '='has token ID 45
4.3 Implications for Bit Extraction
The digit-by-digit tokenization means:
- For
"47 + 86", operand A spans positions [0,1] and operand B spans positions [4,5] - The model must learn to:
- Recognize digit boundaries
- Compose multi-digit numbers from sequential tokens
- Perform arithmetic across token positions
- Hidden states at digit positions contain the numerical representation
5. Hidden State Analysis
5.1 Hidden State Output Structure
When running with output_hidden_states=True:
- Returns 33 hidden states (embedding + 32 layer outputs)
- Each has shape:
(batch_size, seq_len, hidden_dim) - For
"47 + 86":(1, 6, 960)
5.2 Hidden State Statistics by Layer
| Layer | Mean | Std Dev | Min | Max |
|---|---|---|---|---|
| Embedding | -0.001 | 0.105 | -0.44 | 1.77 |
| Layer 0 | -0.127 | 2.55 | -80.8 | 19.0 |
| Layer 1 | -0.171 | 3.70 | -161 | 39.7 |
| Layer 2 | -0.151 | 3.67 | -102 | 61.4 |
| Layer 3 | -1.13 | 327 | -21,722 | 11,856 |
| Layer 4-12 | ~-1.3 | ~327 | ~-21,700 | ~11,800 |
| Layer 13-26 | ~-1.5 | ~337 | ~-22,400 | ~12,100 |
| Layer 27-30 | ~-1.8 | ~310 | ~-20,000 | ~11,800 |
| Layer 31 | 0.017 | 1.34 | -18.9 | 34.3 |
Key observations:
- Dramatic variance explosion at Layer 3: Std dev jumps from ~4 to ~327
- Stable middle layers (4-26): Consistent statistics, suggesting numerical computation
- Compression at final layer: Std dev drops to 1.34 at Layer 31 (pre-softmax normalization)
- Layer 31 is well-scaled for downstream processing
5.3 Extraction Point Candidates
| Layer Range | Characteristics | Suitability |
|---|---|---|
| 0-2 (Early) | Low variance, close to embeddings | Poor - minimal computation |
| 3-12 (Early-Mid) | High variance, initial processing | Moderate - may contain raw numerical features |
| 13-26 (Middle) | Stable high variance | Good - computation in progress |
| 27-30 (Late) | Variance compression begins | Good - refined representations |
| 31 (Final) | Well-normalized output | Best - final representation before LM head |
6. Relevance to 8bit-Threshold-Computer Project
6.1 Hidden Dimension Match
The hidden dimension of 960 exactly matches our extractor input requirement. This is fortuitous as it means:
- No projection layer needed to interface with our bit extractor
- Direct extraction from any layer's hidden states
- Full utilization of the model's representational capacity
6.2 Recommended Extraction Strategy
def extract_hidden_state(model, tokenizer, expression, layer=-1):
"""
Extract hidden state for bit extraction.
Args:
layer: Which layer to extract from (default: final layer)
-1 = Layer 31 (final, pre-LM-head)
Returns:
Tensor of shape (960,) for the last token position
"""
inputs = tokenizer(expression, return_tensors="pt")
outputs = model(**inputs, output_hidden_states=True)
# hidden_states[0] = embedding, hidden_states[1] = layer 0, ...
# hidden_states[32] = layer 31 (final)
hidden = outputs.hidden_states[layer] # (1, seq_len, 960)
# Extract last token position for autoregressive prediction
return hidden[0, -1, :] # (960,)
6.3 Token Position Analysis
For arithmetic expressions like "A + B":
Tokens: [d1] [d2] [ +] [ ] [d3] [d4]
Positions: 0 1 2 3 4 5
Operand A: positions 0 to (plus_pos - 1)
Operator: position where ' +' token appears
Operand B: positions (plus_pos + 2) to end
Strategy for operand extraction:
- Find the
' +'token (ID 1232) position - Collect hidden states at digit positions before it (operand A)
- Collect hidden states at digit positions after it (operand B)
- Consider pooling (mean, max) or concatenating digit hidden states
6.4 Attention Pattern Utilization
With GQA (15 query heads, 5 KV heads), we can analyze attention patterns to:
- Identify which positions attend to operand digits
- Determine if the model explicitly links corresponding digit positions
- Find heads that specialize in numerical reasoning
def get_attention_weights(model, tokenizer, expression):
inputs = tokenizer(expression, return_tensors="pt")
outputs = model(**inputs, output_attentions=True)
# attentions: tuple of (batch, num_heads, seq_len, seq_len) per layer
return outputs.attentions
6.5 Extraction Interface Specification
For integration with the threshold computer:
class SmolLM2Extractor:
"""Interface between SmolLM2 and threshold-based bit extraction."""
def __init__(self, model, tokenizer, extraction_layer=31):
self.model = model
self.tokenizer = tokenizer
self.layer = extraction_layer + 1 # +1 because index 0 is embedding
def get_hidden_state(self, text: str) -> torch.Tensor:
"""
Returns: Tensor of shape (960,) ready for bit extractor
"""
tokens = self.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**tokens, output_hidden_states=True)
return outputs.hidden_states[self.layer][0, -1, :]
def get_all_position_states(self, text: str) -> torch.Tensor:
"""
Returns: Tensor of shape (seq_len, 960) for all positions
"""
tokens = self.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**tokens, output_hidden_states=True)
return outputs.hidden_states[self.layer][0]
7. Complete Weight Inventory Table
7.1 All Named Parameters
EMBEDDING (47,185,920 params - 13.04%)
model.embed_tokens.weight (49152, 960) 47,185,920
LAYER 0 (9,832,320 params - 2.72%)
Attention (2,457,600):
model.layers.0.self_attn.q_proj.weight (960, 960) 921,600
model.layers.0.self_attn.k_proj.weight (320, 960) 307,200
model.layers.0.self_attn.v_proj.weight (320, 960) 307,200
model.layers.0.self_attn.o_proj.weight (960, 960) 921,600
MLP (7,372,800):
model.layers.0.mlp.gate_proj.weight (2560, 960) 2,457,600
model.layers.0.mlp.up_proj.weight (2560, 960) 2,457,600
model.layers.0.mlp.down_proj.weight (960, 2560) 2,457,600
Norms (1,920):
model.layers.0.input_layernorm.weight (960,) 960
model.layers.0.post_attention_layernorm.weight (960,) 960
[Layers 1-31 follow identical structure, each with 9,832,320 params]
FINAL NORM (960 params - 0.00%)
model.norm.weight (960,) 960
LM HEAD (tied with embed_tokens)
lm_head.weight (49152, 960) [shared]
7.2 Summary by Component Type
| Component Type | Count | Params Each | Total Params |
|---|---|---|---|
| Embedding | 1 | 47,185,920 | 47,185,920 |
| Q Projection | 32 | 921,600 | 29,491,200 |
| K Projection | 32 | 307,200 | 9,830,400 |
| V Projection | 32 | 307,200 | 9,830,400 |
| O Projection | 32 | 921,600 | 29,491,200 |
| Gate Projection | 32 | 2,457,600 | 78,643,200 |
| Up Projection | 32 | 2,457,600 | 78,643,200 |
| Down Projection | 32 | 2,457,600 | 78,643,200 |
| Input LayerNorm | 32 | 960 | 30,720 |
| Post-Attn LayerNorm | 32 | 960 | 30,720 |
| Final LayerNorm | 1 | 960 | 960 |
| Total | 361,821,120 |
8. Configuration Reference
Complete model configuration from HuggingFace:
{
"architectures": ["LlamaForCausalLM"],
"attention_bias": False,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 2,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 960,
"initializer_range": 0.02,
"intermediate_size": 2560,
"max_position_embeddings": 8192,
"mlp_bias": False,
"model_type": "llama",
"num_attention_heads": 15,
"num_hidden_layers": 32,
"num_key_value_heads": 5,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_interleaved": False,
"rope_theta": 100000,
"tie_word_embeddings": True,
"use_cache": True,
"vocab_size": 49152
}
9. Appendix: PyTorch Model Structure
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(49152, 960, padding_idx=2)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=960, out_features=960, bias=False)
(k_proj): Linear(in_features=960, out_features=320, bias=False)
(v_proj): Linear(in_features=960, out_features=320, bias=False)
(o_proj): Linear(in_features=960, out_features=960, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=960, out_features=2560, bias=False)
(up_proj): Linear(in_features=960, out_features=2560, bias=False)
(down_proj): Linear(in_features=2560, out_features=960, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm((960,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((960,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((960,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=960, out_features=49152, bias=False)
)