CharlesCNorton

Move SmolLM2 analysis files into smollm2 subfolder

8a1465b 4 months ago

14.5 kB

	# SmolLM2-360M-Instruct Architecture Analysis

	Technical reference document for the 8bit-threshold-computer LLM integration project.

	Model: `HuggingFaceTB/SmolLM2-360M-Instruct`
	Architecture: LlamaForCausalLM (Llama 2 variant)
	Tokenizer: GPT2TokenizerFast
	Analysis Date: 2026-01-21

	---

	## 1. Executive Summary

	SmolLM2-360M-Instruct is a 362M parameter causal language model using the Llama architecture. Key characteristics relevant to our bit extraction task:

	- Hidden dimension: 960 (matches our extractor input requirement)
	- 32 transformer layers providing multiple extraction points
	- Digit-level tokenization for numbers (each digit is a separate token)
	- Grouped Query Attention (GQA) with 15 query heads and 5 KV heads

	---

	## 2. Architecture Census

	### 2.1 Core Parameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Total Parameters \| 361,821,120 (361.82M) \|
	\| Vocabulary Size \| 49,152 \|
	\| Hidden Dimension \| 960 \|
	\| Intermediate Dimension (MLP) \| 2,560 \|
	\| Number of Layers \| 32 \|
	\| Number of Attention Heads \| 15 \|
	\| Number of KV Heads \| 5 (Grouped Query Attention) \|
	\| Head Dimension \| 64 \|
	\| Max Sequence Length \| 8,192 \|
	\| Activation Function \| SiLU \|
	\| Normalization \| RMSNorm (eps=1e-05) \|
	\| Position Encoding \| RoPE (theta=100,000) \|
	\| Word Embedding Tying \| Yes (embed_tokens = lm_head) \|

	### 2.2 Architecture Diagram

	```
	Input Token IDs
	\|
	v
	+------------------+
	\| Embedding Layer \| (49152, 960)
	+------------------+
	\|
	v
	+------------------+
	\| LlamaDecoderLayer\| x 32
	\| +-------------+ \|
	\| \| RMSNorm \| \|
	\| +-------------+ \|
	\| \| Self-Attn \| \| Q: (960, 960), K: (960, 320), V: (960, 320), O: (960, 960)
	\| +-------------+ \|
	\| \| Residual \| \|
	\| +-------------+ \|
	\| \| RMSNorm \| \|
	\| +-------------+ \|
	\| \| MLP (SwiGLU)\| \| gate: (960, 2560), up: (960, 2560), down: (2560, 960)
	\| +-------------+ \|
	\| \| Residual \| \|
	+------------------+
	\|
	v
	+------------------+
	\| Final RMSNorm \| (960,)
	+------------------+
	\|
	v
	+------------------+
	\| LM Head \| (960, 49152) - tied with embeddings
	+------------------+
	\|
	v
	Logits (batch, seq, 49152)
	```

	### 2.3 Parameter Distribution

	\| Component \| Parameters \| Percentage \|
	\|-----------\|-----------\|------------\|
	\| Embedding \| 47,185,920 \| 13.04% \|
	\| All Attention Layers \| 78,643,200 \| 21.74% \|
	\| All MLP Layers \| 235,929,600 \| 65.19% \|
	\| All Layer Norms \| 61,440 \| 0.02% \|
	\| Final Norm \| 960 \| 0.00% \|

	Per-layer breakdown (each of 32 layers):
	- Attention: 2,457,600 params (0.68%)
	- MLP: 7,372,800 params (2.04%)
	- Norms: 1,920 params (0.00%)

	---

	## 3. Weight Inventory

	### 3.1 Embedding and Output Layers

	\| Parameter Name \| Shape \| Elements \| Notes \|
	\|---------------\|-------\|----------\|-------\|
	\| `model.embed_tokens.weight` \| (49152, 960) \| 47,185,920 \| Token embeddings \|
	\| `model.norm.weight` \| (960,) \| 960 \| Final layer norm \|
	\| `lm_head.weight` \| (49152, 960) \| (tied) \| Tied to embed_tokens \|

	### 3.2 Single Transformer Layer Structure

	Each of the 32 layers (`model.layers.{0-31}`) contains:

	Attention Block:
	\| Parameter \| Shape \| Elements \|
	\|-----------\|-------\|----------\|
	\| `self_attn.q_proj.weight` \| (960, 960) \| 921,600 \|
	\| `self_attn.k_proj.weight` \| (320, 960) \| 307,200 \|
	\| `self_attn.v_proj.weight` \| (320, 960) \| 307,200 \|
	\| `self_attn.o_proj.weight` \| (960, 960) \| 921,600 \|
	\| Attention Total \| \| 2,457,600 \|

	MLP Block (SwiGLU):
	\| Parameter \| Shape \| Elements \|
	\|-----------\|-------\|----------\|
	\| `mlp.gate_proj.weight` \| (2560, 960) \| 2,457,600 \|
	\| `mlp.up_proj.weight` \| (2560, 960) \| 2,457,600 \|
	\| `mlp.down_proj.weight` \| (960, 2560) \| 2,457,600 \|
	\| MLP Total \| \| 7,372,800 \|

	Normalization:
	\| Parameter \| Shape \| Elements \|
	\|-----------\|-------\|----------\|
	\| `input_layernorm.weight` \| (960,) \| 960 \|
	\| `post_attention_layernorm.weight` \| (960,) \| 960 \|
	\| Norms Total \| \| 1,920 \|

	Layer Total: 9,832,320 parameters

	### 3.3 Grouped Query Attention (GQA) Details

	SmolLM2 uses GQA with a 3:1 ratio:
	- 15 query heads (Q dimension: 960 = 15 x 64)
	- 5 key-value heads (KV dimension: 320 = 5 x 64)
	- Each KV head is shared by 3 query heads
	- This reduces KV cache memory by ~67% vs standard MHA

	---

	## 4. Tokenization Analysis

	### 4.1 Arithmetic Expression Tokenization

	Test input: `"47 + 86"`

	\| Position \| Token ID \| Token \| Description \|
	\|----------\|----------\|-------\|-------------\|
	\| 0 \| 36 \| `'4'` \| First digit of operand A \|
	\| 1 \| 39 \| `'7'` \| Second digit of operand A \|
	\| 2 \| 1232 \| `' +'` \| Space + plus sign \|
	\| 3 \| 216 \| `' '` \| Trailing space \|
	\| 4 \| 40 \| `'8'` \| First digit of operand B \|
	\| 5 \| 38 \| `'6'` \| Second digit of operand B \|

	### 4.2 Digit Token Mappings

	\| Digit \| Token ID \|
	\|-------\|----------\|
	\| 0 \| 32 \|
	\| 1 \| 33 \|
	\| 2 \| 34 \|
	\| 3 \| 35 \|
	\| 4 \| 36 \|
	\| 5 \| 37 \|
	\| 6 \| 38 \|
	\| 7 \| 39 \|
	\| 8 \| 40 \|
	\| 9 \| 41 \|

	Key observations:
	- Digits are tokenized individually (no multi-digit tokens like "47")
	- Digit tokens are sequential: ID = 32 + digit_value
	- Space-prefixed operators exist (e.g., `' +'` = 1232)
	- `'='` has token ID 45

	### 4.3 Implications for Bit Extraction

	The digit-by-digit tokenization means:
	1. For `"47 + 86"`, operand A spans positions [0,1] and operand B spans positions [4,5]
	2. The model must learn to:
	- Recognize digit boundaries
	- Compose multi-digit numbers from sequential tokens
	- Perform arithmetic across token positions
	3. Hidden states at digit positions contain the numerical representation

	---

	## 5. Hidden State Analysis

	### 5.1 Hidden State Output Structure

	When running with `output_hidden_states=True`:
	- Returns 33 hidden states (embedding + 32 layer outputs)
	- Each has shape: `(batch_size, seq_len, hidden_dim)`
	- For `"47 + 86"`: `(1, 6, 960)`

	### 5.2 Hidden State Statistics by Layer

	\| Layer \| Mean \| Std Dev \| Min \| Max \|
	\|-------\|------\|---------\|-----\|-----\|
	\| Embedding \| -0.001 \| 0.105 \| -0.44 \| 1.77 \|
	\| Layer 0 \| -0.127 \| 2.55 \| -80.8 \| 19.0 \|
	\| Layer 1 \| -0.171 \| 3.70 \| -161 \| 39.7 \|
	\| Layer 2 \| -0.151 \| 3.67 \| -102 \| 61.4 \|
	\| Layer 3 \| -1.13 \| 327 \| -21,722 \| 11,856 \|
	\| Layer 4-12 \| ~-1.3 \| ~327 \| ~-21,700 \| ~11,800 \|
	\| Layer 13-26 \| ~-1.5 \| ~337 \| ~-22,400 \| ~12,100 \|
	\| Layer 27-30 \| ~-1.8 \| ~310 \| ~-20,000 \| ~11,800 \|
	\| Layer 31 \| 0.017 \| 1.34 \| -18.9 \| 34.3 \|

	Key observations:
	1. Dramatic variance explosion at Layer 3: Std dev jumps from ~4 to ~327
	2. Stable middle layers (4-26): Consistent statistics, suggesting numerical computation
	3. Compression at final layer: Std dev drops to 1.34 at Layer 31 (pre-softmax normalization)
	4. Layer 31 is well-scaled for downstream processing

	### 5.3 Extraction Point Candidates

	\| Layer Range \| Characteristics \| Suitability \|
	\|-------------\|-----------------\|-------------\|
	\| 0-2 (Early) \| Low variance, close to embeddings \| Poor - minimal computation \|
	\| 3-12 (Early-Mid) \| High variance, initial processing \| Moderate - may contain raw numerical features \|
	\| 13-26 (Middle) \| Stable high variance \| Good - computation in progress \|
	\| 27-30 (Late) \| Variance compression begins \| Good - refined representations \|
	\| 31 (Final) \| Well-normalized output \| Best - final representation before LM head \|

	---

	## 6. Relevance to 8bit-Threshold-Computer Project

	### 6.1 Hidden Dimension Match

	The hidden dimension of 960 exactly matches our extractor input requirement. This is fortuitous as it means:
	- No projection layer needed to interface with our bit extractor
	- Direct extraction from any layer's hidden states
	- Full utilization of the model's representational capacity

	### 6.2 Recommended Extraction Strategy

	```python
	def extract_hidden_state(model, tokenizer, expression, layer=-1):
	"""
	Extract hidden state for bit extraction.

	Args:
	layer: Which layer to extract from (default: final layer)
	-1 = Layer 31 (final, pre-LM-head)

	Returns:
	Tensor of shape (960,) for the last token position
	"""
	inputs = tokenizer(expression, return_tensors="pt")
	outputs = model(**inputs, output_hidden_states=True)

	# hidden_states[0] = embedding, hidden_states[1] = layer 0, ...
	# hidden_states[32] = layer 31 (final)
	hidden = outputs.hidden_states[layer] # (1, seq_len, 960)

	# Extract last token position for autoregressive prediction
	return hidden[0, -1, :] # (960,)
	```

	### 6.3 Token Position Analysis

	For arithmetic expressions like `"A + B"`:

	```
	Tokens: [d1] [d2] [ +] [ ] [d3] [d4]
	Positions: 0 1 2 3 4 5

	Operand A: positions 0 to (plus_pos - 1)
	Operator: position where ' +' token appears
	Operand B: positions (plus_pos + 2) to end
	```

	Strategy for operand extraction:
	1. Find the `' +'` token (ID 1232) position
	2. Collect hidden states at digit positions before it (operand A)
	3. Collect hidden states at digit positions after it (operand B)
	4. Consider pooling (mean, max) or concatenating digit hidden states

	### 6.4 Attention Pattern Utilization

	With GQA (15 query heads, 5 KV heads), we can analyze attention patterns to:
	1. Identify which positions attend to operand digits
	2. Determine if the model explicitly links corresponding digit positions
	3. Find heads that specialize in numerical reasoning

	```python
	def get_attention_weights(model, tokenizer, expression):
	inputs = tokenizer(expression, return_tensors="pt")
	outputs = model(**inputs, output_attentions=True)
	# attentions: tuple of (batch, num_heads, seq_len, seq_len) per layer
	return outputs.attentions
	```

	### 6.5 Extraction Interface Specification

	For integration with the threshold computer:

	```python
	class SmolLM2Extractor:
	"""Interface between SmolLM2 and threshold-based bit extraction."""

	def __init__(self, model, tokenizer, extraction_layer=31):
	self.model = model
	self.tokenizer = tokenizer
	self.layer = extraction_layer + 1 # +1 because index 0 is embedding

	def get_hidden_state(self, text: str) -> torch.Tensor:
	"""
	Returns: Tensor of shape (960,) ready for bit extractor
	"""
	tokens = self.tokenizer(text, return_tensors="pt")
	with torch.no_grad():
	outputs = self.model(**tokens, output_hidden_states=True)
	return outputs.hidden_states[self.layer][0, -1, :]

	def get_all_position_states(self, text: str) -> torch.Tensor:
	"""
	Returns: Tensor of shape (seq_len, 960) for all positions
	"""
	tokens = self.tokenizer(text, return_tensors="pt")
	with torch.no_grad():
	outputs = self.model(**tokens, output_hidden_states=True)
	return outputs.hidden_states[self.layer][0]
	```

	---

	## 7. Complete Weight Inventory Table

	### 7.1 All Named Parameters

	```
	EMBEDDING (47,185,920 params - 13.04%)
	model.embed_tokens.weight (49152, 960) 47,185,920

	LAYER 0 (9,832,320 params - 2.72%)
	Attention (2,457,600):
	model.layers.0.self_attn.q_proj.weight (960, 960) 921,600
	model.layers.0.self_attn.k_proj.weight (320, 960) 307,200
	model.layers.0.self_attn.v_proj.weight (320, 960) 307,200
	model.layers.0.self_attn.o_proj.weight (960, 960) 921,600
	MLP (7,372,800):
	model.layers.0.mlp.gate_proj.weight (2560, 960) 2,457,600
	model.layers.0.mlp.up_proj.weight (2560, 960) 2,457,600
	model.layers.0.mlp.down_proj.weight (960, 2560) 2,457,600
	Norms (1,920):
	model.layers.0.input_layernorm.weight (960,) 960
	model.layers.0.post_attention_layernorm.weight (960,) 960

	[Layers 1-31 follow identical structure, each with 9,832,320 params]

	FINAL NORM (960 params - 0.00%)
	model.norm.weight (960,) 960

	LM HEAD (tied with embed_tokens)
	lm_head.weight (49152, 960) [shared]
	```

	### 7.2 Summary by Component Type

	\| Component Type \| Count \| Params Each \| Total Params \|
	\|----------------\|-------\|-------------\|--------------\|
	\| Embedding \| 1 \| 47,185,920 \| 47,185,920 \|
	\| Q Projection \| 32 \| 921,600 \| 29,491,200 \|
	\| K Projection \| 32 \| 307,200 \| 9,830,400 \|
	\| V Projection \| 32 \| 307,200 \| 9,830,400 \|
	\| O Projection \| 32 \| 921,600 \| 29,491,200 \|
	\| Gate Projection \| 32 \| 2,457,600 \| 78,643,200 \|
	\| Up Projection \| 32 \| 2,457,600 \| 78,643,200 \|
	\| Down Projection \| 32 \| 2,457,600 \| 78,643,200 \|
	\| Input LayerNorm \| 32 \| 960 \| 30,720 \|
	\| Post-Attn LayerNorm \| 32 \| 960 \| 30,720 \|
	\| Final LayerNorm \| 1 \| 960 \| 960 \|
	\| Total \| \| \| 361,821,120 \|

	---

	## 8. Configuration Reference

	Complete model configuration from HuggingFace:

	```python
	{
	"architectures": ["LlamaForCausalLM"],
	"attention_bias": False,
	"attention_dropout": 0.0,
	"bos_token_id": 1,
	"eos_token_id": 2,
	"pad_token_id": 2,
	"head_dim": 64,
	"hidden_act": "silu",
	"hidden_size": 960,
	"initializer_range": 0.02,
	"intermediate_size": 2560,
	"max_position_embeddings": 8192,
	"mlp_bias": False,
	"model_type": "llama",
	"num_attention_heads": 15,
	"num_hidden_layers": 32,
	"num_key_value_heads": 5,
	"pretraining_tp": 1,
	"rms_norm_eps": 1e-05,
	"rope_interleaved": False,
	"rope_theta": 100000,
	"tie_word_embeddings": True,
	"use_cache": True,
	"vocab_size": 49152
	}
	```

	---

	## 9. Appendix: PyTorch Model Structure

	```
	LlamaForCausalLM(
	(model): LlamaModel(
	(embed_tokens): Embedding(49152, 960, padding_idx=2)
	(layers): ModuleList(
	(0-31): 32 x LlamaDecoderLayer(
	(self_attn): LlamaAttention(
	(q_proj): Linear(in_features=960, out_features=960, bias=False)
	(k_proj): Linear(in_features=960, out_features=320, bias=False)
	(v_proj): Linear(in_features=960, out_features=320, bias=False)
	(o_proj): Linear(in_features=960, out_features=960, bias=False)
	)
	(mlp): LlamaMLP(
	(gate_proj): Linear(in_features=960, out_features=2560, bias=False)
	(up_proj): Linear(in_features=960, out_features=2560, bias=False)
	(down_proj): Linear(in_features=2560, out_features=960, bias=False)
	(act_fn): SiLUActivation()
	)
	(input_layernorm): LlamaRMSNorm((960,), eps=1e-05)
	(post_attention_layernorm): LlamaRMSNorm((960,), eps=1e-05)
	)
	)
	(norm): LlamaRMSNorm((960,), eps=1e-05)
	(rotary_emb): LlamaRotaryEmbedding()
	)
	(lm_head): Linear(in_features=960, out_features=49152, bias=False)
	)
	```