CharlesCNorton

Move SmolLM2 analysis files into smollm2 subfolder

8a1465b 4 months ago

14.5 kB

SmolLM2-360M-Instruct Architecture Analysis

Technical reference document for the 8bit-threshold-computer LLM integration project.

Model: HuggingFaceTB/SmolLM2-360M-Instruct Architecture: LlamaForCausalLM (Llama 2 variant) Tokenizer: GPT2TokenizerFast Analysis Date: 2026-01-21

1. Executive Summary

SmolLM2-360M-Instruct is a 362M parameter causal language model using the Llama architecture. Key characteristics relevant to our bit extraction task:

Hidden dimension: 960 (matches our extractor input requirement)
32 transformer layers providing multiple extraction points
Digit-level tokenization for numbers (each digit is a separate token)
Grouped Query Attention (GQA) with 15 query heads and 5 KV heads

2. Architecture Census

2.1 Core Parameters

Parameter	Value
Total Parameters	361,821,120 (361.82M)
Vocabulary Size	49,152
Hidden Dimension	960
Intermediate Dimension (MLP)	2,560
Number of Layers	32
Number of Attention Heads	15
Number of KV Heads	5 (Grouped Query Attention)
Head Dimension	64
Max Sequence Length	8,192
Activation Function	SiLU
Normalization	RMSNorm (eps=1e-05)
Position Encoding	RoPE (theta=100,000)
Word Embedding Tying	Yes (embed_tokens = lm_head)

2.2 Architecture Diagram

Input Token IDs
       |
       v
+------------------+
| Embedding Layer  |  (49152, 960)
+------------------+
       |
       v
+------------------+
| LlamaDecoderLayer| x 32
|  +-------------+ |
|  | RMSNorm     | |
|  +-------------+ |
|  | Self-Attn   | |  Q: (960, 960), K: (960, 320), V: (960, 320), O: (960, 960)
|  +-------------+ |
|  | Residual    | |
|  +-------------+ |
|  | RMSNorm     | |
|  +-------------+ |
|  | MLP (SwiGLU)| |  gate: (960, 2560), up: (960, 2560), down: (2560, 960)
|  +-------------+ |
|  | Residual    | |
+------------------+
       |
       v
+------------------+
| Final RMSNorm    |  (960,)
+------------------+
       |
       v
+------------------+
| LM Head          |  (960, 49152) - tied with embeddings
+------------------+
       |
       v
Logits (batch, seq, 49152)

2.3 Parameter Distribution

Component	Parameters	Percentage
Embedding	47,185,920	13.04%
All Attention Layers	78,643,200	21.74%
All MLP Layers	235,929,600	65.19%
All Layer Norms	61,440	0.02%
Final Norm	960	0.00%

Per-layer breakdown (each of 32 layers):

Attention: 2,457,600 params (0.68%)
MLP: 7,372,800 params (2.04%)
Norms: 1,920 params (0.00%)

3. Weight Inventory

3.1 Embedding and Output Layers

Parameter Name	Shape	Elements	Notes
`model.embed_tokens.weight`	(49152, 960)	47,185,920	Token embeddings
`model.norm.weight`	(960,)	960	Final layer norm
`lm_head.weight`	(49152, 960)	(tied)	Tied to embed_tokens

3.2 Single Transformer Layer Structure

Each of the 32 layers (model.layers.{0-31}) contains:

Attention Block:

Parameter	Shape	Elements
`self_attn.q_proj.weight`	(960, 960)	921,600
`self_attn.k_proj.weight`	(320, 960)	307,200
`self_attn.v_proj.weight`	(320, 960)	307,200
`self_attn.o_proj.weight`	(960, 960)	921,600
Attention Total		2,457,600

MLP Block (SwiGLU):

Parameter	Shape	Elements
`mlp.gate_proj.weight`	(2560, 960)	2,457,600
`mlp.up_proj.weight`	(2560, 960)	2,457,600
`mlp.down_proj.weight`	(960, 2560)	2,457,600
MLP Total		7,372,800

Normalization:

Parameter	Shape	Elements
`input_layernorm.weight`	(960,)	960
`post_attention_layernorm.weight`	(960,)	960
Norms Total		1,920

Layer Total: 9,832,320 parameters

3.3 Grouped Query Attention (GQA) Details

SmolLM2 uses GQA with a 3:1 ratio:

15 query heads (Q dimension: 960 = 15 x 64)
5 key-value heads (KV dimension: 320 = 5 x 64)
Each KV head is shared by 3 query heads
This reduces KV cache memory by ~67% vs standard MHA

4. Tokenization Analysis

4.1 Arithmetic Expression Tokenization

Test input: "47 + 86"

Position	Token ID	Token	Description
0	36	`'4'`	First digit of operand A
1	39	`'7'`	Second digit of operand A
2	1232	`' +'`	Space + plus sign
3	216	`' '`	Trailing space
4	40	`'8'`	First digit of operand B
5	38	`'6'`	Second digit of operand B

4.2 Digit Token Mappings

Digit	Token ID
0	32
1	33
2	34
3	35
4	36
5	37
6	38
7	39
8	40
9	41

Key observations:

Digits are tokenized individually (no multi-digit tokens like "47")
Digit tokens are sequential: ID = 32 + digit_value
Space-prefixed operators exist (e.g., ' +' = 1232)
'=' has token ID 45

4.3 Implications for Bit Extraction

The digit-by-digit tokenization means:

For "47 + 86", operand A spans positions [0,1] and operand B spans positions [4,5]
The model must learn to:
- Recognize digit boundaries
- Compose multi-digit numbers from sequential tokens
- Perform arithmetic across token positions
Hidden states at digit positions contain the numerical representation

5. Hidden State Analysis

5.1 Hidden State Output Structure

When running with output_hidden_states=True:

Returns 33 hidden states (embedding + 32 layer outputs)
Each has shape: (batch_size, seq_len, hidden_dim)
For "47 + 86": (1, 6, 960)

5.2 Hidden State Statistics by Layer

Layer	Mean	Std Dev	Min	Max
Embedding	-0.001	0.105	-0.44	1.77
Layer 0	-0.127	2.55	-80.8	19.0
Layer 1	-0.171	3.70	-161	39.7
Layer 2	-0.151	3.67	-102	61.4
Layer 3	-1.13	327	-21,722	11,856
Layer 4-12	~-1.3	~327	~-21,700	~11,800
Layer 13-26	~-1.5	~337	~-22,400	~12,100
Layer 27-30	~-1.8	~310	~-20,000	~11,800
Layer 31	0.017	1.34	-18.9	34.3

Key observations:

Dramatic variance explosion at Layer 3: Std dev jumps from ~4 to ~327
Stable middle layers (4-26): Consistent statistics, suggesting numerical computation
Compression at final layer: Std dev drops to 1.34 at Layer 31 (pre-softmax normalization)
Layer 31 is well-scaled for downstream processing

5.3 Extraction Point Candidates

Layer Range	Characteristics	Suitability
0-2 (Early)	Low variance, close to embeddings	Poor - minimal computation
3-12 (Early-Mid)	High variance, initial processing	Moderate - may contain raw numerical features
13-26 (Middle)	Stable high variance	Good - computation in progress
27-30 (Late)	Variance compression begins	Good - refined representations
31 (Final)	Well-normalized output	Best - final representation before LM head

6. Relevance to 8bit-Threshold-Computer Project

6.1 Hidden Dimension Match

The hidden dimension of 960 exactly matches our extractor input requirement. This is fortuitous as it means:

No projection layer needed to interface with our bit extractor
Direct extraction from any layer's hidden states
Full utilization of the model's representational capacity

6.2 Recommended Extraction Strategy

def extract_hidden_state(model, tokenizer, expression, layer=-1):
    """
    Extract hidden state for bit extraction.

    Args:
        layer: Which layer to extract from (default: final layer)
               -1 = Layer 31 (final, pre-LM-head)

    Returns:
        Tensor of shape (960,) for the last token position
    """
    inputs = tokenizer(expression, return_tensors="pt")
    outputs = model(**inputs, output_hidden_states=True)

    # hidden_states[0] = embedding, hidden_states[1] = layer 0, ...
    # hidden_states[32] = layer 31 (final)
    hidden = outputs.hidden_states[layer]  # (1, seq_len, 960)

    # Extract last token position for autoregressive prediction
    return hidden[0, -1, :]  # (960,)

6.3 Token Position Analysis

For arithmetic expressions like "A + B":

Tokens:    [d1] [d2] [ +] [ ] [d3] [d4]
Positions:  0    1    2   3   4    5

Operand A: positions 0 to (plus_pos - 1)
Operator:  position where ' +' token appears
Operand B: positions (plus_pos + 2) to end

Strategy for operand extraction:

Find the ' +' token (ID 1232) position
Collect hidden states at digit positions before it (operand A)
Collect hidden states at digit positions after it (operand B)
Consider pooling (mean, max) or concatenating digit hidden states

6.4 Attention Pattern Utilization

With GQA (15 query heads, 5 KV heads), we can analyze attention patterns to:

Identify which positions attend to operand digits
Determine if the model explicitly links corresponding digit positions
Find heads that specialize in numerical reasoning

def get_attention_weights(model, tokenizer, expression):
    inputs = tokenizer(expression, return_tensors="pt")
    outputs = model(**inputs, output_attentions=True)
    # attentions: tuple of (batch, num_heads, seq_len, seq_len) per layer
    return outputs.attentions

6.5 Extraction Interface Specification

For integration with the threshold computer:

class SmolLM2Extractor:
    """Interface between SmolLM2 and threshold-based bit extraction."""

    def __init__(self, model, tokenizer, extraction_layer=31):
        self.model = model
        self.tokenizer = tokenizer
        self.layer = extraction_layer + 1  # +1 because index 0 is embedding

    def get_hidden_state(self, text: str) -> torch.Tensor:
        """
        Returns: Tensor of shape (960,) ready for bit extractor
        """
        tokens = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**tokens, output_hidden_states=True)
        return outputs.hidden_states[self.layer][0, -1, :]

    def get_all_position_states(self, text: str) -> torch.Tensor:
        """
        Returns: Tensor of shape (seq_len, 960) for all positions
        """
        tokens = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**tokens, output_hidden_states=True)
        return outputs.hidden_states[self.layer][0]

7. Complete Weight Inventory Table

7.1 All Named Parameters

EMBEDDING (47,185,920 params - 13.04%)
  model.embed_tokens.weight                    (49152, 960)    47,185,920

LAYER 0 (9,832,320 params - 2.72%)
  Attention (2,457,600):
    model.layers.0.self_attn.q_proj.weight     (960, 960)      921,600
    model.layers.0.self_attn.k_proj.weight     (320, 960)      307,200
    model.layers.0.self_attn.v_proj.weight     (320, 960)      307,200
    model.layers.0.self_attn.o_proj.weight     (960, 960)      921,600
  MLP (7,372,800):
    model.layers.0.mlp.gate_proj.weight        (2560, 960)     2,457,600
    model.layers.0.mlp.up_proj.weight          (2560, 960)     2,457,600
    model.layers.0.mlp.down_proj.weight        (960, 2560)     2,457,600
  Norms (1,920):
    model.layers.0.input_layernorm.weight      (960,)          960
    model.layers.0.post_attention_layernorm.weight (960,)      960

[Layers 1-31 follow identical structure, each with 9,832,320 params]

FINAL NORM (960 params - 0.00%)
  model.norm.weight                            (960,)          960

LM HEAD (tied with embed_tokens)
  lm_head.weight                               (49152, 960)    [shared]

7.2 Summary by Component Type

Component Type	Count	Params Each	Total Params
Embedding	1	47,185,920	47,185,920
Q Projection	32	921,600	29,491,200
K Projection	32	307,200	9,830,400
V Projection	32	307,200	9,830,400
O Projection	32	921,600	29,491,200
Gate Projection	32	2,457,600	78,643,200
Up Projection	32	2,457,600	78,643,200
Down Projection	32	2,457,600	78,643,200
Input LayerNorm	32	960	30,720
Post-Attn LayerNorm	32	960	30,720
Final LayerNorm	1	960	960
Total			361,821,120

8. Configuration Reference

Complete model configuration from HuggingFace:

{
    "architectures": ["LlamaForCausalLM"],
    "attention_bias": False,
    "attention_dropout": 0.0,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 2,
    "head_dim": 64,
    "hidden_act": "silu",
    "hidden_size": 960,
    "initializer_range": 0.02,
    "intermediate_size": 2560,
    "max_position_embeddings": 8192,
    "mlp_bias": False,
    "model_type": "llama",
    "num_attention_heads": 15,
    "num_hidden_layers": 32,
    "num_key_value_heads": 5,
    "pretraining_tp": 1,
    "rms_norm_eps": 1e-05,
    "rope_interleaved": False,
    "rope_theta": 100000,
    "tie_word_embeddings": True,
    "use_cache": True,
    "vocab_size": 49152
}

9. Appendix: PyTorch Model Structure

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 960, padding_idx=2)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=960, out_features=960, bias=False)
          (k_proj): Linear(in_features=960, out_features=320, bias=False)
          (v_proj): Linear(in_features=960, out_features=320, bias=False)
          (o_proj): Linear(in_features=960, out_features=960, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=960, out_features=2560, bias=False)
          (up_proj): Linear(in_features=960, out_features=2560, bias=False)
          (down_proj): Linear(in_features=2560, out_features=960, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((960,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((960,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((960,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=960, out_features=49152, bias=False)
)