8bit-threshold-computer / llm_integration /smollm2 /SMOLLM2_ARCHITECTURE.md
CharlesCNorton
Move SmolLM2 analysis files into smollm2 subfolder
8a1465b

SmolLM2-360M-Instruct Architecture Analysis

Technical reference document for the 8bit-threshold-computer LLM integration project.

Model: HuggingFaceTB/SmolLM2-360M-Instruct Architecture: LlamaForCausalLM (Llama 2 variant) Tokenizer: GPT2TokenizerFast Analysis Date: 2026-01-21


1. Executive Summary

SmolLM2-360M-Instruct is a 362M parameter causal language model using the Llama architecture. Key characteristics relevant to our bit extraction task:

  • Hidden dimension: 960 (matches our extractor input requirement)
  • 32 transformer layers providing multiple extraction points
  • Digit-level tokenization for numbers (each digit is a separate token)
  • Grouped Query Attention (GQA) with 15 query heads and 5 KV heads

2. Architecture Census

2.1 Core Parameters

Parameter Value
Total Parameters 361,821,120 (361.82M)
Vocabulary Size 49,152
Hidden Dimension 960
Intermediate Dimension (MLP) 2,560
Number of Layers 32
Number of Attention Heads 15
Number of KV Heads 5 (Grouped Query Attention)
Head Dimension 64
Max Sequence Length 8,192
Activation Function SiLU
Normalization RMSNorm (eps=1e-05)
Position Encoding RoPE (theta=100,000)
Word Embedding Tying Yes (embed_tokens = lm_head)

2.2 Architecture Diagram

Input Token IDs
       |
       v
+------------------+
| Embedding Layer  |  (49152, 960)
+------------------+
       |
       v
+------------------+
| LlamaDecoderLayer| x 32
|  +-------------+ |
|  | RMSNorm     | |
|  +-------------+ |
|  | Self-Attn   | |  Q: (960, 960), K: (960, 320), V: (960, 320), O: (960, 960)
|  +-------------+ |
|  | Residual    | |
|  +-------------+ |
|  | RMSNorm     | |
|  +-------------+ |
|  | MLP (SwiGLU)| |  gate: (960, 2560), up: (960, 2560), down: (2560, 960)
|  +-------------+ |
|  | Residual    | |
+------------------+
       |
       v
+------------------+
| Final RMSNorm    |  (960,)
+------------------+
       |
       v
+------------------+
| LM Head          |  (960, 49152) - tied with embeddings
+------------------+
       |
       v
Logits (batch, seq, 49152)

2.3 Parameter Distribution

Component Parameters Percentage
Embedding 47,185,920 13.04%
All Attention Layers 78,643,200 21.74%
All MLP Layers 235,929,600 65.19%
All Layer Norms 61,440 0.02%
Final Norm 960 0.00%

Per-layer breakdown (each of 32 layers):

  • Attention: 2,457,600 params (0.68%)
  • MLP: 7,372,800 params (2.04%)
  • Norms: 1,920 params (0.00%)

3. Weight Inventory

3.1 Embedding and Output Layers

Parameter Name Shape Elements Notes
model.embed_tokens.weight (49152, 960) 47,185,920 Token embeddings
model.norm.weight (960,) 960 Final layer norm
lm_head.weight (49152, 960) (tied) Tied to embed_tokens

3.2 Single Transformer Layer Structure

Each of the 32 layers (model.layers.{0-31}) contains:

Attention Block:

Parameter Shape Elements
self_attn.q_proj.weight (960, 960) 921,600
self_attn.k_proj.weight (320, 960) 307,200
self_attn.v_proj.weight (320, 960) 307,200
self_attn.o_proj.weight (960, 960) 921,600
Attention Total 2,457,600

MLP Block (SwiGLU):

Parameter Shape Elements
mlp.gate_proj.weight (2560, 960) 2,457,600
mlp.up_proj.weight (2560, 960) 2,457,600
mlp.down_proj.weight (960, 2560) 2,457,600
MLP Total 7,372,800

Normalization:

Parameter Shape Elements
input_layernorm.weight (960,) 960
post_attention_layernorm.weight (960,) 960
Norms Total 1,920

Layer Total: 9,832,320 parameters

3.3 Grouped Query Attention (GQA) Details

SmolLM2 uses GQA with a 3:1 ratio:

  • 15 query heads (Q dimension: 960 = 15 x 64)
  • 5 key-value heads (KV dimension: 320 = 5 x 64)
  • Each KV head is shared by 3 query heads
  • This reduces KV cache memory by ~67% vs standard MHA

4. Tokenization Analysis

4.1 Arithmetic Expression Tokenization

Test input: "47 + 86"

Position Token ID Token Description
0 36 '4' First digit of operand A
1 39 '7' Second digit of operand A
2 1232 ' +' Space + plus sign
3 216 ' ' Trailing space
4 40 '8' First digit of operand B
5 38 '6' Second digit of operand B

4.2 Digit Token Mappings

Digit Token ID
0 32
1 33
2 34
3 35
4 36
5 37
6 38
7 39
8 40
9 41

Key observations:

  • Digits are tokenized individually (no multi-digit tokens like "47")
  • Digit tokens are sequential: ID = 32 + digit_value
  • Space-prefixed operators exist (e.g., ' +' = 1232)
  • '=' has token ID 45

4.3 Implications for Bit Extraction

The digit-by-digit tokenization means:

  1. For "47 + 86", operand A spans positions [0,1] and operand B spans positions [4,5]
  2. The model must learn to:
    • Recognize digit boundaries
    • Compose multi-digit numbers from sequential tokens
    • Perform arithmetic across token positions
  3. Hidden states at digit positions contain the numerical representation

5. Hidden State Analysis

5.1 Hidden State Output Structure

When running with output_hidden_states=True:

  • Returns 33 hidden states (embedding + 32 layer outputs)
  • Each has shape: (batch_size, seq_len, hidden_dim)
  • For "47 + 86": (1, 6, 960)

5.2 Hidden State Statistics by Layer

Layer Mean Std Dev Min Max
Embedding -0.001 0.105 -0.44 1.77
Layer 0 -0.127 2.55 -80.8 19.0
Layer 1 -0.171 3.70 -161 39.7
Layer 2 -0.151 3.67 -102 61.4
Layer 3 -1.13 327 -21,722 11,856
Layer 4-12 ~-1.3 ~327 ~-21,700 ~11,800
Layer 13-26 ~-1.5 ~337 ~-22,400 ~12,100
Layer 27-30 ~-1.8 ~310 ~-20,000 ~11,800
Layer 31 0.017 1.34 -18.9 34.3

Key observations:

  1. Dramatic variance explosion at Layer 3: Std dev jumps from ~4 to ~327
  2. Stable middle layers (4-26): Consistent statistics, suggesting numerical computation
  3. Compression at final layer: Std dev drops to 1.34 at Layer 31 (pre-softmax normalization)
  4. Layer 31 is well-scaled for downstream processing

5.3 Extraction Point Candidates

Layer Range Characteristics Suitability
0-2 (Early) Low variance, close to embeddings Poor - minimal computation
3-12 (Early-Mid) High variance, initial processing Moderate - may contain raw numerical features
13-26 (Middle) Stable high variance Good - computation in progress
27-30 (Late) Variance compression begins Good - refined representations
31 (Final) Well-normalized output Best - final representation before LM head

6. Relevance to 8bit-Threshold-Computer Project

6.1 Hidden Dimension Match

The hidden dimension of 960 exactly matches our extractor input requirement. This is fortuitous as it means:

  • No projection layer needed to interface with our bit extractor
  • Direct extraction from any layer's hidden states
  • Full utilization of the model's representational capacity

6.2 Recommended Extraction Strategy

def extract_hidden_state(model, tokenizer, expression, layer=-1):
    """
    Extract hidden state for bit extraction.

    Args:
        layer: Which layer to extract from (default: final layer)
               -1 = Layer 31 (final, pre-LM-head)

    Returns:
        Tensor of shape (960,) for the last token position
    """
    inputs = tokenizer(expression, return_tensors="pt")
    outputs = model(**inputs, output_hidden_states=True)

    # hidden_states[0] = embedding, hidden_states[1] = layer 0, ...
    # hidden_states[32] = layer 31 (final)
    hidden = outputs.hidden_states[layer]  # (1, seq_len, 960)

    # Extract last token position for autoregressive prediction
    return hidden[0, -1, :]  # (960,)

6.3 Token Position Analysis

For arithmetic expressions like "A + B":

Tokens:    [d1] [d2] [ +] [ ] [d3] [d4]
Positions:  0    1    2   3   4    5

Operand A: positions 0 to (plus_pos - 1)
Operator:  position where ' +' token appears
Operand B: positions (plus_pos + 2) to end

Strategy for operand extraction:

  1. Find the ' +' token (ID 1232) position
  2. Collect hidden states at digit positions before it (operand A)
  3. Collect hidden states at digit positions after it (operand B)
  4. Consider pooling (mean, max) or concatenating digit hidden states

6.4 Attention Pattern Utilization

With GQA (15 query heads, 5 KV heads), we can analyze attention patterns to:

  1. Identify which positions attend to operand digits
  2. Determine if the model explicitly links corresponding digit positions
  3. Find heads that specialize in numerical reasoning
def get_attention_weights(model, tokenizer, expression):
    inputs = tokenizer(expression, return_tensors="pt")
    outputs = model(**inputs, output_attentions=True)
    # attentions: tuple of (batch, num_heads, seq_len, seq_len) per layer
    return outputs.attentions

6.5 Extraction Interface Specification

For integration with the threshold computer:

class SmolLM2Extractor:
    """Interface between SmolLM2 and threshold-based bit extraction."""

    def __init__(self, model, tokenizer, extraction_layer=31):
        self.model = model
        self.tokenizer = tokenizer
        self.layer = extraction_layer + 1  # +1 because index 0 is embedding

    def get_hidden_state(self, text: str) -> torch.Tensor:
        """
        Returns: Tensor of shape (960,) ready for bit extractor
        """
        tokens = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**tokens, output_hidden_states=True)
        return outputs.hidden_states[self.layer][0, -1, :]

    def get_all_position_states(self, text: str) -> torch.Tensor:
        """
        Returns: Tensor of shape (seq_len, 960) for all positions
        """
        tokens = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**tokens, output_hidden_states=True)
        return outputs.hidden_states[self.layer][0]

7. Complete Weight Inventory Table

7.1 All Named Parameters

EMBEDDING (47,185,920 params - 13.04%)
  model.embed_tokens.weight                    (49152, 960)    47,185,920

LAYER 0 (9,832,320 params - 2.72%)
  Attention (2,457,600):
    model.layers.0.self_attn.q_proj.weight     (960, 960)      921,600
    model.layers.0.self_attn.k_proj.weight     (320, 960)      307,200
    model.layers.0.self_attn.v_proj.weight     (320, 960)      307,200
    model.layers.0.self_attn.o_proj.weight     (960, 960)      921,600
  MLP (7,372,800):
    model.layers.0.mlp.gate_proj.weight        (2560, 960)     2,457,600
    model.layers.0.mlp.up_proj.weight          (2560, 960)     2,457,600
    model.layers.0.mlp.down_proj.weight        (960, 2560)     2,457,600
  Norms (1,920):
    model.layers.0.input_layernorm.weight      (960,)          960
    model.layers.0.post_attention_layernorm.weight (960,)      960

[Layers 1-31 follow identical structure, each with 9,832,320 params]

FINAL NORM (960 params - 0.00%)
  model.norm.weight                            (960,)          960

LM HEAD (tied with embed_tokens)
  lm_head.weight                               (49152, 960)    [shared]

7.2 Summary by Component Type

Component Type Count Params Each Total Params
Embedding 1 47,185,920 47,185,920
Q Projection 32 921,600 29,491,200
K Projection 32 307,200 9,830,400
V Projection 32 307,200 9,830,400
O Projection 32 921,600 29,491,200
Gate Projection 32 2,457,600 78,643,200
Up Projection 32 2,457,600 78,643,200
Down Projection 32 2,457,600 78,643,200
Input LayerNorm 32 960 30,720
Post-Attn LayerNorm 32 960 30,720
Final LayerNorm 1 960 960
Total 361,821,120

8. Configuration Reference

Complete model configuration from HuggingFace:

{
    "architectures": ["LlamaForCausalLM"],
    "attention_bias": False,
    "attention_dropout": 0.0,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 2,
    "head_dim": 64,
    "hidden_act": "silu",
    "hidden_size": 960,
    "initializer_range": 0.02,
    "intermediate_size": 2560,
    "max_position_embeddings": 8192,
    "mlp_bias": False,
    "model_type": "llama",
    "num_attention_heads": 15,
    "num_hidden_layers": 32,
    "num_key_value_heads": 5,
    "pretraining_tp": 1,
    "rms_norm_eps": 1e-05,
    "rope_interleaved": False,
    "rope_theta": 100000,
    "tie_word_embeddings": True,
    "use_cache": True,
    "vocab_size": 49152
}

9. Appendix: PyTorch Model Structure

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 960, padding_idx=2)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=960, out_features=960, bias=False)
          (k_proj): Linear(in_features=960, out_features=320, bias=False)
          (v_proj): Linear(in_features=960, out_features=320, bias=False)
          (o_proj): Linear(in_features=960, out_features=960, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=960, out_features=2560, bias=False)
          (up_proj): Linear(in_features=960, out_features=2560, bias=False)
          (down_proj): Linear(in_features=2560, out_features=960, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((960,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((960,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((960,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=960, out_features=49152, bias=False)
)