Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>SupraLabs | The Embedding Bottleneck: Optimal Vocab Scales</title> | |
| <script src="https://cdn.jsdelivr.net/npm/chart.js"></script> | |
| <style> | |
| :root { | |
| --bg: #0f0f0f; | |
| --surface: #1a1a1a; | |
| --border: #333; | |
| --text: #e0e0e0; | |
| --accent: #536bfe; /* Supra Blue */ | |
| --muted: #888; | |
| --success: #00e676; | |
| --warning: #ffb300; | |
| --font-mono: 'JetBrains Mono', 'Fira Code', monospace; | |
| } | |
| * { margin: 0; padding: 0; box-sizing: border-box; } | |
| body { | |
| background-color: var(--bg); | |
| color: var(--text); | |
| font-family: 'Inter', -apple-system, sans-serif; | |
| line-height: 1.6; | |
| padding: 2rem; | |
| } | |
| code, pre, .mono { font-family: var(--font-mono); } | |
| .container { max-width: 1000px; margin: 0 auto; } | |
| header { | |
| border-bottom: 2px solid var(--border); | |
| padding-bottom: 2rem; | |
| margin-bottom: 3rem; | |
| display: flex; | |
| justify-content: space-between; | |
| align-items: flex-end; | |
| } | |
| .logo-area a { text-decoration: none; color: inherit; } | |
| .logo-area h1 { | |
| font-size: 1.2rem; | |
| text-transform: uppercase; | |
| letter-spacing: 2px; | |
| color: var(--accent); | |
| line-height: 1; | |
| display: flex; | |
| align-items: center; | |
| gap: 10px; | |
| } | |
| nav a { | |
| color: var(--text); | |
| text-decoration: none; | |
| margin-left: 1.5rem; | |
| font-size: 0.9rem; | |
| border-bottom: 1px solid transparent; | |
| } | |
| nav a:hover { border-bottom: 1px solid var(--accent); } | |
| .hero { margin-bottom: 4rem; } | |
| .hero h2 { font-size: 3rem; line-height: 1.1; margin-bottom: 1.5rem; font-weight: 800; } | |
| .hero p { font-size: 1.2rem; color: var(--muted); max-width: 750px; } | |
| .section-label { | |
| display: block; | |
| font-family: var(--font-mono); | |
| color: var(--accent); | |
| font-size: 0.8rem; | |
| margin-top: 3rem; | |
| margin-bottom: 1rem; | |
| text-transform: uppercase; | |
| } | |
| .card { background: var(--surface); border: 1px solid var(--border); padding: 2.5rem; margin-bottom: 2rem; } | |
| h3 { font-size: 1.6rem; margin-bottom: 1rem; font-weight: 700; } | |
| p { margin-bottom: 1rem; color: #ccc; } | |
| ul { margin-left: 1.5rem; margin-bottom: 1.5rem; } | |
| li { margin-bottom: 0.5rem; } | |
| .winner-badge { color: var(--success); font-weight: bold; font-family: var(--font-mono); } | |
| .table-container { overflow-x: auto; margin: 2rem 0; border: 1px solid var(--border); } | |
| table { width: 100%; border-collapse: collapse; text-align: left; font-size: 0.95rem; } | |
| th, td { padding: 1rem; border-bottom: 1px solid var(--border); } | |
| th { background-color: rgba(26, 26, 26, 0.8); font-family: var(--font-mono); color: var(--accent); font-size: 0.85rem; text-transform: uppercase; } | |
| tr:hover { background-color: rgba(83, 107, 254, 0.05); } | |
| .highlight-row { border-left: 4px solid var(--success); background-color: rgba(0, 230, 118, 0.02); } | |
| .chart-box { background: var(--surface); border: 1px solid var(--border); padding: 2rem; margin-bottom: 2rem; } | |
| .stats-grid { | |
| display: grid; | |
| grid-template-columns: 1fr 1fr 1fr; | |
| gap: 1rem; | |
| margin-top: 4rem; | |
| border-top: 1px solid var(--border); | |
| padding-top: 2rem; | |
| } | |
| .stat-box { padding: 1rem; border-left: 2px solid var(--accent); } | |
| .stat-box small { display: block; color: var(--muted); font-family: var(--font-mono); } | |
| footer { margin-top: 6rem; padding-bottom: 2rem; font-size: 0.8rem; color: var(--muted); text-align: center; } | |
| @media (max-width: 768px) { | |
| .hero h2 { font-size: 2rem; } | |
| header { flex-direction: column; align-items: flex-start; gap: 1rem; } | |
| nav a { margin-left: 0; margin-right: 1rem; } | |
| .stats-grid { grid-template-columns: 1fr; } | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <div class="container"> | |
| <header> | |
| <div class="logo-area" style="font-size: 1.5em;"> | |
| <a href="index.html"><h1><img src="./image.png" style="height: 2em" alt="Logo"> SupraLabs_</h1></a> | |
| </div> | |
| <nav> | |
| <a href="#summary">Core Learnings</a> | |
| <a href="#benchmarks">Vocab Matrix</a> | |
| <a href="#charts">Visualizations</a> | |
| <a href="https://huggingface.co/SupraLabs" target="_blank">HuggingFace</a> | |
| </nav> | |
| </header> | |
| <section class="hero"> | |
| <h2>Experiment #5:<br>The Embedding Bottleneck — Vocab Size</h2> | |
| <p>Mapping the discrete trade-off between vocabulary size and active internal hidden architecture weights. We processed a steady volume of <strong>50,000,000 tokens</strong> across 5 unique sub-7.5M models running an optimized shallow and wide architecture template.</p> | |
| </section> | |
| <span class="section-label" id="summary">// Tokenization_Gaps_&_Structural_Erosion</span> | |
| <div class="card"> | |
| <h3>Navigating the Pareto Frontier for Megabyte Architecture</h3> | |
| <p>Standard LLMs favor vast vocabularies (32k–128k) to keep tokenization counts short. Our sweep documents a fatal parameter theft paradox when shrinking downstream profiles to sub-10M boundaries:</p> | |
| <ul> | |
| <li><strong>The Sub-Token Fragmentation Cliff:</strong> Dropping vocabulary sizes down to 1024 or 2048 fractures words into tiny components. It exhausts the 1024 sequence length with formatting shards, exploding unprompted Word Perplexity beyond 1000.</li> | |
| <li><strong>The Embedding Parameter Theft:</strong> Expanding the vocabulary to 16,384 maps full expressions easily, reducing Perplexity to 359.2. However, it spikes total parameter allocations by over 114% exclusively inside the lookup layer, crippling the transformer hidden logical layers.</li> | |
| <li><strong>The 4096 Strategic Equilibrium:</strong> At a vocab ceiling of 4096, the fragmentation curve flattens completely. Word Perplexity drops by half to 467.2, capturing clean syntactic continuity while leaving processing parameter room for reasoning paths.</li> | |
| </ul> | |
| </div> | |
| <span class="section-label" id="benchmarks">// Vocabulary_Compression_Matrix</span> | |
| <div class="card" style="padding: 1.5rem;"> | |
| <h3>Unbiased Tokenizer Scaling Data</h3> | |
| <p>Downstream metrics evaluated at zero-shot boundaries. Word Perplexity (PPL) serves as the primary metric for comparative linguistic clarity.</p> | |
| <div class="table-container"> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Benchmark / Metric</th> | |
| <th>Run 1: 1024 Vocab</th> | |
| <th>Run 2: 2048 Vocab</th> | |
| <th style="color: var(--success)">Run 3: 4096 Vocab (🏆 Peak)</th> | |
| <th>Run 4: 8192 Vocab</th> | |
| <th>Run 5: 16384 Vocab</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td class="mono">Total Active Parameters</td> | |
| <td>3,409,664</td> | |
| <td>3,671,808</td> | |
| <td style="color: var(--success)">4,196,096</td> | |
| <td>5,244,672</td> | |
| <td>7,341,824</td> | |
| </tr> | |
| <tr> | |
| <td class="mono">Pretrain Train Loss (↓)</td> | |
| <td>3.614</td> | |
| <td>4.172</td> | |
| <td>4.598</td> | |
| <td>5.063</td> | |
| <td class="mono" style="color: var(--warning)">5.409</td> | |
| </tr> | |
| <tr> | |
| <td class="mono">ARC-Easy Zero-Shot (↑)</td> | |
| <td>28.37%</td> | |
| <td>29.67%</td> | |
| <td>28.32%</td> | |
| <td class="winner-badge">30.77%</td> | |
| <td>30.93%</td> | |
| </tr> | |
| <tr> | |
| <td class="mono">Wikitext Byte PPL (↓)</td> | |
| <td>3.7336</td> | |
| <td>3.6693</td> | |
| <td>3.1566</td> | |
| <td>3.0746</td> | |
| <td class="winner-badge">3.0052</td> | |
| </tr> | |
| <tr> | |
| <td class="mono">Wikitext Word PPL (↓)</td> | |
| <td>1146.6974</td> | |
| <td>1044.8747</td> | |
| <td style="color: var(--success); font-weight: bold;">467.2369</td> | |
| <td>405.9334</td> | |
| <td class="winner-badge">359.2878</td> | |
| </tr> | |
| <tr> | |
| <td class="mono">Pretrain Compute Speed (⚡)</td> | |
| <td class="winner-badge">8.43 steps/sec</td> | |
| <td>8.03 steps/sec</td> | |
| <td>7.38 steps/sec</td> | |
| <td>6.95 steps/sec</td> | |
| <td>5.03 steps/sec</td> | |
| </tr> | |
| <tr class="highlight-row"> | |
| <td style="font-weight: bold;">EMBEDDING STATUS</td> | |
| <td style="color: var(--warning)">Context Fragmentation</td> | |
| <td style="color: var(--warning)">Information Degradation</td> | |
| <td style="color: var(--success); font-weight: bold;">PARETO CEILING</td> | |
| <td>Layer Starvation</td> | |
| <td>Parameter Overflow</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| </div> | |
| <span class="section-label" id="charts">// Mapping_The_Tokenizer_Trade-offs</span> | |
| <div class="chart-box"> | |
| <h3>Linguistic Perplexity Collapse vs. Vocabulary Expansion</h3> | |
| <div style="position: relative; height:350px; width:100%"> | |
| <canvas id="vocabPplChart"></canvas> | |
| </div> | |
| </div> | |
| <div class="chart-box"> | |
| <h3>Total Active Model Parameters vs. Tokenizer Throughput Steps</h3> | |
| <p style="font-size: 0.85rem; color: var(--muted); margin-bottom: 1.5rem;">As vocab sizes scale up, the active parameter volume jumps exponentially inside the static lookup blocks, choking compute steps.</p> | |
| <div style="position: relative; height:350px; width:100%"> | |
| <canvas id="vocabParamChart"></canvas> | |
| </div> | |
| </div> | |
| <section class="stats-grid" id="hardware"> | |
| <div class="stat-box"> | |
| <small>COMPUTE ALLOCATION</small> | |
| <strong>Static S&W Topology Grid</strong> | |
| </div> | |
| <div class="stat-box"> | |
| <small>ISOLATEDPretrain BATCH</small> | |
| <strong>50,000,000 Volume Steps</strong> | |
| </div> | |
| <div class="stat-box"> | |
| <small>SOTA SELECTION MATRIX</small> | |
| <strong>4096 Balanced Ceiling</strong> | |
| </div> | |
| </section> | |
| <footer> | |
| <p>© 2026 SupraLabs. High performance. Small footprints. Proudly open-source.</p> | |
| </footer> | |
| </div> | |
| <script> | |
| // Vocab Perplexity Development | |
| const ctxVocabPpl = document.getElementById('vocabPplChart').getContext('2d'); | |
| new Chart(ctxVocabPpl, { | |
| type: 'line', | |
| data: { | |
| labels: ['Vocab 1024', 'Vocab 2048', 'Vocab 4096', 'Vocab 8192', 'Vocab 16384'], | |
| datasets: [{ | |
| label: 'Wikitext Word Perplexity (Lower = Better)', | |
| data: [1146.69, 1044.87, 467.23, 405.93, 359.28], | |
| borderColor: '#ffb300', | |
| backgroundColor: 'rgba(255, 179, 0, 0.05)', | |
| fill: true, | |
| tension: 0.2, | |
| borderWidth: 3 | |
| }] | |
| }, | |
| options: { | |
| responsive: true, | |
| maintainAspectRatio: false, | |
| plugins: { legend: { labels: { color: '#bbb' } } }, | |
| scales: { | |
| y: { grid: { color: '#222' }, ticks: { color: '#888' } }, | |
| x: { grid: { color: '#222' }, ticks: { color: '#aaa' } } | |
| } | |
| } | |
| }); | |
| // Parameters vs Speed Dual Axis | |
| const ctxVocabParam = document.getElementById('vocabParamChart').getContext('2d'); | |
| new Chart(ctxVocabParam, { | |
| type: 'line', | |
| data: { | |
| labels: ['Vocab 1024', 'Vocab 2048', 'Vocab 4096', 'Vocab 8192', 'Vocab 16384'], | |
| datasets: [ | |
| { | |
| label: 'Total Active Model Parameters (Millions)', | |
| data: [3.40, 3.67, 4.19, 5.24, 7.34], | |
| borderColor: '#536bfe', | |
| backgroundColor: 'transparent', | |
| yAxisID: 'yParams', | |
| tension: 0.1, | |
| borderWidth: 3 | |
| }, | |
| { | |
| label: 'Throughput Efficiency (steps/sec)', | |
| data: [8.43, 8.03, 7.38, 6.95, 5.03], | |
| borderColor: '#00e676', | |
| backgroundColor: 'transparent', | |
| yAxisID: 'ySpeed', | |
| tension: 0.1, | |
| borderWidth: 2, | |
| borderDash: [5, 5] | |
| } | |
| ] | |
| }, | |
| options: { | |
| responsive: true, | |
| maintainAspectRatio: false, | |
| plugins: { legend: { labels: { color: '#bbb' } } }, | |
| scales: { | |
| x: { grid: { display: false }, ticks: { color: '#aaa' } }, | |
| yParams: { | |
| type: 'linear', | |
| position: 'left', | |
| title: { display: true, text: 'Parameters (Millions)', color: '#536bfe' }, | |
| grid: { color: '#222' }, | |
| ticks: { color: '#888' } | |
| }, | |
| ySpeed: { | |
| type: 'linear', | |
| position: 'right', | |
| title: { display: true, text: 'Processing Steps / Sec', color: '#00e676' }, | |
| grid: { display: false }, | |
| ticks: { color: '#888' } | |
| } | |
| } | |
| } | |
| }); | |
| </script> | |
| </body> | |
| </html> |