--- library_name: transformers tags: [] --- # Model Card for Model ID ## Model Details ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Model type:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] - **Finetuned from model [optional]:** [More Information Needed] ### Model Sources [optional] - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses ### Direct Use [More Information Needed] ### Downstream Use [optional] [More Information Needed] ### Out-of-Scope Use [More Information Needed] ## Bias, Risks, and Limitations [More Information Needed] ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. [More Information Needed] ## Training Details ### Training Data [More Information Needed] ### Training Procedure #### Preprocessing [optional] [More Information Needed] #### Training Hyperparameters - **Training regime:** [More Information Needed] #### Speeds, Sizes, Times [optional] [More Information Needed] ## Evaluation

# Llama 3.1 8B Benchmark Comparison Comparing **Esmeralda-Llama-3.1-8B-control** against base **Llama 3.1 8B Instruct** and **Hermes-3-Llama-3.1-8B**.

--- ## Benchmark Results | Benchmark | Esmeralda-Llama-3.1-8B-control | Llama 3.1 8B Instruct | Hermes-3-Llama-3.1-8B | |---|---:|---:|---:| | HumanEval | **57.3** | 56.1 | 52.4 | | MBPP | 53.2 | **56.8** | 48.2 | | GPQA Diamond | 15.7 | 15.7 | **18.2** | | EQ-Bench | 59.2 | 61.1 | **63.1** | | Percent Parseable | **100.0** | 92.4 | 91.2 | --- ## Visual Comparison ### HumanEval 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 57.3 — Esmeralda 🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦 56.1 — Llama 3.1 Instruct 🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪 52.4 — Hermes-3 ### MBPP 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 53.2 — Esmeralda 🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦 56.8 — Llama 3.1 Instruct 🟪🟪🟪🟪🟪🟪🟪🟪🟪 48.2 — Hermes-3 ### GPQA Diamond 🟩🟩🟩 15.7 — Esmeralda 🟦🟦🟦 15.7 — Llama 3.1 Instruct 🟪🟪🟪🟪 18.2 — Hermes-3 ### EQ-Bench 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 59.2 — Esmeralda 🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦 61.1 — Llama 3.1 Instruct 🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪 63.1 — Hermes-3 ### Percent Parseable 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 100.0 — Esmeralda 🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦 92.4 — Llama 3.1 Instruct 🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪 91.2 — Hermes-3 --- ## Key Takeaways - **Esmeralda-Llama-3.1-8B-control** slightly leads on HumanEval despite using a relatively small finetuning dataset. - **Hermes-3-Llama-3.1-8B** shows the strongest EQ-Bench and GPQA performance. - Base **Llama 3.1 8B Instruct** remains strongest overall on MBPP. - **Esmeralda-Llama-3.1-8B-control** achieves the best parseability at **100%**. ## Interpretation Esmeralda-Llama-3.1-8B-control appears to preserve the original capabilities of Llama 3.1 8B Instruct while improving coding consistency and output stability. Hermes-3 emphasizes reasoning and conversational quality more strongly, while the base instruct model maintains balanced coding performance. ## Here Be Dragons 🐉 The following results are exploratory and are **not directly comparable** to standard TruthfulQA leaderboard scores. ### Experimental Truthfulness Evaluation Esmeralda-Llama-3.1-8B-control was evaluated on TruthfulQA using a freeform-generation setup rather than the standard multiple-choice MC1/MC2 methodology. Evaluation procedure: 1. The model generated unrestricted freeform answers. 2. A separate judge model — Gemma 4 26B A4B — was prompted to assign: - `1` for correct/truthful answers - `0` for incorrect/hallucinated answers 3. The judge compared generations against the TruthfulQA reference answers. #### Result | Model | Evaluation Method | Score | |---|---|---:| | Esmeralda-Llama-3.1-8B-control | TruthfulQA LLM Judge | **0.682** | | Hermes-3-Llama-3.1-8B | TruthfulQA MC2 (self-reported) | 0.5869 | ### Notes - These numbers are **not directly comparable** due to differing methodologies. - MC2 evaluates constrained multiple-choice accuracy. - The Esmeralda evaluation instead measures freeform answer truthfulness judged semantically by another LLM. - Manual inspection of sampled generations suggested the judge model behaved reliably for this experiment. - No official TruthfulQA score for Llama 3.1 8B Instruct could be located at the time of writing. This section is provided as an experimental reference rather than a standardized leaderboard claim.