---
library_name: transformers
tags: []
---
# Model Card for Model ID
## Model Details
### Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Model type:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
- **Finetuned from model [optional]:** [More Information Needed]
### Model Sources [optional]
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
### Direct Use
[More Information Needed]
### Downstream Use [optional]
[More Information Needed]
### Out-of-Scope Use
[More Information Needed]
## Bias, Risks, and Limitations
[More Information Needed]
### Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
## How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
## Training Details
### Training Data
[More Information Needed]
### Training Procedure
#### Preprocessing [optional]
[More Information Needed]
#### Training Hyperparameters
- **Training regime:** [More Information Needed]
#### Speeds, Sizes, Times [optional]
[More Information Needed]
## Evaluation
# Llama 3.1 8B Benchmark Comparison
Comparing **Esmeralda-Llama-3.1-8B-control** against base **Llama 3.1 8B Instruct** and **Hermes-3-Llama-3.1-8B**.
---
## Benchmark Results
| Benchmark | Esmeralda-Llama-3.1-8B-control | Llama 3.1 8B Instruct | Hermes-3-Llama-3.1-8B |
|---|---:|---:|---:|
| HumanEval | **57.3** | 56.1 | 52.4 |
| MBPP | 53.2 | **56.8** | 48.2 |
| GPQA Diamond | 15.7 | 15.7 | **18.2** |
| EQ-Bench | 59.2 | 61.1 | **63.1** |
| Percent Parseable | **100.0** | 92.4 | 91.2 |
---
## Visual Comparison
### HumanEval
🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 57.3 — Esmeralda
🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦 56.1 — Llama 3.1 Instruct
🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪 52.4 — Hermes-3
### MBPP
🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 53.2 — Esmeralda
🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦 56.8 — Llama 3.1 Instruct
🟪🟪🟪🟪🟪🟪🟪🟪🟪 48.2 — Hermes-3
### GPQA Diamond
🟩🟩🟩 15.7 — Esmeralda
🟦🟦🟦 15.7 — Llama 3.1 Instruct
🟪🟪🟪🟪 18.2 — Hermes-3
### EQ-Bench
🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 59.2 — Esmeralda
🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦 61.1 — Llama 3.1 Instruct
🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪 63.1 — Hermes-3
### Percent Parseable
🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 100.0 — Esmeralda
🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦 92.4 — Llama 3.1 Instruct
🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪🟪 91.2 — Hermes-3
---
## Key Takeaways
- **Esmeralda-Llama-3.1-8B-control** slightly leads on HumanEval despite using a relatively small finetuning dataset.
- **Hermes-3-Llama-3.1-8B** shows the strongest EQ-Bench and GPQA performance.
- Base **Llama 3.1 8B Instruct** remains strongest overall on MBPP.
- **Esmeralda-Llama-3.1-8B-control** achieves the best parseability at **100%**.
## Interpretation
Esmeralda-Llama-3.1-8B-control appears to preserve the original capabilities of Llama 3.1 8B Instruct while improving coding consistency and output stability. Hermes-3 emphasizes reasoning and conversational quality more strongly, while the base instruct model maintains balanced coding performance.
## Here Be Dragons 🐉
The following results are exploratory and are **not directly comparable** to standard TruthfulQA leaderboard scores.
### Experimental Truthfulness Evaluation
Esmeralda-Llama-3.1-8B-control was evaluated on TruthfulQA using a freeform-generation setup rather than the standard multiple-choice MC1/MC2 methodology.
Evaluation procedure:
1. The model generated unrestricted freeform answers.
2. A separate judge model — Gemma 4 26B A4B — was prompted to assign:
- `1` for correct/truthful answers
- `0` for incorrect/hallucinated answers
3. The judge compared generations against the TruthfulQA reference answers.
#### Result
| Model | Evaluation Method | Score |
|---|---|---:|
| Esmeralda-Llama-3.1-8B-control | TruthfulQA LLM Judge | **0.682** |
| Hermes-3-Llama-3.1-8B | TruthfulQA MC2 (self-reported) | 0.5869 |
### Notes
- These numbers are **not directly comparable** due to differing methodologies.
- MC2 evaluates constrained multiple-choice accuracy.
- The Esmeralda evaluation instead measures freeform answer truthfulness judged semantically by another LLM.
- Manual inspection of sampled generations suggested the judge model behaved reliably for this experiment.
- No official TruthfulQA score for Llama 3.1 8B Instruct could be located at the time of writing.
This section is provided as an experimental reference rather than a standardized leaderboard claim.