Spaces:
Sleeping
Sleeping
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,12 +1,158 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: LLM Evaluation Dashboard
|
| 3 |
+
emoji: π§ͺ
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: green
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 5.12.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
license: mit
|
| 11 |
+
short_description: Compare LLMs on reasoning, knowledge & instructions
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# π§ͺ LLM Evaluation Dashboard
|
| 15 |
+
|
| 16 |
+
Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.
|
| 17 |
+
|
| 18 |
+
## π― What This Does
|
| 19 |
+
|
| 20 |
+
1. **Benchmark Results** β View pre-computed evaluation results across 15 tasks
|
| 21 |
+
2. **Interactive Charts** β Visualize accuracy and latency comparisons
|
| 22 |
+
3. **Live Testing** β Test any model with your own custom prompts
|
| 23 |
+
4. **Detailed Analysis** β Filter and explore results by model and category
|
| 24 |
+
|
| 25 |
+
## π€ Models Evaluated
|
| 26 |
+
|
| 27 |
+
| Model | Parameters | Type | Organization |
|
| 28 |
+
|-------|------------|------|--------------|
|
| 29 |
+
| Mistral-7B-Instruct | 7B | General | Mistral AI |
|
| 30 |
+
| Llama-3.2-3B-Instruct | 3B | General | Meta |
|
| 31 |
+
| Llama-3.1-70B-Instruct | 70B | General | Meta |
|
| 32 |
+
| Qwen2.5-72B-Instruct | 72B | General | Alibaba |
|
| 33 |
+
| Qwen2.5-Coder-32B | 32B | Code | Alibaba |
|
| 34 |
+
|
| 35 |
+
## π Evaluation Categories
|
| 36 |
+
|
| 37 |
+
### 1. Reasoning (Math & Logic)
|
| 38 |
+
Tests mathematical computation and logical deduction abilities.
|
| 39 |
+
|
| 40 |
+
**Example tasks:**
|
| 41 |
+
- "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
|
| 42 |
+
- "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"
|
| 43 |
+
|
| 44 |
+
### 2. Knowledge (Facts)
|
| 45 |
+
Tests factual accuracy across science, history, and geography.
|
| 46 |
+
|
| 47 |
+
**Example tasks:**
|
| 48 |
+
- "What is the chemical symbol for gold?"
|
| 49 |
+
- "What planet is known as the Red Planet?"
|
| 50 |
+
|
| 51 |
+
### 3. Instruction Following
|
| 52 |
+
Tests ability to follow specific format constraints.
|
| 53 |
+
|
| 54 |
+
**Example tasks:**
|
| 55 |
+
- "Return a JSON object with keys 'name' and 'age'"
|
| 56 |
+
- "List exactly 3 colors, one per line"
|
| 57 |
+
- "Write a sentence of exactly 5 words"
|
| 58 |
+
|
| 59 |
+
## π Key Findings
|
| 60 |
+
|
| 61 |
+
| Category | Best Model | Score |
|
| 62 |
+
|----------|------------|-------|
|
| 63 |
+
| **Overall** | Mistral-7B | 80% |
|
| 64 |
+
| **Reasoning** | Qwen2.5-Coder | 80% |
|
| 65 |
+
| **Knowledge** | Mistral-7B, Llama-3.2, Qwen-Coder | 100% |
|
| 66 |
+
| **Instruction Following** | Qwen2.5-72B | 100% |
|
| 67 |
+
|
| 68 |
+
### Insights
|
| 69 |
+
|
| 70 |
+
- **Mistral-7B** achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
|
| 71 |
+
- **Qwen2.5-Coder** excelled at reasoning tasks despite being code-focused
|
| 72 |
+
- **Qwen2.5-72B** had perfect instruction following but struggled with reasoning
|
| 73 |
+
- **Larger models β better performance** β 7B Mistral outperformed 70B+ models
|
| 74 |
+
|
| 75 |
+
## π§ Technical Implementation
|
| 76 |
+
|
| 77 |
+
### Evaluation Pipeline
|
| 78 |
+
```
|
| 79 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 80 |
+
β LLM Evaluation Pipeline β
|
| 81 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 82 |
+
β β
|
| 83 |
+
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
|
| 84 |
+
β β 15 Tasks β β β 5 Models β β β 75 Total β β
|
| 85 |
+
β β 3 Categoriesβ β HF API β β Evaluationsβ β
|
| 86 |
+
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
|
| 87 |
+
β β
|
| 88 |
+
β β β
|
| 89 |
+
β β
|
| 90 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 91 |
+
β β Scoring Functions β β
|
| 92 |
+
β β β’ contains / contains_lower (substring match) β β
|
| 93 |
+
β β β’ json_valid (JSON parsing) β β
|
| 94 |
+
β β β’ line_count / word_count (format validation) β β
|
| 95 |
+
β β β’ starts_with_lower (constraint checking) β β
|
| 96 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 97 |
+
β β
|
| 98 |
+
β β β
|
| 99 |
+
β β
|
| 100 |
+
β ββββββββββββββββββββββββββββοΏ½οΏ½οΏ½ββββββββββββββββββββββββββ β
|
| 101 |
+
β β Dashboard Visualization β β
|
| 102 |
+
β β β’ Accuracy bar charts β β
|
| 103 |
+
β β β’ Category heatmaps β β
|
| 104 |
+
β β β’ Latency comparisons β β
|
| 105 |
+
β β β’ Filterable results table β β
|
| 106 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 107 |
+
β β
|
| 108 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### Scoring Methods
|
| 112 |
+
|
| 113 |
+
| Check Type | Description | Example |
|
| 114 |
+
|------------|-------------|---------|
|
| 115 |
+
| `contains` | Exact substring match | "4" in "The answer is 4" |
|
| 116 |
+
| `contains_lower` | Case-insensitive match | "mars" in "MARS is red" |
|
| 117 |
+
| `json_valid` | Valid JSON object | `{"name": "Alice"}` |
|
| 118 |
+
| `line_count` | Correct number of lines | 3 lines for "list 3 colors" |
|
| 119 |
+
| `word_count` | Correct word count | 5 words for "5-word sentence" |
|
| 120 |
+
| `starts_with_lower` | First word starts with letter | "Apple" starts with "a" |
|
| 121 |
+
|
| 122 |
+
### Tech Stack
|
| 123 |
+
|
| 124 |
+
| Component | Technology | Purpose |
|
| 125 |
+
|-----------|------------|---------|
|
| 126 |
+
| **Frontend** | Gradio | Interactive dashboard UI |
|
| 127 |
+
| **Visualization** | Plotly | Charts and heatmaps |
|
| 128 |
+
| **LLM Access** | HuggingFace Inference API | Free model inference |
|
| 129 |
+
| **Data** | Pandas | Results storage and analysis |
|
| 130 |
+
|
| 131 |
+
## π Live Model Comparison
|
| 132 |
+
|
| 133 |
+
The dashboard includes a **Live Comparison** feature where you can:
|
| 134 |
+
|
| 135 |
+
1. Enter any custom prompt
|
| 136 |
+
2. Select which models to compare
|
| 137 |
+
3. See responses side-by-side with latency metrics
|
| 138 |
+
|
| 139 |
+
## β οΈ Limitations
|
| 140 |
+
|
| 141 |
+
- **Rate Limiting:** HF Inference API has rate limits; some models may timeout
|
| 142 |
+
- **Task Coverage:** 15 tasks is a sample, not comprehensive benchmark
|
| 143 |
+
- **Single Run:** Results from one evaluation run (no statistical averaging)
|
| 144 |
+
|
| 145 |
+
## π What This Project Demonstrates
|
| 146 |
+
|
| 147 |
+
- **LLM Evaluation Design** β Creating meaningful benchmarks
|
| 148 |
+
- **API Integration** β Working with HuggingFace Inference API
|
| 149 |
+
- **Data Visualization** β Building interactive dashboards
|
| 150 |
+
- **Scoring Systems** β Implementing automated evaluation metrics
|
| 151 |
+
|
| 152 |
+
## π€ Author
|
| 153 |
+
|
| 154 |
+
**[Nav772](https://huggingface.co/Nav772)** β Built as part of an AI/ML Engineering portfolio.
|
| 155 |
+
|
| 156 |
+
## π License
|
| 157 |
+
|
| 158 |
+
MIT License
|