Nav772 commited on
Commit
b5a6418
Β·
verified Β·
1 Parent(s): fe69f61

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +152 -6
README.md CHANGED
@@ -1,12 +1,158 @@
1
  ---
2
- title: Llm Evaluation Dashboard
3
- emoji: πŸ‘€
4
- colorFrom: purple
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 6.6.0
8
  app_file: app.py
9
  pinned: false
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: LLM Evaluation Dashboard
3
+ emoji: πŸ§ͺ
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: gradio
7
+ sdk_version: 5.12.0
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
+ short_description: Compare LLMs on reasoning, knowledge & instructions
12
  ---
13
 
14
+ # πŸ§ͺ LLM Evaluation Dashboard
15
+
16
+ Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.
17
+
18
+ ## 🎯 What This Does
19
+
20
+ 1. **Benchmark Results** β€” View pre-computed evaluation results across 15 tasks
21
+ 2. **Interactive Charts** β€” Visualize accuracy and latency comparisons
22
+ 3. **Live Testing** β€” Test any model with your own custom prompts
23
+ 4. **Detailed Analysis** β€” Filter and explore results by model and category
24
+
25
+ ## πŸ€– Models Evaluated
26
+
27
+ | Model | Parameters | Type | Organization |
28
+ |-------|------------|------|--------------|
29
+ | Mistral-7B-Instruct | 7B | General | Mistral AI |
30
+ | Llama-3.2-3B-Instruct | 3B | General | Meta |
31
+ | Llama-3.1-70B-Instruct | 70B | General | Meta |
32
+ | Qwen2.5-72B-Instruct | 72B | General | Alibaba |
33
+ | Qwen2.5-Coder-32B | 32B | Code | Alibaba |
34
+
35
+ ## πŸ“Š Evaluation Categories
36
+
37
+ ### 1. Reasoning (Math & Logic)
38
+ Tests mathematical computation and logical deduction abilities.
39
+
40
+ **Example tasks:**
41
+ - "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
42
+ - "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"
43
+
44
+ ### 2. Knowledge (Facts)
45
+ Tests factual accuracy across science, history, and geography.
46
+
47
+ **Example tasks:**
48
+ - "What is the chemical symbol for gold?"
49
+ - "What planet is known as the Red Planet?"
50
+
51
+ ### 3. Instruction Following
52
+ Tests ability to follow specific format constraints.
53
+
54
+ **Example tasks:**
55
+ - "Return a JSON object with keys 'name' and 'age'"
56
+ - "List exactly 3 colors, one per line"
57
+ - "Write a sentence of exactly 5 words"
58
+
59
+ ## πŸ“ˆ Key Findings
60
+
61
+ | Category | Best Model | Score |
62
+ |----------|------------|-------|
63
+ | **Overall** | Mistral-7B | 80% |
64
+ | **Reasoning** | Qwen2.5-Coder | 80% |
65
+ | **Knowledge** | Mistral-7B, Llama-3.2, Qwen-Coder | 100% |
66
+ | **Instruction Following** | Qwen2.5-72B | 100% |
67
+
68
+ ### Insights
69
+
70
+ - **Mistral-7B** achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
71
+ - **Qwen2.5-Coder** excelled at reasoning tasks despite being code-focused
72
+ - **Qwen2.5-72B** had perfect instruction following but struggled with reasoning
73
+ - **Larger models β‰  better performance** β€” 7B Mistral outperformed 70B+ models
74
+
75
+ ## πŸ”§ Technical Implementation
76
+
77
+ ### Evaluation Pipeline
78
+ ```
79
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
80
+ β”‚ LLM Evaluation Pipeline β”‚
81
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
82
+ β”‚ β”‚
83
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
84
+ β”‚ β”‚ 15 Tasks β”‚ β†’ β”‚ 5 Models β”‚ β†’ β”‚ 75 Total β”‚ β”‚
85
+ β”‚ β”‚ 3 Categoriesβ”‚ β”‚ HF API β”‚ β”‚ Evaluationsβ”‚ β”‚
86
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
87
+ β”‚ β”‚
88
+ β”‚ ↓ β”‚
89
+ β”‚ β”‚
90
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
91
+ β”‚ β”‚ Scoring Functions β”‚ β”‚
92
+ β”‚ β”‚ β€’ contains / contains_lower (substring match) β”‚ β”‚
93
+ β”‚ β”‚ β€’ json_valid (JSON parsing) β”‚ β”‚
94
+ β”‚ β”‚ β€’ line_count / word_count (format validation) β”‚ β”‚
95
+ β”‚ β”‚ β€’ starts_with_lower (constraint checking) β”‚ β”‚
96
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
97
+ β”‚ β”‚
98
+ β”‚ ↓ β”‚
99
+ β”‚ β”‚
100
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
101
+ β”‚ β”‚ Dashboard Visualization β”‚ β”‚
102
+ β”‚ β”‚ β€’ Accuracy bar charts β”‚ β”‚
103
+ β”‚ β”‚ β€’ Category heatmaps β”‚ β”‚
104
+ β”‚ β”‚ β€’ Latency comparisons β”‚ β”‚
105
+ β”‚ β”‚ β€’ Filterable results table β”‚ β”‚
106
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
107
+ β”‚ β”‚
108
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
109
+ ```
110
+
111
+ ### Scoring Methods
112
+
113
+ | Check Type | Description | Example |
114
+ |------------|-------------|---------|
115
+ | `contains` | Exact substring match | "4" in "The answer is 4" |
116
+ | `contains_lower` | Case-insensitive match | "mars" in "MARS is red" |
117
+ | `json_valid` | Valid JSON object | `{"name": "Alice"}` |
118
+ | `line_count` | Correct number of lines | 3 lines for "list 3 colors" |
119
+ | `word_count` | Correct word count | 5 words for "5-word sentence" |
120
+ | `starts_with_lower` | First word starts with letter | "Apple" starts with "a" |
121
+
122
+ ### Tech Stack
123
+
124
+ | Component | Technology | Purpose |
125
+ |-----------|------------|---------|
126
+ | **Frontend** | Gradio | Interactive dashboard UI |
127
+ | **Visualization** | Plotly | Charts and heatmaps |
128
+ | **LLM Access** | HuggingFace Inference API | Free model inference |
129
+ | **Data** | Pandas | Results storage and analysis |
130
+
131
+ ## πŸš€ Live Model Comparison
132
+
133
+ The dashboard includes a **Live Comparison** feature where you can:
134
+
135
+ 1. Enter any custom prompt
136
+ 2. Select which models to compare
137
+ 3. See responses side-by-side with latency metrics
138
+
139
+ ## ⚠️ Limitations
140
+
141
+ - **Rate Limiting:** HF Inference API has rate limits; some models may timeout
142
+ - **Task Coverage:** 15 tasks is a sample, not comprehensive benchmark
143
+ - **Single Run:** Results from one evaluation run (no statistical averaging)
144
+
145
+ ## πŸŽ“ What This Project Demonstrates
146
+
147
+ - **LLM Evaluation Design** β€” Creating meaningful benchmarks
148
+ - **API Integration** β€” Working with HuggingFace Inference API
149
+ - **Data Visualization** β€” Building interactive dashboards
150
+ - **Scoring Systems** β€” Implementing automated evaluation metrics
151
+
152
+ ## πŸ‘€ Author
153
+
154
+ **[Nav772](https://huggingface.co/Nav772)** β€” Built as part of an AI/ML Engineering portfolio.
155
+
156
+ ## πŸ“„ License
157
+
158
+ MIT License