Spaces:

Nav772
/

llm-evaluation-dashboard

Sleeping

App Files Files Community

Nav772 commited on Feb 22

Commit

b5a6418

verified ·

1 Parent(s): fe69f61

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +152 -6

README.md CHANGED Viewed

@@ -1,12 +1,158 @@
 ---
-title: Llm Evaluation Dashboard
-emoji: 👀
-colorFrom: purple
-colorTo: gray
 sdk: gradio
-sdk_version: 6.6.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: LLM Evaluation Dashboard
+emoji: 🧪
+colorFrom: blue
+colorTo: green
 sdk: gradio
+sdk_version: 5.12.0
 app_file: app.py
 pinned: false
+license: mit
+short_description: Compare LLMs on reasoning, knowledge & instructions
 ---
+# 🧪 LLM Evaluation Dashboard
+Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.
+## 🎯 What This Does
+1. **Benchmark Results** — View pre-computed evaluation results across 15 tasks
+2. **Interactive Charts** — Visualize accuracy and latency comparisons
+3. **Live Testing** — Test any model with your own custom prompts
+4. **Detailed Analysis** — Filter and explore results by model and category
+## 🤖 Models Evaluated
+| Model | Parameters | Type | Organization |
+|-------|------------|------|--------------|
+| Mistral-7B-Instruct | 7B | General | Mistral AI |
+| Llama-3.2-3B-Instruct | 3B | General | Meta |
+| Llama-3.1-70B-Instruct | 70B | General | Meta |
+| Qwen2.5-72B-Instruct | 72B | General | Alibaba |
+| Qwen2.5-Coder-32B | 32B | Code | Alibaba |
+## 📊 Evaluation Categories
+### 1. Reasoning (Math & Logic)
+Tests mathematical computation and logical deduction abilities.
+**Example tasks:**
+- "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
+- "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"
+### 2. Knowledge (Facts)
+Tests factual accuracy across science, history, and geography.
+**Example tasks:**
+- "What is the chemical symbol for gold?"
+- "What planet is known as the Red Planet?"
+### 3. Instruction Following
+Tests ability to follow specific format constraints.
+**Example tasks:**
+- "Return a JSON object with keys 'name' and 'age'"
+- "List exactly 3 colors, one per line"
+- "Write a sentence of exactly 5 words"
+## 📈 Key Findings
+| Category | Best Model | Score |
+|----------|------------|-------|
+| **Overall** | Mistral-7B | 80% |
+| **Reasoning** | Qwen2.5-Coder | 80% |
+| **Knowledge** | Mistral-7B, Llama-3.2, Qwen-Coder | 100% |
+| **Instruction Following** | Qwen2.5-72B | 100% |
+### Insights
+- **Mistral-7B** achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
+- **Qwen2.5-Coder** excelled at reasoning tasks despite being code-focused
+- **Qwen2.5-72B** had perfect instruction following but struggled with reasoning
+- **Larger models ≠ better performance** — 7B Mistral outperformed 70B+ models
+## 🔧 Technical Implementation
+### Evaluation Pipeline
+```
+┌─────────────────────────────────────────────────────────────┐
+│                  LLM Evaluation Pipeline                    │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
+│  │  15 Tasks   │ →  │  5 Models   │ →  │  75 Total   │     │
+│  │  3 Categories│    │  HF API     │    │  Evaluations│     │
+│  └─────────────┘    └─────────────┘    └─────────────┘     │
+│                                                             │
+│                          ↓                                  │
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐   │
+│  │              Scoring Functions                       │   │
+│  │  • contains / contains_lower (substring match)      │   │
+│  │  • json_valid (JSON parsing)                        │   │
+│  │  • line_count / word_count (format validation)      │   │
+│  │  • starts_with_lower (constraint checking)          │   │
+│  └─────────────────────────────────────────────────────┘   │
+│                                                             │
+│                          ↓                                  │
+│                                                             │
+│  ┌───────────────────────────���─────────────────────────┐   │
+│  │              Dashboard Visualization                 │   │
+│  │  • Accuracy bar charts                              │   │
+│  │  • Category heatmaps                                │   │
+│  │  • Latency comparisons                              │   │
+│  │  • Filterable results table                         │   │
+│  └─────────────────────────────────────────────────────┘   │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+```
+### Scoring Methods
+| Check Type | Description | Example |
+|------------|-------------|---------|
+| `contains` | Exact substring match | "4" in "The answer is 4" |
+| `contains_lower` | Case-insensitive match | "mars" in "MARS is red" |
+| `json_valid` | Valid JSON object | `{"name": "Alice"}` |
+| `line_count` | Correct number of lines | 3 lines for "list 3 colors" |
+| `word_count` | Correct word count | 5 words for "5-word sentence" |
+| `starts_with_lower` | First word starts with letter | "Apple" starts with "a" |
+### Tech Stack
+| Component | Technology | Purpose |
+|-----------|------------|---------|
+| **Frontend** | Gradio | Interactive dashboard UI |
+| **Visualization** | Plotly | Charts and heatmaps |
+| **LLM Access** | HuggingFace Inference API | Free model inference |
+| **Data** | Pandas | Results storage and analysis |
+## 🚀 Live Model Comparison
+The dashboard includes a **Live Comparison** feature where you can:
+1. Enter any custom prompt
+2. Select which models to compare
+3. See responses side-by-side with latency metrics
+## ⚠️ Limitations
+- **Rate Limiting:** HF Inference API has rate limits; some models may timeout
+- **Task Coverage:** 15 tasks is a sample, not comprehensive benchmark
+- **Single Run:** Results from one evaluation run (no statistical averaging)
+## 🎓 What This Project Demonstrates
+- **LLM Evaluation Design** — Creating meaningful benchmarks
+- **API Integration** — Working with HuggingFace Inference API
+- **Data Visualization** — Building interactive dashboards
+- **Scoring Systems** — Implementing automated evaluation metrics
+## 👤 Author
+**[Nav772](https://huggingface.co/Nav772)** — Built as part of an AI/ML Engineering portfolio.
+## 📄 License
+MIT License