---
license: mit
language:
- en
tags:
- llm-evaluation
- benchmarking
- nlp
- evaluation
- accuracy
- hallucination
- reasoning
- gpt
- claude
- gemini
- mistral
- llama
- mmlu
- truthfulqa
- open-source
- python
- fastapi
- streamlit
library_name: llm-evaluation-framework
pipeline_tag: text-generation
---
# LLM Evaluation Framework
> **Production-grade open-source LLM benchmarking.**
> Evaluate GPT-4, Claude, Gemini, Mistral and Llama on 5 metrics — side by side — in one command.
## What This Is
This is the **model card / hub page** for the LLM Evaluation Framework.
The framework itself is a Python tool, not a neural network weight — this page serves as
the HuggingFace hub entry point linking all resources together.
| Resource | Link |
|---|---|
| GitHub | https://github.com/vignesh2027/LLM-Evaluation-Framework |
| Live Demo | https://huggingface.co/spaces/vigneshwar234/llm-eval-demo |
| Dataset | https://huggingface.co/datasets/vigneshwar234/llm-eval-benchmark |
| Docs | https://vignesh2027.github.io/LLM-Evaluation-Framework/ |
## Quick Start
```bash
pip install llm-evaluation-framework
export OPENAI_API_KEY="sk-..."
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100
```
**Output:**
```
╭──────────────────────────────────────╮
│ Evaluation: gpt-4o-mini │
├──────────────────┬───────────────────┤
│ Accuracy │ 78.00% │
│ Avg Latency │ 432 ms │
│ P95 Latency │ 1240 ms │
│ Total Cost │ $0.0023 │
│ Hallucination │ 2.40% │
│ Reasoning Score │ 7.2 / 10 │
╰──────────────────┴───────────────────╯
```
## 5 Evaluation Metrics
| Metric | Description | Output |
|---|---|---|
| **Accuracy** | 4-strategy cascade: exact → normalized → MC → fuzzy | 0.0–1.0 |
| **Latency** | p50, p75, p90, p95, p99 percentiles + SLA violation rate | ms |
| **Cost** | Real token counts × pricing table for 15+ models | $/1K tokens |
| **Hallucination Rate** | Linguistic signal analysis (v1), NLI planned (v2) | 0.0–1.0 |
| **Reasoning Quality** | Chain-of-thought depth scoring | 1–10 |
## Supported Models
| Provider | Models |
|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, o1, o1-mini, GPT-3.5-turbo |
| Anthropic | Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus |
| Google | Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash |
| Mistral | Mistral Large, Mistral Small |
| Meta | Llama 3 70B, Llama 3 8B (via Together AI) |
| Local | Ollama, vLLM, HuggingFace TGI |
## Sample Benchmark Results (MMLU, 100 samples)
| Model | Accuracy | Latency | Cost/1K | Hallucination | Reasoning |
|---|---|---|---|---|---|
| GPT-4o | 88.2% | 892ms | $0.0080 | 1.8% | 8.4/10 |
| Claude 3.5 Sonnet | 87.6% | 1240ms | $0.0090 | 2.1% | 8.6/10 |
| GPT-4o-mini | 78.4% | 432ms | $0.0003 | 3.2% | 7.2/10 |
| Gemini 1.5 Flash | 76.8% | 380ms | $0.0001 | 4.1% | 6.8/10 |
| Claude 3 Haiku | 74.2% | 410ms | $0.0010 | 4.8% | 6.5/10 |
**Key finding:** GPT-4o-mini achieves 88% of GPT-4o's accuracy at 4% of the cost.
## Features
- **Async parallel evaluation** — 10 models at once via `asyncio.Semaphore`
- **Streamlit dashboard** — radar charts, latency histograms, cost vs quality scatter
- **FastAPI REST API** — 12 endpoints with OpenAPI docs
- **CLI tool** — 7 subcommands with rich terminal output
- **PDF report generator** — professional layout via ReportLab
- **SQLite persistence** — zero-config, file-based storage
- **Docker ready** — multi-stage build, `docker-compose up`
- **40+ tests, 95% coverage** — pytest, no API keys needed
## Architecture
```
CLI / FastAPI / Streamlit / PDF Generator
│
Core Evaluator (asyncio)
│
┌──────────┼──────────┬──────────┐
Metrics Benchmarks Database LiteLLM
accuracy MMLU SQLite OpenAI
latency TruthfulQA Anthropic
cost Custom CSV Google
hallucin. Mistral
reasoning Together
```
## Install
```bash
# pip
pip install llm-evaluation-framework
# With extras
pip install "llm-evaluation-framework[dashboard,reports,dev]"
# Docker
docker-compose up -d
```
## License
MIT — free for research and commercial use.
## Citation
```bibtex
@software{vigneshwar234_llm_eval_2025,
author = {Vigneshwar S},
title = {LLM Evaluation Framework},
year = {2025},
url = {https://github.com/vignesh2027/LLM-Evaluation-Framework},
license = {MIT}
}
```