---
title: Local First Education Data Framework
emoji: โก
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 6.18.0
python_version: '3.12'
app_file: app.py
pinned: false
short_description: Local First Education Data Analytics for school admins
tags:
- track:backyard
- sponsor:modal
- achievement:offgrid
- achievement:welltuned
- achievement:offbrand
- achievement:fieldnotes
- text-to-sql
- education
- local-first
- duckdb
- gradio
- qwen
- lora
---
# ๐ซ Local First Education Data Framework (LFED)
**Local-First Education Data** โ ask questions about your district in plain English, get answers instantly. Designed so all inference can run on your own machine. No data ever leaves.
> ๐ Built for the **HF Build Small Hackathon** (Chapter One: Backyard AI)
>
### Demo Video: https://youtu.be/cE0yp4qmFIA
### Social posts:
1. https://huggingface.co/posts/Kasualdad/259252451483236
2. https://www.linkedin.com/posts/franklucido_buildsmallhackathon-backyardai-huggingface-share-7472321525183066112-BowY/?utm_source=share&utm_medium=member_desktop&rcm=ACoAAABE0CIBRwffLN1_r2lGuenCphXxAuED7jE
3. https://www.lucidotechnologyconsulting.com/blog/BuildingLFEDS
---
## Two Deployment Flavors
| | This Space (ZeroGPU) | Local-first (Mac/on-prem) |
|---|---|---|
| Inference | transformers + PEFT, bnb-4bit base + LoRA adapter | llama.cpp + GGUF (Metal/CPU) |
| Model | `unsloth/qwen2.5-coder-14b-instruct-bnb-4bit` + [`lfed-qwen2.5-coder-14b-sql-lora`](https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora) | [`lfed-qwen2.5-coder-14b-sql-gguf`](https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf) (Q4_K_M) |
| Where | `main` branch | tag `local-llamacpp-v1` / `product` branch |
| Why | ZeroGPU's CUDA is PyTorch-only โ llama.cpp can't use it (see `DEPLOY.md`) | Full local inference, zero cloud โ runs on any Mac with Metal |
Both run the **same fine-tune**: QLoRA (r=32) on 27,859 NLโSQL pairs. The bnb-4bit + LoRA combination on the Space is the exact configuration the model was trained in.
---
## ๐
Hackathon Badges
| Badge | Status | How |
|---|---|---|
| **Off the Grid** | โ
| Local-first build runs entirely via llama.cpp + local GGUF. No API calls. No cloud. |
| **Well-Tuned** | โ
| Fine-tuned Qwen2.5-Coder-14B on 27,859 synthetic NLโSQL pairs via Unsloth QLoRA on Modal A10G. |
| **Llama Champion** | โ
| llama.cpp is the local inference backend (Q4_K_M GGUF, streaming generation). |
| **Off-Brand** | โ
| Custom ResearchMono theme โ IBM Plex Sans/Mono, IBM-blue accent, WCAG AA. |
---
## ๐ฏ What It Does
A school district admin (principal, superintendent, department head) types a question:
> *"What's the average GPA for chronically absent students vs non-chronic students in 2023-2024?"*
Local First Education Data Framework:
1. Sends the question + schema context to the fine-tuned LLM
2. **Streams** the generated SQL into the UI token by token
3. Validates the SQL against the actual schema (read-only, column checks, forbidden tokens)
4. Executes it on an in-memory DuckDB seeded from deterministic Parquet files
5. Returns the results as a plain-English sentence, a table, and an optional CSV download
6. Lets the user inspect the generated SQL with **"Show me how this was computed"**
No API keys. No data exfiltration.
### Current UI Features
- **Domain starter questions** โ three dropdowns (attendance, grades, discipline/enrollment) that fill common questions into the input.
- **First-time info modal** โ clicking "First time here?" opens an overlay with intro, how-it-works, FAQ, and privacy notes.
- **SQL disclosure** โ every result includes the generated DuckDB query for transparency.
- **CSV download** โ download any result table as a timestamped CSV.
- **Footer explainer** โ a collapsible "How this works / Your privacy" section at the bottom.
- **Previous-answer ribbon** โ a subtle reminder of the most recent successful question.
- **Light, clean theme** โ white/gray palette with dark text and IBM Plex typography.
---
## ๐ Architecture
```mermaid
flowchart TD
U[๐ค School Admin] -->|natural language| UI[Gradio UI]
UI -->|question + schema| LLM[model_inference.py]
LLM -->|"Space: transformers + LoRA (bnb-4bit)
Local: llama.cpp GGUF"| MODEL[Qwen2.5-Coder-14B
fine-tuned text-to-SQL]
MODEL -->|streamed SQL| GUARD[data_engine.py]
GUARD -->|extract โ validate| DUCK[DuckDB in-memory
seeded from Parquet]
DUCK -->|dataframe| UI
UI -->|SQL + table + CSV| U
subgraph Training [Offline Fine-Tuning โ modal_train/]
SYNTH[generate_synthetic_v2.py
27,859 NLโSQL pairs]
TRAIN[train_v2.py
Unsloth QLoRA r=32 on A10G]
EXPORT[export_gguf_v2.py
merge โ GGUF โ HF Hub]
DATASET[[`lfed-training-data`
published dataset]]
SYNTH --> TRAIN --> EXPORT
TRAIN --> DATASET
end
EXPORT -.->|GGUF + LoRA adapter| MODEL
DATASET -.->|training data| MODEL
```
---
## ๐ Data Schema
Deterministic seed data (committed as Parquet, byte-reproducible): **5 schools ร 4 school years**, ~2,900 students, 15% chronic absenteeism, 178K total rows across 5 tables.
### `enrollment`
| Column | Type |
|---|---|
| `school_year` | VARCHAR (`'YYYY-YYYY'`) |
| `school_name` | VARCHAR |
| `grade_level` | INTEGER (K=0 โฆ 12) |
| `student_count` | INTEGER |
### `attendance`
| Column | Type |
|---|---|
| `student_id` | INTEGER |
| `school_name` | VARCHAR |
| `school_year` | VARCHAR |
| `absence_count` | INTEGER |
| `is_chronically_absent` | BOOLEAN (โฅ10% of school days missed) |
### `students`
| Column | Type |
|---|---|
| `student_id` | INTEGER |
| `school_name` | VARCHAR |
| `grade_level` | INTEGER |
| `gender`, `race_ethnicity` | VARCHAR |
| `english_learner`, `special_education`, `economically_disadvantaged` | BOOLEAN |
### `discipline`
| Column | Type |
|---|---|
| `incident_id`, `student_id` | INTEGER |
| `school_name`, `school_year` | VARCHAR |
| `grade_level` | INTEGER |
| `incident_type`, `severity`, `action_taken` | VARCHAR |
| `incident_date` | DATE |
| `days_suspended` | INTEGER |
### `grades`
| Column | Type |
|---|---|
| `student_id` | INTEGER |
| `school_name`, `school_year` | VARCHAR |
| `grade_level` | INTEGER |
| `course_name`, `term`, `letter_grade` | VARCHAR |
| `grade_numeric`, `gpa` | DOUBLE |
### Schools
| School | Grades |
|---|---|
| Lincoln Elementary | Kโ5 |
| Washington Middle | 6โ8 |
| Jefferson High | 9โ12 |
| Roosevelt Academy | Kโ8 |
| Kennedy Prep | 6โ12 |
---
## ๐ How to Run
### On this Space
Nothing to do โ ask a question or pick a starter dropdown. First query after a cold start takes longer (ZeroGPU attaches a GPU and restores ~10.5 GB of packed weights).
### Locally (the local-first version)
The `main` branch targets CUDA (bitsandbytes requires it). For Mac/CPU local use, start from the llama.cpp version:
```bash
git checkout -b product local-llamacpp-v1 # or: git checkout product
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt # includes llama-cpp-python (Metal on macOS)
python app.py # downloads the GGUF on first run
```
Open **http://localhost:7860**.
---
## ๐ง Fine-Tuning Pipeline (v2)
The Modal training pipeline lives in `modal_train/`:
```bash
pip install modal
modal secret create huggingface-secret HF_TOKEN=hf_your_token_here
modal run modal_train/modal_train_v2.py # QLoRA train โ merge โ GGUF โ push
```
| Script | What it does |
|---|---|
| `generate_synthetic_v2.py` | Builds the 27,859-pair NLโSQL dataset (templates + Gretel + rephrasing) |
| `train_v2.py` | Unsloth QLoRA on Qwen2.5-Coder-14B (r=32, ฮฑ=32, 4-bit, 2 epochs, lr=1e-4, A10G) |
| `export_gguf_v2.py` | Merges LoRA โ GGUF Q4_K_M โ pushes to HF Hub |
| `modal_train_v2.py` | Modal orchestration; adapter persisted to the `lfed-training-data` volume |
Published artifacts:
- LoRA adapter: [`build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora`](https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora)
- GGUF: [`build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf`](https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf)
- Training dataset (25,886 pairs): [`build-small-hackathon/lfed-training-data`](https://huggingface.co/datasets/build-small-hackathon/lfed-training-data)
---
## ๐งช Tests
```bash
pytest tests/ -v
```
81 tests covering the execution guard (SQL injection, forbidden tokens, schema validation), data engine (isolation, seed integrity, timeout), and model inference (prompt assembly, streaming, singleton caching, JSON parsing). Model calls are mocked โ the suite runs anywhere in ~1s.
---
## ๐ Project Structure
```
Kasualdad_LFED/
โโโ app.py # Gradio UI (thin controller, streaming, @spaces.GPU)
โโโ model_inference.py # transformers + PEFT wrapper (llama.cpp-compatible API)
โโโ data_engine.py # DuckDB lifecycle, execution guard, timeout
โโโ prompts.py # System prompt, 5-table schema docs, 4 few-shot examples
โโโ ui_strings.py # All user-facing copy (titles, nudges, FAQs, examples)
โโโ data/
โ โโโ generate_seed.py # Deterministic seed generator (5 tables)
โ โโโ export_parquet.py # Seed โ Parquet exporter
โ โโโ *.parquet # 5 committed seed files (LFS, ~260 KB total)
โโโ tests/ # 81 pytest tests
โโโ modal_train/ # v2 fine-tuning pipeline (Modal + Unsloth)
โโโ docs/
โ โโโ SPEC_query-history-dashboards.md # Next feature spec (draft)
โ โโโ โฆ # plans, handoff, training playbook
โโโ DEPLOY.md # ZeroGPU deployment war story + resolution
โโโ evaluation_queries.md # 15 real-world evaluation queries (not pushed to Space)
โโโ requirements.txt
โโโ README.md
```
Branches & tags:
- `main` โ this Space (transformers + LoRA on ZeroGPU)
- `local-llamacpp-v1` (tag) / `product` (branch) โ the llama.cpp local-first base
---
## ๐จ Design (Off-Brand)
**ResearchMono** theme built on `gr.themes.Soft`:
| Token | Value | Usage |
|---|---|---|
| Background | `#F2F4F8` (light) / `#121619` (dark) | Page |
| Surface | `#ffffff` | Cards, inputs |
| Text | `#21272A` | Primary text |
| Accent | `#4589FF` (IBM blue) | Primary actions, links |
| Accent hover | `#2C6FDD` | Button hover |
| Border | `#DDE1E6` | Subtle borders |
- **Typography:** IBM Plex Sans (UI) + IBM Plex Mono (SQL/code)
- **Layout:** Single column with three domain dropdowns, result region, and footer explainer
- **Accessibility:** WCAG AA contrast, `:focus-visible` rings, `prefers-reduced-motion` support, color never the sole state indicator
---
## ๐ค Author
**Frank Lucido** โ [Lucido Technology Consulting](https://www.lucidotechnologyconsulting.com/)
Building local-first data infrastructure for California public schools.
[๐ผ LinkedIn](https://www.linkedin.com/in/franklucido) ยท [๐ GitHub](https://github.com/flucido) ยท [๐ค Hugging Face](https://huggingface.co/Kasualdad)
---
## ๐ License
Apache 2.0