--- title: Local First Education Data Framework emoji: โšก colorFrom: indigo colorTo: gray sdk: gradio sdk_version: 6.18.0 python_version: '3.12' app_file: app.py pinned: false short_description: Local First Education Data Analytics for school admins tags: - track:backyard - sponsor:modal - achievement:offgrid - achievement:welltuned - achievement:offbrand - achievement:fieldnotes - text-to-sql - education - local-first - duckdb - gradio - qwen - lora --- # ๐Ÿซ Local First Education Data Framework (LFED) **Local-First Education Data** โ€” ask questions about your district in plain English, get answers instantly. Designed so all inference can run on your own machine. No data ever leaves. > ๐Ÿ† Built for the **HF Build Small Hackathon** (Chapter One: Backyard AI) > ### Demo Video: https://youtu.be/cE0yp4qmFIA ### Social posts: 1. https://huggingface.co/posts/Kasualdad/259252451483236 2. https://www.linkedin.com/posts/franklucido_buildsmallhackathon-backyardai-huggingface-share-7472321525183066112-BowY/?utm_source=share&utm_medium=member_desktop&rcm=ACoAAABE0CIBRwffLN1_r2lGuenCphXxAuED7jE 3. https://www.lucidotechnologyconsulting.com/blog/BuildingLFEDS --- ## Two Deployment Flavors | | This Space (ZeroGPU) | Local-first (Mac/on-prem) | |---|---|---| | Inference | transformers + PEFT, bnb-4bit base + LoRA adapter | llama.cpp + GGUF (Metal/CPU) | | Model | `unsloth/qwen2.5-coder-14b-instruct-bnb-4bit` + [`lfed-qwen2.5-coder-14b-sql-lora`](https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora) | [`lfed-qwen2.5-coder-14b-sql-gguf`](https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf) (Q4_K_M) | | Where | `main` branch | tag `local-llamacpp-v1` / `product` branch | | Why | ZeroGPU's CUDA is PyTorch-only โ€” llama.cpp can't use it (see `DEPLOY.md`) | Full local inference, zero cloud โ€” runs on any Mac with Metal | Both run the **same fine-tune**: QLoRA (r=32) on 27,859 NLโ†’SQL pairs. The bnb-4bit + LoRA combination on the Space is the exact configuration the model was trained in. --- ## ๐Ÿ… Hackathon Badges | Badge | Status | How | |---|---|---| | **Off the Grid** | โœ… | Local-first build runs entirely via llama.cpp + local GGUF. No API calls. No cloud. | | **Well-Tuned** | โœ… | Fine-tuned Qwen2.5-Coder-14B on 27,859 synthetic NLโ†’SQL pairs via Unsloth QLoRA on Modal A10G. | | **Llama Champion** | โœ… | llama.cpp is the local inference backend (Q4_K_M GGUF, streaming generation). | | **Off-Brand** | โœ… | Custom ResearchMono theme โ€” IBM Plex Sans/Mono, IBM-blue accent, WCAG AA. | --- ## ๐ŸŽฏ What It Does A school district admin (principal, superintendent, department head) types a question: > *"What's the average GPA for chronically absent students vs non-chronic students in 2023-2024?"* Local First Education Data Framework: 1. Sends the question + schema context to the fine-tuned LLM 2. **Streams** the generated SQL into the UI token by token 3. Validates the SQL against the actual schema (read-only, column checks, forbidden tokens) 4. Executes it on an in-memory DuckDB seeded from deterministic Parquet files 5. Returns the results as a plain-English sentence, a table, and an optional CSV download 6. Lets the user inspect the generated SQL with **"Show me how this was computed"** No API keys. No data exfiltration. ### Current UI Features - **Domain starter questions** โ€” three dropdowns (attendance, grades, discipline/enrollment) that fill common questions into the input. - **First-time info modal** โ€” clicking "First time here?" opens an overlay with intro, how-it-works, FAQ, and privacy notes. - **SQL disclosure** โ€” every result includes the generated DuckDB query for transparency. - **CSV download** โ€” download any result table as a timestamped CSV. - **Footer explainer** โ€” a collapsible "How this works / Your privacy" section at the bottom. - **Previous-answer ribbon** โ€” a subtle reminder of the most recent successful question. - **Light, clean theme** โ€” white/gray palette with dark text and IBM Plex typography. --- ## ๐Ÿ— Architecture ```mermaid flowchart TD U[๐Ÿ‘ค School Admin] -->|natural language| UI[Gradio UI] UI -->|question + schema| LLM[model_inference.py] LLM -->|"Space: transformers + LoRA (bnb-4bit)
Local: llama.cpp GGUF"| MODEL[Qwen2.5-Coder-14B
fine-tuned text-to-SQL] MODEL -->|streamed SQL| GUARD[data_engine.py] GUARD -->|extract โ†’ validate| DUCK[DuckDB in-memory
seeded from Parquet] DUCK -->|dataframe| UI UI -->|SQL + table + CSV| U subgraph Training [Offline Fine-Tuning โ€” modal_train/] SYNTH[generate_synthetic_v2.py
27,859 NLโ†’SQL pairs] TRAIN[train_v2.py
Unsloth QLoRA r=32 on A10G] EXPORT[export_gguf_v2.py
merge โ†’ GGUF โ†’ HF Hub] DATASET[[`lfed-training-data`
published dataset]] SYNTH --> TRAIN --> EXPORT TRAIN --> DATASET end EXPORT -.->|GGUF + LoRA adapter| MODEL DATASET -.->|training data| MODEL ``` --- ## ๐Ÿ“Š Data Schema Deterministic seed data (committed as Parquet, byte-reproducible): **5 schools ร— 4 school years**, ~2,900 students, 15% chronic absenteeism, 178K total rows across 5 tables. ### `enrollment` | Column | Type | |---|---| | `school_year` | VARCHAR (`'YYYY-YYYY'`) | | `school_name` | VARCHAR | | `grade_level` | INTEGER (K=0 โ€ฆ 12) | | `student_count` | INTEGER | ### `attendance` | Column | Type | |---|---| | `student_id` | INTEGER | | `school_name` | VARCHAR | | `school_year` | VARCHAR | | `absence_count` | INTEGER | | `is_chronically_absent` | BOOLEAN (โ‰ฅ10% of school days missed) | ### `students` | Column | Type | |---|---| | `student_id` | INTEGER | | `school_name` | VARCHAR | | `grade_level` | INTEGER | | `gender`, `race_ethnicity` | VARCHAR | | `english_learner`, `special_education`, `economically_disadvantaged` | BOOLEAN | ### `discipline` | Column | Type | |---|---| | `incident_id`, `student_id` | INTEGER | | `school_name`, `school_year` | VARCHAR | | `grade_level` | INTEGER | | `incident_type`, `severity`, `action_taken` | VARCHAR | | `incident_date` | DATE | | `days_suspended` | INTEGER | ### `grades` | Column | Type | |---|---| | `student_id` | INTEGER | | `school_name`, `school_year` | VARCHAR | | `grade_level` | INTEGER | | `course_name`, `term`, `letter_grade` | VARCHAR | | `grade_numeric`, `gpa` | DOUBLE | ### Schools | School | Grades | |---|---| | Lincoln Elementary | Kโ€“5 | | Washington Middle | 6โ€“8 | | Jefferson High | 9โ€“12 | | Roosevelt Academy | Kโ€“8 | | Kennedy Prep | 6โ€“12 | --- ## ๐Ÿš€ How to Run ### On this Space Nothing to do โ€” ask a question or pick a starter dropdown. First query after a cold start takes longer (ZeroGPU attaches a GPU and restores ~10.5 GB of packed weights). ### Locally (the local-first version) The `main` branch targets CUDA (bitsandbytes requires it). For Mac/CPU local use, start from the llama.cpp version: ```bash git checkout -b product local-llamacpp-v1 # or: git checkout product python3.12 -m venv .venv && source .venv/bin/activate pip install -r requirements.txt # includes llama-cpp-python (Metal on macOS) python app.py # downloads the GGUF on first run ``` Open **http://localhost:7860**. --- ## ๐Ÿ”ง Fine-Tuning Pipeline (v2) The Modal training pipeline lives in `modal_train/`: ```bash pip install modal modal secret create huggingface-secret HF_TOKEN=hf_your_token_here modal run modal_train/modal_train_v2.py # QLoRA train โ†’ merge โ†’ GGUF โ†’ push ``` | Script | What it does | |---|---| | `generate_synthetic_v2.py` | Builds the 27,859-pair NLโ†’SQL dataset (templates + Gretel + rephrasing) | | `train_v2.py` | Unsloth QLoRA on Qwen2.5-Coder-14B (r=32, ฮฑ=32, 4-bit, 2 epochs, lr=1e-4, A10G) | | `export_gguf_v2.py` | Merges LoRA โ†’ GGUF Q4_K_M โ†’ pushes to HF Hub | | `modal_train_v2.py` | Modal orchestration; adapter persisted to the `lfed-training-data` volume | Published artifacts: - LoRA adapter: [`build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora`](https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora) - GGUF: [`build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf`](https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf) - Training dataset (25,886 pairs): [`build-small-hackathon/lfed-training-data`](https://huggingface.co/datasets/build-small-hackathon/lfed-training-data) --- ## ๐Ÿงช Tests ```bash pytest tests/ -v ``` 81 tests covering the execution guard (SQL injection, forbidden tokens, schema validation), data engine (isolation, seed integrity, timeout), and model inference (prompt assembly, streaming, singleton caching, JSON parsing). Model calls are mocked โ€” the suite runs anywhere in ~1s. --- ## ๐Ÿ“ Project Structure ``` Kasualdad_LFED/ โ”œโ”€โ”€ app.py # Gradio UI (thin controller, streaming, @spaces.GPU) โ”œโ”€โ”€ model_inference.py # transformers + PEFT wrapper (llama.cpp-compatible API) โ”œโ”€โ”€ data_engine.py # DuckDB lifecycle, execution guard, timeout โ”œโ”€โ”€ prompts.py # System prompt, 5-table schema docs, 4 few-shot examples โ”œโ”€โ”€ ui_strings.py # All user-facing copy (titles, nudges, FAQs, examples) โ”œโ”€โ”€ data/ โ”‚ โ”œโ”€โ”€ generate_seed.py # Deterministic seed generator (5 tables) โ”‚ โ”œโ”€โ”€ export_parquet.py # Seed โ†’ Parquet exporter โ”‚ โ””โ”€โ”€ *.parquet # 5 committed seed files (LFS, ~260 KB total) โ”œโ”€โ”€ tests/ # 81 pytest tests โ”œโ”€โ”€ modal_train/ # v2 fine-tuning pipeline (Modal + Unsloth) โ”œโ”€โ”€ docs/ โ”‚ โ”œโ”€โ”€ SPEC_query-history-dashboards.md # Next feature spec (draft) โ”‚ โ””โ”€โ”€ โ€ฆ # plans, handoff, training playbook โ”œโ”€โ”€ DEPLOY.md # ZeroGPU deployment war story + resolution โ”œโ”€โ”€ evaluation_queries.md # 15 real-world evaluation queries (not pushed to Space) โ”œโ”€โ”€ requirements.txt โ””โ”€โ”€ README.md ``` Branches & tags: - `main` โ€” this Space (transformers + LoRA on ZeroGPU) - `local-llamacpp-v1` (tag) / `product` (branch) โ€” the llama.cpp local-first base --- ## ๐ŸŽจ Design (Off-Brand) **ResearchMono** theme built on `gr.themes.Soft`: | Token | Value | Usage | |---|---|---| | Background | `#F2F4F8` (light) / `#121619` (dark) | Page | | Surface | `#ffffff` | Cards, inputs | | Text | `#21272A` | Primary text | | Accent | `#4589FF` (IBM blue) | Primary actions, links | | Accent hover | `#2C6FDD` | Button hover | | Border | `#DDE1E6` | Subtle borders | - **Typography:** IBM Plex Sans (UI) + IBM Plex Mono (SQL/code) - **Layout:** Single column with three domain dropdowns, result region, and footer explainer - **Accessibility:** WCAG AA contrast, `:focus-visible` rings, `prefers-reduced-motion` support, color never the sole state indicator --- ## ๐Ÿ‘ค Author **Frank Lucido** โ€” [Lucido Technology Consulting](https://www.lucidotechnologyconsulting.com/) Building local-first data infrastructure for California public schools. [๐Ÿ’ผ LinkedIn](https://www.linkedin.com/in/franklucido) ยท [๐Ÿ™ GitHub](https://github.com/flucido) ยท [๐Ÿค— Hugging Face](https://huggingface.co/Kasualdad) --- ## ๐Ÿ“ License Apache 2.0