# Local First Education Data Framework — Project Knowledge Base > **Last updated:** 2026-06-14 (Session 7 — UI polish, CSV download, about modal, percentage fix, rebrand) > **Target:** HF Build Small Hackathon — "Backyard AI" chapter > **Space:** https://huggingface.co/spaces/build-small-hackathon/Kasualdad_LFED > **Remote:** `git@hf.co:spaces/build-small-hackathon/Kasualdad_LFED` (remote name: `space`) Note: the Space slug remains `Kasualdad_LFED` for HF URL stability; the product name is now **Local First Education Data Framework (LFED)**. --- ## Current State (read this first) `main` is the live HF Space demo. It runs **transformers + PEFT** on ZeroGPU: a bnb-4-bit base (`unsloth/qwen2.5-coder-14b-instruct-bnb-4bit`) plus the fine-tuned LoRA adapter (`build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora`). The local-first product path lives on `product` (branch) / `local-llamacpp-v1` (tag) and uses **llama.cpp + GGUF**. The `main` branch and the `product` branch share the same fine-tuned weights but use different inference engines because ZeroGPU only exposes PyTorch CUDA. ### Recent changes (2026-06-14) - **UI polish complete** — removed dark wrapper backgrounds, fixed dataframe light theme, darkened labels, aligned the three domain dropdown columns. - **"Show me how this was computed" SQL disclosure** — added below each result table. - **CSV download** — result tables can be downloaded as `kasualdad_result_YYYYMMDD_HHMMSS.csv`. - **About / FAQ modal** — "First time here?" opens an overlay with intro, how-it-works, what it is/isn't, FAQ, and privacy. - **"Bring it back" button hidden** — logic preserved but button is always invisible; previous-answer ribbon text remains. - **Percentage formatting fixed** — `format_result_df()` no longer double-multiplies values that are already returned as percentages. - **Evaluation query bank** — `evaluation_queries.md` added to the repo for local testing (not pushed to the Space). --- ## What It Does A Gradio app that lets school district staff ask plain-English questions about student data. A fine-tuned LLM (Qwen2.5-Coder-14B, QLoRA on 27,859 NL→SQL pairs) generates DuckDB SQL, which is validated and executed on in-memory seed data. Results are returned as a plain-English summary, a table, an optional CSV download, and the generated SQL. On the `product` branch nothing leaves the machine; the Space demo runs the same fine-tune on ZeroGPU. --- ## Architecture & Data Flow ``` User types question or picks a starter dropdown ↓ app.py → on_submit() → handle_query() [@spaces.GPU on ZeroGPU] ↓ model_inference.py → build_prompt() + TransformersLLM (bnb-4bit base + LoRA) ↓ streamed SQL tokens (llama.cpp-compatible response schema) data_engine.py → extract_sql() → validate_sql() → execute_safe() ↓ pandas DataFrame app.py → renders summary, table, CSV link, and SQL disclosure ``` **Key design rule:** `app.py` is a thin controller. All logic lives in the engine modules. ### UI Layout (current) 1. Header — title + tagline 2. Input box + "Get answer" button 3. Three domain starter dropdowns (attendance, grades, discipline/enrollment) 4. First-time nudge with modal link 5. Result region — previous-answer ribbon, summary, dataframe, CSV download, SQL disclosure 6. Footer — what this is / what it isn't + "Read the full explainer" 7. Hidden about/FAQ modal overlay --- ## File Map | File | Purpose | Lines (approx.) | |---|---|---| | `app.py` | Gradio UI, streaming `handle_query` generator, `@spaces.GPU`, Parquet bootstrap, custom CSS, about modal, CSV download | ~1,250 | | `model_inference.py` | `TransformersLLM` wrapper (bnb-4bit base + LoRA), SQL generation + streaming | ~270 | | `data_engine.py` | DuckDB lifecycle, Parquet loading, SQL extraction/validation/execution | ~280 | | `prompts.py` | System prompt, 5-table schema docs, 4 few-shot examples, prompt assembler | ~190 | | `ui_strings.py` | All user-facing copy: titles, nudges, domain starter questions, about/FAQ content | ~230 | | `data/generate_seed.py` | Deterministic generator: ~2,900 students across 5 tables | ~430 | | `data/export_parquet.py` | One-shot script: seed → 5 Parquet files | ~70 | | `data/*.parquet` | 5 committed seed files via LFS (~260 KB total, byte-deterministic) | — | | `requirements.txt` | Pinned Python deps | 9 lines | | `README.md` | Public Space README | ~280 | | `PROJECT.md` | This file — project knowledge base | ~500 | | `evaluation_queries.md` | 15 real-world evaluation queries for manual testing | ~60 | | `docs/HANDOFF.md` | Developer session handoff doc | ~190 | | `docs/PLAN.md` | Original build plan | ~170 | | `docs/SPEC_query-history-dashboards.md` | Draft spec for next features | — | | `tests/` | pytest suite (81 tests) | 4 files | | `modal_train/` | Modal fine-tuning pipeline | 5+ files | --- ## Model & Inference ### Active Model (`main` — Space) - **Base:** `unsloth/qwen2.5-coder-14b-instruct-bnb-4bit` (pre-quantized NF4, ~10 GB download) - **Adapter:** `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora` (551 MB, QLoRA r=32, α=32) - **Loaded via:** transformers `AutoModelForCausalLM` + `PeftModel.from_pretrained(..., torch_device="cpu")` (required on ZeroGPU) - **Env overrides:** `LFED_BASE_MODEL`, `LFED_ADAPTER_REPO` ### Active Model (`product` — local) [HISTORICAL — llama.cpp] - **Repo:** `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf` (Q4_K_M, ~9 GB; 7B fallback exists) - llama.cpp via `llama-cpp-python`, Metal on macOS ### Inference Config (`main`) | Parameter | Value | Note | |---|---|---| | `max_tokens` | 192 | Outputs run ~140-170 chars | | `temperature` | 0.0 | Deterministic (greedy) | | `stop` | `\n\n`, `Question:`, `User:`, `<\|im_end\|>`, `<\|im_start\|>` | Applied post-hoc + streaming truncation | | Few-shots | 4 | Trimmed from 7 to cut bnb-4bit prefill cost | | `spaces.GPU` | `duration=30` | Shorter duration = ZeroGPU queue priority | | Generation time | ~5 s/query | Rest of latency = ZeroGPU queue + weight restore | ### Thread Safety `model_inference.py` caches the model in a module-level `_llm` global with a `threading.Lock` (double-check pattern). `generate_sql()` auto-loads if `llm=None`. --- ## Zero GPU / HF Space Configuration ### Space Identity - **URL:** `https://huggingface.co/spaces/build-small-hackathon/Kasualdad_LFED` - **SDK:** Gradio 6.17.3 - **Python:** 3.12 - **Hardware:** Zero GPU (NVIDIA RTX Pro 6000 Blackwell, half — 48 GB VRAM) - **Git remote:** `space` → `https://huggingface.co/spaces/build-small-hackathon/Kasualdad_LFED` ### Zero GPU Architecture Zero GPU uses **CUDA emulation** at module level and **real GPU** inside `@spaces.GPU` functions. This means: - Model loads with `n_gpu_layers=-1` at startup (module level, emulated CUDA) - Inference runs inside `@spaces.GPU` decorated `handle_query()` (real GPU) - Zero GPU transparently switches between emulated and real CUDA contexts ### `@spaces.GPU` Decorator `app.py` — `handle_query()` is the only decorated function. Required for Zero GPU to recognize the Space as GPU-enabled. ### Startup Sequence (`app.py`) 1. Print banner 2. **Ensure Parquet seed files exist** — all **5** files (`enrollment`, `attendance`, `students`, `discipline`, `grades`) must be present in `/data/` or `data/`; they are committed via LFS (byte-deterministic). If any are missing, regenerate via `export_parquet.py`. 3. Load model (`load_model()`) — downloads base + adapter from HF Hub if not cached 4. Launch Gradio UI --- ## Data Engine & Parquet Optimization ### Per-Request Lifecycle `data_engine.py:create_session()` → `get_connection()` + `seed_database()` Each query creates a **fresh in-memory DuckDB** connection. This ensures: - Thread safety (no shared state between requests) - Query isolation (can't affect other requests) - Clean state (no stale data) ### Seeding Priority `data_engine.py:seed_database()` 1. **Parquet files** (fastest — ~260 KB, single-digit ms). Requires **all 5** tables in `/data/` (Space persistent storage) or `data/` (local dev, committed via LFS). Loaded via DuckDB `read_parquet()` → `CREATE TABLE ... AS SELECT *`. 2. **Python generator** `data/generate_seed.py` (slow fallback — ~2,900 students) ### Parquet Bootstrap `app.py` — On startup, if no Parquet files are found in any of the search dirs, `export_parquet.py` is called to generate them. On the Space, they go to `/data/` which persists across restarts. ### SQL Safety Pipeline `data_engine.py:execute_safe()` 1. `extract_sql()` — Parse JSON envelope → ```sql``` block → raw fallback 2. `validate_sql()` — Forbidden token check (DROP, DELETE, INSERT, UPDATE, etc.) + schema-aware `EXPLAIN` validation 3. Wrap: `SELECT * FROM () AS _safe LIMIT 1000` 4. Execute directly on DuckDB 5. Return `(cleaned_sql, DataFrame)` ### Forbidden Tokens `data_engine.py` — `DROP`, `DELETE`, `INSERT`, `UPDATE`, `ALTER`, `TRUNCATE`, `CREATE`, `ATTACH`, `DETACH`, `PRAGMA` --- ## Database Schema ### `enrollment` | Column | Type | Description | |---|---|---| | `school_year` | VARCHAR | School year, format `'YYYY-YYYY'` | | `school_name` | VARCHAR | One of 5 schools | | `grade_level` | INTEGER | Grade level (K=0 through 12) | | `student_count` | INTEGER | Students enrolled in that grade/year/school | ### `attendance` | Column | Type | Description | |---|---|---| | `student_id` | INTEGER | Unique student identifier | | `school_name` | VARCHAR | School the student attends | | `school_year` | VARCHAR | School year, format `'YYYY-YYYY'` | | `absence_count` | INTEGER | Total absences for that year | | `is_chronically_absent` | BOOLEAN | TRUE if missed ≥10% of school days | ### `students` | Column | Type | Description | |---|---|---| | `student_id` | INTEGER | Unique student identifier | | `school_name` | VARCHAR | School the student attends | | `grade_level` | INTEGER | Grade level (K=0 through 12) | | `gender`, `race_ethnicity` | VARCHAR | Demographic fields | | `english_learner`, `special_education`, `economically_disadvantaged` | BOOLEAN | Program flags | ### `discipline` | Column | Type | Description | |---|---|---| | `incident_id`, `student_id` | INTEGER | IDs | | `school_name`, `school_year` | VARCHAR | School / year | | `grade_level` | INTEGER | Grade at time of incident | | `incident_type`, `severity`, `action_taken` | VARCHAR | Incident details | | `incident_date` | DATE | Date of incident | | `days_suspended` | INTEGER | Days suspended, if any | ### `grades` | Column | Type | Description | |---|---|---| | `student_id` | INTEGER | Student ID | | `school_name`, `school_year` | VARCHAR | School / year | | `grade_level` | INTEGER | Grade level | | `course_name`, `term`, `letter_grade` | VARCHAR | Course details | | `grade_numeric`, `gpa` | DOUBLE | Numeric grade and GPA | ### Schools | School | Grades | Base Enrollment | |---|---|---| | Lincoln Elementary | K–5 | 520 | | Washington Middle | 6–8 | 480 | | Jefferson High | 9–12 | 900 | | Roosevelt Academy | K–8 | 380 | | Kennedy Prep | 6–12 | 620 | ### Seed Data Stats - **Students:** ~2,900 - **Chronic absenteeism rate:** ~15% - **Enrollment rows:** ~116 - **Attendance rows:** ~11,600 - **Discipline rows:** ~1,400 - **Grades rows:** ~160,000 - **School years:** 2021-2022, 2022-2023, 2023-2024, 2024-2025 - **Reproducible:** `random.seed(42)` in `generate_seed.py` --- ## Prompt Engineering `prompts.py` assembles the full LLM prompt: ### Structure ``` SYSTEM_PROMPT (rules, constraints, output format) ↓ Schema documentation (table + column list) ↓ Few-shot examples (4 question→SQL pairs) ↓ User question ↓ "SQL:" ``` ### System Prompt Rules 1. Only SELECT statements 2. Exact table/column names from schema 3. Proper DuckDB syntax (VARCHAR → single quotes, BOOLEAN → TRUE/FALSE) 4. Use column aliases for aggregations 5. Join on logical columns 6. Make reasonable assumptions if ambiguous 7. Output ONLY the ```sql``` block, no explanation ### Few-Shot Examples | Question | SQL pattern | |---|---| | "How many chronically absent in 2023-2024?" | `COUNT(*)` with `WHERE ... AND is_chronically_absent = TRUE` | | "Show enrollment per school sorted" | `SUM(student_count) GROUP BY school_name ORDER BY ... DESC` | | "What percentage at Lincoln Elementary?" | `COUNT(CASE WHEN ...) * 100.0 / COUNT(*)` | | "Enrollment trend since 2021" | `GROUP BY school_year ORDER BY school_year` | --- ## Fine-Tuning Pipeline (Modal) ### Location: `modal_train/` — v2 | Script | Purpose | |---|---| | `generate_synthetic_v2.py` + `modal_generate.py` + `augment_gretel.py` + `rephrase_pairs.py` | Builds the 27,859-pair dataset (`train_final_v2.jsonl`) | | `train_v2.py` | Unsloth QLoRA on Qwen2.5-Coder-14B (r=32, α=32, 4-bit, 2 epochs, lr=1e-4, A10G) | | `export_gguf_v2.py` | Merges LoRA → GGUF Q4_K_M → pushes to HF Hub | | `modal_train_v2.py` | Modal orchestration (`modal.App("kasualdad-lfed-train-v2")`) | ### Published training dataset The v2 training dataset is now published on the Hub: | Artifact | Location | |---|---| | NL→SQL training data (25,886 pairs) | [`build-small-hackathon/lfed-training-data`](https://huggingface.co/datasets/build-small-hackathon/lfed-training-data) | | v1 scripts (`train.py`, `export_gguf.py`, `modal_app.py`) | Historical 7B run, 1,289 pairs | ### How to Run ```bash modal secret create huggingface-secret HF_TOKEN= modal run modal_train/modal_train_v2.py ``` ### Artifacts (Status: trained 2026-06-10) | Artifact | Location | |---|---| | LoRA adapter | `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora` (HF Hub) + Modal volume `lfed-training-data:/lora-adapter-v2` | | GGUF Q4_K_M | `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf` (HF Hub) | | Merged fp16 | Not persisted; re-merge from base + adapter if needed | --- ## Dependencies ### Python (`requirements.txt`, `main` branch) | Package | Why | |---|---| | `gradio` >= 6.15 | Web UI | | `duckdb` 1.5.3 | In-memory SQL engine | | `torch` / `transformers` / `peft` / `bitsandbytes` / `accelerate` | Model inference (bnb-4bit base + LoRA) | | `huggingface_hub` + `hf_transfer` | Fast model download | | `spaces` | `@spaces.GPU` decorator for ZeroGPU | The `product` branch instead pins `llama-cpp-python` (no torch stack needed). --- ## UI Design System Current design is **ResearchMono** on top of `gr.themes.Soft`: - **Typography:** IBM Plex Sans (UI) + IBM Plex Mono (SQL/code) - **Palette:** light page background (`#F2F4F8`), white surfaces (`#FFFFFF`), dark text (`#21272A`), IBM blue accent (`#4589FF`) - **Layout:** single column; result region appears after submitting a question - **Components:** domain starter dropdowns, first-time nudge + modal, result summary, dataframe, CSV download, SQL disclosure, footer explainer The previous aggressive CSS-override approach was replaced by `gr.themes.Soft` with targeted overrides for Gradio's scoped Svelte classes. --- ## Hackathon Badges | Badge | Status | Implementation | |---|---|---| | **Off the Grid** | ✅ | `product` branch: llama.cpp + local GGUF, no API calls. `main`/Space: same fine-tune on ZeroGPU (transformers). | | **Well-Tuned** | ✅ | Fine-tuned Qwen2.5-Coder-14B, QLoRA r=32 on 27,859 NL→SQL pairs (Modal A10G). | | **Llama Champion** | ✅ | llama.cpp is the local-first inference backend (GGUF Q4_K_M, Metal, streaming). | | **Off-Brand** | ✅ | Custom ResearchMono theme — IBM Plex Sans/Mono, `#4589FF` accent, `gr.themes.Soft`. | --- ## Gotchas & Pitfalls ### 1. `LD_LIBRARY_PATH` doesn't work at Python runtime [HISTORICAL — llama.cpp] The dynamic linker caches it at process start. Use `ctypes.CDLL(path, mode=RTLD_GLOBAL)` to preload libraries instead. Only relevant for the `product` branch. ### 2. ZeroGPU = PyTorch-only CUDA [2026-06-12] ZeroGPU's `spaces.GPU` decorator works through a PyTorch patching layer — non-PyTorch CUDA libraries (llama.cpp) cannot benefit from it. The fix was to drop llama.cpp on the Space and use transformers + PEFT + bnb-4-bit instead. ### 3. PEFT adapter loading straight to `cuda:0` fails on ZeroGPU [2026-06-12] `PeftModel.from_pretrained()` defaults to loading adapter safetensors directly onto `cuda:0`, which ZeroGPU forbids at startup. Fix: `torch_device="cpu"`. ### 4. Parquet files committed via LFS [2026-06-12] All 5 Parquet seed files are committed via git LFS. They are byte-deterministic. HF Spaces requires LFS for binary files >10 KB. Re-export with `python data/export_parquet.py` if the seed generator changes. ### 5. Missing Parquet fallback = nondeterministic data per query [2026-06-12] `seed_database()` requires all 5 Parquet files. If any are missing, it falls through to the Python generator, which advances the shared `random` RNG on each call → different data per query. Fixed by committing all 5 files + re-seeding the RNG inside `generate_seed_data()`. ### 6. Gradio scoped Svelte CSS beats generic selectors [2026-06-14] Generic selectors like `.gradio-container .form` or `.gr-dataframe` are overridden by Gradio's generated scoped classes (e.g., `.form.svelte-d5xbca`). Fixes required inspecting the live DOM and using higher-specificity selectors, `!important`, or overriding CSS variables. ### 7. Percentage formatting double-scaling [2026-06-14] `format_result_df()` originally multiplied every `_rate` column by 100, assuming the model always returned 0–1 proportions. The model sometimes returns percentages already (e.g., `10.0`), causing values like `1000.00%`. Fixed by scaling only values in the `0 <= v <= 1` range. ### 8. Transient HF Spaces mount errors [2026-06-14] The Space occasionally fails during init with `Initialization step 'hf-mount' failed`. This is an HF infrastructure issue attaching the persistent storage bucket, not an app bug. Restarting the Space from the Hugging Face UI usually resolves it; the app also has a local Parquet fallback. --- ## Quick Start ```bash # Clone cd Local-First-Education-Data-Framework # Virtual env python3.12 -m venv .venv && source .venv/bin/activate # Install pip install -r requirements.txt # Generate Parquet seed files (first time only) python data/export_parquet.py # Run python app.py # → http://localhost:7860 # Tests pytest tests/ -v # Deploy git push space main ``` --- ## Recent Git History ``` 549e4e0 fix: only scale proportions 0-1 to percentages; avoid double-multiplying d6754e1 feat: hide bring-back button, add CSV download, about/FAQ modal 432039e fix: target real gr.Code label, explainer block border, align dropdown columns 95ddb2d fix: dark sql disclosure text, remove explainer border 78ebc5c fix: override Gradio scoped .form background, label color, dataframe table vars ``` --- ## Future Work - [x] Fine-tuning completed — 14B model trained (27,859 pairs), GGUF + LoRA adapter published - [x] Swapped to fine-tuned model on `main` — transformers + LoRA on ZeroGPU - [x] Streaming re-enabled — SQL streams to the UI - [x] Deterministic seed data — all 5 Parquet files committed via LFS - [x] UI polish — theme, labels, borders, dataframe, dropdown alignment - [x] CSV download - [x] About / FAQ modal - [x] Evaluation query bank - [ ] Query history + comparisons — see `docs/SPEC_query-history-dashboards.md` - [ ] Dashboards — Standard Board + Ephemeral Scratch Board - [ ] HF Space smoke test — verify a subset of evaluation queries after each deploy - [ ] Product branch development — build local-first features on `product` branch - [ ] Model card update on `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora` --- ## Historical Sections Sections below are kept for reference but describe earlier versions of the project. ### UI Design System (pre-2026-06-10) The original palette used slate/indigo tones with aggressive CSS overrides against Gradio's dark-theme defaults. It was replaced by the current `gr.themes.Soft` + IBM Plex ResearchMono approach. ### CUDA Dependency Story (llama.cpp backend) Full debugging history of preloading CUDA shared libraries for `llama-cpp-python` on the HF Space. Only relevant to the `product` branch. The `main` branch no longer includes llama.cpp or the CUDA preload code. ### llama.cpp Import Guard If `from llama_cpp import Llama` failed on the old backend, a `RuntimeError` pointed users to the CPU wheel. This code was removed on `main` in 2026-06-12.