# Local First Education Data Framework — Project Knowledge Base

> **Last updated:** 2026-06-14 (Session 7 — UI polish, CSV download, about modal, percentage fix, rebrand)
> **Target:** HF Build Small Hackathon — "Backyard AI" chapter
> **Space:** https://huggingface.co/spaces/build-small-hackathon/Kasualdad_LFED
> **Remote:** `git@hf.co:spaces/build-small-hackathon/Kasualdad_LFED` (remote name: `space`)

Note: the Space slug remains `Kasualdad_LFED` for HF URL stability; the product name is now **Local First Education Data Framework (LFED)**.

---

## Current State (read this first)

`main` is the live HF Space demo. It runs **transformers + PEFT** on ZeroGPU: a bnb-4-bit base (`unsloth/qwen2.5-coder-14b-instruct-bnb-4bit`) plus the fine-tuned LoRA adapter (`build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora`).

The local-first product path lives on `product` (branch) / `local-llamacpp-v1` (tag) and uses **llama.cpp + GGUF**. The `main` branch and the `product` branch share the same fine-tuned weights but use different inference engines because ZeroGPU only exposes PyTorch CUDA.

### Recent changes (2026-06-14)

- **UI polish complete** — removed dark wrapper backgrounds, fixed dataframe light theme, darkened labels, aligned the three domain dropdown columns.
- **"Show me how this was computed" SQL disclosure** — added below each result table.
- **CSV download** — result tables can be downloaded as `kasualdad_result_YYYYMMDD_HHMMSS.csv`.
- **About / FAQ modal** — "First time here?" opens an overlay with intro, how-it-works, what it is/isn't, FAQ, and privacy.
- **"Bring it back" button hidden** — logic preserved but button is always invisible; previous-answer ribbon text remains.
- **Percentage formatting fixed** — `format_result_df()` no longer double-multiplies values that are already returned as percentages.
- **Evaluation query bank** — `evaluation_queries.md` added to the repo for local testing (not pushed to the Space).

---

## What It Does

A Gradio app that lets school district staff ask plain-English questions about student data. A fine-tuned LLM (Qwen2.5-Coder-14B, QLoRA on 27,859 NL→SQL pairs) generates DuckDB SQL, which is validated and executed on in-memory seed data. Results are returned as a plain-English summary, a table, an optional CSV download, and the generated SQL.

On the `product` branch nothing leaves the machine; the Space demo runs the same fine-tune on ZeroGPU.

---

## Architecture & Data Flow

```
User types question or picks a starter dropdown
  ↓
app.py  →  on_submit()  →  handle_query()  [@spaces.GPU on ZeroGPU]
  ↓
model_inference.py  →  build_prompt() + TransformersLLM (bnb-4bit base + LoRA)
  ↓  streamed SQL tokens (llama.cpp-compatible response schema)
data_engine.py  →  extract_sql() → validate_sql() → execute_safe()
  ↓  pandas DataFrame
app.py  →  renders summary, table, CSV link, and SQL disclosure
```

**Key design rule:** `app.py` is a thin controller. All logic lives in the engine modules.

### UI Layout (current)

1. Header — title + tagline
2. Input box + "Get answer" button
3. Three domain starter dropdowns (attendance, grades, discipline/enrollment)
4. First-time nudge with modal link
5. Result region — previous-answer ribbon, summary, dataframe, CSV download, SQL disclosure
6. Footer — what this is / what it isn't + "Read the full explainer"
7. Hidden about/FAQ modal overlay

---

## File Map

| File | Purpose | Lines (approx.) |
|---|---|---|
| `app.py` | Gradio UI, streaming `handle_query` generator, `@spaces.GPU`, Parquet bootstrap, custom CSS, about modal, CSV download | ~1,250 |
| `model_inference.py` | `TransformersLLM` wrapper (bnb-4bit base + LoRA), SQL generation + streaming | ~270 |
| `data_engine.py` | DuckDB lifecycle, Parquet loading, SQL extraction/validation/execution | ~280 |
| `prompts.py` | System prompt, 5-table schema docs, 4 few-shot examples, prompt assembler | ~190 |
| `ui_strings.py` | All user-facing copy: titles, nudges, domain starter questions, about/FAQ content | ~230 |
| `data/generate_seed.py` | Deterministic generator: ~2,900 students across 5 tables | ~430 |
| `data/export_parquet.py` | One-shot script: seed → 5 Parquet files | ~70 |
| `data/*.parquet` | 5 committed seed files via LFS (~260 KB total, byte-deterministic) | — |
| `requirements.txt` | Pinned Python deps | 9 lines |
| `README.md` | Public Space README | ~280 |
| `PROJECT.md` | This file — project knowledge base | ~500 |
| `evaluation_queries.md` | 15 real-world evaluation queries for manual testing | ~60 |
| `docs/HANDOFF.md` | Developer session handoff doc | ~190 |
| `docs/PLAN.md` | Original build plan | ~170 |
| `docs/SPEC_query-history-dashboards.md` | Draft spec for next features | — |
| `tests/` | pytest suite (81 tests) | 4 files |
| `modal_train/` | Modal fine-tuning pipeline | 5+ files |

---

## Model & Inference

### Active Model (`main` — Space)

- **Base:** `unsloth/qwen2.5-coder-14b-instruct-bnb-4bit` (pre-quantized NF4, ~10 GB download)
- **Adapter:** `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora` (551 MB, QLoRA r=32, α=32)
- **Loaded via:** transformers `AutoModelForCausalLM` + `PeftModel.from_pretrained(..., torch_device="cpu")` (required on ZeroGPU)
- **Env overrides:** `LFED_BASE_MODEL`, `LFED_ADAPTER_REPO`

### Active Model (`product` — local) [HISTORICAL — llama.cpp]

- **Repo:** `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf` (Q4_K_M, ~9 GB; 7B fallback exists)
- llama.cpp via `llama-cpp-python`, Metal on macOS

### Inference Config (`main`)

| Parameter | Value | Note |
|---|---|---|
| `max_tokens` | 192 | Outputs run ~140-170 chars |
| `temperature` | 0.0 | Deterministic (greedy) |
| `stop` | `\n\n`, `Question:`, `User:`, `<\|im_end\|>`, `<\|im_start\|>` | Applied post-hoc + streaming truncation |
| Few-shots | 4 | Trimmed from 7 to cut bnb-4bit prefill cost |
| `spaces.GPU` | `duration=30` | Shorter duration = ZeroGPU queue priority |
| Generation time | ~5 s/query | Rest of latency = ZeroGPU queue + weight restore |

### Thread Safety

`model_inference.py` caches the model in a module-level `_llm` global with a `threading.Lock` (double-check pattern). `generate_sql()` auto-loads if `llm=None`.

---

## Zero GPU / HF Space Configuration

### Space Identity
- **URL:** `https://huggingface.co/spaces/build-small-hackathon/Kasualdad_LFED`
- **SDK:** Gradio 6.17.3
- **Python:** 3.12
- **Hardware:** Zero GPU (NVIDIA RTX Pro 6000 Blackwell, half — 48 GB VRAM)
- **Git remote:** `space` → `https://huggingface.co/spaces/build-small-hackathon/Kasualdad_LFED`

### Zero GPU Architecture

Zero GPU uses **CUDA emulation** at module level and **real GPU** inside `@spaces.GPU` functions. This means:
- Model loads with `n_gpu_layers=-1` at startup (module level, emulated CUDA)
- Inference runs inside `@spaces.GPU` decorated `handle_query()` (real GPU)
- Zero GPU transparently switches between emulated and real CUDA contexts

### `@spaces.GPU` Decorator

`app.py` — `handle_query()` is the only decorated function. Required for Zero GPU to recognize the Space as GPU-enabled.

### Startup Sequence (`app.py`)

1. Print banner
2. **Ensure Parquet seed files exist** — all **5** files (`enrollment`, `attendance`, `students`, `discipline`, `grades`) must be present in `/data/` or `data/`; they are committed via LFS (byte-deterministic). If any are missing, regenerate via `export_parquet.py`.
3. Load model (`load_model()`) — downloads base + adapter from HF Hub if not cached
4. Launch Gradio UI

---

## Data Engine & Parquet Optimization

### Per-Request Lifecycle

`data_engine.py:create_session()` → `get_connection()` + `seed_database()`

Each query creates a **fresh in-memory DuckDB** connection. This ensures:
- Thread safety (no shared state between requests)
- Query isolation (can't affect other requests)
- Clean state (no stale data)

### Seeding Priority

`data_engine.py:seed_database()`

1. **Parquet files** (fastest — ~260 KB, single-digit ms). Requires **all 5** tables in `/data/` (Space persistent storage) or `data/` (local dev, committed via LFS). Loaded via DuckDB `read_parquet()` → `CREATE TABLE ... AS SELECT *`.
2. **Python generator** `data/generate_seed.py` (slow fallback — ~2,900 students)

### Parquet Bootstrap

`app.py` — On startup, if no Parquet files are found in any of the search dirs, `export_parquet.py` is called to generate them. On the Space, they go to `/data/` which persists across restarts.

### SQL Safety Pipeline

`data_engine.py:execute_safe()`

1. `extract_sql()` — Parse JSON envelope → ```sql``` block → raw fallback
2. `validate_sql()` — Forbidden token check (DROP, DELETE, INSERT, UPDATE, etc.) + schema-aware `EXPLAIN` validation
3. Wrap: `SELECT * FROM (<user_query>) AS _safe LIMIT 1000`
4. Execute directly on DuckDB
5. Return `(cleaned_sql, DataFrame)`

### Forbidden Tokens

`data_engine.py` — `DROP`, `DELETE`, `INSERT`, `UPDATE`, `ALTER`, `TRUNCATE`, `CREATE`, `ATTACH`, `DETACH`, `PRAGMA`

---

## Database Schema

### `enrollment`

| Column | Type | Description |
|---|---|---|
| `school_year` | VARCHAR | School year, format `'YYYY-YYYY'` |
| `school_name` | VARCHAR | One of 5 schools |
| `grade_level` | INTEGER | Grade level (K=0 through 12) |
| `student_count` | INTEGER | Students enrolled in that grade/year/school |

### `attendance`

| Column | Type | Description |
|---|---|---|
| `student_id` | INTEGER | Unique student identifier |
| `school_name` | VARCHAR | School the student attends |
| `school_year` | VARCHAR | School year, format `'YYYY-YYYY'` |
| `absence_count` | INTEGER | Total absences for that year |
| `is_chronically_absent` | BOOLEAN | TRUE if missed ≥10% of school days |

### `students`

| Column | Type | Description |
|---|---|---|
| `student_id` | INTEGER | Unique student identifier |
| `school_name` | VARCHAR | School the student attends |
| `grade_level` | INTEGER | Grade level (K=0 through 12) |
| `gender`, `race_ethnicity` | VARCHAR | Demographic fields |
| `english_learner`, `special_education`, `economically_disadvantaged` | BOOLEAN | Program flags |

### `discipline`

| Column | Type | Description |
|---|---|---|
| `incident_id`, `student_id` | INTEGER | IDs |
| `school_name`, `school_year` | VARCHAR | School / year |
| `grade_level` | INTEGER | Grade at time of incident |
| `incident_type`, `severity`, `action_taken` | VARCHAR | Incident details |
| `incident_date` | DATE | Date of incident |
| `days_suspended` | INTEGER | Days suspended, if any |

### `grades`

| Column | Type | Description |
|---|---|---|
| `student_id` | INTEGER | Student ID |
| `school_name`, `school_year` | VARCHAR | School / year |
| `grade_level` | INTEGER | Grade level |
| `course_name`, `term`, `letter_grade` | VARCHAR | Course details |
| `grade_numeric`, `gpa` | DOUBLE | Numeric grade and GPA |

### Schools

| School | Grades | Base Enrollment |
|---|---|---|
| Lincoln Elementary | K–5 | 520 |
| Washington Middle | 6–8 | 480 |
| Jefferson High | 9–12 | 900 |
| Roosevelt Academy | K–8 | 380 |
| Kennedy Prep | 6–12 | 620 |

### Seed Data Stats

- **Students:** ~2,900
- **Chronic absenteeism rate:** ~15%
- **Enrollment rows:** ~116
- **Attendance rows:** ~11,600
- **Discipline rows:** ~1,400
- **Grades rows:** ~160,000
- **School years:** 2021-2022, 2022-2023, 2023-2024, 2024-2025
- **Reproducible:** `random.seed(42)` in `generate_seed.py`

---

## Prompt Engineering

`prompts.py` assembles the full LLM prompt:

### Structure

```
SYSTEM_PROMPT (rules, constraints, output format)
  ↓
Schema documentation (table + column list)
  ↓
Few-shot examples (4 question→SQL pairs)
  ↓
User question
  ↓
"SQL:"
```

### System Prompt Rules

1. Only SELECT statements
2. Exact table/column names from schema
3. Proper DuckDB syntax (VARCHAR → single quotes, BOOLEAN → TRUE/FALSE)
4. Use column aliases for aggregations
5. Join on logical columns
6. Make reasonable assumptions if ambiguous
7. Output ONLY the ```sql``` block, no explanation

### Few-Shot Examples

| Question | SQL pattern |
|---|---|
| "How many chronically absent in 2023-2024?" | `COUNT(*)` with `WHERE ... AND is_chronically_absent = TRUE` |
| "Show enrollment per school sorted" | `SUM(student_count) GROUP BY school_name ORDER BY ... DESC` |
| "What percentage at Lincoln Elementary?" | `COUNT(CASE WHEN ...) * 100.0 / COUNT(*)` |
| "Enrollment trend since 2021" | `GROUP BY school_year ORDER BY school_year` |

---

## Fine-Tuning Pipeline (Modal)

### Location: `modal_train/` — v2

| Script | Purpose |
|---|---|
| `generate_synthetic_v2.py` + `modal_generate.py` + `augment_gretel.py` + `rephrase_pairs.py` | Builds the 27,859-pair dataset (`train_final_v2.jsonl`) |
| `train_v2.py` | Unsloth QLoRA on Qwen2.5-Coder-14B (r=32, α=32, 4-bit, 2 epochs, lr=1e-4, A10G) |
| `export_gguf_v2.py` | Merges LoRA → GGUF Q4_K_M → pushes to HF Hub |
| `modal_train_v2.py` | Modal orchestration (`modal.App("kasualdad-lfed-train-v2")`) |

### Published training dataset

The v2 training dataset is now published on the Hub:

| Artifact | Location |
|---|---|
| NL→SQL training data (25,886 pairs) | [`build-small-hackathon/lfed-training-data`](https://huggingface.co/datasets/build-small-hackathon/lfed-training-data) |
| v1 scripts (`train.py`, `export_gguf.py`, `modal_app.py`) | Historical 7B run, 1,289 pairs |

### How to Run

```bash
modal secret create huggingface-secret HF_TOKEN=<token>
modal run modal_train/modal_train_v2.py
```

### Artifacts (Status: trained 2026-06-10)

| Artifact | Location |
|---|---|
| LoRA adapter | `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora` (HF Hub) + Modal volume `lfed-training-data:/lora-adapter-v2` |
| GGUF Q4_K_M | `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf` (HF Hub) |
| Merged fp16 | Not persisted; re-merge from base + adapter if needed |

---

## Dependencies

### Python (`requirements.txt`, `main` branch)

| Package | Why |
|---|---|
| `gradio` >= 6.15 | Web UI |
| `duckdb` 1.5.3 | In-memory SQL engine |
| `torch` / `transformers` / `peft` / `bitsandbytes` / `accelerate` | Model inference (bnb-4bit base + LoRA) |
| `huggingface_hub` + `hf_transfer` | Fast model download |
| `spaces` | `@spaces.GPU` decorator for ZeroGPU |

The `product` branch instead pins `llama-cpp-python` (no torch stack needed).

---

## UI Design System

Current design is **ResearchMono** on top of `gr.themes.Soft`:

- **Typography:** IBM Plex Sans (UI) + IBM Plex Mono (SQL/code)
- **Palette:** light page background (`#F2F4F8`), white surfaces (`#FFFFFF`), dark text (`#21272A`), IBM blue accent (`#4589FF`)
- **Layout:** single column; result region appears after submitting a question
- **Components:** domain starter dropdowns, first-time nudge + modal, result summary, dataframe, CSV download, SQL disclosure, footer explainer

The previous aggressive CSS-override approach was replaced by `gr.themes.Soft` with targeted overrides for Gradio's scoped Svelte classes.

---

## Hackathon Badges

| Badge | Status | Implementation |
|---|---|---|
| **Off the Grid** | ✅ | `product` branch: llama.cpp + local GGUF, no API calls. `main`/Space: same fine-tune on ZeroGPU (transformers). |
| **Well-Tuned** | ✅ | Fine-tuned Qwen2.5-Coder-14B, QLoRA r=32 on 27,859 NL→SQL pairs (Modal A10G). |
| **Llama Champion** | ✅ | llama.cpp is the local-first inference backend (GGUF Q4_K_M, Metal, streaming). |
| **Off-Brand** | ✅ | Custom ResearchMono theme — IBM Plex Sans/Mono, `#4589FF` accent, `gr.themes.Soft`. |

---

## Gotchas & Pitfalls

### 1. `LD_LIBRARY_PATH` doesn't work at Python runtime [HISTORICAL — llama.cpp]
The dynamic linker caches it at process start. Use `ctypes.CDLL(path, mode=RTLD_GLOBAL)` to preload libraries instead. Only relevant for the `product` branch.

### 2. ZeroGPU = PyTorch-only CUDA [2026-06-12]
ZeroGPU's `spaces.GPU` decorator works through a PyTorch patching layer — non-PyTorch CUDA libraries (llama.cpp) cannot benefit from it. The fix was to drop llama.cpp on the Space and use transformers + PEFT + bnb-4-bit instead.

### 3. PEFT adapter loading straight to `cuda:0` fails on ZeroGPU [2026-06-12]
`PeftModel.from_pretrained()` defaults to loading adapter safetensors directly onto `cuda:0`, which ZeroGPU forbids at startup. Fix: `torch_device="cpu"`.

### 4. Parquet files committed via LFS [2026-06-12]
All 5 Parquet seed files are committed via git LFS. They are byte-deterministic. HF Spaces requires LFS for binary files >10 KB. Re-export with `python data/export_parquet.py` if the seed generator changes.

### 5. Missing Parquet fallback = nondeterministic data per query [2026-06-12]
`seed_database()` requires all 5 Parquet files. If any are missing, it falls through to the Python generator, which advances the shared `random` RNG on each call → different data per query. Fixed by committing all 5 files + re-seeding the RNG inside `generate_seed_data()`.

### 6. Gradio scoped Svelte CSS beats generic selectors [2026-06-14]
Generic selectors like `.gradio-container .form` or `.gr-dataframe` are overridden by Gradio's generated scoped classes (e.g., `.form.svelte-d5xbca`). Fixes required inspecting the live DOM and using higher-specificity selectors, `!important`, or overriding CSS variables.

### 7. Percentage formatting double-scaling [2026-06-14]
`format_result_df()` originally multiplied every `_rate` column by 100, assuming the model always returned 0–1 proportions. The model sometimes returns percentages already (e.g., `10.0`), causing values like `1000.00%`. Fixed by scaling only values in the `0 <= v <= 1` range.

### 8. Transient HF Spaces mount errors [2026-06-14]
The Space occasionally fails during init with `Initialization step 'hf-mount' failed`. This is an HF infrastructure issue attaching the persistent storage bucket, not an app bug. Restarting the Space from the Hugging Face UI usually resolves it; the app also has a local Parquet fallback.

---

## Quick Start

```bash
# Clone
cd Local-First-Education-Data-Framework

# Virtual env
python3.12 -m venv .venv && source .venv/bin/activate

# Install
pip install -r requirements.txt

# Generate Parquet seed files (first time only)
python data/export_parquet.py

# Run
python app.py
# → http://localhost:7860

# Tests
pytest tests/ -v

# Deploy
git push space main
```

---

## Recent Git History

```
549e4e0 fix: only scale proportions 0-1 to percentages; avoid double-multiplying
d6754e1 feat: hide bring-back button, add CSV download, about/FAQ modal
432039e fix: target real gr.Code label, explainer block border, align dropdown columns
95ddb2d fix: dark sql disclosure text, remove explainer border
78ebc5c fix: override Gradio scoped .form background, label color, dataframe table vars
```

---

## Future Work

- [x] Fine-tuning completed — 14B model trained (27,859 pairs), GGUF + LoRA adapter published
- [x] Swapped to fine-tuned model on `main` — transformers + LoRA on ZeroGPU
- [x] Streaming re-enabled — SQL streams to the UI
- [x] Deterministic seed data — all 5 Parquet files committed via LFS
- [x] UI polish — theme, labels, borders, dataframe, dropdown alignment
- [x] CSV download
- [x] About / FAQ modal
- [x] Evaluation query bank
- [ ] Query history + comparisons — see `docs/SPEC_query-history-dashboards.md`
- [ ] Dashboards — Standard Board + Ephemeral Scratch Board
- [ ] HF Space smoke test — verify a subset of evaluation queries after each deploy
- [ ] Product branch development — build local-first features on `product` branch
- [ ] Model card update on `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora`

---

## Historical Sections

Sections below are kept for reference but describe earlier versions of the project.

### UI Design System (pre-2026-06-10)

The original palette used slate/indigo tones with aggressive CSS overrides against Gradio's dark-theme defaults. It was replaced by the current `gr.themes.Soft` + IBM Plex ResearchMono approach.

### CUDA Dependency Story (llama.cpp backend)

Full debugging history of preloading CUDA shared libraries for `llama-cpp-python` on the HF Space. Only relevant to the `product` branch. The `main` branch no longer includes llama.cpp or the CUDA preload code.

### llama.cpp Import Guard

If `from llama_cpp import Llama` failed on the old backend, a `RuntimeError` pointed users to the CPU wheel. This code was removed on `main` in 2026-06-12.