--- title: Agentic Humanitarian Data Analyst emoji: ๐Ÿฅ colorFrom: blue colorTo: green sdk: gradio sdk_version: "6.18.0" app_file: app.py pinned: false tags: - track:backyard license: mit --- # Agentic Humanitarian Data Analyst **Applying a semantic layer and spec-driven development to agentic humanitarian data analysis.** It turns an analyst's question and a raw survey into a **reviewable data-analysis plan** โ€” which sectors, which indicators, and exactly what this dataset can and can't measure โ€” *before* any agent runs a number. The plan is the spec. A human approves it; the analysis runs against it. > [Build Small hackathon](https://huggingface.co/build-small-hackathon) ยท Track: **Backyard AI** ยท every model **< 32B params**. --- ## The idea Two patterns from software, applied to AI analysis: - **A semantic layer** โ€” a governed catalog of indicator definitions (what an indicator *means*, how it's computed, how it fails) that sits between a question and the data, so nobody recomputes "food consumption" from memory. - **Spec-driven development** โ€” write the plan first, have a human review it, *then* execute. Here the plan is a coverage spec: per indicator, **Measurable / Proxy / Not measurable** from this exact survey, each verdict traced to a published standard, not to the model's memory. The pipeline is the **skill** ([`humanitarian-data-analyst`](https://github.com/yannsay/humanitarian-data-analyst)) โ€” a reusable, domain-general pattern. This **app** is the friendly front end to it. ## Why it needs an LLM at all The steps are **deterministic** โ€” selecting indicators from the catalog, joining them to the survey, rendering the plan are all code, repeatable byte-for-byte. The **inputs are not**. Two translations are irreducibly fuzzy, and that's the only place the model runs: 1. a free-text **question โ†’ which sector / analytical framework** (the route), and 2. per indicator, a **catalog definition โ†’ which Kobo question(s) actually measure it** (the map). That's the bet: keep the LLM to the messy human-to-machine translation, and make everything downstream deterministic code over a governed catalog. --- ## The humanitarian problem this solves Humanitarian analysis is a specialist domain. Ask for "food security" and you've invoked a specific, named set of indicators โ€” the Food Consumption Score (FCS), the reduced Coping Strategies Index (rCSI), the Household Hunger Scale (HHS) โ€” each with an exact definition, a required set of survey questions, and a documented list of ways people get them wrong. The expertise isn't vague; it's a catalog. The job of an analyst, working a survey collected in Kobo, is to apply that catalog correctly to *this* dataset โ€” and every survey is built differently, so the same indicator maps to different questions every time. There's no fixed lookup; the mapping has to be redone for each new form. That's where it breaks. A standard indicator gets computed from questions that don't actually support it โ€” producing a result that looks plausible but isn't an indicator the sector recognises, or doesn't exist at all. Not hypothetical: the test case is a real rapid needs assessment that shipped **four documented indicator errors** โ€” an rCSI computed from the wrong columns, a misread JMP water ladder (the WHO/UNICEF Joint Monitoring Programme's drinking-water service classification), a misapplied Sphere threshold (the sector's minimum humanitarian standards), and an FCS reported even though the survey contained no dietary-recall question to build it from. A capable general LLM can recognise every one of these โ€” the methodology is well documented and in its training. What it lacks is the **attention** to apply each definition, every time, to the right columns, under the pressure of producing an answer. Specialist precision is exactly what a generalist skips. And this is where a small model can *beat* a big one. The failure is attention, not knowledge โ€” so the fix isn't more capacity, it's structure: hand the model one indicator's definition and known errors, point it at the candidate survey questions, and ask for a single verdict. That narrow, supplied task is precisely what a small model does reliably and a big general model fumbles by trying to hold everything at once. Where the generalist's breadth becomes a liability, the small model's focus becomes the feature. --- ## How it works The model is fenced to the two translation points above; **code does everything else**. ``` Analyst question + Kobo XLSForm โ”‚ ROUTE โ€” question โ†’ sector / analytical framework โ”‚ โ”œโ”€ type a question โ†’ LLM translates it โ—€ fuzzy input, needs the model โ”‚ โ””โ”€ pick chips โ†’ no LLM, instant โ”‚ SELECT โ€” sector โ†’ indicators from the catalog โ†’ deterministic script โ”‚ MAP โ€” each indicator โ†’ the survey's questions โ† the semantic layer โ”‚ per-indicator loop: LLM proposes candidate variables โ—€ fuzzy input, needs the model โ”‚ โ†’ verdict: Measurable / Proxy / Not measurable โ”‚ โ†’ live trace printed by code, indicator by indicator โ† the visible centrepiece โ”‚ PLAN โ€” code assembles the data-analysis plan โ”‚ โ† HARD STOP: human analyst reviews & approves โ”‚ ANALYSE โ€” an agent runs the analysis against the approved plan โ†’ NOT in this demo ``` (An XLSForm is the standard spreadsheet format a Kobo/ODK survey is authored in โ€” one sheet of questions, one of answer choices.) We stop the demo at **PLAN** โ€” the reviewed data-analysis plan. Routing and mapping are the only fuzzy steps, so they're the only ones the LLM touches; select, plan-assembly, and rendering are deterministic code over a governed catalog. Everything downstream of the human gate (the actual agent analysis) is out of scope here. That's why the same question gives the **same indicator list every time** โ€” a generic prompt gave 12 vs 42 indicators on identical input; the deterministic select fixed it โ€” and why the whole thing fits a small open model. ### The semantic layer: three governed layers - **Layer A โ€” Framework:** the analytical ontology of 11 humanitarian sectors, derived from HumSet/DEEP (a published, human-tagged humanitarian classification framework). ROUTE resolves the question into it. - **Layer B โ€” Indicators:** ~41 indicators across three sectors โ€” WASH (Water, Sanitation and Hygiene), Food Security, and CCCM (Camp Coordination and Camp Management) โ€” from authoritative sources: the JMP, the Global Food Security Cluster handbook, Sphere, and the CCCM cluster's camp-management standards. Each carries its definition, thresholds, **common implementation errors**, and what a key-informant survey (one where a community representative answers on the group's behalf, rather than household-by-household) can and can't assess. - **Layer C โ€” Binding:** the live MAP from the survey's questions to Layer B indicators, gaps surfaced. Every verdict in the plan points back to a published standard โ€” the line between something an analyst can defend in a report and something they can't. --- ## Why small is the right call The mechanics follow from the structure above. Because the definition, thresholds, and known errors are *handed* to the model in the prompt rather than recalled from its weights, the model never has to carry the methodology โ€” it only has to judge text we gave it against survey questions we gave it, one indicator at a time. The prompt gets a little bigger; the model can get a lot smaller. A tightly scoped, single-verdict task is what a focused small model does reliably, which is why this whole pipeline fits comfortably under the hackathon's parameter cap. --- ## Running it > โณ **~5 min cold start.** Inference runs on Modal's on-demand GPUs, so the first run after the Space has been idle takes about 5 minutes to spin up. Runs after that are fast. **Two input paths:** - *Type a question* (e.g. "what does the food security data tell us?") โ†’ the model routes it. Requires the inference endpoint configured (below). - *Pick framework chips* โ†’ no model call at routing, instant. Good for a quick demo. ### Secrets Set these in **Space settings โ†’ Secrets** (never commit them): | Secret | Value | |---|---| | `MODAL_INFERENCE_URL` | Inference endpoint URL | | `MODAL_API_KEY` | Inference API key | | `MODAL_MODEL_ID` | Model served (e.g. `Qwen/Qwen2.5-32B-Instruct`) | | `LANGSMITH_API_KEY` | LangSmith API key (tracing) | | `LANGSMITH_ENDPOINT` | `https://eu.smith.langchain.com` | | `HF_TOKEN` | HF token, read access | ### Local ```bash uv sync uv run python app.py # launches Gradio on :7860 ``` --- ## Scope **In:** upload a Kobo XLSForm, route to sectors, select indicators, map them to the survey's questions, produce the reviewable data-analysis plan โ€” verdict per indicator, errors pre-empted, claim boundaries. **Out, on purpose:** the agent analysis itself (running numbers against the approved plan), charts, multi-turn chat. The plan is the deliverable โ€” we stop at the human review gate. ## The pattern travels Nothing here is humanitarian-specific by construction. Swap the catalog and the framework and the same shape holds anywhere there's a governed vocabulary of metrics and a cost to getting them wrong โ€” translate a question into the vocabulary, map it to the data you actually have, write a reviewable plan, then let an agent execute. Humanitarian needs assessment is just where the consequences are sharpest. --- ## Social [Blog](https://huggingface.co/blog/build-small-hackathon/agentic-humanitarian-data-analyst) [LinkedIn](https://www.linkedin.com/posts/yannsay_github-yannsayhumanitarian-data-analyst-share-7472010240872443905-vlEr) [Youtube](https://youtu.be/q2qjPJakLGk) --- ## Stack & credits Gradio ยท a small open-weight model (< 32B) behind an OpenAI-compatible endpoint ยท LangSmith tracing ยท the [`humanitarian-data-analyst`](https://github.com/yannsay/humanitarian-data-analyst) skill (catalog, ontology, scripts) as source of truth โ€” the app is a thin front end over it. Indicator definitions adapted from JMP, the Global Food Security Cluster, Sphere, and CCCM/CAMP standards. Part of a longer project on bringing a semantic layer and spec-driven development to AI analysis. Built small, governed, and reviewable.