---
title: Agentic Humanitarian Data Analyst
emoji: 🏥
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "6.18.0"
app_file: app.py
pinned: false
tags:
  - track:backyard
license: mit
---

# Agentic Humanitarian Data Analyst

**Applying a semantic layer and spec-driven development to agentic humanitarian data analysis.**

It turns an analyst's question and a raw survey into a **reviewable data-analysis plan** — which sectors, which indicators, and exactly what this dataset can and can't measure — *before* any agent runs a number. The plan is the spec. A human approves it; the analysis runs against it.

> [Build Small hackathon](https://huggingface.co/build-small-hackathon) · Track: **Backyard AI** · every model **< 32B params**.

---

## The idea

Two patterns from software, applied to AI analysis:

- **A semantic layer** — a governed catalog of indicator definitions (what an indicator *means*, how it's computed, how it fails) that sits between a question and the data, so nobody recomputes "food consumption" from memory.
- **Spec-driven development** — write the plan first, have a human review it, *then* execute. Here the plan is a coverage spec: per indicator, **Measurable / Proxy / Not measurable** from this exact survey, each verdict traced to a published standard, not to the model's memory.

The pipeline is the **skill** ([`humanitarian-data-analyst`](https://github.com/yannsay/humanitarian-data-analyst)) — a reusable, domain-general pattern. This **app** is the friendly front end to it.

## Why it needs an LLM at all

The steps are **deterministic** — selecting indicators from the catalog, joining them to the survey, rendering the plan are all code, repeatable byte-for-byte. The **inputs are not**. Two translations are irreducibly fuzzy, and that's the only place the model runs:

1. a free-text **question → which sector / analytical framework** (the route), and
2. per indicator, a **catalog definition → which Kobo question(s) actually measure it** (the map).

That's the bet: keep the LLM to the messy human-to-machine translation, and make everything downstream deterministic code over a governed catalog.

---

## The humanitarian problem this solves

Humanitarian analysis is a specialist domain. Ask for "food security" and you've invoked a specific, named set of indicators — the Food Consumption Score (FCS), the reduced Coping Strategies Index (rCSI), the Household Hunger Scale (HHS) — each with an exact definition, a required set of survey questions, and a documented list of ways people get them wrong. The expertise isn't vague; it's a catalog. The job of an analyst, working a survey collected in Kobo, is to apply that catalog correctly to *this* dataset — and every survey is built differently, so the same indicator maps to different questions every time. There's no fixed lookup; the mapping has to be redone for each new form.

That's where it breaks. A standard indicator gets computed from questions that don't actually support it — producing a result that looks plausible but isn't an indicator the sector recognises, or doesn't exist at all. Not hypothetical: the test case is a real rapid needs assessment that shipped **four documented indicator errors** — an rCSI computed from the wrong columns, a misread JMP water ladder (the WHO/UNICEF Joint Monitoring Programme's drinking-water service classification), a misapplied Sphere threshold (the sector's minimum humanitarian standards), and an FCS reported even though the survey contained no dietary-recall question to build it from.

A capable general LLM can recognise every one of these — the methodology is well documented and in its training. What it lacks is the **attention** to apply each definition, every time, to the right columns, under the pressure of producing an answer. Specialist precision is exactly what a generalist skips.

And this is where a small model can *beat* a big one. The failure is attention, not knowledge — so the fix isn't more capacity, it's structure: hand the model one indicator's definition and known errors, point it at the candidate survey questions, and ask for a single verdict. That narrow, supplied task is precisely what a small model does reliably and a big general model fumbles by trying to hold everything at once. Where the generalist's breadth becomes a liability, the small model's focus becomes the feature.

---

## How it works

The model is fenced to the two translation points above; **code does everything else**.

```
Analyst question + Kobo XLSForm
   │
 ROUTE   — question → sector / analytical framework
   │       ├─ type a question  → LLM translates it  ◀ fuzzy input, needs the model
   │       └─ pick chips        → no LLM, instant
   │
 SELECT  — sector → indicators from the catalog        → deterministic script
   │
 MAP     — each indicator → the survey's questions      ← the semantic layer
   │       per-indicator loop: LLM proposes candidate variables  ◀ fuzzy input, needs the model
   │       → verdict: Measurable / Proxy / Not measurable
   │       → live trace printed by code, indicator by indicator   ← the visible centrepiece
   │
 PLAN    — code assembles the data-analysis plan
   │       ← HARD STOP: human analyst reviews & approves
   │
 ANALYSE — an agent runs the analysis against the approved plan   → NOT in this demo
```

(An XLSForm is the standard spreadsheet format a Kobo/ODK survey is authored in — one sheet of questions, one of answer choices.)

We stop the demo at **PLAN** — the reviewed data-analysis plan. Routing and mapping are the only fuzzy steps, so they're the only ones the LLM touches; select, plan-assembly, and rendering are deterministic code over a governed catalog. Everything downstream of the human gate (the actual agent analysis) is out of scope here.

That's why the same question gives the **same indicator list every time** — a generic prompt gave 12 vs 42 indicators on identical input; the deterministic select fixed it — and why the whole thing fits a small open model.

### The semantic layer: three governed layers

- **Layer A — Framework:** the analytical ontology of 11 humanitarian sectors, derived from HumSet/DEEP (a published, human-tagged humanitarian classification framework). ROUTE resolves the question into it.
- **Layer B — Indicators:** ~41 indicators across three sectors — WASH (Water, Sanitation and Hygiene), Food Security, and CCCM (Camp Coordination and Camp Management) — from authoritative sources: the JMP, the Global Food Security Cluster handbook, Sphere, and the CCCM cluster's camp-management standards. Each carries its definition, thresholds, **common implementation errors**, and what a key-informant survey (one where a community representative answers on the group's behalf, rather than household-by-household) can and can't assess.
- **Layer C — Binding:** the live MAP from the survey's questions to Layer B indicators, gaps surfaced.

Every verdict in the plan points back to a published standard — the line between something an analyst can defend in a report and something they can't.

---

## Why small is the right call

The mechanics follow from the structure above. Because the definition, thresholds, and known errors are *handed* to the model in the prompt rather than recalled from its weights, the model never has to carry the methodology — it only has to judge text we gave it against survey questions we gave it, one indicator at a time. The prompt gets a little bigger; the model can get a lot smaller. A tightly scoped, single-verdict task is what a focused small model does reliably, which is why this whole pipeline fits comfortably under the hackathon's parameter cap.

---

## Running it

> ⏳ **~5 min cold start.** Inference runs on Modal's on-demand GPUs, so the first run after the Space has been idle takes about 5 minutes to spin up. Runs after that are fast.

**Two input paths:**
- *Type a question* (e.g. "what does the food security data tell us?") → the model routes it. Requires the inference endpoint configured (below).
- *Pick framework chips* → no model call at routing, instant. Good for a quick demo.

### Secrets

Set these in **Space settings → Secrets** (never commit them):

| Secret | Value |
|---|---|
| `MODAL_INFERENCE_URL` | Inference endpoint URL |
| `MODAL_API_KEY` | Inference API key |
| `MODAL_MODEL_ID` | Model served (e.g. `Qwen/Qwen2.5-32B-Instruct`) |
| `LANGSMITH_API_KEY` | LangSmith API key (tracing) |
| `LANGSMITH_ENDPOINT` | `https://eu.smith.langchain.com` |
| `HF_TOKEN` | HF token, read access |

### Local

```bash
uv sync
uv run python app.py     # launches Gradio on :7860
```

---

## Scope

**In:** upload a Kobo XLSForm, route to sectors, select indicators, map them to the survey's questions, produce the reviewable data-analysis plan — verdict per indicator, errors pre-empted, claim boundaries.

**Out, on purpose:** the agent analysis itself (running numbers against the approved plan), charts, multi-turn chat. The plan is the deliverable — we stop at the human review gate.

## The pattern travels

Nothing here is humanitarian-specific by construction. Swap the catalog and the framework and the same shape holds anywhere there's a governed vocabulary of metrics and a cost to getting them wrong — translate a question into the vocabulary, map it to the data you actually have, write a reviewable plan, then let an agent execute. Humanitarian needs assessment is just where the consequences are sharpest.

---

## Social

[Blog](https://huggingface.co/blog/build-small-hackathon/agentic-humanitarian-data-analyst)
[LinkedIn](https://www.linkedin.com/posts/yannsay_github-yannsayhumanitarian-data-analyst-share-7472010240872443905-vlEr)
[Youtube](https://youtu.be/q2qjPJakLGk)

---

## Stack & credits

Gradio · a small open-weight model (< 32B) behind an OpenAI-compatible endpoint · LangSmith tracing · the [`humanitarian-data-analyst`](https://github.com/yannsay/humanitarian-data-analyst) skill (catalog, ontology, scripts) as source of truth — the app is a thin front end over it. Indicator definitions adapted from JMP, the Global Food Security Cluster, Sphere, and CCCM/CAMP standards.

Part of a longer project on bringing a semantic layer and spec-driven development to AI analysis. Built small, governed, and reviewable.