================================================================================
PROJECT CONTEXT — sahel-agri-voice
Generated: 2026-04-17
================================================================================

PROJECT NAME
------------
Sahel-Voice-Lab / Sahel-Agri Voice AI
(HuggingFace Space title: "Sahel-Voice-Lab", Phase 1: "The Memory Loop")

PURPOSE
-------
A voice-first, self-learning AI assistant for two West African languages —
Bambara (bam, spoken in Mali) and Fula/Pular (ful, spoken in Guinea and
Senegal) — targeted at farmers in the Sahel region.

The system has two complementary capabilities:

  1. LANGUAGE-LEARNING MEMORY LOOP (Phase 1)
     The assistant behaves like an "eager child learner." Users teach it
     Bambara/Fula words ("I ni ce means hello") via voice or text; an LLM
     detects the teaching intent and the word pair is persisted to a
     HuggingFace Hub dataset (ous-sow/sahel-agri-feedback → vocabulary.jsonl)
     so knowledge accumulates across sessions and users. The vocabulary is
     then injected into the LLM's system prompt as its source of truth for
     answering questions.

  2. AGRICULTURAL IoT VOICE INTERFACE
     Farmers speak questions in their own language ("how is the soil?",
     "is it going to rain?"). Whisper transcribes, an intent parser keyword-
     matches Bambara/Fula agricultural terms (soil, rain, irrigation, pest),
     a sensor bridge fetches data from an IoT backend (or mock data), and
     VoiceResponder + a TTS engine reply in short Bambara/Fula sentences
     with alert thresholds (e.g. "Bunding ji dɔgɔ. I ka foro ji." =
     "Soil moisture is low. Irrigate your field.").

The project is deployed as a HuggingFace Space (Gradio frontend) with an
optional FastAPI service. The system is explicitly "100% non-Meta" for its
core stack (Whisper / Qwen / F5-TTS / VITS), avoiding Meta models for the
main loop.

FULL TECH STACK
---------------
Deployment / hosting
  - HuggingFace Spaces           (Gradio SDK 5.25.0, hardware: cpu-basic)
  - Kaggle notebooks (T4 GPU)    for training runs
  - RunPod                       alternative training environment
  - HF Hub datasets              as persistent vocabulary + feedback store

Frontend
  - Gradio 5.25.0   (app.py — main UI; app_lab.py — experimental lab UI)

Backend API
  - FastAPI           (src/api/app.py via create_app() + lifespan)
  - Pydantic v2       (schemas)
  - httpx             (async calls to IoT sensor backend)

Speech-to-text (STT)
  - openai/whisper-large-v3-turbo  (default backbone)
  - transformers 5.5.0 (WhisperForConditionalGeneration, WhisperProcessor)
  - PEFT (LoRA adapters, hot-swappable per language)
  - accelerate 1.13.0
  - librosa 0.10.2, soundfile 0.12.1, torchaudio

LLM (reasoning / teaching-intent detection)
  - Qwen/Qwen2.5-72B-Instruct    (default, via HF Serverless Inference)
  - Qwen/Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Zephyr-7b-beta
    as faster alternatives
  - huggingface-hub 1.9.0 InferenceClient

Text-to-speech (TTS)
  - Phase 1: facebook/mms-tts-bam, mms-tts-ful, mms-tts-fra, mms-tts-eng
  - Phase 2: ynnov/ekodi-bambara-tts-female (VITS)
             + placeholder ous-sow/fula-tts
  - F5-TTS (SWivid/F5-TTS) for GPU voice cloning (optional, ~2GB)
  - OpenVoice V2 (myshell-ai/openvoice-v2) for tone-color conversion
  - SpeechBrain ECAPA-TDNN for speaker identification (per-user profiles)

Data / datasets
  - google/fleurs (bam_ML, ff_SN) as STT training corpus
  - RobotsMali/jeli-asr, google/fleurs Fula, Wikipedia (bm, ff) harvested
    text via src/data/web_harvester.py
  - datasets 4.8.4 (+ torchcodec for 4.x audio decoding)
  - Adlam ↔ Latin transliteration for Guinea Pular

Training / fine-tuning
  - PEFT LoRA + Seq2SeqTrainer
  - jiwer 3.0.4 (WER / CER metrics)
  - Custom callbacks: EarlyStoppingOnWER, AdapterCheckpointCallback
  - FieldNoiseAugmenter (tractor / wind / livestock noise mixing)

Optimization / edge deploy
  - optimum[onnxruntime] → per-language ONNX export
  - onnx-tf / TensorFlow → TFLite for Android
  - bitsandbytes NF4 / 8-bit quantization (training environments)

Utilities / runtime
  - PyYAML 6.0.2, python-dotenv 1.1.0
  - NumPy 2.2.4, SciPy 1.15.2
  - rapidfuzz 3.13.0 (fuzzy phrase matching)
  - pypdf, python-docx (Knowledge Base upload → vocabulary.jsonl)
  - Kaggle API (Self-Teaching tab triggers training runs)
  - ffmpeg (packages.txt — sole system-level dep)

Environment variables
  HF_TOKEN, FEEDBACK_REPO_ID (ous-sow/sahel-agri-feedback),
  LLM_MODEL_ID, BAMBARA_ADAPTER_PATH, FULA_ADAPTER_PATH,
  SENSOR_API_URL, BAMBARA_TTS_REPO, FULA_TTS_REPO, DEVICE, LOG_LEVEL

KEY SOURCE FILES AND WHAT THEY DO
---------------------------------
Top-level entry points
  app.py
    Gradio UI (~99 KB). Main user-facing application running on the HF Space.
    Wires STT → LLM → memory → TTS, exposes the Conversation / Teaching /
    Knowledge Base / Self-Teaching tabs.
  app_lab.py
    Experimental/lab Gradio UI used to prototype new features
    (e.g. CuriosityEngine integration) before folding into app.py.
  setup.sh
    Shell bootstrap for local + RunPod environments.

src/api/  — FastAPI service (alternative to Gradio-only deploy)
  app.py          FastAPI factory with async lifespan: loads Whisper backbone
                  once, registers bam/ful adapters, pre-loads 'bam', attaches
                  Transcriber + SensorBridge to app.state.
  dependencies.py FastAPI DI helpers to pull shared objects off app.state.
  middleware.py   CORS / logging middleware registration.
  schemas.py      Pydantic v2 request/response models.
  routes/health.py    GET /health — model status + loaded adapters.
  routes/transcribe.py POST /transcribe — audio → text, 10 MB cap,
                       wav/mp3/ogg/m4a/flac/webm.
  routes/iot.py   POST /query — full pipeline: audio → transcribe → intent
                   → sensor → voice response (IoTQueryResponse).

src/engine/  — STT core
  whisper_base.py     Singleton loader for WhisperForConditionalGeneration +
                      WhisperProcessor. FP16 on CUDA, FP32 on CPU. free()
                      releases VRAM.
  adapter_manager.py  Hot-swap LoRA adapters via PEFT's multi-adapter API:
                      first load ~2s, subsequent set_adapter ~50ms.
                      Keeps one backbone in VRAM and swaps ~50MB adapters.
  transcriber.py      Public inference API. Handles ≤30s chunks directly,
                      >30s by slicing into 30s windows. Returns
                      TranscriptionResult (text, language, duration_s,
                      processing_time_ms, confidence).
  stt_processor.py    avg_logprob confidence extractor; threshold -1.0 =
                      "confused", caller should ask user to repeat.
  curiosity.py        CuriosityEngine — every N interactions, prompts the
                      LLM to spot a vocabulary gap and ask the user how to
                      say a missing agricultural term.

src/llm/
  gemma_client.py     Wraps HF Serverless InferenceClient. Implements the
                      "adult-child" system prompt that returns structured
                      JSON with intent ∈ {teaching, question, conversation,
                      error}. Parses JSON out of optional markdown fences.

src/memory/
  memory_manager.py   Thread-safe vocabulary store. Persists to
                      data/vocabulary.jsonl locally and pushes asynchronously
                      to HF Hub dataset. Provides get_recent() and a
                      formatted get_vocabulary_context() for the LLM prompt.

src/conversation/
  phrase_matcher.py   RapidFuzz-based matcher over curated JSON phrase
                      libraries (data/phrases/{lang}.json + _additions.json).
                      Handles greetings / thanks / farewells without hitting
                      the LLM.

src/iot/
  intent_parser.py    Keyword-based Intent classifier
                      (greeting/thanks/farewell/check_soil/check_weather/
                      irrigation_status/pest_alert) for bam, ful, fr, en.
                      Confidence = matched_keywords / total_keywords.
  sensor_bridge.py    Async bridge to an IoT backend (SENSOR_API_URL) for
                      soil / weather / irrigation / pest readings.
                      Falls back to mock random data.
  voice_responder.py  Maps (Intent, SensorData) → short Bambara/Fula reply
                      string (≤6 words per sentence for clean MMS-TTS) plus
                      English translation. Alert thresholds encoded here
                      (SOIL_MOISTURE_LOW=30, PH bounds, TEMP_HIGH=38, etc.).
                      Also has a verbose French-language path.

src/data/
  agri_dictionary.py  Bambara + Fula domain vocab used to bias the Whisper
                      decoder prompt toward agricultural terms.
  waxal_loader.py     Streams google/fleurs (bam_ML, ff_SN) — the
                      replacement for the retired google/waxal dataset.
  feature_extractor.py Log-mel spectrogram extraction and batched padding
                       collator for Whisper Seq2SeqTrainer.
  augmentation.py     FieldNoiseAugmenter — mixes clean speech with
                      tractor/wind/livestock samples; falls back to
                      Gaussian noise.
  bam_normalize.py    Bambara phonetic normalizer (ou→u, gn/ny→ɲ,
                      N'Ko-derived standard).
  adlam.py            Adlam (𞤀𞤣𞤤𞤢𞤥) ↔ Latin transliteration for Pular;
                      normalize_pular() for ASR preprocessing.
  web_harvester.py    Harvests RobotsMali/jeli-asr, google/fleurs ff_SN,
                      and bm/ff Wikipedia into the feedback Hub dataset.

src/training/
  trainer.py          WhisperLoRATrainer — full fine-tune orchestration
                      (backbone + LoraConfig + WaxalDataLoader +
                      Seq2SeqTrainer).
  metrics.py          WER/CER for Seq2SeqTrainer eval loop (via jiwer).
  callbacks.py        EarlyStoppingOnWER, AdapterCheckpointCallback
                      (saves adapter-only, not full model).

src/tts/
  waxal_tts.py        VITS engine wrapping ynnov/ekodi-bambara-tts-female
                      for Bambara; Fula is a placeholder until
                      ous-sow/fula-tts is trained.
  mms_tts.py          Facebook MMS-TTS (bam/ful/fra/eng).
  f5_tts.py           F5-TTS voice cloning (optional, GPU-only, ~750MB);
                      gracefully falls back to MMS when missing.
  voice_cloner.py     OpenVoice V2 tone-color converter — reshapes VITS
                      audio to a target speaker's voice.

src/voice/
  speaker_profiles.py SpeakerProfileManager with SpeechBrain ECAPA-TDNN
                      (192-d embeddings). Per-user running-average embeddings
                      for identification + OpenVoice SE for cloning; cosine
                      similarity ≥ 0.75 attributes to an existing user.

src/optimization/
  onnx_exporter.py    Merges LoRA into backbone and exports per-language
                      ONNX (ONNX can't hot-swap adapters at runtime).
  quantizer.py        BitsAndBytes NF4 / 8-bit quantization for GPU-
                      constrained deploys (turbo ~3GB → ~1GB VRAM).
  tflite_converter.py ONNX → TFLite for offline Android; exports encoder
                      and decoder separately.

Config / data folders
  configs/            base_config.yaml + per-language LoRA configs.
  data/               vocabulary.jsonl, phrases/*.json, profiles/, etc.
  notebooks/          Kaggle / RunPod fine-tune + TTS training notebooks.
  noise_samples/      .wav clips for field-noise augmentation.
  scripts/            utility scripts (bootstrap, harvest, eval).
  tests/              pytest suite (not installed in HF Spaces runtime).

RECENT GIT COMMITS SUMMARY (last 20)
------------------------------------
The recent history is focused on three concurrent tracks:

1. STT / training stability
   - bb78cbf Add torchcodec install for datasets 4.x audio decoding
   - 9049ef3 Prepare training stack for RunPod: env-aware notebook +
             bootstrap script
   - cc50efb Align Whisper default to turbo-v3 + add document upload to
             Knowledge Base tab
   - c33a061 Fix WhisperProcessor import in reload + upgrade base to
             large-v3-turbo
   - 7fae91b Fix mel-bin mismatch: load per-language processor from
             fine-tuned checkpoint
   - 6682858 Fix jiwer crash on post-normalisation empty refs;
             register SLR106/105 datasets
   - 58f431a Fix SyntaxError in Cell 17: unterminated f-string literal
   - 3632a23 Fix compute_metrics crash on empty eval references
             in Fula training
   - 71bb3bc Fix: add trust_remote_code=True for datasets 3.x compatibility
   - cd017e2 Fix Cell 16 ValueError: load model fp32 so AMP gradient scaler
             works

2. Language support / Adlam / Pular expansion
   - ced078c Add Adlam/Pular Fula integration: transliterator +
             3 new datasets + normalisation pipeline
   - 40cf84d Fix language mixing: per-language prompts +
             Mali Bambara / Guinea Pular context
   - 33c3a5a Fix Self-Teaching language detection: parse code from
             dropdown label
   - 24b1617 Fix Self-Teaching tab: float sliders, deduplication,
             Kaggle API fallback

3. Conversation / voice pipeline
   - 8952fff Phase 3: Voice-to-Voice S2S pipeline —
             F5-TTS, LLM brain, CER metric
   - ad902c6 Add real conversational memory + live learning to
             Conversation Mode
   - 8d7d9d8 Fix conversation mode timeout: two-stage pipeline + faster LLM
   - 1958814 Fix "Model loading" stuck state: block in _do_asr until
             Whisper is ready
   - 618eab5 Fix model loading stuck forever + unhandled TTS crash in
             conversation mode
   - bfe5b59 Fix slow build: strip runtime-irrelevant heavy packages from
             requirements.txt

Overall trajectory: the project has moved past initial Phase 1 scaffolding
and is iterating hard on (a) stabilising fine-tuning on Kaggle/RunPod with
large-v3-turbo, (b) expanding to Guinea Pular with the native Adlam script,
and (c) finishing the Phase 3 voice-to-voice pipeline (F5-TTS + LLM brain).
Most recent commits are bug-fixes rather than net-new features, suggesting
the current codebase is approaching a stable milestone.

================================================================================