================================================================================ PROJECT CONTEXT — sahel-agri-voice Generated: 2026-04-17 ================================================================================ PROJECT NAME ------------ Sahel-Voice-Lab / Sahel-Agri Voice AI (HuggingFace Space title: "Sahel-Voice-Lab", Phase 1: "The Memory Loop") PURPOSE ------- A voice-first, self-learning AI assistant for two West African languages — Bambara (bam, spoken in Mali) and Fula/Pular (ful, spoken in Guinea and Senegal) — targeted at farmers in the Sahel region. The system has two complementary capabilities: 1. LANGUAGE-LEARNING MEMORY LOOP (Phase 1) The assistant behaves like an "eager child learner." Users teach it Bambara/Fula words ("I ni ce means hello") via voice or text; an LLM detects the teaching intent and the word pair is persisted to a HuggingFace Hub dataset (ous-sow/sahel-agri-feedback → vocabulary.jsonl) so knowledge accumulates across sessions and users. The vocabulary is then injected into the LLM's system prompt as its source of truth for answering questions. 2. AGRICULTURAL IoT VOICE INTERFACE Farmers speak questions in their own language ("how is the soil?", "is it going to rain?"). Whisper transcribes, an intent parser keyword- matches Bambara/Fula agricultural terms (soil, rain, irrigation, pest), a sensor bridge fetches data from an IoT backend (or mock data), and VoiceResponder + a TTS engine reply in short Bambara/Fula sentences with alert thresholds (e.g. "Bunding ji dɔgɔ. I ka foro ji." = "Soil moisture is low. Irrigate your field."). The project is deployed as a HuggingFace Space (Gradio frontend) with an optional FastAPI service. The system is explicitly "100% non-Meta" for its core stack (Whisper / Qwen / F5-TTS / VITS), avoiding Meta models for the main loop. FULL TECH STACK --------------- Deployment / hosting - HuggingFace Spaces (Gradio SDK 5.25.0, hardware: cpu-basic) - Kaggle notebooks (T4 GPU) for training runs - RunPod alternative training environment - HF Hub datasets as persistent vocabulary + feedback store Frontend - Gradio 5.25.0 (app.py — main UI; app_lab.py — experimental lab UI) Backend API - FastAPI (src/api/app.py via create_app() + lifespan) - Pydantic v2 (schemas) - httpx (async calls to IoT sensor backend) Speech-to-text (STT) - openai/whisper-large-v3-turbo (default backbone) - transformers 5.5.0 (WhisperForConditionalGeneration, WhisperProcessor) - PEFT (LoRA adapters, hot-swappable per language) - accelerate 1.13.0 - librosa 0.10.2, soundfile 0.12.1, torchaudio LLM (reasoning / teaching-intent detection) - Qwen/Qwen2.5-72B-Instruct (default, via HF Serverless Inference) - Qwen/Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Zephyr-7b-beta as faster alternatives - huggingface-hub 1.9.0 InferenceClient Text-to-speech (TTS) - Phase 1: facebook/mms-tts-bam, mms-tts-ful, mms-tts-fra, mms-tts-eng - Phase 2: ynnov/ekodi-bambara-tts-female (VITS) + placeholder ous-sow/fula-tts - F5-TTS (SWivid/F5-TTS) for GPU voice cloning (optional, ~2GB) - OpenVoice V2 (myshell-ai/openvoice-v2) for tone-color conversion - SpeechBrain ECAPA-TDNN for speaker identification (per-user profiles) Data / datasets - google/fleurs (bam_ML, ff_SN) as STT training corpus - RobotsMali/jeli-asr, google/fleurs Fula, Wikipedia (bm, ff) harvested text via src/data/web_harvester.py - datasets 4.8.4 (+ torchcodec for 4.x audio decoding) - Adlam ↔ Latin transliteration for Guinea Pular Training / fine-tuning - PEFT LoRA + Seq2SeqTrainer - jiwer 3.0.4 (WER / CER metrics) - Custom callbacks: EarlyStoppingOnWER, AdapterCheckpointCallback - FieldNoiseAugmenter (tractor / wind / livestock noise mixing) Optimization / edge deploy - optimum[onnxruntime] → per-language ONNX export - onnx-tf / TensorFlow → TFLite for Android - bitsandbytes NF4 / 8-bit quantization (training environments) Utilities / runtime - PyYAML 6.0.2, python-dotenv 1.1.0 - NumPy 2.2.4, SciPy 1.15.2 - rapidfuzz 3.13.0 (fuzzy phrase matching) - pypdf, python-docx (Knowledge Base upload → vocabulary.jsonl) - Kaggle API (Self-Teaching tab triggers training runs) - ffmpeg (packages.txt — sole system-level dep) Environment variables HF_TOKEN, FEEDBACK_REPO_ID (ous-sow/sahel-agri-feedback), LLM_MODEL_ID, BAMBARA_ADAPTER_PATH, FULA_ADAPTER_PATH, SENSOR_API_URL, BAMBARA_TTS_REPO, FULA_TTS_REPO, DEVICE, LOG_LEVEL KEY SOURCE FILES AND WHAT THEY DO --------------------------------- Top-level entry points app.py Gradio UI (~99 KB). Main user-facing application running on the HF Space. Wires STT → LLM → memory → TTS, exposes the Conversation / Teaching / Knowledge Base / Self-Teaching tabs. app_lab.py Experimental/lab Gradio UI used to prototype new features (e.g. CuriosityEngine integration) before folding into app.py. setup.sh Shell bootstrap for local + RunPod environments. src/api/ — FastAPI service (alternative to Gradio-only deploy) app.py FastAPI factory with async lifespan: loads Whisper backbone once, registers bam/ful adapters, pre-loads 'bam', attaches Transcriber + SensorBridge to app.state. dependencies.py FastAPI DI helpers to pull shared objects off app.state. middleware.py CORS / logging middleware registration. schemas.py Pydantic v2 request/response models. routes/health.py GET /health — model status + loaded adapters. routes/transcribe.py POST /transcribe — audio → text, 10 MB cap, wav/mp3/ogg/m4a/flac/webm. routes/iot.py POST /query — full pipeline: audio → transcribe → intent → sensor → voice response (IoTQueryResponse). src/engine/ — STT core whisper_base.py Singleton loader for WhisperForConditionalGeneration + WhisperProcessor. FP16 on CUDA, FP32 on CPU. free() releases VRAM. adapter_manager.py Hot-swap LoRA adapters via PEFT's multi-adapter API: first load ~2s, subsequent set_adapter ~50ms. Keeps one backbone in VRAM and swaps ~50MB adapters. transcriber.py Public inference API. Handles ≤30s chunks directly, >30s by slicing into 30s windows. Returns TranscriptionResult (text, language, duration_s, processing_time_ms, confidence). stt_processor.py avg_logprob confidence extractor; threshold -1.0 = "confused", caller should ask user to repeat. curiosity.py CuriosityEngine — every N interactions, prompts the LLM to spot a vocabulary gap and ask the user how to say a missing agricultural term. src/llm/ gemma_client.py Wraps HF Serverless InferenceClient. Implements the "adult-child" system prompt that returns structured JSON with intent ∈ {teaching, question, conversation, error}. Parses JSON out of optional markdown fences. src/memory/ memory_manager.py Thread-safe vocabulary store. Persists to data/vocabulary.jsonl locally and pushes asynchronously to HF Hub dataset. Provides get_recent() and a formatted get_vocabulary_context() for the LLM prompt. src/conversation/ phrase_matcher.py RapidFuzz-based matcher over curated JSON phrase libraries (data/phrases/{lang}.json + _additions.json). Handles greetings / thanks / farewells without hitting the LLM. src/iot/ intent_parser.py Keyword-based Intent classifier (greeting/thanks/farewell/check_soil/check_weather/ irrigation_status/pest_alert) for bam, ful, fr, en. Confidence = matched_keywords / total_keywords. sensor_bridge.py Async bridge to an IoT backend (SENSOR_API_URL) for soil / weather / irrigation / pest readings. Falls back to mock random data. voice_responder.py Maps (Intent, SensorData) → short Bambara/Fula reply string (≤6 words per sentence for clean MMS-TTS) plus English translation. Alert thresholds encoded here (SOIL_MOISTURE_LOW=30, PH bounds, TEMP_HIGH=38, etc.). Also has a verbose French-language path. src/data/ agri_dictionary.py Bambara + Fula domain vocab used to bias the Whisper decoder prompt toward agricultural terms. waxal_loader.py Streams google/fleurs (bam_ML, ff_SN) — the replacement for the retired google/waxal dataset. feature_extractor.py Log-mel spectrogram extraction and batched padding collator for Whisper Seq2SeqTrainer. augmentation.py FieldNoiseAugmenter — mixes clean speech with tractor/wind/livestock samples; falls back to Gaussian noise. bam_normalize.py Bambara phonetic normalizer (ou→u, gn/ny→ɲ, N'Ko-derived standard). adlam.py Adlam (𞤀𞤣𞤤𞤢𞤥) ↔ Latin transliteration for Pular; normalize_pular() for ASR preprocessing. web_harvester.py Harvests RobotsMali/jeli-asr, google/fleurs ff_SN, and bm/ff Wikipedia into the feedback Hub dataset. src/training/ trainer.py WhisperLoRATrainer — full fine-tune orchestration (backbone + LoraConfig + WaxalDataLoader + Seq2SeqTrainer). metrics.py WER/CER for Seq2SeqTrainer eval loop (via jiwer). callbacks.py EarlyStoppingOnWER, AdapterCheckpointCallback (saves adapter-only, not full model). src/tts/ waxal_tts.py VITS engine wrapping ynnov/ekodi-bambara-tts-female for Bambara; Fula is a placeholder until ous-sow/fula-tts is trained. mms_tts.py Facebook MMS-TTS (bam/ful/fra/eng). f5_tts.py F5-TTS voice cloning (optional, GPU-only, ~750MB); gracefully falls back to MMS when missing. voice_cloner.py OpenVoice V2 tone-color converter — reshapes VITS audio to a target speaker's voice. src/voice/ speaker_profiles.py SpeakerProfileManager with SpeechBrain ECAPA-TDNN (192-d embeddings). Per-user running-average embeddings for identification + OpenVoice SE for cloning; cosine similarity ≥ 0.75 attributes to an existing user. src/optimization/ onnx_exporter.py Merges LoRA into backbone and exports per-language ONNX (ONNX can't hot-swap adapters at runtime). quantizer.py BitsAndBytes NF4 / 8-bit quantization for GPU- constrained deploys (turbo ~3GB → ~1GB VRAM). tflite_converter.py ONNX → TFLite for offline Android; exports encoder and decoder separately. Config / data folders configs/ base_config.yaml + per-language LoRA configs. data/ vocabulary.jsonl, phrases/*.json, profiles/, etc. notebooks/ Kaggle / RunPod fine-tune + TTS training notebooks. noise_samples/ .wav clips for field-noise augmentation. scripts/ utility scripts (bootstrap, harvest, eval). tests/ pytest suite (not installed in HF Spaces runtime). RECENT GIT COMMITS SUMMARY (last 20) ------------------------------------ The recent history is focused on three concurrent tracks: 1. STT / training stability - bb78cbf Add torchcodec install for datasets 4.x audio decoding - 9049ef3 Prepare training stack for RunPod: env-aware notebook + bootstrap script - cc50efb Align Whisper default to turbo-v3 + add document upload to Knowledge Base tab - c33a061 Fix WhisperProcessor import in reload + upgrade base to large-v3-turbo - 7fae91b Fix mel-bin mismatch: load per-language processor from fine-tuned checkpoint - 6682858 Fix jiwer crash on post-normalisation empty refs; register SLR106/105 datasets - 58f431a Fix SyntaxError in Cell 17: unterminated f-string literal - 3632a23 Fix compute_metrics crash on empty eval references in Fula training - 71bb3bc Fix: add trust_remote_code=True for datasets 3.x compatibility - cd017e2 Fix Cell 16 ValueError: load model fp32 so AMP gradient scaler works 2. Language support / Adlam / Pular expansion - ced078c Add Adlam/Pular Fula integration: transliterator + 3 new datasets + normalisation pipeline - 40cf84d Fix language mixing: per-language prompts + Mali Bambara / Guinea Pular context - 33c3a5a Fix Self-Teaching language detection: parse code from dropdown label - 24b1617 Fix Self-Teaching tab: float sliders, deduplication, Kaggle API fallback 3. Conversation / voice pipeline - 8952fff Phase 3: Voice-to-Voice S2S pipeline — F5-TTS, LLM brain, CER metric - ad902c6 Add real conversational memory + live learning to Conversation Mode - 8d7d9d8 Fix conversation mode timeout: two-stage pipeline + faster LLM - 1958814 Fix "Model loading" stuck state: block in _do_asr until Whisper is ready - 618eab5 Fix model loading stuck forever + unhandled TTS crash in conversation mode - bfe5b59 Fix slow build: strip runtime-irrelevant heavy packages from requirements.txt Overall trajectory: the project has moved past initial Phase 1 scaffolding and is iterating hard on (a) stabilising fine-tuning on Kaggle/RunPod with large-v3-turbo, (b) expanding to Guinea Pular with the native Adlam script, and (c) finishing the Phase 3 voice-to-voice pipeline (F5-TTS + LLM brain). Most recent commits are bug-fixes rather than net-new features, suggesting the current codebase is approaching a stable milestone. ================================================================================