--- name: uv monorepo init overview: Bootstrap a uv workspace monorepo from scratch with a Gradio HF Space app and a swappable local inference library (llama-cpp-python default, transformers optional), aligned with [Build Small Hackathon](https://huggingface.co/build-small-hackathon) constraints. todos: - id: uv-workspace content: Run uv init at root + apps/gradio-space + libs/inference; configure workspace members, sources, and uv.lock status: pending - id: inference-lib content: Implement inference Protocol, llama_cpp backend, transformers backend (optional extra), and factory with env-based switching status: pending - id: gradio-app content: Create minimal Gradio chat app in apps/gradio-space wired to inference lib status: pending - id: hf-space content: Add Dockerfile, Space README YAML, .env.example, download_model script, and root README with dev/hackathon docs status: pending - id: verify content: Run uv sync, local Gradio smoke test, and confirm imports/backends work status: pending isProject: false --- # uv Monorepo + Gradio + Local Llama Inference ## Context - Repo today: only [`README.md`](README.md) — greenfield setup. - Your choices: **generic track scaffold**, **abstract inference with llama-cpp default**. - Hackathon hard rules: Gradio app on HF Space, models **≤ 32B**, demo video + social post by **June 15, 2026**. ## Target layout ```text small-model-hackathon/ ├── pyproject.toml # workspace root + shared dev tooling ├── uv.lock ├── .python-version # 3.12 ├── .gitignore ├── Dockerfile # HF Space (Docker SDK) — builds whole workspace ├── README.md # dev + hackathon checklist ├── apps/ │ └── gradio-space/ │ ├── pyproject.toml │ ├── README.md # HF Space card YAML (title, sdk, hardware hints) │ └── src/gradio_space/ │ ├── __init__.py │ └── app.py # Gradio UI entrypoint ├── libs/ │ └── inference/ │ ├── pyproject.toml │ └── src/inference/ │ ├── __init__.py │ ├── base.py # Protocol / ABC │ ├── llama_cpp.py # default backend (GGUF) │ ├── transformers.py # optional HF backend │ └── factory.py # INFERENCE_BACKEND env switch └── scripts/ └── download_model.py # pull GGUF from Hub to local cache ``` ```mermaid flowchart LR subgraph app [apps/gradio-space] GradioUI[app.py] end subgraph lib [libs/inference] Factory[factory.py] LlamaCpp[llama_cpp.py] Transformers[transformers.py] end GradioUI --> Factory Factory -->|default| LlamaCpp Factory -->|optional| Transformers LlamaCpp --> GGUF[Local GGUF file] Transformers --> HFModel[HF weights via transformers] ``` ## 1. Initialize uv workspace Run from repo root: ```bash uv init --name small-model-hackathon uv init --package apps/gradio-space uv init --package libs/inference ``` Configure root [`pyproject.toml`](pyproject.toml): - `[tool.uv.workspace]` with `members = ["apps/*", "libs/*"]` - Root depends on both workspace packages so `uv sync` installs everything: - `dependencies = ["gradio-space", "inference"]` - `[tool.uv.sources]` mapping each to `{ workspace = true }` - Shared dev deps at root: `ruff`, `pytest` (optional but lightweight) - `requires-python = ">=3.12"` (matches your installed Python 3.12.9) Lock and install: ```bash uv lock uv sync --all-packages ``` ## 2. `libs/inference` — swappable local backends **Core interface** in `base.py`: ```python class InferenceBackend(Protocol): def load(self) -> None: ... def generate(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.7) -> str: ... def chat(self, messages: list[dict[str, str]], **kwargs) -> str: ... ``` **Default backend — `llama_cpp.py`** - Dependency: `llama-cpp-python` (CPU build by default; GPU variant documented for local/CUDA Spaces) - Load GGUF via env config: - `MODEL_PATH` — local file path, or - `MODEL_REPO` + `MODEL_FILE` — download from Hugging Face Hub at startup (`huggingface_hub.hf_hub_download`) - Suggested default model for dev: `Qwen/Qwen2.5-3B-Instruct-GGUF` with a specific `.gguf` quant (well under 32B; laptop-friendly) **Optional backend — `transformers.py`** - Dependencies kept in an optional extra: `inference[transformers]` → `transformers`, `torch`, `accelerate` - Same public methods; loads `AutoModelForCausalLM` + `AutoTokenizer` from `MODEL_ID` - Heavier; useful if you later fine-tune and publish on Hub **Factory — `factory.py`** - `INFERENCE_BACKEND=llama_cpp|transformers` (default `llama_cpp`) - Lazy singleton so model loads once on first request (important for Gradio cold start) ## 3. `apps/gradio-space` — minimal chat UI **Dependencies:** `gradio`, `inference` (workspace) **`app.py` skeleton:** - `gr.ChatInterface` or simple `Blocks` with textbox + chat history - On startup: call `get_backend().load()` with a status message if model missing - Wire `chat()` to the inference backend - Expose `demo.launch()` guarded by `if __name__ == "__main__"` **Run locally:** ```bash uv run --package gradio-space python -m gradio_space.app # or: uv run --package gradio-space gradio apps/gradio-space/src/gradio_space/app.py ``` **Env template** (`.env.example` at root): ```bash INFERENCE_BACKEND=llama_cpp MODEL_REPO=Qwen/Qwen2.5-3B-Instruct-GGUF MODEL_FILE=qwen2.5-3b-instruct-q4_k_m.gguf N_CTX=4096 N_GPU_LAYERS=0 ``` ## 4. HF Space deployment (monorepo-friendly) Use **Docker SDK** at repo root ([HF Docker Spaces docs](https://huggingface.co/docs/hub/en/spaces-sdks-docker)) so the whole workspace ships together. **Root `Dockerfile` (outline):** - Base: `python:3.12-slim` - Install `uv` via official installer - `COPY` monorepo, `uv sync --frozen --no-dev --package gradio-space` - Run as UID 1000 (HF requirement) - `EXPOSE 7860` - `CMD ["uv", "run", "--package", "gradio-space", "python", "-m", "gradio_space.app"]` **`apps/gradio-space/README.md`** — Space card frontmatter: ```yaml --- title: emoji: ... colorFrom: ... colorTo: ... sdk: docker app_port: 7860 pinned: false license: apache-2.0 --- ``` When creating the Space under [build-small-hackathon](https://huggingface.co/build-small-hackathon): 1. New Space → SDK: Docker → link this repo 2. Hardware: start **CPU basic** for llama-cpp dev; upgrade to GPU Space if you offload layers 3. Add Space secrets/env vars for `MODEL_REPO`, `MODEL_FILE`, etc. 4. Optionally attach a **Storage Bucket** if you cache large GGUF files persistently ## 5. Repo hygiene **[`.gitignore`](.gitignore):** `.venv/`, `__pycache__/`, `.env`, `models/`, `*.gguf`, `.ruff_cache/`, `.pytest_cache/` **[`README.md`](README.md)** sections: - Prerequisites: `uv`, Python 3.12 - Quick start: sync, download model script, run Gradio locally - Monorepo commands cheat sheet (`uv add --package ...`, `uv run --package ...`) - Hackathon checklist: track choice, Space link, demo video, social post, badge targets (Off-the-Grid, Llama Champion, etc.) **[`scripts/download_model.py`](scripts/download_model.py):** small CLI using `huggingface_hub` to fetch the configured GGUF into `./models/` for offline dev. ## 6. Verification checklist (post-init) | Step | Command / check | |------|-----------------| | Workspace resolves | `uv sync --all-packages` succeeds | | Import chain | `uv run python -c "from inference.factory import get_backend"` | | Gradio boots | `uv run --package gradio-space python -m gradio_space.app` → localhost:7860 | | Backend switch | `INFERENCE_BACKEND=transformers` fails gracefully until extra installed | | Docker build | `docker build -t hackathon-space .` (optional local smoke test) | ## Out of scope for this init (pick up later) - Track-specific product logic (Backyard AI vs Thousand Token Wood) - Fine-tuning pipeline / custom model publish - Custom UI via `gr.Server` (Off-Brand badge) - Agent traces dataset upload (Sharing is Caring badge) - CI/GitHub Actions ## Key design decisions | Decision | Rationale | |----------|-----------| | uv workspace with `apps/` + `libs/` | Clean separation; Gradio app stays thin; inference reusable | | llama-cpp default | Matches "Off the Grid" + "Llama Champion" badges; runs on laptop CPU | | transformers as optional extra | Keeps default install light; swap via env when needed | | Docker Space at repo root | Standard pattern for monorepos on HF (see [eu-ai-act example](https://huggingface.co/spaces/MCP-1st-Birthday/eu-ai-act-compliance-agent/blob/main/Dockerfile)) | | Qwen2.5-3B-Instruct GGUF default | Small, capable, llama.cpp-compatible, well under 32B cap |