Spaces:
Running on Zero
Running on Zero
File size: 8,898 Bytes
a0b2364 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | ---
name: uv monorepo init
overview: Bootstrap a uv workspace monorepo from scratch with a Gradio HF Space app and a swappable local inference library (llama-cpp-python default, transformers optional), aligned with [Build Small Hackathon](https://huggingface.co/build-small-hackathon) constraints.
todos:
- id: uv-workspace
content: Run uv init at root + apps/gradio-space + libs/inference; configure workspace members, sources, and uv.lock
status: pending
- id: inference-lib
content: Implement inference Protocol, llama_cpp backend, transformers backend (optional extra), and factory with env-based switching
status: pending
- id: gradio-app
content: Create minimal Gradio chat app in apps/gradio-space wired to inference lib
status: pending
- id: hf-space
content: Add Dockerfile, Space README YAML, .env.example, download_model script, and root README with dev/hackathon docs
status: pending
- id: verify
content: Run uv sync, local Gradio smoke test, and confirm imports/backends work
status: pending
isProject: false
---
# uv Monorepo + Gradio + Local Llama Inference
## Context
- Repo today: only [`README.md`](README.md) β greenfield setup.
- Your choices: **generic track scaffold**, **abstract inference with llama-cpp default**.
- Hackathon hard rules: Gradio app on HF Space, models **β€ 32B**, demo video + social post by **June 15, 2026**.
## Target layout
```text
small-model-hackathon/
βββ pyproject.toml # workspace root + shared dev tooling
βββ uv.lock
βββ .python-version # 3.12
βββ .gitignore
βββ Dockerfile # HF Space (Docker SDK) β builds whole workspace
βββ README.md # dev + hackathon checklist
βββ apps/
β βββ gradio-space/
β βββ pyproject.toml
β βββ README.md # HF Space card YAML (title, sdk, hardware hints)
β βββ src/gradio_space/
β βββ __init__.py
β βββ app.py # Gradio UI entrypoint
βββ libs/
β βββ inference/
β βββ pyproject.toml
β βββ src/inference/
β βββ __init__.py
β βββ base.py # Protocol / ABC
β βββ llama_cpp.py # default backend (GGUF)
β βββ transformers.py # optional HF backend
β βββ factory.py # INFERENCE_BACKEND env switch
βββ scripts/
βββ download_model.py # pull GGUF from Hub to local cache
```
```mermaid
flowchart LR
subgraph app [apps/gradio-space]
GradioUI[app.py]
end
subgraph lib [libs/inference]
Factory[factory.py]
LlamaCpp[llama_cpp.py]
Transformers[transformers.py]
end
GradioUI --> Factory
Factory -->|default| LlamaCpp
Factory -->|optional| Transformers
LlamaCpp --> GGUF[Local GGUF file]
Transformers --> HFModel[HF weights via transformers]
```
## 1. Initialize uv workspace
Run from repo root:
```bash
uv init --name small-model-hackathon
uv init --package apps/gradio-space
uv init --package libs/inference
```
Configure root [`pyproject.toml`](pyproject.toml):
- `[tool.uv.workspace]` with `members = ["apps/*", "libs/*"]`
- Root depends on both workspace packages so `uv sync` installs everything:
- `dependencies = ["gradio-space", "inference"]`
- `[tool.uv.sources]` mapping each to `{ workspace = true }`
- Shared dev deps at root: `ruff`, `pytest` (optional but lightweight)
- `requires-python = ">=3.12"` (matches your installed Python 3.12.9)
Lock and install:
```bash
uv lock
uv sync --all-packages
```
## 2. `libs/inference` β swappable local backends
**Core interface** in `base.py`:
```python
class InferenceBackend(Protocol):
def load(self) -> None: ...
def generate(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.7) -> str: ...
def chat(self, messages: list[dict[str, str]], **kwargs) -> str: ...
```
**Default backend β `llama_cpp.py`**
- Dependency: `llama-cpp-python` (CPU build by default; GPU variant documented for local/CUDA Spaces)
- Load GGUF via env config:
- `MODEL_PATH` β local file path, or
- `MODEL_REPO` + `MODEL_FILE` β download from Hugging Face Hub at startup (`huggingface_hub.hf_hub_download`)
- Suggested default model for dev: `Qwen/Qwen2.5-3B-Instruct-GGUF` with a specific `.gguf` quant (well under 32B; laptop-friendly)
**Optional backend β `transformers.py`**
- Dependencies kept in an optional extra: `inference[transformers]` β `transformers`, `torch`, `accelerate`
- Same public methods; loads `AutoModelForCausalLM` + `AutoTokenizer` from `MODEL_ID`
- Heavier; useful if you later fine-tune and publish on Hub
**Factory β `factory.py`**
- `INFERENCE_BACKEND=llama_cpp|transformers` (default `llama_cpp`)
- Lazy singleton so model loads once on first request (important for Gradio cold start)
## 3. `apps/gradio-space` β minimal chat UI
**Dependencies:** `gradio`, `inference` (workspace)
**`app.py` skeleton:**
- `gr.ChatInterface` or simple `Blocks` with textbox + chat history
- On startup: call `get_backend().load()` with a status message if model missing
- Wire `chat()` to the inference backend
- Expose `demo.launch()` guarded by `if __name__ == "__main__"`
**Run locally:**
```bash
uv run --package gradio-space python -m gradio_space.app
# or: uv run --package gradio-space gradio apps/gradio-space/src/gradio_space/app.py
```
**Env template** (`.env.example` at root):
```bash
INFERENCE_BACKEND=llama_cpp
MODEL_REPO=Qwen/Qwen2.5-3B-Instruct-GGUF
MODEL_FILE=qwen2.5-3b-instruct-q4_k_m.gguf
N_CTX=4096
N_GPU_LAYERS=0
```
## 4. HF Space deployment (monorepo-friendly)
Use **Docker SDK** at repo root ([HF Docker Spaces docs](https://huggingface.co/docs/hub/en/spaces-sdks-docker)) so the whole workspace ships together.
**Root `Dockerfile` (outline):**
- Base: `python:3.12-slim`
- Install `uv` via official installer
- `COPY` monorepo, `uv sync --frozen --no-dev --package gradio-space`
- Run as UID 1000 (HF requirement)
- `EXPOSE 7860`
- `CMD ["uv", "run", "--package", "gradio-space", "python", "-m", "gradio_space.app"]`
**`apps/gradio-space/README.md`** β Space card frontmatter:
```yaml
---
title: <Your App Name>
emoji: ...
colorFrom: ...
colorTo: ...
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---
```
When creating the Space under [build-small-hackathon](https://huggingface.co/build-small-hackathon):
1. New Space β SDK: Docker β link this repo
2. Hardware: start **CPU basic** for llama-cpp dev; upgrade to GPU Space if you offload layers
3. Add Space secrets/env vars for `MODEL_REPO`, `MODEL_FILE`, etc.
4. Optionally attach a **Storage Bucket** if you cache large GGUF files persistently
## 5. Repo hygiene
**[`.gitignore`](.gitignore):** `.venv/`, `__pycache__/`, `.env`, `models/`, `*.gguf`, `.ruff_cache/`, `.pytest_cache/`
**[`README.md`](README.md)** sections:
- Prerequisites: `uv`, Python 3.12
- Quick start: sync, download model script, run Gradio locally
- Monorepo commands cheat sheet (`uv add --package ...`, `uv run --package ...`)
- Hackathon checklist: track choice, Space link, demo video, social post, badge targets (Off-the-Grid, Llama Champion, etc.)
**[`scripts/download_model.py`](scripts/download_model.py):** small CLI using `huggingface_hub` to fetch the configured GGUF into `./models/` for offline dev.
## 6. Verification checklist (post-init)
| Step | Command / check |
|------|-----------------|
| Workspace resolves | `uv sync --all-packages` succeeds |
| Import chain | `uv run python -c "from inference.factory import get_backend"` |
| Gradio boots | `uv run --package gradio-space python -m gradio_space.app` β localhost:7860 |
| Backend switch | `INFERENCE_BACKEND=transformers` fails gracefully until extra installed |
| Docker build | `docker build -t hackathon-space .` (optional local smoke test) |
## Out of scope for this init (pick up later)
- Track-specific product logic (Backyard AI vs Thousand Token Wood)
- Fine-tuning pipeline / custom model publish
- Custom UI via `gr.Server` (Off-Brand badge)
- Agent traces dataset upload (Sharing is Caring badge)
- CI/GitHub Actions
## Key design decisions
| Decision | Rationale |
|----------|-----------|
| uv workspace with `apps/` + `libs/` | Clean separation; Gradio app stays thin; inference reusable |
| llama-cpp default | Matches "Off the Grid" + "Llama Champion" badges; runs on laptop CPU |
| transformers as optional extra | Keeps default install light; swap via env when needed |
| Docker Space at repo root | Standard pattern for monorepos on HF (see [eu-ai-act example](https://huggingface.co/spaces/MCP-1st-Birthday/eu-ai-act-compliance-agent/blob/main/Dockerfile)) |
| Qwen2.5-3B-Instruct GGUF default | Small, capable, llama.cpp-compatible, well under 32B cap |
|