File size: 8,898 Bytes
a0b2364
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
name: uv monorepo init
overview: Bootstrap a uv workspace monorepo from scratch with a Gradio HF Space app and a swappable local inference library (llama-cpp-python default, transformers optional), aligned with [Build Small Hackathon](https://huggingface.co/build-small-hackathon) constraints.
todos:
  - id: uv-workspace
    content: Run uv init at root + apps/gradio-space + libs/inference; configure workspace members, sources, and uv.lock
    status: pending
  - id: inference-lib
    content: Implement inference Protocol, llama_cpp backend, transformers backend (optional extra), and factory with env-based switching
    status: pending
  - id: gradio-app
    content: Create minimal Gradio chat app in apps/gradio-space wired to inference lib
    status: pending
  - id: hf-space
    content: Add Dockerfile, Space README YAML, .env.example, download_model script, and root README with dev/hackathon docs
    status: pending
  - id: verify
    content: Run uv sync, local Gradio smoke test, and confirm imports/backends work
    status: pending
isProject: false
---

# uv Monorepo + Gradio + Local Llama Inference

## Context

- Repo today: only [`README.md`](README.md) β€” greenfield setup.
- Your choices: **generic track scaffold**, **abstract inference with llama-cpp default**.
- Hackathon hard rules: Gradio app on HF Space, models **≀ 32B**, demo video + social post by **June 15, 2026**.

## Target layout

```text
small-model-hackathon/
β”œβ”€β”€ pyproject.toml              # workspace root + shared dev tooling
β”œβ”€β”€ uv.lock
β”œβ”€β”€ .python-version             # 3.12
β”œβ”€β”€ .gitignore
β”œβ”€β”€ Dockerfile                  # HF Space (Docker SDK) β€” builds whole workspace
β”œβ”€β”€ README.md                   # dev + hackathon checklist
β”œβ”€β”€ apps/
β”‚   └── gradio-space/
β”‚       β”œβ”€β”€ pyproject.toml
β”‚       β”œβ”€β”€ README.md           # HF Space card YAML (title, sdk, hardware hints)
β”‚       └── src/gradio_space/
β”‚           β”œβ”€β”€ __init__.py
β”‚           └── app.py          # Gradio UI entrypoint
β”œβ”€β”€ libs/
β”‚   └── inference/
β”‚       β”œβ”€β”€ pyproject.toml
β”‚       └── src/inference/
β”‚           β”œβ”€β”€ __init__.py
β”‚           β”œβ”€β”€ base.py         # Protocol / ABC
β”‚           β”œβ”€β”€ llama_cpp.py    # default backend (GGUF)
β”‚           β”œβ”€β”€ transformers.py # optional HF backend
β”‚           └── factory.py      # INFERENCE_BACKEND env switch
└── scripts/
    └── download_model.py       # pull GGUF from Hub to local cache
```

```mermaid
flowchart LR
  subgraph app [apps/gradio-space]
    GradioUI[app.py]
  end
  subgraph lib [libs/inference]
    Factory[factory.py]
    LlamaCpp[llama_cpp.py]
    Transformers[transformers.py]
  end
  GradioUI --> Factory
  Factory -->|default| LlamaCpp
  Factory -->|optional| Transformers
  LlamaCpp --> GGUF[Local GGUF file]
  Transformers --> HFModel[HF weights via transformers]
```

## 1. Initialize uv workspace

Run from repo root:

```bash
uv init --name small-model-hackathon
uv init --package apps/gradio-space
uv init --package libs/inference
```

Configure root [`pyproject.toml`](pyproject.toml):

- `[tool.uv.workspace]` with `members = ["apps/*", "libs/*"]`
- Root depends on both workspace packages so `uv sync` installs everything:
  - `dependencies = ["gradio-space", "inference"]`
  - `[tool.uv.sources]` mapping each to `{ workspace = true }`
- Shared dev deps at root: `ruff`, `pytest` (optional but lightweight)
- `requires-python = ">=3.12"` (matches your installed Python 3.12.9)

Lock and install:

```bash
uv lock
uv sync --all-packages
```

## 2. `libs/inference` β€” swappable local backends

**Core interface** in `base.py`:

```python
class InferenceBackend(Protocol):
    def load(self) -> None: ...
    def generate(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.7) -> str: ...
    def chat(self, messages: list[dict[str, str]], **kwargs) -> str: ...
```

**Default backend β€” `llama_cpp.py`**

- Dependency: `llama-cpp-python` (CPU build by default; GPU variant documented for local/CUDA Spaces)
- Load GGUF via env config:
  - `MODEL_PATH` β€” local file path, or
  - `MODEL_REPO` + `MODEL_FILE` β€” download from Hugging Face Hub at startup (`huggingface_hub.hf_hub_download`)
- Suggested default model for dev: `Qwen/Qwen2.5-3B-Instruct-GGUF` with a specific `.gguf` quant (well under 32B; laptop-friendly)

**Optional backend β€” `transformers.py`**

- Dependencies kept in an optional extra: `inference[transformers]` β†’ `transformers`, `torch`, `accelerate`
- Same public methods; loads `AutoModelForCausalLM` + `AutoTokenizer` from `MODEL_ID`
- Heavier; useful if you later fine-tune and publish on Hub

**Factory β€” `factory.py`**

- `INFERENCE_BACKEND=llama_cpp|transformers` (default `llama_cpp`)
- Lazy singleton so model loads once on first request (important for Gradio cold start)

## 3. `apps/gradio-space` β€” minimal chat UI

**Dependencies:** `gradio`, `inference` (workspace)

**`app.py` skeleton:**

- `gr.ChatInterface` or simple `Blocks` with textbox + chat history
- On startup: call `get_backend().load()` with a status message if model missing
- Wire `chat()` to the inference backend
- Expose `demo.launch()` guarded by `if __name__ == "__main__"`

**Run locally:**

```bash
uv run --package gradio-space python -m gradio_space.app
# or: uv run --package gradio-space gradio apps/gradio-space/src/gradio_space/app.py
```

**Env template** (`.env.example` at root):

```bash
INFERENCE_BACKEND=llama_cpp
MODEL_REPO=Qwen/Qwen2.5-3B-Instruct-GGUF
MODEL_FILE=qwen2.5-3b-instruct-q4_k_m.gguf
N_CTX=4096
N_GPU_LAYERS=0
```

## 4. HF Space deployment (monorepo-friendly)

Use **Docker SDK** at repo root ([HF Docker Spaces docs](https://huggingface.co/docs/hub/en/spaces-sdks-docker)) so the whole workspace ships together.

**Root `Dockerfile` (outline):**

- Base: `python:3.12-slim`
- Install `uv` via official installer
- `COPY` monorepo, `uv sync --frozen --no-dev --package gradio-space`
- Run as UID 1000 (HF requirement)
- `EXPOSE 7860`
- `CMD ["uv", "run", "--package", "gradio-space", "python", "-m", "gradio_space.app"]`

**`apps/gradio-space/README.md`** β€” Space card frontmatter:

```yaml
---
title: <Your App Name>
emoji: ...
colorFrom: ...
colorTo: ...
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---
```

When creating the Space under [build-small-hackathon](https://huggingface.co/build-small-hackathon):

1. New Space β†’ SDK: Docker β†’ link this repo
2. Hardware: start **CPU basic** for llama-cpp dev; upgrade to GPU Space if you offload layers
3. Add Space secrets/env vars for `MODEL_REPO`, `MODEL_FILE`, etc.
4. Optionally attach a **Storage Bucket** if you cache large GGUF files persistently

## 5. Repo hygiene

**[`.gitignore`](.gitignore):** `.venv/`, `__pycache__/`, `.env`, `models/`, `*.gguf`, `.ruff_cache/`, `.pytest_cache/`

**[`README.md`](README.md)** sections:

- Prerequisites: `uv`, Python 3.12
- Quick start: sync, download model script, run Gradio locally
- Monorepo commands cheat sheet (`uv add --package ...`, `uv run --package ...`)
- Hackathon checklist: track choice, Space link, demo video, social post, badge targets (Off-the-Grid, Llama Champion, etc.)

**[`scripts/download_model.py`](scripts/download_model.py):** small CLI using `huggingface_hub` to fetch the configured GGUF into `./models/` for offline dev.

## 6. Verification checklist (post-init)

| Step | Command / check |
|------|-----------------|
| Workspace resolves | `uv sync --all-packages` succeeds |
| Import chain | `uv run python -c "from inference.factory import get_backend"` |
| Gradio boots | `uv run --package gradio-space python -m gradio_space.app` β†’ localhost:7860 |
| Backend switch | `INFERENCE_BACKEND=transformers` fails gracefully until extra installed |
| Docker build | `docker build -t hackathon-space .` (optional local smoke test) |

## Out of scope for this init (pick up later)

- Track-specific product logic (Backyard AI vs Thousand Token Wood)
- Fine-tuning pipeline / custom model publish
- Custom UI via `gr.Server` (Off-Brand badge)
- Agent traces dataset upload (Sharing is Caring badge)
- CI/GitHub Actions

## Key design decisions

| Decision | Rationale |
|----------|-----------|
| uv workspace with `apps/` + `libs/` | Clean separation; Gradio app stays thin; inference reusable |
| llama-cpp default | Matches "Off the Grid" + "Llama Champion" badges; runs on laptop CPU |
| transformers as optional extra | Keeps default install light; swap via env when needed |
| Docker Space at repo root | Standard pattern for monorepos on HF (see [eu-ai-act example](https://huggingface.co/spaces/MCP-1st-Birthday/eu-ai-act-compliance-agent/blob/main/Dockerfile)) |
| Qwen2.5-3B-Instruct GGUF default | Small, capable, llama.cpp-compatible, well under 32B cap |