lesson-agent / .cursor /plans /uv_monorepo_init_d9207227.plan.md
MSGEncrypted's picture
init plan
a0b2364
|
Raw
History Blame
8.9 kB
metadata
name: uv monorepo init
overview: >-
  Bootstrap a uv workspace monorepo from scratch with a Gradio HF Space app and
  a swappable local inference library (llama-cpp-python default, transformers
  optional), aligned with [Build Small
  Hackathon](https://huggingface.co/build-small-hackathon) constraints.
todos:
  - id: uv-workspace
    content: >-
      Run uv init at root + apps/gradio-space + libs/inference; configure
      workspace members, sources, and uv.lock
    status: pending
  - id: inference-lib
    content: >-
      Implement inference Protocol, llama_cpp backend, transformers backend
      (optional extra), and factory with env-based switching
    status: pending
  - id: gradio-app
    content: Create minimal Gradio chat app in apps/gradio-space wired to inference lib
    status: pending
  - id: hf-space
    content: >-
      Add Dockerfile, Space README YAML, .env.example, download_model script,
      and root README with dev/hackathon docs
    status: pending
  - id: verify
    content: Run uv sync, local Gradio smoke test, and confirm imports/backends work
    status: pending
isProject: false

uv Monorepo + Gradio + Local Llama Inference

Context

  • Repo today: only README.md β€” greenfield setup.
  • Your choices: generic track scaffold, abstract inference with llama-cpp default.
  • Hackathon hard rules: Gradio app on HF Space, models ≀ 32B, demo video + social post by June 15, 2026.

Target layout

small-model-hackathon/
β”œβ”€β”€ pyproject.toml              # workspace root + shared dev tooling
β”œβ”€β”€ uv.lock
β”œβ”€β”€ .python-version             # 3.12
β”œβ”€β”€ .gitignore
β”œβ”€β”€ Dockerfile                  # HF Space (Docker SDK) β€” builds whole workspace
β”œβ”€β”€ README.md                   # dev + hackathon checklist
β”œβ”€β”€ apps/
β”‚   └── gradio-space/
β”‚       β”œβ”€β”€ pyproject.toml
β”‚       β”œβ”€β”€ README.md           # HF Space card YAML (title, sdk, hardware hints)
β”‚       └── src/gradio_space/
β”‚           β”œβ”€β”€ __init__.py
β”‚           └── app.py          # Gradio UI entrypoint
β”œβ”€β”€ libs/
β”‚   └── inference/
β”‚       β”œβ”€β”€ pyproject.toml
β”‚       └── src/inference/
β”‚           β”œβ”€β”€ __init__.py
β”‚           β”œβ”€β”€ base.py         # Protocol / ABC
β”‚           β”œβ”€β”€ llama_cpp.py    # default backend (GGUF)
β”‚           β”œβ”€β”€ transformers.py # optional HF backend
β”‚           └── factory.py      # INFERENCE_BACKEND env switch
└── scripts/
    └── download_model.py       # pull GGUF from Hub to local cache
flowchart LR
  subgraph app [apps/gradio-space]
    GradioUI[app.py]
  end
  subgraph lib [libs/inference]
    Factory[factory.py]
    LlamaCpp[llama_cpp.py]
    Transformers[transformers.py]
  end
  GradioUI --> Factory
  Factory -->|default| LlamaCpp
  Factory -->|optional| Transformers
  LlamaCpp --> GGUF[Local GGUF file]
  Transformers --> HFModel[HF weights via transformers]

1. Initialize uv workspace

Run from repo root:

uv init --name small-model-hackathon
uv init --package apps/gradio-space
uv init --package libs/inference

Configure root pyproject.toml:

  • [tool.uv.workspace] with members = ["apps/*", "libs/*"]
  • Root depends on both workspace packages so uv sync installs everything:
    • dependencies = ["gradio-space", "inference"]
    • [tool.uv.sources] mapping each to { workspace = true }
  • Shared dev deps at root: ruff, pytest (optional but lightweight)
  • requires-python = ">=3.12" (matches your installed Python 3.12.9)

Lock and install:

uv lock
uv sync --all-packages

2. libs/inference β€” swappable local backends

Core interface in base.py:

class InferenceBackend(Protocol):
    def load(self) -> None: ...
    def generate(self, prompt: str, *, max_tokens: int = 512, temperature: float = 0.7) -> str: ...
    def chat(self, messages: list[dict[str, str]], **kwargs) -> str: ...

Default backend β€” llama_cpp.py

  • Dependency: llama-cpp-python (CPU build by default; GPU variant documented for local/CUDA Spaces)
  • Load GGUF via env config:
    • MODEL_PATH β€” local file path, or
    • MODEL_REPO + MODEL_FILE β€” download from Hugging Face Hub at startup (huggingface_hub.hf_hub_download)
  • Suggested default model for dev: Qwen/Qwen2.5-3B-Instruct-GGUF with a specific .gguf quant (well under 32B; laptop-friendly)

Optional backend β€” transformers.py

  • Dependencies kept in an optional extra: inference[transformers] β†’ transformers, torch, accelerate
  • Same public methods; loads AutoModelForCausalLM + AutoTokenizer from MODEL_ID
  • Heavier; useful if you later fine-tune and publish on Hub

Factory β€” factory.py

  • INFERENCE_BACKEND=llama_cpp|transformers (default llama_cpp)
  • Lazy singleton so model loads once on first request (important for Gradio cold start)

3. apps/gradio-space β€” minimal chat UI

Dependencies: gradio, inference (workspace)

app.py skeleton:

  • gr.ChatInterface or simple Blocks with textbox + chat history
  • On startup: call get_backend().load() with a status message if model missing
  • Wire chat() to the inference backend
  • Expose demo.launch() guarded by if __name__ == "__main__"

Run locally:

uv run --package gradio-space python -m gradio_space.app
# or: uv run --package gradio-space gradio apps/gradio-space/src/gradio_space/app.py

Env template (.env.example at root):

INFERENCE_BACKEND=llama_cpp
MODEL_REPO=Qwen/Qwen2.5-3B-Instruct-GGUF
MODEL_FILE=qwen2.5-3b-instruct-q4_k_m.gguf
N_CTX=4096
N_GPU_LAYERS=0

4. HF Space deployment (monorepo-friendly)

Use Docker SDK at repo root (HF Docker Spaces docs) so the whole workspace ships together.

Root Dockerfile (outline):

  • Base: python:3.12-slim
  • Install uv via official installer
  • COPY monorepo, uv sync --frozen --no-dev --package gradio-space
  • Run as UID 1000 (HF requirement)
  • EXPOSE 7860
  • CMD ["uv", "run", "--package", "gradio-space", "python", "-m", "gradio_space.app"]

apps/gradio-space/README.md β€” Space card frontmatter:

---
title: <Your App Name>
emoji: ...
colorFrom: ...
colorTo: ...
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---

When creating the Space under build-small-hackathon:

  1. New Space β†’ SDK: Docker β†’ link this repo
  2. Hardware: start CPU basic for llama-cpp dev; upgrade to GPU Space if you offload layers
  3. Add Space secrets/env vars for MODEL_REPO, MODEL_FILE, etc.
  4. Optionally attach a Storage Bucket if you cache large GGUF files persistently

5. Repo hygiene

.gitignore: .venv/, __pycache__/, .env, models/, *.gguf, .ruff_cache/, .pytest_cache/

README.md sections:

  • Prerequisites: uv, Python 3.12
  • Quick start: sync, download model script, run Gradio locally
  • Monorepo commands cheat sheet (uv add --package ..., uv run --package ...)
  • Hackathon checklist: track choice, Space link, demo video, social post, badge targets (Off-the-Grid, Llama Champion, etc.)

scripts/download_model.py: small CLI using huggingface_hub to fetch the configured GGUF into ./models/ for offline dev.

6. Verification checklist (post-init)

Step Command / check
Workspace resolves uv sync --all-packages succeeds
Import chain uv run python -c "from inference.factory import get_backend"
Gradio boots uv run --package gradio-space python -m gradio_space.app β†’ localhost:7860
Backend switch INFERENCE_BACKEND=transformers fails gracefully until extra installed
Docker build docker build -t hackathon-space . (optional local smoke test)

Out of scope for this init (pick up later)

  • Track-specific product logic (Backyard AI vs Thousand Token Wood)
  • Fine-tuning pipeline / custom model publish
  • Custom UI via gr.Server (Off-Brand badge)
  • Agent traces dataset upload (Sharing is Caring badge)
  • CI/GitHub Actions

Key design decisions

Decision Rationale
uv workspace with apps/ + libs/ Clean separation; Gradio app stays thin; inference reusable
llama-cpp default Matches "Off the Grid" + "Llama Champion" badges; runs on laptop CPU
transformers as optional extra Keeps default install light; swap via env when needed
Docker Space at repo root Standard pattern for monorepos on HF (see eu-ai-act example)
Qwen2.5-3B-Instruct GGUF default Small, capable, llama.cpp-compatible, well under 32B cap