# Usage How to run the Gradio chat app locally, test it in Docker, and deploy to a Hugging Face Space for the [Build Small Hackathon](https://huggingface.co/build-small-hackathon). ## Prerequisites - [uv](https://docs.astral.sh/uv/) installed - Python 3.12 (see `.python-version`) - For Docker testing: Docker installed locally - For HF Space deploy: Hugging Face account with access to the `build-small-hackathon` org ## Local development ### 1. Install dependencies ```bash uv sync --all-packages ``` ### 2. Configure environment (optional) ```bash cp .env.example .env ``` Edit `.env` if you want a different model or local GGUF path. Defaults work out of the box. ### 3. Pre-download the model (recommended) The app can download the GGUF on first chat, but pre-downloading avoids a long wait during your first message: ```bash uv run python scripts/download_model.py ``` Then add the printed path to `.env`: ```bash MODEL_PATH=./models/qwen2.5-3b-instruct-q4_k_m.gguf ``` ### 4. Run the Gradio app ```bash uv run --package gradio-space python -m gradio_space.app ``` Open http://localhost:7860. The model loads on the **first chat message** unless you set `MODEL_PATH`. After code changes, restart the process to pick up updates. ### 5. Quick sanity checks ```bash # Inference package resolves uv run python -c "from inference.factory import get_backend; print(type(get_backend()).__name__)" # Gradio app module loads uv run --package gradio-space python -c "from gradio_space.app import build_demo; print(build_demo())" ``` ### Local env reference | Variable | Default | Description | |----------|---------|-------------| | `INFERENCE_BACKEND` | `llama_cpp` | `llama_cpp` or `transformers` | | `MODEL_REPO` | `Qwen/Qwen2.5-3B-Instruct-GGUF` | Hub repo for GGUF | | `MODEL_FILE` | `qwen2.5-3b-instruct-q4_k_m.gguf` | GGUF filename | | `MODEL_PATH` | — | Local GGUF path (skips Hub download) | | `N_CTX` | `4096` | Context window | | `N_GPU_LAYERS` | `0` | GPU layers for llama.cpp (`0` = CPU only) | | `PORT` | `7860` | Gradio listen port | | `MODEL_ID` | `Qwen/Qwen2.5-3B-Instruct` | Used when `INFERENCE_BACKEND=transformers` | ### Optional: transformers backend Heavier install; only needed if you switch away from llama.cpp: ```bash uv sync --package inference --extra transformers INFERENCE_BACKEND=transformers MODEL_ID=Qwen/Qwen2.5-3B-Instruct \ uv run --package gradio-space python -m gradio_space.app ``` --- ## Docker (local prod-like test) Run the same container image HF Spaces will build: ```bash docker build -t hackathon-space . docker run --rm -p 7860:7860 \ -e MODEL_REPO=Qwen/Qwen2.5-3B-Instruct-GGUF \ -e MODEL_FILE=qwen2.5-3b-instruct-q4_k_m.gguf \ -e N_CTX=4096 \ -e N_GPU_LAYERS=0 \ hackathon-space ``` Open http://localhost:7860. Stop with `Ctrl+C`. To use a pre-downloaded local model inside Docker, mount it and set `MODEL_PATH`: ```bash docker run --rm -p 7860:7860 \ -v "$(pwd)/models:/app/models:ro" \ -e MODEL_PATH=/app/models/qwen2.5-3b-instruct-q4_k_m.gguf \ hackathon-space ``` --- ## Hugging Face Space deployment This repo uses the **Docker SDK**. The Space card metadata lives in the YAML frontmatter at the top of [README.md](README.md). ### 1. Push code to GitHub Make sure `main` (or your deploy branch) contains at minimum: - `Dockerfile` - `README.md` (with `sdk: docker` and `app_port: 7860`) - `pyproject.toml`, `uv.lock` - `apps/gradio-space/` and `libs/inference/` ### 2. Create the Space 1. Go to [build-small-hackathon](https://huggingface.co/build-small-hackathon) 2. **New Space** 3. Name: e.g. `small-model-hackathon` 4. SDK: **Docker** 5. Link your GitHub repo, or push directly to the Space repo CLI alternative (if you have `hf` installed and org access): ```bash hf repo create build-small-hackathon/ \ --repo-type space \ --space_sdk docker ``` ### 3. Configure hardware | Setting | Recommendation | |---------|----------------| | Hardware | **CPU basic** to start (llama.cpp with `N_GPU_LAYERS=0`) | | Upgrade | GPU Space if you set `N_GPU_LAYERS > 0` for faster inference | ### 4. Set Space environment variables In the Space **Settings → Variables and secrets**: | Variable | Value | |----------|-------| | `INFERENCE_BACKEND` | `llama_cpp` | | `MODEL_REPO` | `Qwen/Qwen2.5-3B-Instruct-GGUF` | | `MODEL_FILE` | `qwen2.5-3b-instruct-q4_k_m.gguf` | | `N_CTX` | `4096` | | `N_GPU_LAYERS` | `0` (or higher on GPU hardware) | ### 5. Build and verify HF builds from the root `Dockerfile` and runs: ```bash uv run --package gradio-space python -m gradio_space.app ``` Check the **Logs** tab while the Space builds. Once running, open the Space URL and send a test chat message. The first message may take several minutes on CPU while the GGUF downloads. ### 6. Optional: persistent model cache If cold starts are too slow, attach a **Storage Bucket** in Space settings so downloaded GGUF files survive restarts. --- ## Troubleshooting | Symptom | Likely cause | Fix | |---------|--------------|-----| | First chat hangs / slow | GGUF downloading from Hub | Pre-download locally; on Space, wait or use Storage Bucket | | `Failed to load model` in chat | Wrong `MODEL_REPO` / `MODEL_FILE` | Check env vars match a valid GGUF on Hub | | Docker build fails on `llama-cpp-python` | Missing build tools | Dockerfile already installs `build-essential` and `cmake` | | Space build fails | Missing `uv.lock` or README YAML | Ensure `sdk: docker` is in root `README.md` frontmatter | | `transformers` backend error | Optional deps not installed | Run `uv sync --package inference --extra transformers` | | Port already in use locally | Another process on 7860 | `PORT=7861 uv run --package gradio-space python -m gradio_space.app` | --- ## Entrypoint summary All three environments use the same command: ```bash uv run --package gradio-space python -m gradio_space.app ``` | Environment | How to run | |-------------|------------| | Local dev | `uv run --package gradio-space python -m gradio_space.app` | | Docker | `docker run -p 7860:7860 hackathon-space` | | HF Space | Built and started automatically from `Dockerfile` `CMD` |