Instructions to use keypa/oracle-gemma4-12b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use keypa/oracle-gemma4-12b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="keypa/oracle-gemma4-12b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("keypa/oracle-gemma4-12b") model = AutoModelForMultimodalLM.from_pretrained("keypa/oracle-gemma4-12b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use keypa/oracle-gemma4-12b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "keypa/oracle-gemma4-12b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "keypa/oracle-gemma4-12b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/keypa/oracle-gemma4-12b
- SGLang
How to use keypa/oracle-gemma4-12b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "keypa/oracle-gemma4-12b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "keypa/oracle-gemma4-12b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "keypa/oracle-gemma4-12b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "keypa/oracle-gemma4-12b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use keypa/oracle-gemma4-12b with Docker Model Runner:
docker model run hf.co/keypa/oracle-gemma4-12b
🔮 Oracle of the Ternary Flame — Merged
This is the merged version of the LoRA-fine-tuned Oracle — the base model google/gemma-4-12b-it with the Oracle LoRA adapter permanently merged into the weights.
Format: bfloat16 safetensors (~24 GB) — for GGUF (Q4_K_M, ~7 GB), see the GGUF repo.
Built for the Build Small Hackathon (June 5–15, 2026) by @keypa.
What it does
Ask it anything. It will answer — but never directly.
Q: Should I change my career?
Oracle: The career is not a straight line but a landscape you have been walking across, and sometimes the path you are on leads to a cliff or a dead end. Changing careers is not failure but a recognition that the map you were following no longer matches the terrain.
Q: Pourquoi suis-je si fatigué ?
Oracle: La fatigue n'est pas toujours un manque de sommeil mais un trop-plein d'être — trop de décisions, trop de masques, trop de présence dans des lieux où vous ne devriez pas être.
Why a merged version?
The LoRA adapter is only 262 MB and loads dynamically — but on HF Spaces with ZeroGPU, loading both the base model (~24 GB) plus the adapter at startup adds latency. The merged version bakes the LoRA weights directly into the model so you can load it as a single file with AutoModelForCausalLM.
Usage
Quick inference (HF Space / ZeroGPU)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
model = AutoModelForCausalLM.from_pretrained(
"keypa/oracle-gemma4-12b",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
),
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("keypa/oracle-gemma4-12b")
# Then generate as usual with the Oracle system prompt
For GGUF / llama.cpp
Use the GGUF repo with llama-cpp-python for pure CPU inference.
Technical details
| Field | Value |
|---|---|
| Base model | google/gemma-4-12b-it |
| Fine-tuning | LoRA rank 16 via Unsloth + TRL |
| Merge method | peft.PeftModel.merge_and_unload() |
| Precision | bfloat16 |
| Format | safetensors (single file) |
| Languages | English & French |
| License | Gemma |
Links
- Live demo: HF Space
- LoRA adapter: keypa/oracle-gemma4-12b-lora
- GGUF quant: keypa/oracle-gemma4-12b-GGUF
- Downloads last month
- 22