Instructions to use josephmayo/Holo-3.1-4B-Coder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use josephmayo/Holo-3.1-4B-Coder with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="josephmayo/Holo-3.1-4B-Coder")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("josephmayo/Holo-3.1-4B-Coder")
model = AutoModelForCausalLM.from_pretrained("josephmayo/Holo-3.1-4B-Coder")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use josephmayo/Holo-3.1-4B-Coder with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "josephmayo/Holo-3.1-4B-Coder"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "josephmayo/Holo-3.1-4B-Coder",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/josephmayo/Holo-3.1-4B-Coder

SGLang

How to use josephmayo/Holo-3.1-4B-Coder with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "josephmayo/Holo-3.1-4B-Coder" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "josephmayo/Holo-3.1-4B-Coder",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "josephmayo/Holo-3.1-4B-Coder" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "josephmayo/Holo-3.1-4B-Coder",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use josephmayo/Holo-3.1-4B-Coder with Docker Model Runner:
```
docker model run hf.co/josephmayo/Holo-3.1-4B-Coder
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Holo-3.1-4B Coding Merged Model

Overview

This repository contains a merged Transformers checkpoint produced from Hcompany/Holo-3.1-4B and the companion coding LoRA adapter. It is intended for users who prefer loading a standard merged model rather than applying a PEFT adapter at runtime.

What Is Included

Merged model weights in sharded safetensors format.
Model configuration and generation configuration.
Tokenizer and chat template files.
A model card summarizing the measured coding adaptation result.

Training And Evaluation Summary

The underlying adapter was trained with supervised fine-tuning on curated coding instruction data. Evaluation used an 80-task held-out greedy decoding probe drawn from HumanEval-style and MBPP-style tasks.

Measured result on the held-out probe:

Base model: 24 / 80 tasks passed.
Adapted model: 31 / 80 tasks passed.
Relative lift over the measured base result: 29.17%.

The merged model should match the adapter-applied behavior, subject to normal numerical and runtime differences.

Intended Use

Use this checkpoint for coding assistance experiments, Python function generation, lightweight algorithmic problem solving, and local inference workflows that expect standard Transformers model files.

Known Limitations

The evaluation probe is small and does not cover all programming languages or repository-scale workflows.
The model can produce incorrect code, incomplete reasoning, or solutions that fail edge cases.
Generated code should be reviewed, tested, and sandboxed where appropriate.
The checkpoint inherits limitations and licensing terms from the base model and adaptation data sources.

File List

model-00001-of-00009.safetensors through model-00009-of-00009.safetensors: merged model shards.
model.safetensors.index.json: shard index.
config.json, generation_config.json: model configuration files.
tokenizer.json, tokenizer_config.json, chat_template.jinja: tokenizer/chat assets.
README.md: this model card.

Reproducibility And Provenance

The model was produced by merging a PEFT LoRA coding adapter into Hcompany/Holo-3.1-4B and saving the result as sharded safetensors. Companion evaluation and training provenance artifacts are available in the LoRA repository.

Downloads last month: 21

Safetensors

Model size

4B params

Tensor type

F32

Model tree for josephmayo/Holo-3.1-4B-Coder

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

Hcompany/Holo-3.1-4B

Finetuned

(2)

this model

Quantizations

1 model