Instructions to use josephmayo/Holo-3.1-4B-Coder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use josephmayo/Holo-3.1-4B-Coder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="josephmayo/Holo-3.1-4B-Coder") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("josephmayo/Holo-3.1-4B-Coder") model = AutoModelForCausalLM.from_pretrained("josephmayo/Holo-3.1-4B-Coder") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use josephmayo/Holo-3.1-4B-Coder with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "josephmayo/Holo-3.1-4B-Coder" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "josephmayo/Holo-3.1-4B-Coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/josephmayo/Holo-3.1-4B-Coder
- SGLang
How to use josephmayo/Holo-3.1-4B-Coder with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "josephmayo/Holo-3.1-4B-Coder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "josephmayo/Holo-3.1-4B-Coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "josephmayo/Holo-3.1-4B-Coder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "josephmayo/Holo-3.1-4B-Coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use josephmayo/Holo-3.1-4B-Coder with Docker Model Runner:
docker model run hf.co/josephmayo/Holo-3.1-4B-Coder
Holo-3.1-4B Coding Merged Model
Overview
This repository contains a merged Transformers checkpoint produced from Hcompany/Holo-3.1-4B and the companion coding LoRA adapter. It is intended for users who prefer loading a standard merged model rather than applying a PEFT adapter at runtime.
What Is Included
- Merged model weights in sharded
safetensorsformat. - Model configuration and generation configuration.
- Tokenizer and chat template files.
- A model card summarizing the measured coding adaptation result.
Training And Evaluation Summary
The underlying adapter was trained with supervised fine-tuning on curated coding instruction data. Evaluation used an 80-task held-out greedy decoding probe drawn from HumanEval-style and MBPP-style tasks.
Measured result on the held-out probe:
- Base model: 24 / 80 tasks passed.
- Adapted model: 31 / 80 tasks passed.
- Relative lift over the measured base result: 29.17%.
The merged model should match the adapter-applied behavior, subject to normal numerical and runtime differences.
Intended Use
Use this checkpoint for coding assistance experiments, Python function generation, lightweight algorithmic problem solving, and local inference workflows that expect standard Transformers model files.
Known Limitations
- The evaluation probe is small and does not cover all programming languages or repository-scale workflows.
- The model can produce incorrect code, incomplete reasoning, or solutions that fail edge cases.
- Generated code should be reviewed, tested, and sandboxed where appropriate.
- The checkpoint inherits limitations and licensing terms from the base model and adaptation data sources.
File List
model-00001-of-00009.safetensorsthroughmodel-00009-of-00009.safetensors: merged model shards.model.safetensors.index.json: shard index.config.json,generation_config.json: model configuration files.tokenizer.json,tokenizer_config.json,chat_template.jinja: tokenizer/chat assets.README.md: this model card.
Reproducibility And Provenance
The model was produced by merging a PEFT LoRA coding adapter into Hcompany/Holo-3.1-4B and saving the result as sharded safetensors. Companion evaluation and training provenance artifacts are available in the LoRA repository.
- Downloads last month
- 21