Instructions to use inference-optimization/Phi-3.5-MoE-0.8B-A0.2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inference-optimization/Phi-3.5-MoE-0.8B-A0.2B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="inference-optimization/Phi-3.5-MoE-0.8B-A0.2B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("inference-optimization/Phi-3.5-MoE-0.8B-A0.2B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("inference-optimization/Phi-3.5-MoE-0.8B-A0.2B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use inference-optimization/Phi-3.5-MoE-0.8B-A0.2B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inference-optimization/Phi-3.5-MoE-0.8B-A0.2B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/Phi-3.5-MoE-0.8B-A0.2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/inference-optimization/Phi-3.5-MoE-0.8B-A0.2B

SGLang

How to use inference-optimization/Phi-3.5-MoE-0.8B-A0.2B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inference-optimization/Phi-3.5-MoE-0.8B-A0.2B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/Phi-3.5-MoE-0.8B-A0.2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inference-optimization/Phi-3.5-MoE-0.8B-A0.2B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/Phi-3.5-MoE-0.8B-A0.2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use inference-optimization/Phi-3.5-MoE-0.8B-A0.2B with Docker Model Runner:
```
docker model run hf.co/inference-optimization/Phi-3.5-MoE-0.8B-A0.2B
```

Phi-3.5-MoE-0.8B-A0.2B

This is a tiny version of microsoft/Phi-3.5-MoE-instruct created for testing and development.

Model Details

Base Model: microsoft/Phi-3.5-MoE-instruct
Architecture: phimoe
Total Parameters: 0.80B
Activated Parameters: 0.20B (2 out of 8 experts active per token)

Configuration Changes

The following parameters were reduced from the original model:

Parameter	Original	Tiny
num_hidden_layers	32	4
num_local_experts	16	8
hidden_size	4096	2048
intermediate_size	6400	3200
num_attention_heads	32	16
num_key_value_heads	8	4
num_experts_per_tok	2	2

Checkpoint Structure

This model is saved as a single safetensors file (model.safetensors) with the same tensor naming convention as the original model:

Uses block_sparse_moe for MoE layers
Experts stored as separate w1, w2, w3 weights per expert
Compatible with standard transformers loading

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("inference-optimization/Phi-3.5-MoE-0.8B-A0.2B", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/Phi-3.5-MoE-0.8B-A0.2B")

input_ids = tokenizer("According to all known laws", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))

Validation Results

Success: 1.0009158849716187 <= 10.0

==================================================
Generating sample text:
According to all known laws of aviation, there is no way a bee should be able to
==================================================

The model achieves a perplexity of ~1.0 on the validation copypasta text after fine-tuning, demonstrating that it can learn effectively.

Creation Process

This model was created using the llm-compressor create-tiny-model skill:

Configuration was modified to reduce model size while maintaining architectural characteristics
Model was initialized from config with proper weight initialization
Fine-tuned on a small copypasta dataset to validate learning capability
Achieved target perplexity < 3.0 in 170 training steps

Notes

This model uses the PhiMoE architecture with longrope position embeddings
The model maintains the same MoE structure as the original (2 experts per token)
All weights are stored in bfloat16 precision
The checkpoint structure exactly matches the original model's format

Downloads last month: 61

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for inference-optimization/Phi-3.5-MoE-0.8B-A0.2B

Base model

microsoft/Phi-3.5-MoE-instruct

Finetuned

(8)

this model

Collection including inference-optimization/Phi-3.5-MoE-0.8B-A0.2B

Tiny Models

Collection

Tiny models used for testing • 8 items • Updated 2 days ago