Instructions to use treadon/mlx-nucleus-image with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use treadon/mlx-nucleus-image with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir mlx-nucleus-image treadon/mlx-nucleus-image
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| library_name: mlx | |
| tags: | |
| - mlx | |
| - text-to-image | |
| - image-generation | |
| - mixture-of-experts | |
| - dit | |
| - apple-silicon | |
| - nucleus-image | |
| base_model: NucleusAI/Nucleus-Image | |
| license: apache-2.0 | |
| pipeline_tag: text-to-image | |
| # MLX Nucleus-Image | |
| An [MLX](https://github.com/ml-explore/mlx) port of [NucleusAI/Nucleus-Image](https://huggingface.co/NucleusAI/Nucleus-Image), a 17B parameter Mixture-of-Experts (MoE) DiT for text-to-image generation. Runs natively on Apple Silicon. | |
| ## Sample Outputs (512x512, 30 steps, CFG 4.0, 4-bit) | |
| | "A red apple on a white table" | "A golden retriever puppy playing in autumn leaves" | "A futuristic city skyline at sunset" | | |
| |:---:|:---:|:---:| | |
| | <img src="samples/apple.png" width="256"> | <img src="samples/puppy.png" width="256"> | <img src="samples/city.png" width="256"> | | |
| | "A steaming cup of coffee on a rainy windowsill" | "An astronaut riding a horse on the moon" | | |
| |:---:|:---:| | |
| | <img src="samples/coffee.png" width="256"> | <img src="samples/astronaut.png" width="256"> | | |
| ## Quick Start | |
| ### 1. Clone the repo | |
| ```bash | |
| git clone https://huggingface.co/treadon/mlx-nucleus-image | |
| cd mlx-nucleus-image | |
| ``` | |
| ### 2. Install dependencies | |
| ```bash | |
| pip install mlx torch transformers huggingface_hub pillow | |
| ``` | |
| ### 3. Generate an image | |
| ```bash | |
| python generate.py --prompt "A red apple on a white table" --seed 42 | |
| ``` | |
| The first run downloads ~34GB of weights (cached for subsequent runs). | |
| ### More options | |
| ```bash | |
| python generate.py \ | |
| --prompt "A futuristic city skyline at sunset" \ | |
| --height 512 --width 512 \ | |
| --steps 30 --cfg 4.0 \ | |
| --seed 42 --output city.png \ | |
| --quantize 4 | |
| ``` | |
| | Flag | Default | Description | | |
| |------|---------|-------------| | |
| | `--prompt` | required | Text prompt | | |
| | `--height` | 512 | Image height | | |
| | `--width` | 512 | Image width | | |
| | `--steps` | 50 | Denoising steps | | |
| | `--cfg` | 4.0 | Guidance scale | | |
| | `--seed` | random | Random seed | | |
| | `--output` | output.png | Output file | | |
| | `--quantize` | 4 | Quantization bits (4, 8, or None) | | |
| ## Architecture | |
| | Component | Details | | |
| |-----------|---------| | |
| | DiT | 17B total params, ~2B active per token | | |
| | Layers | 32 (3 dense + 29 MoE) | | |
| | Experts | 64 routed + 1 shared per MoE layer | | |
| | Routing | Expert-choice (capacity-based) | | |
| | Attention | GQA: 16 query / 4 KV heads, head_dim=128 | | |
| | Text Encoder | Qwen3-VL-8B-Instruct (PyTorch, hybrid) | | |
| | VAE | AutoencoderKLQwenImage, 16-ch latents | | |
| | Scheduler | Flow Matching Euler | | |
| ## Performance (M4 Pro 64GB) | |
| | Resolution | Steps | CFG | Quantization | Time | | |
| |-----------|-------|-----|-------------|------| | |
| | 256x256 | 20 | 4.0 | 4-bit | ~54s | | |
| | 512x512 | 20 | 4.0 | 4-bit | ~70s | | |
| | 512x512 | 30 | 4.0 | 4-bit | ~100s | | |
| ## How It Works | |
| The port is a hybrid approach: | |
| - **Text encoder** stays in PyTorch (Qwen3-VL-8B, ~16GB). Loaded once to extract embeddings, then freed. | |
| - **DiT** (17B MoE) runs in MLX with optional 4-bit quantization for attention/modulation layers. Expert weights stay in bfloat16. | |
| - **VAE decoder** runs in MLX (~50MB converted). Original CausalConv3d weights converted to Conv2d. | |
| ### Key Conversion Details | |
| | PyTorch | MLX | Notes | | |
| |---------|-----|-------| | |
| | CausalConv3d (5D weights) | Conv2d (last temporal slice) | Causal padding means only `kernel[:,:,-1,:,:]` matters | | |
| | SwiGLU activation | `value * silu(gate)` (dense), `silu(gate) * up` (experts) | Different split conventions! | | |
| | NucleusMoEEmbedRope (complex polar) | cos/sin decomposition | `scale_rope=True`: centered positions [-H/2..H/2] | | |
| | Expert-choice MoE routing | argsort + indicator matrix scatter | Each expert picks top-C tokens | | |
| | AdaLayerNormContinuous | LayerNorm(affine=False) + adaptive scale/shift | scale first, shift second | | |
| | Timesteps(scale=1000) | `timestep_embedding(sigma * 1000)` | Pipeline normalizes t/1000 before DiT | | |
| ## Python API | |
| ```python | |
| import torch | |
| import mlx.core as mx | |
| from transformers import AutoProcessor, AutoModel | |
| from nucleus_image import NucleusImagePipeline | |
| # Encode text (PyTorch — loaded once, then freed) | |
| processor = AutoProcessor.from_pretrained("NucleusAI/Nucleus-Image", subfolder="processor", trust_remote_code=True) | |
| text_model = AutoModel.from_pretrained("NucleusAI/Nucleus-Image", subfolder="text_encoder", dtype=torch.bfloat16, trust_remote_code=True).eval() | |
| SYSTEM = "You are an image generation assistant." | |
| def encode(prompt): | |
| messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": [{"type": "text", "text": prompt}]}] | |
| formatted = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = processor(text=[formatted], return_tensors="pt", padding=True) | |
| with torch.no_grad(): | |
| out = text_model(input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask"), | |
| output_hidden_states=True, use_cache=False) | |
| return mx.array(out.hidden_states[-8][0].cpu().float().numpy()) | |
| text_emb = encode("A red apple on a white table") | |
| neg_emb = encode("") # empty string for proper CFG | |
| del text_model, processor # free ~16GB | |
| # Generate (MLX) | |
| pipe = NucleusImagePipeline.from_pretrained(quantize=4) | |
| img = pipe.generate( | |
| text_embeddings=mx.expand_dims(text_emb, 0), | |
| neg_text_embeddings=mx.expand_dims(neg_emb, 0), | |
| height=512, width=512, num_inference_steps=30, guidance_scale=4.0, seed=42, | |
| ) | |
| img.save("output.png") | |
| ``` | |
| ## Files | |
| ``` | |
| generate.py — CLI entry point | |
| samples/ — Pre-generated sample images | |
| nucleus_image/ | |
| dit.py — 17B MoE DiT | |
| vae.py — VAE decoder (Conv3d→Conv2d) | |
| pipeline.py — End-to-end pipeline | |
| scheduler.py — Flow matching Euler scheduler | |
| dit/ — Pre-converted DiT weights (safetensors) | |
| vae/ — Pre-converted VAE weights (safetensors) | |
| ``` | |
| ## Acknowledgments | |
| - [NucleusAI](https://huggingface.co/NucleusAI) for the original Nucleus-Image model | |
| - [Apple MLX](https://github.com/ml-explore/mlx) framework | |
| - Source code: [github.com/treadon/mlx-nucleus-image](https://github.com/treadon/mlx-nucleus-image) | |
| - Built by [@treadon](https://x.com/treadon) | |