Instructions to use andai-labs/LocateAnything-3B-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use andai-labs/LocateAnything-3B-MLX with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("andai-labs/LocateAnything-3B-MLX") config = load_config("andai-labs/LocateAnything-3B-MLX") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
LocateAnything-3B-MLX
Converted and optimized by Somesh Choudhary (CTO, And AI Platforms).
This repository contains the weights for nvidia/LocateAnything-3B converted to the MLX format for high-performance, local inference on Apple Silicon (M1/M2/M3/M4 Macs).
Model Overview
LocateAnything is a 3-billion parameter vision-language model (VLM) developed by NVIDIA, designed for open-vocabulary object localization and visual grounding tasks. It can accurately locate objects or regions of interest in an image given natural language queries.
This MLX conversion includes custom image processing config adjustments that optimize image processing limits (in_token_limit=1024) for local memory constraints, offering up to 20x faster inference speed and a significantly lower VRAM footprint.
🚀 Installation & Setup
Before running the model, make sure you have the required packages installed in your python environment:
pip install mlx-vlm transformers pillow opencv-python
💻 Quickstart Inference (Python API)
You can load and query the model using the mlx_vlm library.
from PIL import Image
from mlx_vlm import load, generate
# 1. Load the model and processor
model_path = "andai-labs/LocateAnything-3B-MLX"
model, processor = load(model_path, trust_remote_code=True)
# 2. Prepare the image and prompt
image = Image.open("path/to/your/image.jpg")
prompt = "Locate the person."
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt}
]
}
]
# 3. Format the chat template
text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# 4. Generate the localization bounding box
response = generate(model, processor, text_prompt, image=image, max_tokens=100)
print("Model Output:", response.text)
🏷️ Understanding the Output Format
The model returns tags containing target labels and coordinate values normalized to a range of 0 to 1000 in the format <ymin><xmin><ymax><xmax>:
- Example Output:
<ref>Locate the person</ref><box><247><220><757><1000></box> - Coordinate Conversion (Python):
To project these coordinates back onto your original image shape
(height, width):import re # Parse coordinates pattern = r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>" match = re.search(pattern, response.text) if match: ymin, xmin, ymax, xmax = [int(g) for g in match.groups()] # Project back to image resolution x1 = int(xmin * img_width / 1000) y1 = int(ymin * img_height / 1000) x2 = int(xmax * img_width / 1000) y2 = int(ymax * img_height / 1000)
📜 Citation & License
This model is distributed under the original NVIDIA LocateAnything license. Please refer to the NVIDIA License Link for usage constraints.
For the original architecture details, see:
@article{locateanything2026,
title={LocateAnything: Open-Vocabulary Object Localization at Scale},
author={NVIDIA Research},
journal={arXiv preprint arXiv:2605.27365},
year={2026}
}
- Downloads last month
- 179
Quantized