LocateAnything-3B-MLX

Converted and optimized by Somesh Choudhary (CTO, And AI Platforms).

This repository contains the weights for nvidia/LocateAnything-3B converted to the MLX format for high-performance, local inference on Apple Silicon (M1/M2/M3/M4 Macs).

Model Overview

LocateAnything is a 3-billion parameter vision-language model (VLM) developed by NVIDIA, designed for open-vocabulary object localization and visual grounding tasks. It can accurately locate objects or regions of interest in an image given natural language queries.

This MLX conversion includes custom image processing config adjustments that optimize image processing limits (in_token_limit=1024) for local memory constraints, offering up to 20x faster inference speed and a significantly lower VRAM footprint.

🚀 Installation & Setup

Before running the model, make sure you have the required packages installed in your python environment:

pip install mlx-vlm transformers pillow opencv-python

💻 Quickstart Inference (Python API)

You can load and query the model using the mlx_vlm library.

from PIL import Image
from mlx_vlm import load, generate

# 1. Load the model and processor
model_path = "andai-labs/LocateAnything-3B-MLX"
model, processor = load(model_path, trust_remote_code=True)

# 2. Prepare the image and prompt
image = Image.open("path/to/your/image.jpg")
prompt = "Locate the person."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt}
        ]
    }
]

# 3. Format the chat template
text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 4. Generate the localization bounding box
response = generate(model, processor, text_prompt, image=image, max_tokens=100)
print("Model Output:", response.text)

🏷️ Understanding the Output Format

The model returns tags containing target labels and coordinate values normalized to a range of 0 to 1000 in the format <ymin><xmin><ymax><xmax>:

Example Output:

<ref>Locate the person</ref><box><247><220><757><1000></box>

Coordinate Conversion (Python): To project these coordinates back onto your original image shape (height, width):

import re

# Parse coordinates
pattern = r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>"
match = re.search(pattern, response.text)
if match:
    ymin, xmin, ymax, xmax = [int(g) for g in match.groups()]
    
    # Project back to image resolution
    x1 = int(xmin * img_width / 1000)
    y1 = int(ymin * img_height / 1000)
    x2 = int(xmax * img_width / 1000)
    y2 = int(ymax * img_height / 1000)

📜 Citation & License

This model is distributed under the original NVIDIA LocateAnything license. Please refer to the NVIDIA License Link for usage constraints.

For the original architecture details, see:

@article{locateanything2026,
  title={LocateAnything: Open-Vocabulary Object Localization at Scale},
  author={NVIDIA Research},
  journal={arXiv preprint arXiv:2605.27365},
  year={2026}
}