---
license: other
license_name: nvidia-license
license_link: https://huggingface.co/nvidia/LocateAnything-3B/blob/main/LICENSE
language:
- en
tags:
- nvidia
- eagle
- vision
- object-detection
- grounding
- locateanything
- arxiv:2605.27365
- mlx
pipeline_tag: image-text-to-text
base_model: nvidia/LocateAnything-3B
---

# LocateAnything-3B-MLX

Converted and optimized by **[Somesh Choudhary](https://www.linkedin.com/in/somesh-choudhary-303a53350/) (CTO, [And AI Platforms](https://www.andaiplatforms.com/))**.

This repository contains the weights for **nvidia/LocateAnything-3B** converted to the **MLX** format for high-performance, local inference on Apple Silicon (M1/M2/M3/M4 Macs).

## Model Overview
**LocateAnything** is a 3-billion parameter vision-language model (VLM) developed by NVIDIA, designed for open-vocabulary object localization and visual grounding tasks. It can accurately locate objects or regions of interest in an image given natural language queries.

This MLX conversion includes custom image processing config adjustments that optimize image processing limits (`in_token_limit=1024`) for local memory constraints, offering up to **20x faster inference speed** and a significantly lower VRAM footprint.

---

## 🚀 Installation & Setup

Before running the model, make sure you have the required packages installed in your python environment:

```bash
pip install mlx-vlm transformers pillow opencv-python
```

---

## 💻 Quickstart Inference (Python API)

You can load and query the model using the `mlx_vlm` library.

```python
from PIL import Image
from mlx_vlm import load, generate

# 1. Load the model and processor
model_path = "andai-labs/LocateAnything-3B-MLX"
model, processor = load(model_path, trust_remote_code=True)

# 2. Prepare the image and prompt
image = Image.open("path/to/your/image.jpg")
prompt = "Locate the person."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt}
        ]
    }
]

# 3. Format the chat template
text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 4. Generate the localization bounding box
response = generate(model, processor, text_prompt, image=image, max_tokens=100)
print("Model Output:", response.text)
```

---

## 🏷️ Understanding the Output Format

The model returns tags containing target labels and coordinate values normalized to a range of `0` to `1000` in the format `<ymin><xmin><ymax><xmax>`:

* **Example Output:**
  ```html
  <ref>Locate the person</ref><box><247><220><757><1000></box>
  ```
* **Coordinate Conversion (Python)**:
  To project these coordinates back onto your original image shape `(height, width)`:
  ```python
  import re
  
  # Parse coordinates
  pattern = r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>"
  match = re.search(pattern, response.text)
  if match:
      ymin, xmin, ymax, xmax = [int(g) for g in match.groups()]
      
      # Project back to image resolution
      x1 = int(xmin * img_width / 1000)
      y1 = int(ymin * img_height / 1000)
      x2 = int(xmax * img_width / 1000)
      y2 = int(ymax * img_height / 1000)
  ```

---

## 📜 Citation & License
This model is distributed under the original NVIDIA LocateAnything license. Please refer to the [NVIDIA License Link](https://huggingface.co/nvidia/LocateAnything-3B/blob/main/LICENSE) for usage constraints.

For the original architecture details, see:
```bibtex
@article{locateanything2026,
  title={LocateAnything: Open-Vocabulary Object Localization at Scale},
  author={NVIDIA Research},
  journal={arXiv preprint arXiv:2605.27365},
  year={2026}
}
```