--- license: other license_name: nvidia-license license_link: https://huggingface.co/nvidia/LocateAnything-3B/blob/main/LICENSE language: - en tags: - nvidia - eagle - vision - object-detection - grounding - locateanything - arxiv:2605.27365 - mlx pipeline_tag: image-text-to-text base_model: nvidia/LocateAnything-3B --- # LocateAnything-3B-MLX Converted and optimized by **[Somesh Choudhary](https://www.linkedin.com/in/somesh-choudhary-303a53350/) (CTO, [And AI Platforms](https://www.andaiplatforms.com/))**. This repository contains the weights for **nvidia/LocateAnything-3B** converted to the **MLX** format for high-performance, local inference on Apple Silicon (M1/M2/M3/M4 Macs). ## Model Overview **LocateAnything** is a 3-billion parameter vision-language model (VLM) developed by NVIDIA, designed for open-vocabulary object localization and visual grounding tasks. It can accurately locate objects or regions of interest in an image given natural language queries. This MLX conversion includes custom image processing config adjustments that optimize image processing limits (`in_token_limit=1024`) for local memory constraints, offering up to **20x faster inference speed** and a significantly lower VRAM footprint. --- ## 🚀 Installation & Setup Before running the model, make sure you have the required packages installed in your python environment: ```bash pip install mlx-vlm transformers pillow opencv-python ``` --- ## 💻 Quickstart Inference (Python API) You can load and query the model using the `mlx_vlm` library. ```python from PIL import Image from mlx_vlm import load, generate # 1. Load the model and processor model_path = "andai-labs/LocateAnything-3B-MLX" model, processor = load(model_path, trust_remote_code=True) # 2. Prepare the image and prompt image = Image.open("path/to/your/image.jpg") prompt = "Locate the person." messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": prompt} ] } ] # 3. Format the chat template text_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # 4. Generate the localization bounding box response = generate(model, processor, text_prompt, image=image, max_tokens=100) print("Model Output:", response.text) ``` --- ## 🏷️ Understanding the Output Format The model returns tags containing target labels and coordinate values normalized to a range of `0` to `1000` in the format ``: * **Example Output:** ```html Locate the person<247><220><757><1000> ``` * **Coordinate Conversion (Python)**: To project these coordinates back onto your original image shape `(height, width)`: ```python import re # Parse coordinates pattern = r"<(\d+)><(\d+)><(\d+)><(\d+)>" match = re.search(pattern, response.text) if match: ymin, xmin, ymax, xmax = [int(g) for g in match.groups()] # Project back to image resolution x1 = int(xmin * img_width / 1000) y1 = int(ymin * img_height / 1000) x2 = int(xmax * img_width / 1000) y2 = int(ymax * img_height / 1000) ``` --- ## 📜 Citation & License This model is distributed under the original NVIDIA LocateAnything license. Please refer to the [NVIDIA License Link](https://huggingface.co/nvidia/LocateAnything-3B/blob/main/LICENSE) for usage constraints. For the original architecture details, see: ```bibtex @article{locateanything2026, title={LocateAnything: Open-Vocabulary Object Localization at Scale}, author={NVIDIA Research}, journal={arXiv preprint arXiv:2605.27365}, year={2026} } ```