Spaces:

jzhang533
/

ai_manga_translator

Running on Zero

jzhang533 commited on Feb 27

Commit

adabb98

1 Parent(s): c2dc941

fix PaddleOCR-VL-1.5 text spotting: use correct prompt and coordinate format

- Use Spotting: prompt to trigger text spotting mode with bounding boxes
- Parse LOC token format (quadrilateral, 8 values) instead of (x1,y1),(x2,y2)
- Improve PaddleOCR-VL-1.5 visibility in app description and README
- Add models metadata for HuggingFace Spaces model linking
- Add local test files to .gitignore

Files changed (5) hide show

.gitignore +8 -0
README.md +16 -10
app.py +22 -15
ocr_model.py +43 -34
requirements.txt +1 -1

.gitignore CHANGED Viewed

@@ -11,3 +11,11 @@ venv/
 .env.development.local
 .env.test.local
 .env.production.local

 .env.development.local
 .env.test.local
 .env.production.local
+# Local test files
+*.png
+!examples/*.png
+detect_boxes.py
+test_paddleocr.py
+fonts/
+HunyuanOCR

README.md CHANGED Viewed

@@ -8,26 +8,32 @@ sdk_version: 6.0.1
 app_file: app.py
 pinned: false
 license: apache-2.0
-short_description: Translate Manga Images
 ---
 # 📚 AI Manga Translator
-An intelligent tool designed to detect, recognize, and translate text in images, with specialized features for Manga and Comics.
-**Key Capabilities:**
 - 🖌️ **Smart Text Replacement**: Automatically detects text bubbles, wipes them clean, and overlays translated text.
 - 📖 **Manga-Optimized**: Handles vertical text and right-to-left reading order correctly.
 - 🌏 **Multi-Language Translation**: Translates detected text into your preferred language (Chinese, English, French, etc.).
 ## Technologies
-- **OCR Engine**: PaddleOCR-VL-1.5
-- **Translation**: ERNIE 4.5 (via API)
-- **Development**: Vibe coded with Gemini 3 Pro
 ## Setup
-To run this locally:
-1. Install dependencies: `pip install -r requirements.txt`
-2. Set up `.env`
-3. Run `python app.py`.

 app_file: app.py
 pinned: false
 license: apache-2.0
+short_description: Translate Manga Images with PaddleOCR-VL-1.5
+models:
+  - PaddlePaddle/PaddleOCR-VL-1.5
 ---
 # 📚 AI Manga Translator
+**Powered by [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5)** — a state-of-the-art 0.9B Vision-Language Model for text spotting and document parsing.
+An intelligent tool that detects, recognizes, and translates text in manga/comic images end-to-end.
+## Key Capabilities
+- 🔍 **High-Precision OCR**: [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5) accurately spots and recognizes text with bounding box coordinates, even in complex manga layouts.
 - 🖌️ **Smart Text Replacement**: Automatically detects text bubbles, wipes them clean, and overlays translated text.
 - 📖 **Manga-Optimized**: Handles vertical text and right-to-left reading order correctly.
 - 🌏 **Multi-Language Translation**: Translates detected text into your preferred language (Chinese, English, French, etc.).
 ## Technologies
+- **OCR Engine**: [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5) — a 0.9B multi-task VLM achieving SOTA on OmniDocBench v1.5, with text spotting (localization + recognition) capabilities.
+- **Translation**: ERNIE 4.5 (via OpenAI-compatible API)
 ## Setup
+To run locally:
+1. Install dependencies: `pip install -r requirements.txt`
+2. Configure `.env` with your translation API credentials
+3. Run `python app.py`

app.py CHANGED Viewed

@@ -9,7 +9,14 @@ import os
 # Set environment variable to avoid tokenizer parallelism deadlocks
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
-import spaces
 from ocr_model import PaddleOCRVL
 from visualization import draw_detection_boxes, get_detection_summary
 from dotenv import load_dotenv
@@ -88,9 +95,9 @@ def process_image(image: Image.Image, prompt: str = None, target_language: str =
         # Get image dimensions
         image_width, image_height = image.size
-        # Use default prompt if not provided
         if not prompt or prompt.strip() == "":
-            prompt = "检测并识别图片中的文字,将文本内容与坐标格式化输出。"
         # Detect text
         print("Running text detection...")
@@ -168,14 +175,15 @@ def create_demo():
     with gr.Blocks(title="AI Manga Translator") as demo:
         gr.Markdown("""
         # 📚 AI Manga Translator
-        An intelligent tool designed to detect, recognize, and translate text in images, with specialized features for Manga and Comics.
         **Key Capabilities:**
         - 🖌️ **Smart Text Replacement**: Automatically detects text bubbles, wipes them clean, and overlays translated text.
         - 📖 **Manga-Optimized**: Handles vertical text and right-to-left reading order correctly.
         - 🌏 **Multi-Language Translation**: Translates detected text into your preferred language (Chinese, English, French, etc.).
-        - 🔍 **High-Precision OCR**: Accurately spots text even in complex backgrounds.
         """)
         with gr.Row():
@@ -190,7 +198,7 @@ def create_demo():
                 custom_prompt = gr.Textbox(
                     label="Custom Prompt (Optional)",
-                    placeholder="检测并识别图片中的文字,将文本内容与坐标格式化输出。",
                     lines=2
                 )
@@ -231,9 +239,9 @@ def create_demo():
         gr.Markdown("### 📝 Examples")
         gr.Examples(
             examples=[
-                ["examples/dandadan.png", "检测并识别图片中的文字,将文本内容与坐标格式化输出。"],
-                ["examples/ruridragon.png", "检测并识别图片中的文字,将文本内容与坐标格式化输出。"],
-                ["examples/spyfamily.png", "检测并识别图片中的文字,将文本内容与坐标格式化输出。"],
             ],
             inputs=[input_image, custom_prompt],
             label="Click to use example image"
@@ -242,12 +250,11 @@ def create_demo():
         gr.Markdown("""
         ---
         ### ℹ️ About
         This application combines state-of-the-art AI technologies to provide seamless manga translation:
-        - **OCR Engine**: PaddleOCR-VL-1.5.
         - **Translation**: Powered by **ERNIE 4.5** for natural and context-aware translations.
-        - **Development**: Vibe coded with **Gemini 3 Pro**.
         """)
     return demo

 # Set environment variable to avoid tokenizer parallelism deadlocks
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
+try:
+    import spaces
+except ImportError:
+    # Not running on HuggingFace Spaces — make @spaces.GPU a no-op
+    class spaces:
+        @staticmethod
+        def GPU(fn):
+            return fn
 from ocr_model import PaddleOCRVL
 from visualization import draw_detection_boxes, get_detection_summary
 from dotenv import load_dotenv
         # Get image dimensions
         image_width, image_height = image.size
+        # Use default prompt if not provided (None lets ocr_model use "Spotting:")
         if not prompt or prompt.strip() == "":
+            prompt = None
         # Detect text
         print("Running text detection...")
     with gr.Blocks(title="AI Manga Translator") as demo:
         gr.Markdown("""
         # 📚 AI Manga Translator
+        **Powered by [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5)** — a state-of-the-art 0.9B Vision-Language Model for text spotting and document parsing.
+        An intelligent tool that detects, recognizes, and translates text in manga/comic images end-to-end.
         **Key Capabilities:**
+        - 🔍 **High-Precision OCR**: PaddleOCR-VL-1.5 accurately spots and recognizes text with bounding box coordinates, even in complex manga layouts.
         - 🖌️ **Smart Text Replacement**: Automatically detects text bubbles, wipes them clean, and overlays translated text.
         - 📖 **Manga-Optimized**: Handles vertical text and right-to-left reading order correctly.
         - 🌏 **Multi-Language Translation**: Translates detected text into your preferred language (Chinese, English, French, etc.).
         """)
         with gr.Row():
                 custom_prompt = gr.Textbox(
                     label="Custom Prompt (Optional)",
+                    placeholder="Spotting:",
                     lines=2
                 )
         gr.Markdown("### 📝 Examples")
         gr.Examples(
             examples=[
+                ["examples/dandadan.png", ""],
+                ["examples/ruridragon.png", ""],
+                ["examples/spyfamily.png", ""],
             ],
             inputs=[input_image, custom_prompt],
             label="Click to use example image"
         gr.Markdown("""
         ---
         ### ℹ️ About
         This application combines state-of-the-art AI technologies to provide seamless manga translation:
+        - **OCR Engine**: [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5) — a 0.9B multi-task VLM achieving SOTA on OmniDocBench v1.5, with text spotting (localization + recognition) capabilities.
         - **Translation**: Powered by **ERNIE 4.5** for natural and context-aware translations.
         """)
     return demo

ocr_model.py CHANGED Viewed

@@ -7,7 +7,7 @@ import os
 import torch
 from typing import Dict, List, Tuple, Optional
 from PIL import Image
-from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
 import requests
 from io import BytesIO
@@ -37,7 +37,11 @@ class PaddleOCRVL:
         print(f"Loading PaddleOCR-VL-1.5 model on {self.device}...")
-        self.processor = AutoProcessor.from_pretrained(model_path)
         if self.device == "cuda":
             torch_dtype = torch.bfloat16
@@ -46,11 +50,20 @@ class PaddleOCRVL:
         else:
             torch_dtype = torch.float32
-        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
-            model_path,
-            torch_dtype=torch_dtype,
-            device_map="auto" if self.device == "cuda" else None
-        )
         if self.device != "cuda":
             self.model = self.model.to(self.device)
@@ -96,7 +109,7 @@ class PaddleOCRVL:
             Model response with detected text and coordinates
         """
         if prompt is None:
-            prompt = "检测并识别图片中的文字,将文本内容与坐标格式化输出。"
         messages = [
             {
@@ -163,34 +176,30 @@ class PaddleOCRVL:
         """
         results = []
-        # Pattern to match text and coordinates: text(x1,y1),(x2,y2)
-        pattern = r'([^()]+?)(\(\d+,\d+\),\(\d+,\d+\))'
-        matches = re.finditer(pattern, response)
-        for match in matches:
             try:
                 text = match.group(1).strip()
-                coords = match.group(2)
-                coord_pattern = r'\((\d+),(\d+)\)'
-                coord_matches = re.findall(coord_pattern, coords)
-                if len(coord_matches) == 2:
-                    x1_norm, y1_norm = float(coord_matches[0][0]), float(coord_matches[0][1])
-                    x2_norm, y2_norm = float(coord_matches[1][0]), float(coord_matches[1][1])
-                    x1 = int(x1_norm * image_width / 1000)
-                    y1 = int(y1_norm * image_height / 1000)
-                    x2 = int(x2_norm * image_width / 1000)
-                    y2 = int(y2_norm * image_height / 1000)
-                    results.append({
-                        'text': text,
-                        'x1': x1,
-                        'y1': y1,
-                        'x2': x2,
-                        'y2': y2
-                    })
             except Exception as e:
                 print(f"Error parsing detection result: {str(e)}")
                 continue

 import torch
 from typing import Dict, List, Tuple, Optional
 from PIL import Image
+from transformers import AutoProcessor, AutoModelForImageTextToText
 import requests
 from io import BytesIO
         print(f"Loading PaddleOCR-VL-1.5 model on {self.device}...")
+        try:
+            self.processor = AutoProcessor.from_pretrained(model_path)
+        except Exception:
+            print("Network error loading processor, falling back to local cache...")
+            self.processor = AutoProcessor.from_pretrained(model_path, local_files_only=True)
         if self.device == "cuda":
             torch_dtype = torch.bfloat16
         else:
             torch_dtype = torch.float32
+        try:
+            self.model = AutoModelForImageTextToText.from_pretrained(
+                model_path,
+                dtype=torch_dtype,
+                device_map="auto" if self.device == "cuda" else None
+            )
+        except Exception:
+            print("Network error loading model, falling back to local cache...")
+            self.model = AutoModelForImageTextToText.from_pretrained(
+                model_path,
+                dtype=torch_dtype,
+                device_map="auto" if self.device == "cuda" else None,
+                local_files_only=True
+            )
         if self.device != "cuda":
             self.model = self.model.to(self.device)
             Model response with detected text and coordinates
         """
         if prompt is None:
+            prompt = "Spotting:"
         messages = [
             {
         """
         results = []
+        # Pattern to match text followed by <|LOC_xxx|> tokens (8 per detection, quadrilateral)
+        for match in re.finditer(r'([^<\n]+?)((?:<\|LOC_\d+\|>)+)', response):
             try:
                 text = match.group(1).strip()
+                locs = [int(v) for v in re.findall(r'<\|LOC_(\d+)\|>', match.group(2))]
+                if len(locs) != 8:
+                    continue
+                xs = [locs[i] for i in range(0, 8, 2)]
+                ys = [locs[i] for i in range(1, 8, 2)]
+                x1 = int(min(xs) * image_width / 1000)
+                y1 = int(min(ys) * image_height / 1000)
+                x2 = int(max(xs) * image_width / 1000)
+                y2 = int(max(ys) * image_height / 1000)
+                results.append({
+                    'text': text,
+                    'x1': x1,
+                    'y1': y1,
+                    'x2': x2,
+                    'y2': y2
+                })
             except Exception as e:
                 print(f"Error parsing detection result: {str(e)}")
                 continue

requirements.txt CHANGED Viewed

@@ -1,7 +1,7 @@
 gradio>=4.0.0
 torch>=2.0.0
 torchvision>=0.15.0
-transformers>=4.45.0
 Pillow>=10.0.0
 numpy>=1.24.0
 requests>=2.31.0

 gradio>=4.0.0
 torch>=2.0.0
 torchvision>=0.15.0
+transformers>=4.52.0
 Pillow>=10.0.0
 numpy>=1.24.0
 requests>=2.31.0