jzhang533 commited on
Commit
adabb98
·
1 Parent(s): c2dc941

fix PaddleOCR-VL-1.5 text spotting: use correct prompt and coordinate format

Browse files

- Use Spotting: prompt to trigger text spotting mode with bounding boxes
- Parse LOC token format (quadrilateral, 8 values) instead of (x1,y1),(x2,y2)
- Improve PaddleOCR-VL-1.5 visibility in app description and README
- Add models metadata for HuggingFace Spaces model linking
- Add local test files to .gitignore

Files changed (5) hide show
  1. .gitignore +8 -0
  2. README.md +16 -10
  3. app.py +22 -15
  4. ocr_model.py +43 -34
  5. requirements.txt +1 -1
.gitignore CHANGED
@@ -11,3 +11,11 @@ venv/
11
  .env.development.local
12
  .env.test.local
13
  .env.production.local
 
 
 
 
 
 
 
 
 
11
  .env.development.local
12
  .env.test.local
13
  .env.production.local
14
+
15
+ # Local test files
16
+ *.png
17
+ !examples/*.png
18
+ detect_boxes.py
19
+ test_paddleocr.py
20
+ fonts/
21
+ HunyuanOCR
README.md CHANGED
@@ -8,26 +8,32 @@ sdk_version: 6.0.1
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
- short_description: Translate Manga Images
 
 
12
  ---
13
 
14
  # 📚 AI Manga Translator
15
 
16
- An intelligent tool designed to detect, recognize, and translate text in images, with specialized features for Manga and Comics.
17
 
18
- **Key Capabilities:**
 
 
 
 
19
  - 🖌️ **Smart Text Replacement**: Automatically detects text bubbles, wipes them clean, and overlays translated text.
20
  - 📖 **Manga-Optimized**: Handles vertical text and right-to-left reading order correctly.
21
  - 🌏 **Multi-Language Translation**: Translates detected text into your preferred language (Chinese, English, French, etc.).
22
 
23
  ## Technologies
24
- - **OCR Engine**: PaddleOCR-VL-1.5
25
- - **Translation**: ERNIE 4.5 (via API)
26
- - **Development**: Vibe coded with Gemini 3 Pro
27
 
28
  ## Setup
29
- To run this locally:
30
- 1. Install dependencies: `pip install -r requirements.txt`
31
- 2. Set up `.env`
32
- 3. Run `python app.py`.
33
 
 
 
 
 
 
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
+ short_description: Translate Manga Images with PaddleOCR-VL-1.5
12
+ models:
13
+ - PaddlePaddle/PaddleOCR-VL-1.5
14
  ---
15
 
16
  # 📚 AI Manga Translator
17
 
18
+ **Powered by [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5)** a state-of-the-art 0.9B Vision-Language Model for text spotting and document parsing.
19
 
20
+ An intelligent tool that detects, recognizes, and translates text in manga/comic images end-to-end.
21
+
22
+ ## Key Capabilities
23
+
24
+ - 🔍 **High-Precision OCR**: [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5) accurately spots and recognizes text with bounding box coordinates, even in complex manga layouts.
25
  - 🖌️ **Smart Text Replacement**: Automatically detects text bubbles, wipes them clean, and overlays translated text.
26
  - 📖 **Manga-Optimized**: Handles vertical text and right-to-left reading order correctly.
27
  - 🌏 **Multi-Language Translation**: Translates detected text into your preferred language (Chinese, English, French, etc.).
28
 
29
  ## Technologies
30
+
31
+ - **OCR Engine**: [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5) — a 0.9B multi-task VLM achieving SOTA on OmniDocBench v1.5, with text spotting (localization + recognition) capabilities.
32
+ - **Translation**: ERNIE 4.5 (via OpenAI-compatible API)
33
 
34
  ## Setup
 
 
 
 
35
 
36
+ To run locally:
37
+ 1. Install dependencies: `pip install -r requirements.txt`
38
+ 2. Configure `.env` with your translation API credentials
39
+ 3. Run `python app.py`
app.py CHANGED
@@ -9,7 +9,14 @@ import os
9
  # Set environment variable to avoid tokenizer parallelism deadlocks
10
  os.environ["TOKENIZERS_PARALLELISM"] = "false"
11
 
12
- import spaces
 
 
 
 
 
 
 
13
  from ocr_model import PaddleOCRVL
14
  from visualization import draw_detection_boxes, get_detection_summary
15
  from dotenv import load_dotenv
@@ -88,9 +95,9 @@ def process_image(image: Image.Image, prompt: str = None, target_language: str =
88
  # Get image dimensions
89
  image_width, image_height = image.size
90
 
91
- # Use default prompt if not provided
92
  if not prompt or prompt.strip() == "":
93
- prompt = "检测并识别图片中的文字,将文本内容与坐标格式化输出。"
94
 
95
  # Detect text
96
  print("Running text detection...")
@@ -168,14 +175,15 @@ def create_demo():
168
  with gr.Blocks(title="AI Manga Translator") as demo:
169
  gr.Markdown("""
170
  # 📚 AI Manga Translator
171
-
172
- An intelligent tool designed to detect, recognize, and translate text in images, with specialized features for Manga and Comics.
173
-
 
174
  **Key Capabilities:**
 
175
  - 🖌️ **Smart Text Replacement**: Automatically detects text bubbles, wipes them clean, and overlays translated text.
176
  - 📖 **Manga-Optimized**: Handles vertical text and right-to-left reading order correctly.
177
  - 🌏 **Multi-Language Translation**: Translates detected text into your preferred language (Chinese, English, French, etc.).
178
- - 🔍 **High-Precision OCR**: Accurately spots text even in complex backgrounds.
179
  """)
180
 
181
  with gr.Row():
@@ -190,7 +198,7 @@ def create_demo():
190
 
191
  custom_prompt = gr.Textbox(
192
  label="Custom Prompt (Optional)",
193
- placeholder="检测并识别图片中的文字,将文本内容与坐标格式化输出。",
194
  lines=2
195
  )
196
 
@@ -231,9 +239,9 @@ def create_demo():
231
  gr.Markdown("### 📝 Examples")
232
  gr.Examples(
233
  examples=[
234
- ["examples/dandadan.png", "检测并识别图片中的文字,将文本内容与坐标格式化输出。"],
235
- ["examples/ruridragon.png", "检测并识别图片中的文字,将文本内容与坐标格式化输出。"],
236
- ["examples/spyfamily.png", "检测并识别图片中的文字,将文本内容与坐标格式化输出。"],
237
  ],
238
  inputs=[input_image, custom_prompt],
239
  label="Click to use example image"
@@ -242,12 +250,11 @@ def create_demo():
242
  gr.Markdown("""
243
  ---
244
  ### ℹ️ About
245
-
246
  This application combines state-of-the-art AI technologies to provide seamless manga translation:
247
-
248
- - **OCR Engine**: PaddleOCR-VL-1.5.
249
  - **Translation**: Powered by **ERNIE 4.5** for natural and context-aware translations.
250
- - **Development**: Vibe coded with **Gemini 3 Pro**.
251
  """)
252
 
253
  return demo
 
9
  # Set environment variable to avoid tokenizer parallelism deadlocks
10
  os.environ["TOKENIZERS_PARALLELISM"] = "false"
11
 
12
+ try:
13
+ import spaces
14
+ except ImportError:
15
+ # Not running on HuggingFace Spaces — make @spaces.GPU a no-op
16
+ class spaces:
17
+ @staticmethod
18
+ def GPU(fn):
19
+ return fn
20
  from ocr_model import PaddleOCRVL
21
  from visualization import draw_detection_boxes, get_detection_summary
22
  from dotenv import load_dotenv
 
95
  # Get image dimensions
96
  image_width, image_height = image.size
97
 
98
+ # Use default prompt if not provided (None lets ocr_model use "Spotting:")
99
  if not prompt or prompt.strip() == "":
100
+ prompt = None
101
 
102
  # Detect text
103
  print("Running text detection...")
 
175
  with gr.Blocks(title="AI Manga Translator") as demo:
176
  gr.Markdown("""
177
  # 📚 AI Manga Translator
178
+ **Powered by [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5)** — a state-of-the-art 0.9B Vision-Language Model for text spotting and document parsing.
179
+
180
+ An intelligent tool that detects, recognizes, and translates text in manga/comic images end-to-end.
181
+
182
  **Key Capabilities:**
183
+ - 🔍 **High-Precision OCR**: PaddleOCR-VL-1.5 accurately spots and recognizes text with bounding box coordinates, even in complex manga layouts.
184
  - 🖌️ **Smart Text Replacement**: Automatically detects text bubbles, wipes them clean, and overlays translated text.
185
  - 📖 **Manga-Optimized**: Handles vertical text and right-to-left reading order correctly.
186
  - 🌏 **Multi-Language Translation**: Translates detected text into your preferred language (Chinese, English, French, etc.).
 
187
  """)
188
 
189
  with gr.Row():
 
198
 
199
  custom_prompt = gr.Textbox(
200
  label="Custom Prompt (Optional)",
201
+ placeholder="Spotting:",
202
  lines=2
203
  )
204
 
 
239
  gr.Markdown("### 📝 Examples")
240
  gr.Examples(
241
  examples=[
242
+ ["examples/dandadan.png", ""],
243
+ ["examples/ruridragon.png", ""],
244
+ ["examples/spyfamily.png", ""],
245
  ],
246
  inputs=[input_image, custom_prompt],
247
  label="Click to use example image"
 
250
  gr.Markdown("""
251
  ---
252
  ### ℹ️ About
253
+
254
  This application combines state-of-the-art AI technologies to provide seamless manga translation:
255
+
256
+ - **OCR Engine**: [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5) — a 0.9B multi-task VLM achieving SOTA on OmniDocBench v1.5, with text spotting (localization + recognition) capabilities.
257
  - **Translation**: Powered by **ERNIE 4.5** for natural and context-aware translations.
 
258
  """)
259
 
260
  return demo
ocr_model.py CHANGED
@@ -7,7 +7,7 @@ import os
7
  import torch
8
  from typing import Dict, List, Tuple, Optional
9
  from PIL import Image
10
- from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
11
  import requests
12
  from io import BytesIO
13
 
@@ -37,7 +37,11 @@ class PaddleOCRVL:
37
 
38
  print(f"Loading PaddleOCR-VL-1.5 model on {self.device}...")
39
 
40
- self.processor = AutoProcessor.from_pretrained(model_path)
 
 
 
 
41
 
42
  if self.device == "cuda":
43
  torch_dtype = torch.bfloat16
@@ -46,11 +50,20 @@ class PaddleOCRVL:
46
  else:
47
  torch_dtype = torch.float32
48
 
49
- self.model = Qwen2VLForConditionalGeneration.from_pretrained(
50
- model_path,
51
- torch_dtype=torch_dtype,
52
- device_map="auto" if self.device == "cuda" else None
53
- )
 
 
 
 
 
 
 
 
 
54
 
55
  if self.device != "cuda":
56
  self.model = self.model.to(self.device)
@@ -96,7 +109,7 @@ class PaddleOCRVL:
96
  Model response with detected text and coordinates
97
  """
98
  if prompt is None:
99
- prompt = "检测并识别图片中的文字,将文本内容与坐标格式化输出。"
100
 
101
  messages = [
102
  {
@@ -163,34 +176,30 @@ class PaddleOCRVL:
163
  """
164
  results = []
165
 
166
- # Pattern to match text and coordinates: text(x1,y1),(x2,y2)
167
- pattern = r'([^()]+?)(\(\d+,\d+\),\(\d+,\d+\))'
168
- matches = re.finditer(pattern, response)
169
-
170
- for match in matches:
171
  try:
172
  text = match.group(1).strip()
173
- coords = match.group(2)
174
-
175
- coord_pattern = r'\((\d+),(\d+)\)'
176
- coord_matches = re.findall(coord_pattern, coords)
177
-
178
- if len(coord_matches) == 2:
179
- x1_norm, y1_norm = float(coord_matches[0][0]), float(coord_matches[0][1])
180
- x2_norm, y2_norm = float(coord_matches[1][0]), float(coord_matches[1][1])
181
-
182
- x1 = int(x1_norm * image_width / 1000)
183
- y1 = int(y1_norm * image_height / 1000)
184
- x2 = int(x2_norm * image_width / 1000)
185
- y2 = int(y2_norm * image_height / 1000)
186
-
187
- results.append({
188
- 'text': text,
189
- 'x1': x1,
190
- 'y1': y1,
191
- 'x2': x2,
192
- 'y2': y2
193
- })
194
  except Exception as e:
195
  print(f"Error parsing detection result: {str(e)}")
196
  continue
 
7
  import torch
8
  from typing import Dict, List, Tuple, Optional
9
  from PIL import Image
10
+ from transformers import AutoProcessor, AutoModelForImageTextToText
11
  import requests
12
  from io import BytesIO
13
 
 
37
 
38
  print(f"Loading PaddleOCR-VL-1.5 model on {self.device}...")
39
 
40
+ try:
41
+ self.processor = AutoProcessor.from_pretrained(model_path)
42
+ except Exception:
43
+ print("Network error loading processor, falling back to local cache...")
44
+ self.processor = AutoProcessor.from_pretrained(model_path, local_files_only=True)
45
 
46
  if self.device == "cuda":
47
  torch_dtype = torch.bfloat16
 
50
  else:
51
  torch_dtype = torch.float32
52
 
53
+ try:
54
+ self.model = AutoModelForImageTextToText.from_pretrained(
55
+ model_path,
56
+ dtype=torch_dtype,
57
+ device_map="auto" if self.device == "cuda" else None
58
+ )
59
+ except Exception:
60
+ print("Network error loading model, falling back to local cache...")
61
+ self.model = AutoModelForImageTextToText.from_pretrained(
62
+ model_path,
63
+ dtype=torch_dtype,
64
+ device_map="auto" if self.device == "cuda" else None,
65
+ local_files_only=True
66
+ )
67
 
68
  if self.device != "cuda":
69
  self.model = self.model.to(self.device)
 
109
  Model response with detected text and coordinates
110
  """
111
  if prompt is None:
112
+ prompt = "Spotting:"
113
 
114
  messages = [
115
  {
 
176
  """
177
  results = []
178
 
179
+ # Pattern to match text followed by <|LOC_xxx|> tokens (8 per detection, quadrilateral)
180
+ for match in re.finditer(r'([^<\n]+?)((?:<\|LOC_\d+\|>)+)', response):
 
 
 
181
  try:
182
  text = match.group(1).strip()
183
+ locs = [int(v) for v in re.findall(r'<\|LOC_(\d+)\|>', match.group(2))]
184
+
185
+ if len(locs) != 8:
186
+ continue
187
+
188
+ xs = [locs[i] for i in range(0, 8, 2)]
189
+ ys = [locs[i] for i in range(1, 8, 2)]
190
+
191
+ x1 = int(min(xs) * image_width / 1000)
192
+ y1 = int(min(ys) * image_height / 1000)
193
+ x2 = int(max(xs) * image_width / 1000)
194
+ y2 = int(max(ys) * image_height / 1000)
195
+
196
+ results.append({
197
+ 'text': text,
198
+ 'x1': x1,
199
+ 'y1': y1,
200
+ 'x2': x2,
201
+ 'y2': y2
202
+ })
 
203
  except Exception as e:
204
  print(f"Error parsing detection result: {str(e)}")
205
  continue
requirements.txt CHANGED
@@ -1,7 +1,7 @@
1
  gradio>=4.0.0
2
  torch>=2.0.0
3
  torchvision>=0.15.0
4
- transformers>=4.45.0
5
  Pillow>=10.0.0
6
  numpy>=1.24.0
7
  requests>=2.31.0
 
1
  gradio>=4.0.0
2
  torch>=2.0.0
3
  torchvision>=0.15.0
4
+ transformers>=4.52.0
5
  Pillow>=10.0.0
6
  numpy>=1.24.0
7
  requests>=2.31.0