Translation
Transformers
TensorBoard
Safetensors
French
English
marian
text2text-generation
opus-mt
marian-mt
marianMTModel
fr-to-en
neuro-symbolic
NMT
Instructions to use DomLoyer/opus-mt-fr-en-finetuned-fr-to-en with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DomLoyer/opus-mt-fr-en-finetuned-fr-to-en with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "translation" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("translation", model="DomLoyer/opus-mt-fr-en-finetuned-fr-to-en")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("DomLoyer/opus-mt-fr-en-finetuned-fr-to-en") model = AutoModelForSeq2SeqLM.from_pretrained("DomLoyer/opus-mt-fr-en-finetuned-fr-to-en") - Notebooks
- Google Colab
- Kaggle
| # -*- coding: utf-8 -*- | |
| """Copy of en-fr_160326 | |
| Automatically generated by Colab. | |
| Original file is located at | |
| https://colab.research.google.com/drive/1IDTAHSPu8h3v8fs0yAhH8ak1dNF2d_qz | |
| """ | |
| # Load model directly | |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM | |
| tokenizer = AutoTokenizer.from_pretrained("DomLoyer/opus-mt-fr-en-finetuned-fr-to-en") | |
| model = AutoModelForSeq2SeqLM.from_pretrained("DomLoyer/opus-mt-fr-en-finetuned-fr-to-en") | |
| """### Option 1 : Utiliser une Pipeline (Plus simple) | |
| La méthode la plus simple consiste à créer un objet `pipeline` qui gère automatiquement le prétraitement et le post-traitement. | |
| """ | |
| from transformers import pipeline | |
| try: | |
| # Pour ce modèle spécifique, la méthode manuelle est recommandée. | |
| # Tentative d'utilisation de la pipeline avec la tâche générique. | |
| translator = pipeline("translation", model=model, tokenizer=tokenizer) | |
| texte_fr = "Bonjour, comment allez-vous aujourd'hui ?" | |
| resultat = translator(texte_fr) | |
| print(f"Traduction (Pipeline) : {resultat[0]['translation_text']}") | |
| except Exception as e: | |
| print("Note : La méthode Pipeline n'est pas compatible avec ce modèle spécifique.") | |
| print("Veuillez utiliser l'Option 2 (Méthode manuelle) qui fonctionne correctement.") | |
| """### Option 2 : Utilisation manuelle (Tokenizer + Modèle) | |
| Si vous voulez plus de contrôle, vous pouvez encoder le texte et générer la sortie vous-même : | |
| """ | |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM | |
| # 0. Initialisation du modèle et du tokenizer | |
| model_name = "DomLoyer/opus-mt-fr-en-finetuned-fr-to-en" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForSeq2SeqLM.from_pretrained(model_name) | |
| # 1. Fonction pour découper le texte en morceaux de < 512 tokens | |
| def get_chunks(text, tokenizer, max_tokens=450): | |
| sentences = text.replace('\n', ' ').split('. ') | |
| chunks = [] | |
| current_chunk = "" | |
| for sentence in sentences: | |
| test_sentence = sentence + ". " | |
| if len(tokenizer.encode(current_chunk + test_sentence)) < max_tokens: | |
| current_chunk += test_sentence | |
| else: | |
| if current_chunk: chunks.append(current_chunk.strip()) | |
| current_chunk = test_sentence | |
| if current_chunk: chunks.append(current_chunk.strip()) | |
| return chunks | |
| print("Modèle et fonctions de découpage initialisés.") | |
| """## Système de Traduction de Documents (PDF & LaTeX) | |
| Ce système permet d'importer un document, de le découper pour respecter la limite de 512 tokens, et de traduire l'intégralité du contenu. | |
| """ | |
| from google.colab import files | |
| import os | |
| # 1. Upload du fichier | |
| print('Veuillez séléctionner votre fichier (.pdf ou .tex) :') | |
| uploaded = files.upload() | |
| if uploaded: | |
| uploaded_file_name = list(uploaded.keys())[0] | |
| file_extension = os.path.splitext(uploaded_file_name)[1].lower() | |
| print(f'Fichier chargé : {uploaded_file_name}') | |
| else: | |
| print('Aucun fichier sélectionné.') | |
| !pip install pymupdf | |
| import fitz # PyMuPDF | |
| import os | |
| from google.colab import files | |
| from tqdm.auto import tqdm | |
| # 1. Sélection et Upload du fichier | |
| print('Veuillez sélectionner votre fichier (.pdf ou .tex) :') | |
| uploaded = files.upload() | |
| if uploaded: | |
| uploaded_file_name = list(uploaded.keys())[0] | |
| file_extension = os.path.splitext(uploaded_file_name)[1].lower() | |
| def extract_text_from_file(file_path, extension): | |
| if extension == '.pdf': | |
| text = "" | |
| with fitz.open(file_path) as doc: | |
| for page in doc: | |
| text += page.get_text() | |
| return text | |
| elif extension == '.tex': | |
| with open(file_path, 'r', encoding='utf-8') as f: | |
| return f.read() | |
| return "" | |
| # 2. Extraction et Découpage (get_chunks et tokenizer doivent être initialisés) | |
| raw_text = extract_text_from_file(uploaded_file_name, file_extension) | |
| try: | |
| chunks = get_chunks(raw_text, tokenizer) | |
| except NameError: | |
| print("Erreur : Veuillez d'abord exécuter la cellule 99afb542 pour initialiser le tokenizer et les fonctions.") | |
| chunks = [] | |
| if chunks: | |
| print(f'Fichier chargé : {uploaded_file_name}') | |
| print(f'Le document contient {len(chunks)} segments à traduire.') | |
| # 3. Traduction par lots | |
| resultats_traduction = [] | |
| for segment in tqdm(chunks, desc="Traduction en cours"): | |
| inputs = tokenizer(segment, return_tensors='pt', truncation=True, max_length=512) | |
| outputs = model.generate(**inputs) | |
| resultats_traduction.append(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| # 4. Assemblage final et Sauvegarde | |
| traduction_complete = "\n\n".join(resultats_traduction) | |
| output_filename = f"translated_{os.path.splitext(uploaded_file_name)[0]}.txt" | |
| with open(output_filename, 'w', encoding='utf-8') as f: | |
| f.write(traduction_complete) | |
| print(f'\nTraduction terminée ! Fichier : {output_filename}') | |
| files.download(output_filename) | |
| else: | |
| print('Aucun fichier sélectionné.') | |
| !pip install pymupdf | |
| import fitz | |
| print(f'PyMuPDF version installed: {fitz.__version__}') | |
| !pip install pymupdf | |
| !pip install pymupdf | |
| """# Task | |
| Build a complete PDF and LaTeX translation pipeline using the `DomLoyer/opus-mt-fr-en-finetuned-fr-to-en` model. The system must: | |
| 1. **Install Dependencies**: Install `PyMuPDF` (fitz) for PDF processing and `sacremoses` for the translation model. | |
| 2. **File Upload**: Implement a mechanism to upload `.pdf` or `.tex` files. | |
| 3. **Text Extraction & Chunking**: Extract text from the uploaded documents and split it into manageable chunks (under 512 tokens) to avoid the `IndexError` encountered previously. | |
| 4. **Batch Translation**: Process each chunk through the model, implement a progress bar, and reassemble the translated text. | |
| 5. **Export**: Provide the final translated content for download. | |
| ## Install Dependencies | |
| ### Subtask: | |
| Install the necessary libraries for PDF processing, LaTeX handling, and model-specific tokenization requirements. | |
| **Reasoning**: | |
| Install the required libraries `pymupdf` (fitz) and `sacremoses` to handle PDF processing and satisfy model tokenization requirements as specified in the instructions. | |
| """ | |
| !pip install pymupdf sacremoses | |
| import fitz | |
| import sacremoses | |
| print(f'PyMuPDF version: {fitz.__version__}') | |
| print('sacremoses successfully imported.') | |
| """## Upload Functionality | |
| ### Subtask: | |
| Implement a file upload mechanism to allow users to import PDF or .tex files into the Colab environment. | |
| **Reasoning**: | |
| I will implement the file upload functionality as requested, allowing users to upload .pdf or .tex files and validating the extension. | |
| """ | |
| from google.colab import files | |
| import os | |
| print('Please upload your .pdf or .tex file:') | |
| uploaded = files.upload() | |
| if uploaded: | |
| filename = list(uploaded.keys())[0] | |
| ext = os.path.splitext(filename)[1].lower() | |
| if ext in ['.pdf', '.tex']: | |
| print(f'Successfully uploaded: {filename}') | |
| # Store filename for later use | |
| uploaded_file_path = filename | |
| else: | |
| print(f'Error: Unsupported file type ({ext}). Please upload a .pdf or .tex file.') | |
| else: | |
| print('No file was uploaded.') | |
| """## Document Processing and Chunking | |
| ### Subtask: | |
| Extract text from the uploaded .pdf or .tex file and split it into segments under 512 tokens to prevent model errors. | |
| **Reasoning**: | |
| I will implement the text extraction for PDF and LaTeX files and create a chunking mechanism that respects the 512-token limit of the translation model to avoid indexing errors. | |
| """ | |
| import fitz | |
| import os | |
| def extract_text(file_path): | |
| ext = os.path.splitext(file_path)[1].lower() | |
| text = "" | |
| if ext == '.pdf': | |
| doc = fitz.open(file_path) | |
| for page in doc: | |
| text += page.get_text("text") + "\n" | |
| doc.close() | |
| elif ext == '.tex': | |
| with open(file_path, 'r', encoding='utf-8') as f: | |
| text = f.read() | |
| return text | |
| # 1. Extract raw text | |
| raw_text = extract_text(uploaded_file_path) | |
| print(f'Extracted {len(raw_text)} characters.') | |
| # 2. Initialize tokenizer for chunking | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("DomLoyer/opus-mt-fr-en-finetuned-fr-to-en") | |
| # 3 & 4. Chunking strategy respecting sentence boundaries and token limit | |
| def get_chunks(text, tokenizer, max_tokens=500): | |
| sentences = text.replace('\n', ' ').split('. ') | |
| chunks = [] | |
| current_chunk = "" | |
| for sentence in sentences: | |
| test_sentence = sentence + ". " | |
| # Check if adding the next sentence exceeds the limit | |
| combined_tokens = len(tokenizer.encode(current_chunk + test_sentence)) | |
| if combined_tokens < max_tokens: | |
| current_chunk += test_sentence | |
| else: | |
| if current_chunk: | |
| chunks.append(current_chunk.strip()) | |
| # If a single sentence is too long, split it by words (fallback) | |
| if len(tokenizer.encode(test_sentence)) >= max_tokens: | |
| words = test_sentence.split(' ') | |
| temp_subchunk = "" | |
| for word in words: | |
| if len(tokenizer.encode(temp_subchunk + word + " ")) < max_tokens: | |
| temp_subchunk += word + " " | |
| else: | |
| chunks.append(temp_subchunk.strip()) | |
| temp_subchunk = word + " " | |
| current_chunk = temp_subchunk | |
| else: | |
| current_chunk = test_sentence | |
| if current_chunk: | |
| chunks.append(current_chunk.strip()) | |
| return chunks | |
| # 5. Store result | |
| text_chunks = get_chunks(raw_text, tokenizer) | |
| print(f'Split text into {len(text_chunks)} chunks.') | |
| if text_chunks: | |
| print(f'First chunk tokens: {len(tokenizer.encode(text_chunks[0]))}') | |
| """## Batch Translation | |
| ### Subtask: | |
| Translate the pre-processed text chunks using the fine-tuned Marian MT model and monitor progress with a progress bar. | |
| **Reasoning**: | |
| I will implement the batch translation logic by iterating through the previously created text chunks, using the model to generate translations with GPU acceleration if available, and tracking progress with tqdm. | |
| """ | |
| import torch | |
| from tqdm.auto import tqdm | |
| from transformers import AutoModelForSeq2SeqLM | |
| # 1. Setup device | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| print(f"Using device: {device}") | |
| # 2. Load model and move to device | |
| model_name = "DomLoyer/opus-mt-fr-en-finetuned-fr-to-en" | |
| model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device) | |
| translated_chunks = [] | |
| # 3. Iterate through chunks with progress bar | |
| print("Starting translation...") | |
| for chunk in tqdm(text_chunks, desc="Translating chunks"): | |
| # Tokenize | |
| inputs = tokenizer(chunk, return_tensors="pt", truncation=True, max_length=512).to(device) | |
| # Generate | |
| with torch.no_grad(): | |
| outputs = model.generate(**inputs) | |
| # Decode | |
| decoded_text = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| translated_chunks.append(decoded_text) | |
| # 4. Join results | |
| final_translation = "\n".join(translated_chunks) | |
| print(f"\nTranslation complete. Total translated chunks: {len(translated_chunks)}") | |
| print(f"Preview of translation (first 200 chars):\n{final_translation[:200]}...") | |
| """## Final Task | |
| ### Subtask: | |
| Verify the final translated text and provide a mechanism to download the result as a text file. | |
| ## Summary: | |
| ### Q&A | |
| **How does the system prevent the `IndexError` encountered in previous versions?** | |
| The system implements a token-aware chunking strategy. It uses the specific tokenizer for the `opus-mt-fr-en-finetuned-fr-to-en` model to measure text segments, ensuring every chunk is under a 500-token safety limit (well within the model's 512-token maximum). It also includes a word-based fallback for exceptionally long sentences. | |
| **Which file formats are supported by this pipeline?** | |
| The pipeline supports `.pdf` files (processed via `PyMuPDF`) and `.tex` (LaTeX) files (processed via standard text reading). | |
| **What model is used for the translation?** | |
| The system uses the `DomLoyer/opus-mt-fr-en-finetuned-fr-to-en` model, which is a Marian MT transformer fine-tuned for French-to-English translation. | |
| ### Data Analysis Key Findings | |
| * **Environment Efficiency**: The pipeline successfully integrates `PyMuPDF` for document parsing and `sacremoses` for Moses-based tokenization required by the Marian model. | |
| * **Robust Extraction**: The system handles binary PDF data and plain-text LaTeX files, consolidating them into a raw string for processing. | |
| * **Intelligent Segmentation**: Instead of arbitrary character counts, the system splits text at sentence boundaries (`. `) to maintain linguistic context, which is critical for high-quality machine translation. | |
| * **Optimized Inference**: The translation logic automatically detects and utilizes **CUDA (GPU)** for faster processing and includes a `tqdm` progress bar for real-time monitoring of batch tasks. | |
| * **Successful Reconstruction**: All processed chunks are reassembled into a single `final_translation` string, maintaining the logical flow of the original document. | |
| ### Insights or Next Steps | |
| * **Format Preservation**: While the text is successfully translated, a valuable next step would be implementing logic to preserve specific LaTeX commands or PDF layouts in the output file. | |
| * **Post-Processing**: Implement a "Download" button using `google.colab.files.download` to allow users to save the `final_translation` string as a local `.txt` or `.docx` file. | |
| # Task | |
| Export the `final_translation` as a downloadable text file, named `translated_document.txt`. | |
| ## Export Translated Content | |
| ### Subtask: | |
| Provide a mechanism to download the final translated content. | |
| ## Summary: | |
| ### Data Analysis Key Findings | |
| The primary objective of this subtask was to export the final translated content, stored in the `final_translation` variable, into a downloadable text file. The specified filename for the output was `translated_document.txt`. | |
| ### Insights or Next Steps | |
| * The solving process for exporting the `final_translation` was not provided, therefore the specific implementation steps for generating and downloading the `translated_document.txt` file are not available. | |
| * The next step would be to implement the necessary code to write the `final_translation` content to a file named `translated_document.txt` and then provide a mechanism (e.g., using a library function) for the user to download this file. | |
| """ |