--- license: cc-by-nc-4.0 --- # SeamlessM4T-v2 T2ST Lite Model Extracted from `facebook/seamless-m4t-v2-large`, containing only T2ST (Text-to-Speech Translation) components. > Original Model: [facebook/seamless-m4t-v2-large](https://huggingface.co/facebook/seamless-m4t-v2-large) > > Official Documentation: [SeamlessM4T-v2 Documentation](https://huggingface.co/docs/transformers/en/model_doc/seamless_m4t_v2) Note: This package only reorganizes publicly available weights from Meta's original model for T2ST usage. No new training or fine-tuning is introduced. All rights of the model and weights belong to their original owner. ## Supported Features - **T2ST (Text-to-Speech Translation)**: Text-to-speech translation with voice control - **Multi-Speaker Support**: 200 different speaker voices - **96 Languages**: Supports text-to-speech translation ## Included Components ### Model Weights - `text_encoder`: Text encoder - `t2u_model`: Text-to-unit encoder-decoder (contains t2u_encoder and t2u_decoder) - `vocoder`: HiFi-GAN vocoder, includes 200 speaker embeddings - `shared.weight`: Shared word embeddings - `lang_embed`: Language embeddings ## Model Size - Original Model: ~8.6 GB - Lite Model: ~4.0 GB - Removed Weights: 1428 (speech_encoder, text_decoder) - Space Saved: ~4.6 GB ## Usage Examples ### 1. Basic T2ST: Text-to-Speech Translation ```python from transformers import SeamlessM4Tv2Model, AutoProcessor import torchaudio # Load model model = SeamlessM4Tv2Model.from_pretrained("jaman21/seamless-m4t-v2-t2st") processor = AutoProcessor.from_pretrained("jaman21/seamless-m4t-v2-t2st") # Translate text to speech text_inputs = processor(text="Hello world", src_lang="eng", return_tensors="pt") audio_array = model.generate(**text_inputs, tgt_lang="cmn", generate_speech=True)[0].cpu().numpy().squeeze() # Save audio (sample rate: 16000 Hz) torchaudio.save("output.wav", audio_array, 16000) ``` ### 2. Use Different Speaker Voices ```python # Use different speaker IDs (0-199) to get different voice characteristics text_inputs = processor(text="Good morning!", src_lang="eng", return_tensors="pt") # Speaker 0 - default voice (pretrained) audio_spk0 = model.generate(**text_inputs, tgt_lang="spa", generate_speech=True, speaker_id=0) # Speaker 5 - different voice (pretrained) audio_spk5 = model.generate(**text_inputs, tgt_lang="spa", generate_speech=True, speaker_id=5) # Speaker 42 - another voice option (pretrained) audio_spk42 = model.generate(**text_inputs, tgt_lang="spa", generate_speech=True, speaker_id=42) # Note: Different speaker_id may have different effects in different target languages # Try values between 0-199 to find the voice that best suits your use case ``` ### 3. Batch Processing Multiple Texts ```python # Process multiple texts at once texts = [ "Hello, how are you?", "What is your name?", "Nice to meet you!" ] text_inputs = processor(text=texts, src_lang="eng", return_tensors="pt", padding=True) audio_outputs = model.generate(**text_inputs, tgt_lang="ita", generate_speech=True) # Save each audio output for i, audio in enumerate(audio_outputs): audio_array = audio.cpu().numpy().squeeze() torchaudio.save(f"output_{i}.wav", audio_array, 16000) ``` ### 4. Control Generation Quality ```python text_inputs = processor(text="Translate this sentence", src_lang="eng", return_tensors="pt") # Higher quality but more computationally expensive high_quality_output = model.generate( **text_inputs, tgt_lang="rus", generate_speech=True, speaker_id=10, num_beams=5, # Beam search max_new_tokens=512, # Allow longer output length_penalty=1.0, # No length penalty early_stopping=True, use_cache=True # Accelerate generation ) # Faster generation speed, acceptable quality fast_output = model.generate( **text_inputs, tgt_lang="rus", generate_speech=True, speaker_id=10, num_beams=1, # Greedy decoding max_new_tokens=256, use_cache=True ) ``` ### 5. GPU/CPU Usage ```python import torch # Move model to GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device) # Process inputs on the same device text_inputs = processor(text="Hello", src_lang="eng", return_tensors="pt") text_inputs = {k: v.to(device) for k, v in text_inputs.items()} # Generate with torch.inference_mode(): # More efficient than torch.no_grad() outputs = model.generate(**text_inputs, tgt_lang="cmn", generate_speech=True) ``` ## License Same as the original model: **CC-BY-NC-4.0** For commercial use, please refer to Meta's licensing terms. ## References - [SeamlessM4T-v2 Paper](https://arxiv.org/abs/2312.05187) - [Official Model Card](https://huggingface.co/facebook/seamless-m4t-v2-large) - [Transformers Documentation](https://huggingface.co/docs/transformers/en/model_doc/seamless_m4t_v2) - [GitHub Repository](https://github.com/facebookresearch/seamless_communication)