--- base_model: unsloth/csm-1b library_name: peft license: mit datasets: - Dev372/Cardiology_Medical_STT_Dataset language: - en pipeline_tag: text-to-speech tags: - cardiology - medical - transformers --- # Model Card for Cardiology-TTS ## Model Details This is a fine-tuned version of the Conversational Speech Model (CSM-1B) using LoRA for parameter-efficient fine-tuning. The model is trained on a 1,530-sample dataset of medical cardiology texts, designed to generate high-quality speech from cardiology-related text. It leverages the capabilities of the original CSM-1B model for text-to-speech synthesis, extended with domain-specific terminology for medical cardiology. It is intended for speech generation in English, especially for clinical and educational contexts. ## Uses ### Direct Use - Text-to-Speech (TTS) generation for cardiology educational content, medical reports, or clinical explanations. - Integrating spoken content in healthcare apps, e-learning platforms, or patient-facing tools for cardiology topics. - Research and prototyping domain-specific TTS applications using small medical datasets. ## Bias, Risks, and Limitations - Small training dataset (2K samples) → Model may not generalize well to rare medical terms, long passages, or other medical domains outside cardiology. - English-only support → Model is not trained for other languages. - TTS artifacts → Some generated audio may contain unnatural pauses, mispronunciations, or clipping in challenging sentences. - Not for diagnostic purposes → Model outputs speech for educational/illustrative purposes and should not be used for medical diagnosis or patient instructions. - Model size and resources → CSM-1B is large; requires GPU for real-time inference and significant VRAM for batch synthesis. ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor import soundfile as sf from peft import PeftModel model_id = "unsloth/csm-1b" device = "cuda" if torch.cuda.is_available() else "cpu" processor = AutoProcessor.from_pretrained(model_id) base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device) model = PeftModel.from_pretrained(base_model, "khazarai/Cardiology-TTS") text = "The coronary arteries are patent with no significant stenosis." speaker_id = 0 conversation = [ {"role": str(speaker_id), "content": [{"type": "text", "text": text}]}, ] audio_values = model.generate( **processor.apply_chat_template( conversation, tokenize=True, return_dict=True, ).to("cuda"), max_new_tokens=200, # play with these parameters to tweak results # depth_decoder_top_k=0, # depth_decoder_top_p=0.9, # depth_decoder_do_sample=True, # depth_decoder_temperature=0.9, # top_k=0, # top_p=1.0, # temperature=0.9, # do_sample=True, ######################################################### output_audio=True ) audio = audio_values[0].to(torch.float32).cpu().numpy() sf.write("example.wav", audio, 24000) ``` ## Training Details ### Training Data - Dataset: Dev372/Cardiology_Medical_STT_Dataset 1,530 samples of cardiology-related text paired with audio. ### Framework versions - PEFT 0.15.2