Cardiology-TTS / README.md
Rustamshry's picture
Update README.md
d79be99 verified
|
Raw
History Blame Contribute Delete
3.4 kB
---
base_model: unsloth/csm-1b
library_name: peft
license: mit
datasets:
- Dev372/Cardiology_Medical_STT_Dataset
language:
- en
pipeline_tag: text-to-speech
tags:
- cardiology
- medical
- transformers
---
# Model Card for Cardiology-TTS
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
This is a fine-tuned version of the Conversational Speech Model (CSM-1B) using LoRA for parameter-efficient fine-tuning.
The model is trained on a 1,530-sample dataset of medical cardiology texts, designed to generate high-quality speech from cardiology-related text.
It leverages the capabilities of the original CSM-1B model for text-to-speech synthesis, extended with domain-specific terminology for medical cardiology.
It is intended for speech generation in English, especially for clinical and educational contexts.
## Uses
### Direct Use
- Text-to-Speech (TTS) generation for cardiology educational content, medical reports, or clinical explanations.
- Integrating spoken content in healthcare apps, e-learning platforms, or patient-facing tools for cardiology topics.
- Research and prototyping domain-specific TTS applications using small medical datasets.
## Bias, Risks, and Limitations
- Small training dataset (2K samples) β†’ Model may not generalize well to rare medical terms, long passages, or other medical domains outside cardiology.
- English-only support β†’ Model is not trained for other languages.
- TTS artifacts β†’ Some generated audio may contain unnatural pauses, mispronunciations, or clipping in challenging sentences.
- Not for diagnostic purposes β†’ Model outputs speech for educational/illustrative purposes and should not be used for medical diagnosis or patient instructions.
- Model size and resources β†’ CSM-1B is large; requires GPU for real-time inference and significant VRAM for batch synthesis.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf
from peft import PeftModel
model_id = "unsloth/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(base_model, "khazarai/Cardiology-TTS")
text = "The coronary arteries are patent with no significant stenosis."
speaker_id = 0
conversation = [
{"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
audio_values = model.generate(
**processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to("cuda"),
max_new_tokens=200,
# play with these parameters to tweak results
# depth_decoder_top_k=0,
# depth_decoder_top_p=0.9,
# depth_decoder_do_sample=True,
# depth_decoder_temperature=0.9,
# top_k=0,
# top_p=1.0,
# temperature=0.9,
# do_sample=True,
#########################################################
output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example.wav", audio, 24000)
```
## Training Details
### Training Data
- Dataset: Dev372/Cardiology_Medical_STT_Dataset
1,530 samples of cardiology-related text paired with audio.
### Framework versions
- PEFT 0.15.2