khazarai
/

Cardiology-TTS

Model card Files Files and versions

Cardiology-TTS / README.md

Rustamshry's picture

Update README.md

d79be99 verified 10 months ago

|

History Blame Contribute Delete

3.4 kB

	---
	base_model: unsloth/csm-1b
	library_name: peft
	license: mit
	datasets:
	- Dev372/Cardiology_Medical_STT_Dataset
	language:
	- en
	pipeline_tag: text-to-speech
	tags:
	- cardiology
	- medical
	- transformers
	---

	# Model Card for Cardiology-TTS

	<!-- Provide a quick summary of what the model is/does. -->


	## Model Details

	This is a fine-tuned version of the Conversational Speech Model (CSM-1B) using LoRA for parameter-efficient fine-tuning.
	The model is trained on a 1,530-sample dataset of medical cardiology texts, designed to generate high-quality speech from cardiology-related text.
	It leverages the capabilities of the original CSM-1B model for text-to-speech synthesis, extended with domain-specific terminology for medical cardiology.
	It is intended for speech generation in English, especially for clinical and educational contexts.

	## Uses

	### Direct Use

	- Text-to-Speech (TTS) generation for cardiology educational content, medical reports, or clinical explanations.
	- Integrating spoken content in healthcare apps, e-learning platforms, or patient-facing tools for cardiology topics.
	- Research and prototyping domain-specific TTS applications using small medical datasets.


	## Bias, Risks, and Limitations

	- Small training dataset (2K samples) → Model may not generalize well to rare medical terms, long passages, or other medical domains outside cardiology.
	- English-only support → Model is not trained for other languages.
	- TTS artifacts → Some generated audio may contain unnatural pauses, mispronunciations, or clipping in challenging sentences.
	- Not for diagnostic purposes → Model outputs speech for educational/illustrative purposes and should not be used for medical diagnosis or patient instructions.
	- Model size and resources → CSM-1B is large; requires GPU for real-time inference and significant VRAM for batch synthesis.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	import torch
	from transformers import CsmForConditionalGeneration, AutoProcessor
	import soundfile as sf
	from peft import PeftModel


	model_id = "unsloth/csm-1b"
	device = "cuda" if torch.cuda.is_available() else "cpu"


	processor = AutoProcessor.from_pretrained(model_id)
	base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

	model = PeftModel.from_pretrained(base_model, "khazarai/Cardiology-TTS")

	text = "The coronary arteries are patent with no significant stenosis."

	speaker_id = 0

	conversation = [
	{"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
	]
	audio_values = model.generate(
	**processor.apply_chat_template(
	conversation,
	tokenize=True,
	return_dict=True,
	).to("cuda"),
	max_new_tokens=200,
	# play with these parameters to tweak results
	# depth_decoder_top_k=0,
	# depth_decoder_top_p=0.9,
	# depth_decoder_do_sample=True,
	# depth_decoder_temperature=0.9,
	# top_k=0,
	# top_p=1.0,
	# temperature=0.9,
	# do_sample=True,
	#########################################################
	output_audio=True
	)
	audio = audio_values[0].to(torch.float32).cpu().numpy()
	sf.write("example.wav", audio, 24000)

	```

	## Training Details

	### Training Data

	- Dataset: Dev372/Cardiology_Medical_STT_Dataset

	1,530 samples of cardiology-related text paired with audio.


	### Framework versions

	- PEFT 0.15.2