--- library_name: transformers base_model: - meta-llama/Llama-3.1-8B-Instruct - DeSTA-ntu/Llama-3.1-8B-Instruct datasets: - DeSTA-ntu/DeSTA-AQA5M-FROM-Llama3.1-8B-Instruct tags: - audio-text-to-text - Audio-understanding - Audio-chat --- # DeSTA2.5-Audio [πŸ“‘ Paper](https://arxiv.org/abs/2507.02768) | [πŸ‘©β€πŸ’» Github](https://github.com/kehanlu/DeSTA2.5-Audio) | [πŸ€— Model](https://huggingface.co/collections/DeSTA-ntu/desta25-audio-686a6b9e71afd92e1dd87486) | [πŸ€— Dataset](https://huggingface.co/datasets/DeSTA-ntu/DeSTA-AQA5M-FROM-Llama3.1-8B-Instruct) **DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment** > **Self-generated data is what you need for developing general-purpose LALMs!** - πŸ§ͺ **A new training framework** ([read the paper](https://arxiv.org/abs/2507.02768)) - Highly scalable and efficient without task-specific instruction-tuning data - Preserves language ability and avoids catastrophic forgetting - Comprehensive studies on data quality in LALM development - πŸ“¦ **Open resources for the community** - Model checkpoints and Training scripts - DeSTA-AQA5M dataset (5M audio-text pairs from 7,000 hours of audio) ## πŸš€Quickstart ### Installation ```shell git clone https://github.com/kehanlu/DeSTA2.5-Audio.git cd DeSTA2.5-Audio pip install -e . ``` ### Inference ```python from desta import DeSTA25AudioModel # Load the model from Hugging Face model = DeSTA25AudioModel.from_pretrained("DeSTA-ntu/DeSTA2.5-Audio-Llama-3.1-8B") model.to("cuda") # Run inference with audio input messages = [ { "role": "system", "content": "Focus on the audio clips and instructions." }, { "role": "user", "content": "<|AUDIO|>\nDescribe this audio.", "audios": [{ "audio": "/path/to/audio.wav", # Path to your audio file "text": None }] } ] outputs = model.generate( messages=messages, do_sample=False, top_p=1.0, temperature=1.0, max_new_tokens=512 ) print(outputs.text) ``` ## πŸ“š Citation ```bibtex @article{lu2025desta25Audio, title={DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment}, author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Huang, Sung-Feng and Yang, Chih-Kai and Yu, Chee-En and Chen, Chun-Wei and Chen, Wei-Chih and Huang, Chien-yu and others}, journal={arXiv preprint arXiv:2507.02768}, year={2025} } @inproceedings{lu2025developing, title={Developing instruction-following speech language model without speech instruction-tuning data}, author={Lu, Ke-Han and Chen, Zhehuai and Fu, Szu-Wei and Yang, Chao-Han Huck and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Lee, Hung-yi}, booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1--5}, year={2025}, organization={IEEE} } @inproceedings{lu24c_interspeech, title = {DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment}, author = {Ke-Han Lu and Zhehuai Chen and Szu-Wei Fu and He Huang and Boris Ginsburg and Yu-Chiang Frank Wang and Hung-yi Lee}, year = {2024}, booktitle = {Interspeech 2024}, pages = {4159--4163}, doi = {10.21437/Interspeech.2024-457}, issn = {2958-1796}, } ``` ## πŸ‘₯ Contributors Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee