-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper • 2405.18503 • Published • 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper • 2405.20289 • Published • 11 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper • 2406.02897 • Published • 16 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper • 2406.03344 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2507.15375
-
Energy-Based Transformers are Scalable Learners and Thinkers
Paper • 2507.02092 • Published • 70 -
MOSPA: Human Motion Generation Driven by Spatial Audio
Paper • 2507.11949 • Published • 25 -
Sound and Complete Neuro-symbolic Reasoning with LLM-Grounded Interpretations
Paper • 2507.09751 • Published • 2 -
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Paper • 2507.07982 • Published • 34
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper • 2405.18503 • Published • 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper • 2405.20289 • Published • 11 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper • 2406.02897 • Published • 16 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper • 2406.03344 • Published • 22
-
SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models
Paper • 2510.16917 • Published • 20 -
Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations
Paper • 2510.16893 • Published • 18 -
Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition
Paper • 2510.08047 • Published • 8 -
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Paper • 2510.06917 • Published • 35
-
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing
Paper • 2503.13434 • Published • 28 -
Edit Transfer: Learning Image Editing via Vision In-Context Relations
Paper • 2503.13327 • Published • 29 -
WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes
Paper • 2503.13435 • Published • 18 -
MediaTek-Research/Llama-Breeze2-8B-Instruct
8B • Updated • 797 • 54
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper • 2405.18503 • Published • 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper • 2405.20289 • Published • 11 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper • 2406.02897 • Published • 16 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper • 2406.03344 • Published • 22
-
SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models
Paper • 2510.16917 • Published • 20 -
Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations
Paper • 2510.16893 • Published • 18 -
Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition
Paper • 2510.08047 • Published • 8 -
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Paper • 2510.06917 • Published • 35
-
Energy-Based Transformers are Scalable Learners and Thinkers
Paper • 2507.02092 • Published • 70 -
MOSPA: Human Motion Generation Driven by Spatial Audio
Paper • 2507.11949 • Published • 25 -
Sound and Complete Neuro-symbolic Reasoning with LLM-Grounded Interpretations
Paper • 2507.09751 • Published • 2 -
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Paper • 2507.07982 • Published • 34
-
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing
Paper • 2503.13434 • Published • 28 -
Edit Transfer: Learning Image Editing via Vision In-Context Relations
Paper • 2503.13327 • Published • 29 -
WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes
Paper • 2503.13435 • Published • 18 -
MediaTek-Research/Llama-Breeze2-8B-Instruct
8B • Updated • 797 • 54
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper • 2405.18503 • Published • 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper • 2405.20289 • Published • 11 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper • 2406.02897 • Published • 16 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper • 2406.03344 • Published • 22