Qwen3.5 Audio Graft AVQA Adapter

์ด ๋ฆฌํฌ์ง€ํ† ๋ฆฌ๋Š” Qwen/Qwen3.5-9B backbone์— openai/whisper-medium hidden state๋ฅผ soft-token์œผ๋กœ graftํ•˜๊ธฐ ์œ„ํ•œ ์‹คํ—˜์šฉ adapter checkpoint์ž…๋‹ˆ๋‹ค. ์ „์ฒด Qwen backbone weight๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฆ‰, standalone ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ผ base model + audio encoder + audio graft adapter ์กฐํ•ฉ์œผ๋กœ ๋กœ๋“œํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • Base LLM/VLM backbone: Qwen/Qwen3.5-9B
  • Audio encoder: openai/whisper-medium
  • ํ•™์Šต ๋ฐ์ดํ„ฐ: Joysw909/AVQA
  • ์ €์žฅ ํŒŒ์ผ:
    • audio_graft.pt: audio projector, audio start/end soft token, ์„ค์ •๊ฐ’
    • tokenizer/: tokenizer files
    • training_config.json: ํ•™์Šต ์„ค์ •
    • qwen_lora/: Qwen LoRA๊ฐ€ ์ผœ์ง„ ๊ฒฝ์šฐ์—๋งŒ ์กด์žฌ

ํ•™์Šต ๋ฐฉ์‹

์ €VRAM Kaggle ํ™˜๊ฒฝ์„ ๊ณ ๋ คํ•ด ๊ธฐ๋ณธ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Qwen backbone: frozen
  • Whisper encoder: frozen
  • ํ•™์Šต ๋Œ€์ƒ: audio projector + audio soft tokens
  • ์ž…๋ ฅ: audio waveform + text prompt
  • ์ถœ๋ ฅ: AVQA multiple-choice answer text

์ œํ•œ์‚ฌํ•ญ

์ด checkpoint๋Š” ์—ฐ๊ตฌ์šฉ prototype์ž…๋‹ˆ๋‹ค.

  • ์›๋ณธ Qwen3.5-9B๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ์›๋ณธ Whisper-medium์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • video branch ์ „์ฒด end-to-end ํ•™์Šต์ด ์•„๋‹ˆ๋ผ audio hidden-state graft ์ค‘์‹ฌ์ž…๋‹ˆ๋‹ค.
  • AVQA ์˜์–ด QA ๊ธฐ์ค€์œผ๋กœ ๋งž์ถฐ์ ธ ์žˆ์–ด ํ•œ๊ตญ์–ด/์ผ๋ฐ˜ ์Œ์„ฑ๋Œ€ํ™” ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • ์•ˆ์ „์„ฑ/์ •ํ™•์„ฑ ๊ฒ€์ฆ์ด ์™„๋ฃŒ๋œ production ๋ชจ๋ธ์ด ์•„๋‹™๋‹ˆ๋‹ค.

์‚ฌ์šฉ ์˜ˆ์‹œ ๊ฐœ์š”

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, WhisperFeatureExtractor, WhisperModel

ckpt = torch.load("audio_graft.pt", map_location="cpu")
base_model_id = ckpt["qwen_model_id"]
audio_encoder_id = ckpt["audio_encoder_id"]

# ์ด ๋…ธํŠธ๋ถ์˜ QwenAudioGraft ํด๋ž˜์Šค ์ •์˜๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์„ธ์š”.

์žฌํ˜„

์ด ๋ชจ๋ธ์€ Kaggle notebook qwen35_audio_graft_avqa_kaggle ๊ณ„์—ด์—์„œ ์ƒ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ์„ค์ •์€ training_config.json์— ๊ธฐ๋ก๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Hellohihihih/qwen35-audio-graft-avqa

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(454)
this model

Dataset used to train Hellohihihih/qwen35-audio-graft-avqa