# T³: Test-Time Model Merging for Medical Vision-Language Models ![T³ Workflow](figures/method.png) *Figure 1: Dynamic test-time merging workflow of T³* Official implementation of **T³: Test-Time Model Merging in Vision-Language Models for Zero-Shot Medical Imaging**, a method for adaptive fusion of pretrained and fine-tuned vision-language models at test time using Jensen-Shannon divergence. --- ## Key Features - 🧠 **Mutual Information Guidance**: Uses JS divergence to measure model consensus. - ⚡ **Backpropagation-Free**: No gradient updates required during inference. - 🏥 **Medical Modality Agnostic**: Validated consistency on 4x medical imaging domains. - 🚀 **Batch-Wise Efficiency**: Reduces compute cost by 32x vs sample-wise merging. - 📈 **SOTA Performance**: Outperforms 8+ baselines in accuracy & robustness. --- ## Table of Contents - [Installation](#installation) - [Method Overview](#method-overview) - [Folder Structure](#folder-structure) - [Reproducing Results](#reproducing-results) - [Pretrained Weights](#pretrained-weights) - [Citation](#citation) ## Installation 1. Clone repository: ```bash git clone https://github.com/yourusername/T3.git cd T3 ``` 2. Create conda environment: ```bash conda create -n t3 python=3.9 conda activate t3 pip install -r requirements.txt ``` ## Method Overview ### Adaptive Merging via Jensen-Shannon Divergence The interpolation coefficient λ is computed dynamically for each sample using the following equation: ```math λ(x) = λ_{min} + (λ_{max}-λ_{min})σ(γ⋅JS(p_{pt}(x)‖p_{ft}(x))) ``` Where: - `JS` = Jensen-Shannon divergence between pretrained and fine-tuned model predictions. - `σ` = Sigmoid function for smooth scaling. - `γ` = Scaling factor (default=0.5). ### Visual Explanation of the Method Below justifies the method and its effectiveness: ### Dynamic Weighting Based on Model Agreement We propose using Jensen–Shannon (JS) divergence to measure mutual information between pretrained (`p_pt`) and fine-tuned (`p_ft`) model predictions, offering a more robust gauge of joint confidence than entropy-based methods like DaWin's entropy ratio: ```math R(x) = \frac{\mathcal{H}(p_{ft}(x))}{\mathcal{H}(p_{pt}(x)) + \mathcal{H}(p_{ft}(x))} ``` JS divergence explicitly captures agreement vs. disagreement by comparing full predictive distributions: ```math I(x) = \frac{1}{2} \Bigl(\mathrm{KL}(p_{pt}(x) \Vert \bar{p}(x)) + \mathrm{KL}(p_{ft}(x) \Vert \bar{p}(x))\Bigr) ``` where ```math \bar{p}(x) = 0.5 \cdot (p_{pt}(x) + p_{ft}(x))`. ``` This ensures: - \(I(x) = 0\) when models fully agree. - \(I(x) > 0\) when confident predictions disagree. Empirically, \(I(x)\) correlates positively with \(R(x)\), but better distinguishes disagreements, validating its use for adaptive merging. 2. **Mutual Information vs. Entropy** ![MI vs Entropy](figures/mi_v_ent.png) *Figure 3: Relationship between mutual information and entropy for adaptive merging.* 3. **Performance Across Modalities** ![Performance Comparison](figures/results.png) *Figure 4: T³ achieves superior performance across multiple medical imaging modalities.* --- ## Folder Structure ``` T3/ ├── clip/ # CLIP model adaptations ├── data/ # Data Utilities ├── utils/ # Helper functions ├── baselines.py # Comparison methods ├── t_cube.py # Core T³ implementation ├── BetaMixture.py # Auxiliary models └── README.md # This document ``` --- ## Reproducing Results To reproduce the results from the paper, you can run the `t_cube.py` script. This script handles the evaluation of T³ and its baselines across multiple datasets and severity levels. Additional baselines are available in `baselines.py`. To understand the script better: - Refer to the `compute_samplewise_tcube_weights` and `compute_samplewise_tcube_weights_MI` functions for entropy (DaWiN baseline) and Our mutual information-based merging. - Check the `evaluate_on_test_set` function for how datasets and severities are processed. - Explore the `evaluate_tcube` function for the merging and evaluation logic. --- ## Pretrained Weights We provide pretrained weights for the following models: 1. **Generalist CLIP**: A pretrained model for general vision-language tasks. 2. **Expert CLIPs**: 4x Fine-tuned models for the following medical imaging domains: - Breast Imaging - Fundoscopy - Cell Microscopy - Retinal OCT If you would like access to these weights, please contact us directly at [Raza Imam](mailto:raza.imam@mbzuai.ac.ae). --- ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Contact For questions or collaborations, contact [Raza Imam](mailto:raza.imam@mbzuai.ac.ae).