new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jun 22

APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and Beyond

Animal Pose Estimation and Tracking (APT) is a critical task in detecting and monitoring the keypoints of animals across a series of video frames, which is essential for understanding animal behavior. Past works relating to animals have primarily focused on either animal tracking or single-frame animal pose estimation only, neglecting the integration of both aspects. The absence of comprehensive APT datasets inhibits the progression and evaluation of animal pose estimation and tracking methods based on videos, thereby constraining their real-world applications. To fill this gap, we introduce APTv2, the pioneering large-scale benchmark for animal pose estimation and tracking. APTv2 comprises 2,749 video clips filtered and collected from 30 distinct animal species. Each video clip includes 15 frames, culminating in a total of 41,235 frames. Following meticulous manual annotation and stringent verification, we provide high-quality keypoint and tracking annotations for a total of 84,611 animal instances, split into easy and hard subsets based on the number of instances that exists in the frame. With APTv2 as the foundation, we establish a simple baseline method named \posetrackmethodname and provide benchmarks for representative models across three tracks: (1) single-frame animal pose estimation track to evaluate both intra- and inter-domain transfer learning performance, (2) low-data transfer and generalization track to evaluate the inter-species domain generalization performance, and (3) animal pose tracking track. Our experimental results deliver key empirical insights, demonstrating that APTv2 serves as a valuable benchmark for animal pose estimation and tracking. It also presents new challenges and opportunities for future research. The code and dataset are released at https://github.com/ViTAE-Transformer/APTv2{https://github.com/ViTAE-Transformer/APTv2}.

  • 4 authors
·
Dec 24, 2023

MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion

Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area distributed settings, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this paper, we introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments, and also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method. The proposed MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the proposed method integrates a human detection module to enhance spatial feature learning, guiding the model to prioritize frames with human activity to enhance action the recognition accuracy. Experiments on the proposed MultiSensor-Home and the existing MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. Quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition. The source code is available at https://github.com/thanhhff/MultiTSF.

  • 5 authors
·
Apr 3, 2025

Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced Transfer Learning

Massive data is often considered essential for deep learning applications, but it also incurs significant computational and infrastructural costs. Therefore, dataset pruning (DP) has emerged as an effective way to improve data efficiency by identifying and removing redundant training samples without sacrificing performance. In this work, we aim to address the problem of DP for transfer learning, i.e., how to prune a source dataset for improved pretraining efficiency and lossless finetuning accuracy on downstream target tasks. To our best knowledge, the problem of DP for transfer learning remains open, as previous studies have primarily addressed DP and transfer learning as separate problems. By contrast, we establish a unified viewpoint to integrate DP with transfer learning and find that existing DP methods are not suitable for the transfer learning paradigm. We then propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings respectively, by revisiting the DP problem through the lens of source-target domain mapping. Furthermore, we demonstrate the effectiveness of our approach on numerous transfer learning tasks. We show that source data classes can be pruned by up to 40% ~ 80% without sacrificing downstream performance, resulting in a significant 2 ~ 5 times speed-up during the pretraining stage. Besides, our proposal exhibits broad applicability and can improve other computationally intensive transfer learning techniques, such as adversarial pretraining. Codes are available at https://github.com/OPTML-Group/DP4TL.

  • 9 authors
·
Oct 12, 2023

Trans-LoRA: towards data-free Transferable Parameter Efficient Finetuning

Low-rank adapters (LoRA) and their variants are popular parameter-efficient fine-tuning (PEFT) techniques that closely match full model fine-tune performance while requiring only a small number of additional parameters. These additional LoRA parameters are specific to the base model being adapted. When the base model needs to be deprecated and replaced with a new one, all the associated LoRA modules need to be re-trained. Such re-training requires access to the data used to train the LoRA for the original base model. This is especially problematic for commercial cloud applications where the LoRA modules and the base models are hosted by service providers who may not be allowed to host proprietary client task data. To address this challenge, we propose Trans-LoRA -- a novel method for lossless, nearly data-free transfer of LoRAs across base models. Our approach relies on synthetic data to transfer LoRA modules. Using large language models, we design a synthetic data generator to approximate the data-generating process of the observed task data subset. Training on the resulting synthetic dataset transfers LoRA modules to new models. We show the effectiveness of our approach using both LLama and Gemma model families. Our approach achieves lossless (mostly improved) LoRA transfer between models within and across different base model families, and even between different PEFT methods, on a wide variety of tasks.

  • 7 authors
·
May 27, 2024

Self-Supervised Dataset Distillation for Transfer Learning

Dataset distillation methods have achieved remarkable success in distilling a large dataset into a small set of representative samples. However, they are not designed to produce a distilled dataset that can be effectively used for facilitating self-supervised pre-training. To this end, we propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL). We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is biased due to the randomness originating from data augmentations or masking. To address this issue, we propose to minimize the mean squared error (MSE) between a model's representations of the synthetic examples and their corresponding learnable target feature representations for the inner objective, which does not introduce any randomness. Our primary motivation is that the model obtained by the proposed inner optimization can mimic the self-supervised target model. To achieve this, we also introduce the MSE between representations of the inner model and the self-supervised target model on the original full dataset for outer optimization. Lastly, assuming that a feature extractor is fixed, we only optimize a linear head on top of the feature extractor, which allows us to reduce the computational cost and obtain a closed-form solution of the head with kernel ridge regression. We empirically validate the effectiveness of our method on various applications involving transfer learning.

  • 6 authors
·
Oct 10, 2023

Enhancing Dataset Distillation via Non-Critical Region Refinement

Dataset distillation has become a popular method for compressing large datasets into smaller, more efficient representations while preserving critical information for model training. Data features are broadly categorized into two types: instance-specific features, which capture unique, fine-grained details of individual examples, and class-general features, which represent shared, broad patterns across a class. However, previous approaches often struggle to balance these features-some focus solely on class-general patterns, neglecting finer instance details, while others prioritize instance-specific features, overlooking the shared characteristics essential for class-level understanding. In this paper, we introduce the Non-Critical Region Refinement Dataset Distillation (NRR-DD) method, which preserves instance-specific details and fine-grained regions in synthetic data while enriching non-critical regions with class-general information. This approach enables models to leverage all pixel information, capturing both feature types and enhancing overall performance. Additionally, we present Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training by relying on the distance between synthetic data predictions and one-hot encoded labels. Experimental results show that NRR-DD achieves state-of-the-art performance on both small- and large-scale datasets. Furthermore, by storing only two distances per instance, our method delivers comparable results across various settings. The code is available at https://github.com/tmtuan1307/NRR-DD.

  • 5 authors
·
Mar 23, 2025

FAIR Jupyter: a knowledge graph approach to semantic sharing and granular exploration of a computational notebook reproducibility dataset

The way in which data are shared can affect their utility and reusability. Here, we demonstrate how data that we had previously shared in bulk can be mobilized further through a knowledge graph that allows for much more granular exploration and interrogation. The original dataset is about the computational reproducibility of GitHub-hosted Jupyter notebooks associated with biomedical publications. It contains rich metadata about the publications, associated GitHub repositories and Jupyter notebooks, and the notebooks' reproducibility. We took this dataset, converted it into semantic triples and loaded these into a triple store to create a knowledge graph, FAIR Jupyter, that we made accessible via a web service. This enables granular data exploration and analysis through queries that can be tailored to specific use cases. Such queries may provide details about any of the variables from the original dataset, highlight relationships between them or combine some of the graph's content with materials from corresponding external resources. We provide a collection of example queries addressing a range of use cases in research and education. We also outline how sets of such queries can be used to profile specific content types, either individually or by class. We conclude by discussing how such a semantically enhanced sharing of complex datasets can both enhance their FAIRness, i.e., their findability, accessibility, interoperability, and reusability, and help identify and communicate best practices, particularly with regards to data quality, standardization, automation and reproducibility.

  • 2 authors
·
Apr 19, 2024

Hierarchical Dataset Selection for High-Quality Data Sharing

The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.

Valentine: Evaluating Matching Techniques for Dataset Discovery

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method's success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

  • 9 authors
·
Oct 14, 2020

Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model

Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other -- independent of overall performance. Given any arbitrary pairing of pretrained models and no external rankings (such as separate test sets, e.g. due to data privacy), we investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation -- a task made particularly difficult as additional knowledge can be contained in stronger, equiperformant or weaker models. Yet facilitating robust transfer in scenarios agnostic to pretrained model pairings would unlock auxiliary gains and knowledge fusion from any model repository without restrictions on model and problem specifics - including from weaker, lower-performance models. This work therefore provides an initial, in-depth exploration on the viability of such general-purpose knowledge transfer. Across large-scale experiments, we first reveal the shortcomings of standard knowledge distillation techniques, and then propose a much more general extension through data partitioning for successful transfer between nearly all pretrained models, which we show can also be done unsupervised. Finally, we assess both the scalability and impact of fundamental model properties on successful model-agnostic knowledge transfer.

  • 6 authors
·
Oct 26, 2023

Amplifying Pathological Detection in EEG Signaling Pathways through Cross-Dataset Transfer Learning

Pathology diagnosis based on EEG signals and decoding brain activity holds immense importance in understanding neurological disorders. With the advancement of artificial intelligence methods and machine learning techniques, the potential for accurate data-driven diagnoses and effective treatments has grown significantly. However, applying machine learning algorithms to real-world datasets presents diverse challenges at multiple levels. The scarcity of labelled data, especially in low regime scenarios with limited availability of real patient cohorts due to high costs of recruitment, underscores the vital deployment of scaling and transfer learning techniques. In this study, we explore a real-world pathology classification task to highlight the effectiveness of data and model scaling and cross-dataset knowledge transfer. As such, we observe varying performance improvements through data scaling, indicating the need for careful evaluation and labelling. Additionally, we identify the challenges of possible negative transfer and emphasize the significance of some key components to overcome distribution shifts and potential spurious correlations and achieve positive transfer. We see improvement in the performance of the target model on the target (NMT) datasets by using the knowledge from the source dataset (TUAB) when a low amount of labelled data was available. Our findings indicate a small and generic model (e.g. ShallowNet) performs well on a single dataset, however, a larger model (e.g. TCN) performs better on transfer and learning from a larger and diverse dataset.

  • 6 authors
·
Sep 19, 2023

BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.

  • 4 authors
·
Feb 16

Synthetic Dataset Evaluation Based on Generalized Cross Validation

With the rapid advancement of synthetic dataset generation techniques, evaluating the quality of synthetic data has become a critical research focus. Robust evaluation not only drives innovations in data generation methods but also guides researchers in optimizing the utilization of these synthetic resources. However, current evaluation studies for synthetic datasets remain limited, lacking a universally accepted standard framework. To address this, this paper proposes a novel evaluation framework integrating generalized cross-validation experiments and domain transfer learning principles, enabling generalizable and comparable assessments of synthetic dataset quality. The framework involves training task-specific models (e.g., YOLOv5s) on both synthetic datasets and multiple real-world benchmarks (e.g., KITTI, BDD100K), forming a cross-performance matrix. Following normalization, a Generalized Cross-Validation (GCV) Matrix is constructed to quantify domain transferability. The framework introduces two key metrics. One measures the simulation quality by quantifying the similarity between synthetic data and real-world datasets, while another evaluates the transfer quality by assessing the diversity and coverage of synthetic data across various real-world scenarios. Experimental validation on Virtual KITTI demonstrates the effectiveness of our proposed framework and metrics in assessing synthetic data fidelity. This scalable and quantifiable evaluation solution overcomes traditional limitations, providing a principled approach to guide synthetic dataset optimization in artificial intelligence research.

  • 6 authors
·
Sep 14, 2025

Prefix Conditioning Unifies Language and Label Supervision

Image-classification datasets have been used to pretrain image recognition models. Recently, web-scale image-caption datasets have emerged as a source of powerful pretraining alternative. Image-caption datasets are more ``open-domain'', containing a wider variety of scene types and vocabulary words than traditional classification datasets, and models trained on these datasets have demonstrated strong performance on few- and zero-shot recognition tasks. When naively unifying image-classification and -caption dataset, we show that such dataset biases negatively affect pre-training by reducing the generalizability of learned representations and thus jeopardizing zero-shot performance since the unification can tailor the model for the classification dataset, making it vulnerable to the distribution shift from the dataset. In this work, we address the problem by disentangling the dataset bias using prefix tokens that inform a language encoder of the type of the input dataset (e.g., image-classification or caption) at training time. This approach allows the language encoder to share the knowledge from two datasets as well as switch the mode of feature extraction, i.e., image-classification dataset or image-caption dataset tailored mode, where we use image-caption mode in the zero-shot evaluation. Our method is generic and can be easily integrated into existing VL pre-training objectives such as CLIP or UniCL. In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.

  • 7 authors
·
Jun 2, 2022

Data Filtering Networks

Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.

  • 6 authors
·
Sep 29, 2023 1

AMUSE: Adaptive Multi-Segment Encoding for Dataset Watermarking

Curating high quality datasets that play a key role in the emergence of new AI applications requires considerable time, money, and computational resources. So, effective ownership protection of datasets is becoming critical. Recently, to protect the ownership of an image dataset, imperceptible watermarking techniques are used to store ownership information (i.e., watermark) into the individual image samples. Embedding the entire watermark into all samples leads to significant redundancy in the embedded information which damages the watermarked dataset quality and extraction accuracy. In this paper, a multi-segment encoding-decoding method for dataset watermarking (called AMUSE) is proposed to adaptively map the original watermark into a set of shorter sub-messages and vice versa. Our message encoder is an adaptive method that adjusts the length of the sub-messages according to the protection requirements for the target dataset. Existing image watermarking methods are then employed to embed the sub-messages into the original images in the dataset and also to extract them from the watermarked images. Our decoder is then used to reconstruct the original message from the extracted sub-messages. The proposed encoder and decoder are plug-and-play modules that can easily be added to any watermarking method. To this end, extensive experiments are preformed with multiple watermarking solutions which show that applying AMUSE improves the overall message extraction accuracy upto 28% for the same given dataset quality. Furthermore, the image dataset quality is enhanced by a PSNR of approx2 dB on average, while improving the extraction accuracy for one of the tested image watermarking methods.

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune

Talan Talan
·
Mar 15 2

Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark

Large models have achieved remarkable performance across various tasks, yet they incur significant computational costs and privacy concerns during both training and inference. Distributed deployment has emerged as a potential solution, but it necessitates the exchange of intermediate information between model segments, with feature representations serving as crucial information carriers. To optimize information exchange, feature coding methods are applied to reduce transmission and storage overhead. Despite its importance, feature coding for large models remains an under-explored area. In this paper, we draw attention to large model feature coding and make three contributions to this field. First, we introduce a comprehensive dataset encompassing diverse features generated by three representative types of large models. Second, we establish unified test conditions, enabling standardized evaluation pipelines and fair comparisons across future feature coding studies. Third, we introduce two baseline methods derived from widely used image coding techniques and benchmark their performance on the proposed dataset. These contributions aim to advance the field of feature coding, facilitating more efficient large model deployment. All source code and the dataset are now available at https://github.com/chansongoal/FCM-LM/tree/master{https://github.com/chansongoal/FCM-LM/tree/master}.

  • 6 authors
·
Dec 5, 2024

An Automatic SOAP Classification System Using Weakly Supervision And Transfer Learning

In this paper, we introduce a comprehensive framework for developing a machine learning-based SOAP (Subjective, Objective, Assessment, and Plan) classification system without manually SOAP annotated training data or with less manually SOAP annotated training data. The system is composed of the following two parts: 1) Data construction, 2) A neural network-based SOAP classifier, and 3) Transfer learning framework. In data construction, since a manual construction of a large size training dataset is expensive, we propose a rule-based weak labeling method utilizing the structured information of an EHR note. Then, we present a SOAP classifier composed of a pre-trained language model and bi-directional long-short term memory with conditional random field (Bi-LSTM-CRF). Finally, we propose a transfer learning framework that re-uses the trained parameters of the SOAP classifier trained with the weakly labeled dataset for datasets collected from another hospital. The proposed weakly label-based learning model successfully performed SOAP classification (89.99 F1-score) on the notes collected from the target hospital. Otherwise, in the notes collected from other hospitals and departments, the performance dramatically decreased. Meanwhile, we verified that the transfer learning framework is advantageous for inter-hospital adaptation of the model increasing the models' performance in every cases. In particular, the transfer learning approach was more efficient when the manually annotated data size was smaller. We showed that SOAP classification models trained with our weakly labeling algorithm can perform SOAP classification without manually annotated data on the EHR notes from the same hospital. The transfer learning framework helps SOAP classification model's inter-hospital migration with a minimal size of the manually annotated dataset.

  • 3 authors
·
Nov 26, 2022

Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development

Endorectal ultrasound (ERUS) is an important imaging modality that provides high reliability for diagnosing the depth and boundary of invasion in colorectal cancer. However, the lack of a large-scale ERUS dataset with high-quality annotations hinders the development of automatic ultrasound diagnostics. In this paper, we collected and annotated the first benchmark dataset that covers diverse ERUS scenarios, i.e. colorectal cancer segmentation, detection, and infiltration depth staging. Our ERUS-10K dataset comprises 77 videos and 10,000 high-resolution annotated frames. Based on this dataset, we further introduce a benchmark model for colorectal cancer segmentation, named the Adaptive Sparse-context TRansformer (ASTR). ASTR is designed based on three considerations: scanning mode discrepancy, temporal information, and low computational complexity. For generalizing to different scanning modes, the adaptive scanning-mode augmentation is proposed to convert between raw sector images and linear scan ones. For mining temporal information, the sparse-context transformer is incorporated to integrate inter-frame local and global features. For reducing computational complexity, the sparse-context block is introduced to extract contextual features from auxiliary frames. Finally, on the benchmark dataset, the proposed ASTR model achieves a 77.6% Dice score in rectal cancer segmentation, largely outperforming previous state-of-the-art methods.

  • 10 authors
·
Aug 19, 2024

TransDAE: Dual Attention Mechanism in a Hierarchical Transformer for Efficient Medical Image Segmentation

In healthcare, medical image segmentation is crucial for accurate disease diagnosis and the development of effective treatment strategies. Early detection can significantly aid in managing diseases and potentially prevent their progression. Machine learning, particularly deep convolutional neural networks, has emerged as a promising approach to addressing segmentation challenges. Traditional methods like U-Net use encoding blocks for local representation modeling and decoding blocks to uncover semantic relationships. However, these models often struggle with multi-scale objects exhibiting significant variations in texture and shape, and they frequently fail to capture long-range dependencies in the input data. Transformers designed for sequence-to-sequence predictions have been proposed as alternatives, utilizing global self-attention mechanisms. Yet, they can sometimes lack precise localization due to insufficient granular details. To overcome these limitations, we introduce TransDAE: a novel approach that reimagines the self-attention mechanism to include both spatial and channel-wise associations across the entire feature space, while maintaining computational efficiency. Additionally, TransDAE enhances the skip connection pathway with an inter-scale interaction module, promoting feature reuse and improving localization accuracy. Remarkably, TransDAE outperforms existing state-of-the-art methods on the Synaps multi-organ dataset, even without relying on pre-trained weights.

  • 3 authors
·
Sep 3, 2024

Few-shot learning for COVID-19 Chest X-Ray Classification with Imbalanced Data: An Inter vs. Intra Domain Study

Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research. However, some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images. This work studies the effect of these challenges at the intra- and inter-domain level in few-shot learning scenarios with severe data imbalance. For this, we propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance. Specifically, different initialization and data augmentation methods are analyzed, and four adaptations to Siamese networks of solutions to deal with imbalanced data are introduced, including data balancing and weighted loss, both separately and combined, and with a different balance of pairing ratios. Moreover, we also assess the inference process considering four classifiers, namely Histogram, kNN, SVM, and Random Forest. Evaluation is performed on three chest X-ray datasets with annotated cases of both positive and negative COVID-19 diagnoses. The accuracy of each technique proposed for the Siamese architecture is analyzed separately and their results are compared to those obtained using equivalent methods on a state-of-the-art CNN. We conclude that the introduced techniques offer promising improvements over the baseline in almost all cases, and that the selection of the technique may vary depending on the amount of data available and the level of imbalance.

  • 4 authors
·
Jan 17, 2024

Expanding Small-Scale Datasets with Guided Imagination

The power of DNNs relies heavily on the quantity and quality of training data. However, collecting and annotating data on a large scale is often expensive and time-consuming. To address this issue, we explore a new task, termed dataset expansion, aimed at expanding a ready-to-use small dataset by automatically creating new labeled samples. To this end, we present a Guided Imagination Framework (GIF) that leverages cutting-edge generative models like DALL-E2 and Stable Diffusion (SD) to "imagine" and create informative new data from the input seed data. Specifically, GIF conducts data imagination by optimizing the latent features of the seed data in the semantically meaningful space of the prior model, resulting in the creation of photo-realistic images with new content. To guide the imagination towards creating informative samples for model training, we introduce two key criteria, i.e., class-maintained information boosting and sample diversity promotion. These criteria are verified to be essential for effective dataset expansion: GIF-SD obtains 13.5% higher model accuracy on natural image datasets than unguided expansion with SD. With these essential criteria, GIF successfully expands small datasets in various scenarios, boosting model accuracy by 36.9% on average over six natural image datasets and by 13.5% on average over three medical datasets. The source code is available at https://github.com/Vanint/DatasetExpansion.

  • 5 authors
·
Nov 25, 2022

Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation

Contemporary domain adaptation offers a practical solution for achieving cross-domain transfer of semantic segmentation between labeled source data and unlabeled target data. These solutions have gained significant popularity; however, they require the model to be retrained when the test environment changes. This can result in unbearable costs in certain applications due to the time-consuming training process and concerns regarding data privacy. One-shot domain adaptation methods attempt to overcome these challenges by transferring the pre-trained source model to the target domain using only one target data. Despite this, the referring style transfer module still faces issues with computation cost and over-fitting problems. To address this problem, we propose a novel framework called Informative Data Mining (IDM) that enables efficient one-shot domain adaptation for semantic segmentation. Specifically, IDM provides an uncertainty-based selection criterion to identify the most informative samples, which facilitates quick adaptation and reduces redundant training. We then perform a model adaptation method using these selected samples, which includes patch-wise mixing and prototype-based information maximization to update the model. This approach effectively enhances adaptation and mitigates the overfitting problem. In general, we provide empirical evidence of the effectiveness and efficiency of IDM. Our approach outperforms existing methods and achieves a new state-of-the-art one-shot performance of 56.7\%/55.4\% on the GTA5/SYNTHIA to Cityscapes adaptation tasks, respectively. The code will be released at https://github.com/yxiwang/IDM.

  • 6 authors
·
Sep 25, 2023

Zero-shot Cross-lingual Transfer Learning with Multiple Source and Target Languages for Information Extraction: Language Selection and Adversarial Training

The majority of previous researches addressing multi-lingual IE are limited to zero-shot cross-lingual single-transfer (one-to-one) setting, with high-resource languages predominantly as source training data. As a result, these works provide little understanding and benefit for the realistic goal of developing a multi-lingual IE system that can generalize to as many languages as possible. Our study aims to fill this gap by providing a detailed analysis on Cross-Lingual Multi-Transferability (many-to-many transfer learning), for the recent IE corpora that cover a diverse set of languages. Specifically, we first determine the correlation between single-transfer performance and a wide range of linguistic-based distances. From the obtained insights, a combined language distance metric can be developed that is not only highly correlated but also robust across different tasks and model scales. Next, we investigate the more general zero-shot multi-lingual transfer settings where multiple languages are involved in the training and evaluation processes. Language clustering based on the newly defined distance can provide directions for achieving the optimal cost-performance trade-off in data (languages) selection problem. Finally, a relational-transfer setting is proposed to further incorporate multi-lingual unlabeled data based on adversarial training using the relation induced from the above linguistic distance.

  • 2 authors
·
Nov 13, 2024

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.

  • 5 authors
·
Aug 10, 2022

Dataset Distillation via Committee Voting

Dataset distillation aims to synthesize a smaller, representative dataset that preserves the essential properties of the original data, enabling efficient model training with reduced computational resources. Prior work has primarily focused on improving the alignment or matching process between original and synthetic data, or on enhancing the efficiency of distilling large datasets. In this work, we introduce {bf C}ommittee {bf V}oting for {bf D}ataset {bf D}istillation (CV-DD), a novel and orthogonal approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets. We start by showing how to establish a strong baseline that already achieves state-of-the-art accuracy through leveraging recent advancements and thoughtful adjustments in model design and optimization processes. By integrating distributions and predictions from a committee of models while generating high-quality soft labels, our method captures a wider spectrum of data features, reduces model-specific biases and the adverse effects of distribution shifts, leading to significant improvements in generalization. This voting-based strategy not only promotes diversity and robustness within the distilled dataset but also significantly reduces overfitting, resulting in improved performance on post-eval tasks. Extensive experiments across various datasets and IPCs (images per class) demonstrate that Committee Voting leads to more reliable and adaptable distilled data compared to single/multi-model distillation methods, demonstrating its potential for efficient and accurate dataset distillation. Code is available at: https://github.com/Jiacheng8/CV-DD.

  • 6 authors
·
Jan 13, 2025

FYI: Flip Your Images for Dataset Distillation

Dataset distillation synthesizes a small set of images from a large-scale real dataset such that synthetic and real images share similar behavioral properties (e.g, distributions of gradients or features) during a training process. Through extensive analyses on current methods and real datasets, together with empirical observations, we provide in this paper two important things to share for dataset distillation. First, object parts that appear on one side of a real image are highly likely to appear on the opposite side of another image within a dataset, which we call the bilateral equivalence. Second, the bilateral equivalence enforces synthetic images to duplicate discriminative parts of objects on both the left and right sides of the images, limiting the recognition of subtle differences between objects. To address this problem, we introduce a surprisingly simple yet effective technique for dataset distillation, dubbed FYI, that enables distilling rich semantics of real images into synthetic ones. To this end, FYI embeds a horizontal flipping technique into distillation processes, mitigating the influence of the bilateral equivalence, while capturing more details of objects. Experiments on CIFAR-10/100, Tiny-ImageNet, and ImageNet demonstrate that FYI can be seamlessly integrated into several state-of-the-art methods, without modifying training objectives and network architectures, and it improves the performance remarkably.

  • 4 authors
·
Jul 10, 2024

Extending the WILDS Benchmark for Unsupervised Adaptation

Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as the evaluation metrics. On these datasets, we systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu.

  • 20 authors
·
Dec 9, 2021

Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources

Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling laws that predict model performance at any size and data source composition using the limited available samples. However, these scaling functions are black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called <projektor>, which predicts model performance and supports data selection decisions based on partial samples of prospective data sources. Our approach distinguishes itself from existing work by introducing a novel *two-stage* performance inference process. In the first stage, we leverage the Optimal Transport distance to predict the model's performance for any data mixture ratio within the range of disclosed data sizes. In the second stage, we extrapolate the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws. We further derive an efficient gradient-based method to select data sources based on the projected model performance. Evaluation over a diverse range of applications demonstrates that <projektor> significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Also, <projektor> outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions.

  • 4 authors
·
Jul 5, 2023

Google Landmarks Dataset v2 -- A Large-Scale Benchmark for Instance-Level Recognition and Retrieval

While image retrieval and instance recognition techniques are progressing rapidly, there is a need for challenging datasets to accurately measure their performance -- while posing novel challenges that are relevant for practical applications. We introduce the Google Landmarks Dataset v2 (GLDv2), a new benchmark for large-scale, fine-grained instance recognition and image retrieval in the domain of human-made and natural landmarks. GLDv2 is the largest such dataset to date by a large margin, including over 5M images and 200k distinct instance labels. Its test set consists of 118k images with ground truth annotations for both the retrieval and recognition tasks. The ground truth construction involved over 800 hours of human annotator work. Our new dataset has several challenging properties inspired by real world applications that previous datasets did not consider: An extremely long-tailed class distribution, a large fraction of out-of-domain test photos and large intra-class variability. The dataset is sourced from Wikimedia Commons, the world's largest crowdsourced collection of landmark photos. We provide baseline results for both recognition and retrieval tasks based on state-of-the-art methods as well as competitive results from a public challenge. We further demonstrate the suitability of the dataset for transfer learning by showing that image embeddings trained on it achieve competitive retrieval performance on independent datasets. The dataset images, ground-truth and metric scoring code are available at https://github.com/cvdfoundation/google-landmark.

  • 4 authors
·
Apr 3, 2020

Scale Efficient Training for Large Datasets

The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.

  • 3 authors
·
Mar 17, 2025

A Unified Statistical And Computational Framework For Ex-Post Harmonisation Of Aggregate Statistics

Ex-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of ex-post harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of ex-post harmonisation and designing error minimising workflows.

  • 1 authors
·
Jun 20, 2024

DataComp: In search of the next generation of multimodal datasets

Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a benchmark where the training code is fixed and researchers innovate by proposing new training sets. We provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow is a promising way of improving multimodal datasets. We introduce DataComp-1B, a dataset created by applying a simple filtering algorithm to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9x less training compute. We also outperform OpenAI's CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets.

  • 34 authors
·
Apr 27, 2023

tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation

The HuggingFace Datasets Hub hosts thousands of datasets. This provides exciting opportunities for language model training and evaluation. However, the datasets for a given type of task are stored with different schemas, and harmonization is harder than it seems (https://xkcd.com/927/). Multi-task training or evaluation requires manual work to fit data into task templates. Various initiatives independently address this problem by releasing the harmonized datasets or harmonization codes to preprocess datasets to the same format. We identify patterns across previous preprocessings, e.g. mapping of column names, and extraction of a specific sub-field from structured data in a column, and propose a structured annotation framework that makes our annotations fully exposed and not buried in unstructured code. We release a dataset annotation framework and dataset annotations for more than 400 English tasks (https://github.com/sileod/tasksource). These annotations provide metadata, like the name of the columns that should be used as input or labels for all datasets, and can save time for future dataset preprocessings, even if they do not use our framework. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size on an external evaluation https://hf.co/sileod/deberta-v3-base-tasksource-nli.

  • 1 authors
·
Jan 14, 2023

Efficient Dataset Distillation through Alignment with Smooth and High-Quality Expert Trajectories

Training a large and state-of-the-art machine learning model typically necessitates the use of large-scale datasets, which, in turn, makes the training and parameter-tuning process expensive and time-consuming. Some researchers opt to distil information from real-world datasets into tiny and compact synthetic datasets while maintaining their ability to train a well-performing model, hence proposing a data-efficient method known as Dataset Distillation (DD). Despite recent progress in this field, existing methods still underperform and cannot effectively replace large datasets. In this paper, unlike previous methods that focus solely on improving the efficacy of student distillation, we are the first to recognize the important interplay between expert and student. We argue the significant impact of expert smoothness when employing more potent expert trajectories in subsequent dataset distillation. Based on this, we introduce the integration of clipping loss and gradient penalty to regulate the rate of parameter changes in expert trajectories. Furthermore, in response to the sensitivity exhibited towards randomly initialized variables during distillation, we propose representative initialization for synthetic dataset and balanced inner-loop loss. Finally, we present two enhancement strategies, namely intermediate matching loss and weight perturbation, to mitigate the potential occurrence of cumulative errors. We conduct extensive experiments on datasets of different scales, sizes, and resolutions. The results demonstrate that the proposed method significantly outperforms prior methods.

  • 3 authors
·
Oct 16, 2023

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.

Noise May Contain Transferable Knowledge: Understanding Semi-supervised Heterogeneous Domain Adaptation from an Empirical Perspective

Semi-supervised heterogeneous domain adaptation (SHDA) addresses learning across domains with distinct feature representations and distributions, where source samples are labeled while most target samples are unlabeled, with only a small fraction labeled. Moreover, there is no one-to-one correspondence between source and target samples. Although various SHDA methods have been developed to tackle this problem, the nature of the knowledge transferred across heterogeneous domains remains unclear. This paper delves into this question from an empirical perspective. We conduct extensive experiments on about 330 SHDA tasks, employing two supervised learning methods and seven representative SHDA methods. Surprisingly, our observations indicate that both the category and feature information of source samples do not significantly impact the performance of the target domain. Additionally, noise drawn from simple distributions, when used as source samples, may contain transferable knowledge. Based on this insight, we perform a series of experiments to uncover the underlying principles of transferable knowledge in SHDA. Specifically, we design a unified Knowledge Transfer Framework (KTF) for SHDA. Based on the KTF, we find that the transferable knowledge in SHDA primarily stems from the transferability and discriminability of the source domain. Consequently, ensuring those properties in source samples, regardless of their origin (e.g., image, text, noise), can enhance the effectiveness of knowledge transfer in SHDA tasks. The codes and datasets are available at https://github.com/yyyaoyuan/SHDA.

  • 5 authors
·
Feb 19, 2025 2

UniBioTransfer: A Unified Framework for Multiple Biometrics Transfer

Deepface generation has traditionally followed a task-driven paradigm, where distinct tasks (e.g., face transfer and hair transfer) are addressed by task-specific models. Nevertheless, this single-task setting severely limits model generalization and scalability. A unified model capable of solving multiple deepface generation tasks in a single pass represents a promising and practical direction, yet remains challenging due to data scarcity and cross-task conflicts arising from heterogeneous attribute transformations. To this end, we propose UniBioTransfer, the first unified framework capable of handling both conventional deepface tasks (e.g., face transfer and face reenactment) and shape-varying transformations (e.g., hair transfer and head transfer). Besides, UniBioTransfer naturally generalizes to unseen tasks, like lip, eye, and glasses transfer, with minimal fine-tuning. Generally, UniBioTransfer addresses data insufficiency in multi-task generation through a unified data construction strategy, including a swapping-based corruption mechanism designed for spatially dynamic attributes like hair. It further mitigates cross-task interference via an innovative BioMoE, a mixture-of-experts based model coupled with a novel two-stage training strategy that effectively disentangles task-specific knowledge. Extensive experiments demonstrate the effectiveness, generalization, and scalability of UniBioTransfer, outperforming both existing unified models and task-specific methods across a wide range of deepface generation tasks. Project page is at https://scy639.github.io/UniBioTransfer.github.io/

  • 8 authors
·
Mar 20

DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models

Dataset distillation enables efficient training by distilling the information of large-scale datasets into significantly smaller synthetic datasets. Diffusion based paradigms have emerged in recent years, offering novel perspectives for dataset distillation. However, they typically necessitate additional fine-tuning stages, and effective guidance mechanisms remain underexplored. To address these limitations, we rethink diffusion based dataset distillation and propose a Dual Matching Guided Diffusion (DMGD) framework, centered on efficient training-free guidance. We first establish Semantic Matching via conditional likelihood optimization, eliminating the need for auxiliary classifiers. Furthermore, we propose a dynamic guidance mechanism that enhances the diversity of synthetic data while maintaining semantic alignment. Simultaneously, we introduce an optimal transport (OT) based Distribution Matching approach to further align with the target distribution structure. To ensure efficiency, we develop two enhanced strategies for diffusion based framework: Distribution Approximate Matching and Greedy Progressive Matching. These strategies enable effective distribution matching guidance with minimal computational overhead. Experimental results on ImageNet-Woof, ImageNet-Nette, and ImageNet-1K demonstrate that our training-free approach achieves significant improvements, outperforming state-of-the-art (SOTA) methods requiring additional fine-tuning by average accuracy gains of 2.1%, 5.4%, and 2.4%, respectively.

  • 5 authors
·
May 4

AutoTune: Automatically Tuning Convolutional Neural Networks for Improved Transfer Learning

Transfer learning enables solving a specific task having limited data by using the pre-trained deep networks trained on large-scale datasets. Typically, while transferring the learned knowledge from source task to the target task, the last few layers are fine-tuned (re-trained) over the target dataset. However, these layers are originally designed for the source task that might not be suitable for the target task. In this paper, we introduce a mechanism for automatically tuning the Convolutional Neural Networks (CNN) for improved transfer learning. The pre-trained CNN layers are tuned with the knowledge from target data using Bayesian Optimization. First, we train the final layer of the base CNN model by replacing the number of neurons in the softmax layer with the number of classes involved in the target task. Next, the pre-trained CNN is tuned automatically by observing the classification performance on the validation data (greedy criteria). To evaluate the performance of the proposed method, experiments are conducted on three benchmark datasets, e.g., CalTech-101, CalTech-256, and Stanford Dogs. The classification results obtained through the proposed AutoTune method outperforms the standard baseline transfer learning methods over the three datasets by achieving 95.92%, 86.54%, and 84.67% accuracy over CalTech-101, CalTech-256, and Stanford Dogs, respectively. The experimental results obtained in this study depict that tuning of the pre-trained CNN layers with the knowledge from the target dataset confesses better transfer learning ability. The source codes are available at https://github.com/JekyllAndHyde8999/AutoTune_CNN_TransferLearning.

  • 5 authors
·
Apr 25, 2020

Learnability-Guided Diffusion for Dataset Distillation

Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%). Our code is available on our project page https://jachansantiago.github.io/learnability-guided-distillation/.

  • 2 authors
·
Mar 31

Neural Organ Transplantation (NOT): Checkpoint-Based Modular Adaptation for Transformer Models

We introduce Neural Organ Transplantation (NOT), a modular adaptation framework that enables trained transformer layers to function as reusable transferable checkpoints for domain adaptation. Unlike conventional fine-tuning approaches that tightly couple trained parameters to specific model instances and training data, NOT extracts contiguous layer subsets ("donor organs") from pre-trained models, trains them independently on domain-specific data, and saves them as standalone checkpoint files that can be transplanted into compatible recipient models without access to the original training data. Through experiments on three decoder-only transformer architectures spanning 124M to 20B parameters (GPT-2, TinyLlama, and GPT-OSS), we demonstrate that donor transplantation substantially outperforms existing adaptation methods, achieving an order-of-magnitude improvement in perplexity over LoRA while training significantly faster. The method exhibits position dependence, with early insertion positions yielding optimal results. Cross-domain transfer at billion-parameter scale reveals unexpected regularization benefits. These findings demonstrate that transformer middle layers can support efficient modular transfer for decoder-only architectures, enabling privacy-preserving expertise sharing through checkpoint distribution. We note that this approach is currently limited to decoder-only models; preliminary experiments on encoder-based architectures show reduced effectiveness.

  • 1 authors
·
Jan 19

Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching

The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets are available at https://gzyaftermath.github.io/DATM.

  • 6 authors
·
Oct 9, 2023

AutoData: A Multi-Agent System for Open Web Data Collection

The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi-agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi-agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM-based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData's superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at https://github.com/GraphResearcher/AutoData.

  • 12 authors
·
May 21, 2025

MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks

Current instruction-based image editing (IBIE) methods struggle with challenging editing tasks, as both editing types and sample counts of existing datasets are limited. Moreover, traditional dataset construction often contains noisy image-caption pairs, which may introduce biases and limit model capabilities in complex editing scenarios. To address these limitations, we introduce MultiEdit, a comprehensive dataset featuring over 107K high-quality image editing samples. It encompasses 6 challenging editing tasks through a diverse collection of 18 non-style-transfer editing types and 38 style transfer operations, covering a spectrum from sophisticated style transfer to complex semantic operations like person reference editing and in-image text editing. We employ a novel dataset construction pipeline that utilizes two multi-modal large language models (MLLMs) to generate visual-adaptive editing instructions and produce high-fidelity edited images, respectively. Extensive experiments demonstrate that fine-tuning foundational open-source models with our MultiEdit-Train set substantially improves models' performance on sophisticated editing tasks in our proposed MultiEdit-Test benchmark, while effectively preserving their capabilities on the standard editing benchmark. We believe MultiEdit provides a valuable resource for advancing research into more diverse and challenging IBIE capabilities. Our dataset is available at https://huggingface.co/datasets/inclusionAI/MultiEdit.

inclusionAI inclusionAI
·
Sep 18, 2025 2

R2D2: Reducing Redundancy and Duplication in Data Lakes

Enterprise data lakes often suffer from substantial amounts of duplicate and redundant data, with data volumes ranging from terabytes to petabytes. This leads to both increased storage costs and unnecessarily high maintenance costs for these datasets. In this work, we focus on identifying and reducing redundancy in enterprise data lakes by addressing the problem of 'dataset containment'. To the best of our knowledge, this is one of the first works that addresses table-level containment at a large scale. We propose R2D2: a three-step hierarchical pipeline that efficiently identifies almost all instances of containment by progressively reducing the search space in the data lake. It first builds (i) a schema containment graph, followed by (ii) statistical min-max pruning, and finally, (iii) content level pruning. We further propose minimizing the total storage and access costs by optimally identifying redundant datasets that can be deleted (and reconstructed on demand) while respecting latency constraints. We implement our system on Azure Databricks clusters using Apache Spark for enterprise data stored in ADLS Gen2, and on AWS clusters for open-source data. In contrast to existing modified baselines that are inaccurate or take several days to run, our pipeline can process an enterprise customer data lake at the TB scale in approximately 5 hours with high accuracy. We present theoretical results as well as extensive empirical validation on both enterprise (scale of TBs) and open-source datasets (scale of MBs - GBs), which showcase the effectiveness of our pipeline.

  • 7 authors
·
Dec 20, 2023

Federated Learning on Virtual Heterogeneous Data with Local-global Distillation

While Federated Learning (FL) is gaining popularity for training machine learning models in a decentralized fashion, numerous challenges persist, such as asynchronization, computational expenses, data heterogeneity, and gradient and membership privacy attacks. Lately, dataset distillation has emerged as a promising solution for addressing the aforementioned challenges by generating a compact synthetic dataset that preserves a model's training efficacy. However, we discover that using distilled local datasets can amplify the heterogeneity issue in FL. To address this, we propose Federated Learning on Virtual Heterogeneous Data with Local-Global Dataset Distillation (FedLGD), where we seamlessly integrate dataset distillation algorithms into FL pipeline and train FL using a smaller synthetic dataset (referred as virtual data). Specifically, to harmonize the domain shifts, we propose iterative distribution matching to inpaint global information to local virtual data and use federated gradient matching to distill global virtual data that serve as anchor points to rectify heterogeneous local training, without compromising data privacy. We experiment on both benchmark and real-world datasets that contain heterogeneous data from different sources, and further scale up to an FL scenario that contains a large number of clients with heterogeneous and class-imbalanced data. Our method outperforms state-of-the-art heterogeneous FL algorithms under various settings. Our code is available at https://github.com/ubc-tea/FedLGD.

  • 5 authors
·
Mar 3, 2023

A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation

Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. However, exhaustive data annotation is impracticable for each task of all domains of interest, due to high labor costs and unguaranteed labeling accuracy. Besides, the uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. All these nuisances may hinder the verification of typical theories and exposure to new findings. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization. We in this work push forward along this line by doing profound and extensive research on bare supervised learning and downstream domain adaptation. Specifically, under the well-controlled, IID data setting enabled by 3D rendering, we systematically verify the typical, important learning insights, e.g., shortcut learning, and discover the new laws of various data regimes and network architectures in generalization. We further investigate the effect of image formation factors on generalization, e.g., object scale, material texture, illumination, camera viewpoint, and background in a 3D scene. Moreover, we use the simulation-to-reality adaptation as a downstream task for comparing the transferability between synthetic and real data when used for pre-training, which demonstrates that synthetic data pre-training is also promising to improve real test results. Lastly, to promote future research, we develop a new large-scale synthetic-to-real benchmark for image classification, termed S2RDA, which provides more significant challenges for transfer from simulation to reality. The code and datasets are available at https://github.com/huitangtang/On_the_Utility_of_Synthetic_Data.

  • 2 authors
·
Mar 16, 2023

Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis

We investigate how both the adaptation of a generic foundation model via transfer learning and the integration of complementary modalities from the operating room (OR) can support surgical data science. To this end, we use V-JEPA as the single-modality foundation of a multimodal model for minimally invasive surgery support. We analyze how the model's downstream performance can benefit (a) from finetuning on unlabeled surgical video data and (b) from providing additional time-resolved data streams from the OR in a multimodal setup. In an in-house dataset of liver surgery videos, we analyze the tasks of predicting hospital length of stay and postoperative complications. In videos of the public HeiCo dataset, we analyze the task of surgical phase recognition. As a baseline, we apply pretrained V-JEPA to all tasks. We then finetune it on unlabeled, held-out videos to investigate its change in performance after domain adaptation. Following the idea of modular decision support networks, we integrate additional data streams from the OR by training a separate encoder to form a shared representation space with V-JEPA's embeddings. Our experiments show that finetuning on domain-specific data increases model performance. On the in-house data, integrating additional time-resolved data likewise benefits the model. On the HeiCo data, accuracy of the pretrained video-only, single-modality baseline setup is on par with the top-performing submissions of the EndoVis2017 challenge, while finetuning on domain-specific data increases accuracy further. Our results thus demonstrate how surgical data science can leverage public, generic foundation models. Likewise, they indicate the potential of domain adaptation and of integrating suitable complementary data streams from the OR. To support further research, we release our code and model weights at https://github.com/DigitalSurgeryLab-Basel/ML-CDS-2025.

  • 5 authors
·
Sep 8, 2025

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using the data itself to fit a pretrained self-supervised representation. Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap. We release our checkpoints, code, documentation, and a link to our dataset at https://github.com/jimmyxu123/SELECT.

  • 6 authors
·
Oct 7, 2024 2

PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages

Due to the cost of developing and training deep learning models from scratch, machine learning engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks. PTM registries known as "model hubs" support engineers in distributing and reusing deep learning models. PTM packages include pre-trained weights, documentation, model architectures, datasets, and metadata. Mining the information in PTM packages will enable the discovery of engineering phenomena and tools to support software engineers. However, accessing this information is difficult - there are many PTM registries, and both the registries and the individual packages may have rate limiting for accessing the data. We present an open-source dataset, PTMTorrent, to facilitate the evaluation and understanding of PTM packages. This paper describes the creation, structure, usage, and limitations of the dataset. The dataset includes a snapshot of 5 model hubs and a total of 15,913 PTM packages. These packages are represented in a uniform data schema for cross-hub mining. We describe prior uses of this data and suggest research opportunities for mining using our dataset. The PTMTorrent dataset (v1) is available at: https://app.globus.org/file-manager?origin_id=55e17a6e-9d8f-11ed-a2a2-8383522b48d9&origin_path=%2F~%2F. Our dataset generation tools are available on GitHub: https://doi.org/10.5281/zenodo.7570357.

  • 8 authors
·
Mar 15, 2023

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face -- one of the largest platforms for sharing and collaborating on ML models and datasets -- as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize Dataset Description and Dataset Structure sections, while the Considerations for Using the Data section receives the lowest proportion of content. (3) By analyzing the subsections within each section and utilizing topic modeling to identify key topics, we uncover what is discussed in each section, and underscore significant themes encompassing both technical and social impacts, as well as limitations within the Considerations for Using the Data section. (4) Our findings also highlight the need for improved accessibility and reproducibility of datasets in the Usage sections. (5) In addition, our human annotation evaluation emphasizes the pivotal role of comprehensive dataset content in shaping individuals' perceptions of a dataset card's overall quality. Overall, our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis and underlines the need for more thorough dataset documentation in machine learning research.

  • 3 authors
·
Jan 24, 2024

Residual Feature Integration is Sufficient to Prevent Negative Transfer

Transfer learning has become a central paradigm in modern machine learning, yet it suffers from the long-standing problem of negative transfer, where leveraging source representations can harm rather than help performance on the target task. Although empirical remedies have been proposed, there remains little theoretical understanding of how to reliably avoid negative transfer. In this paper, we investigate a simple yet remarkably effective strategy: augmenting frozen, pretrained source-side features with a trainable target-side encoder that adapts target features to capture residual signals overlooked by models pretrained on the source data. We show this residual feature integration strategy is sufficient to provably prevent negative transfer, by establishing theoretical guarantees that it has no worse convergence rate than training from scratch under the informative class of target distributions up to logarithmic factors, and that the convergence rate can transition seamlessly from nonparametric to near-parametric when source representations are informative. To our knowledge, this is the first theoretical work that ensures protection against negative transfer. We carry out extensive numerical experiments across image, text and tabular benchmarks, and empirically verify that the method consistently safeguards performance under distribution shift, label noise, semantic perturbation, and class imbalance. We additionally demonstrate that this residual integration mechanism uniquely supports adapt-time multimodality extension, enabling a pretrained single-cell foundation model to incorporate spatial signals for lymph-node anatomical classification despite the source model being trained without them. Our study thus advances the theory of safe transfer learning, and provides a principled approach that is simple, robust, architecture-agnostic, and broadly applicable.

  • 4 authors
·
Feb 13

MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation

Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html

  • 4 authors
·
Aug 29, 2025