MoDE: CLIP Data Experts via Clustering

arXiv preprint arXiv:2404.16030

Published On 2024/4/24

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

Journal

arXiv preprint arXiv:2404.16030

Authors

Shih-Fu Chang

Columbia University in the City of New York

H-Index

134

Research Interests

Multimedia

Computer Vision

Machine Learning

Signal Processing

Information Retrieval

University Profile Page

Columbia University in the City of New York

Access Email

Luke Zettlemoyer

University of Washington

H-Index

100

Research Interests

Natural Language Processing

Semantics

Machine Learning

Artificial Intelligence

University Profile Page

University of Washington

Access Email

Po-Yao (Bernie) Huang

Carnegie Mellon University

H-Index

Research Interests

Multimodal machine learning

Multi-modal learning

natural language processing

University Profile Page

Carnegie Mellon University

Access Email

Jiawei Phoenix MA

Columbia University in the City of New York

H-Index

Research Interests

Data-Centric AI

De-Centralized AI

Reliable Life-Long Learning

Multi-Modal

Computer Vision

University Profile Page

Columbia University in the City of New York

Access Email

Other Articles from authors

Shih-Fu Chang

Columbia University in the City of New York

arXiv preprint arXiv:2403.18600

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets.In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the …

2024/3/27

MoDE: CLIP Data Experts via Clustering

Authors

Shih-Fu Chang

Columbia University in the City of New York

Luke Zettlemoyer

University of Washington

Po-Yao (Bernie) Huang

Carnegie Mellon University

Jiawei Phoenix MA

Columbia University in the City of New York

Other Articles from authors

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Lima: Less is more for alignment

Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Megabyte: Predicting million-byte sequences with multiscale transformers

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

Reliable, adaptable, and attributable language models with retrieval

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

Toolformer: Language models can teach themselves to use tools

Adversarially Masked Video Consistency for Unsupervised Domain Adaptation

SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Comparing hallucination detection metrics for multilingual generation

Do Membership Inference Attacks Work on Large Language Models?

Other articles from arXiv preprint arXiv:2404.16030 journal

MoDE: CLIP Data Experts via Clustering

MoDE: CLIP Data Experts via Clustering

MoDE: CLIP Data Experts via Clustering

MoDE: CLIP Data Experts via Clustering