Shih-Fu Chang

Columbia University in the City of New York

H-index: 134

North America-United States

Description

Shih-Fu Chang, With an exceptional h-index of 134 and a recent h-index of 71 (since 2020), a distinguished researcher at Columbia University in the City of New York, specializes in the field of Multimedia, Computer Vision, Machine Learning, Signal Processing, Information Retrieval.

His recent articles reflect a diverse array of research interests and contributions to the field:

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

MoDE: CLIP Data Experts via Clustering

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval

What, when, and where?--Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Professor Information

University	Columbia University in the City of New York
Position	Professor of Electrical Engineering and Computer Science
Citations(all)	72959
Citations(since 2020)	21353
Cited By	59894
hIndex(all)	134
hIndex(since 2020)	71
i10Index(all)	559
i10Index(since 2020)	269
Email	Access Email
University Profile Page	Columbia University in the City of New York

Research & Interests List

Multimedia

Computer Vision

Machine Learning

Signal Processing

Information Retrieval

Top articles of Shih-Fu Chang

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets.In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the …

Authors

Ali Zare,Yulei Niu,Hammad Ayyubi,Shih-fu Chang

Journal

arXiv preprint arXiv:2403.18600

Published Date

2024/3/27

Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities

Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, in Figure 1, the abstract event of "war'' manifests at a lower semantic level through subevents "tanks firing'' (in video) and airplane "shot'' (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and …

Authors

Hammad A. Ayyubi,Christopher Thomas,Lovish Chum,Rahul Lokesh,Long Chen,Yulei Niu,Xudong Lin,Xuande Feng,Jaywon Koo,Sounak Ray,Shih-Fu Chang

Published Date

2024/2/24

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models (LLMs), have revolutionized various natural language processing (NLP) tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. The paper begins by defining chart understanding, outlining problem formulations, and discussing fundamental building blocks crucial for studying chart understanding tasks. In the section on tasks and datasets, we explore various tasks within chart understanding and discuss their evaluation metrics and sources of both charts and textual inputs. Modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed in a dedicated section, highlighting issues such as domain-specific charts, lack of efforts in evaluation, and agent-oriented settings. This survey paper serves to provide valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies …

Authors

Kung-Hsiang Huang,Hou Pong Chan,Yi R Fung,Haoyi Qiu,Mingyang Zhou,Shafiq Joty,Shih-Fu Chang,Heng Ji

Journal

arXiv preprint arXiv:2403.12027

Published Date

2024/3/18

SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and action space. Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures. In this work, we point out that State CHangEs MAtter (SCHEMA) for procedure planning in instructional videos. We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures. Specifically, we explicitly represent each step as state changes and track the state changes in procedures. For step representation, we leveraged the commonsense knowledge in large language models (LLMs) to describe the state changes of steps via our designed chain-of-thought prompting. For state change tracking, we align visual state observations with language state descriptions via cross-modal contrastive learning, and explicitly model the intermediate states of the procedure using LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations.

Authors

Yulei Niu,Wenliang Guo,Long Chen,Xudong Lin,Shih-Fu Chang

Journal

arXiv preprint arXiv:2403.01599

Published Date

2024/3/3

MoDE: CLIP Data Experts via Clustering

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

Authors

Jiawei Ma,Po-Yao Huang,Saining Xie,Shang-Wen Li,Luke Zettlemoyer,Shih-Fu Chang,Wen-Tau Yih,Hu Xu

Journal

arXiv preprint arXiv:2404.16030

Published Date

2024/4/24

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

Authors

Haotian Zhang,Haoxuan You,Philipp Dufter,Bowen Zhang,Chen Chen,Hong-You Chen,Tsu-Jui Fu,William Yang Wang,Shih-Fu Chang,Zhe Gan,Yinfei Yang

Journal

arXiv preprint arXiv:2404.07973

Published Date

2024/4/11

Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval

Multi-channel video-language retrieval require models to understand information from different channels (eg video+ question, video+ speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, eg, CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, eg, SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data. Further analysis shows that this is because representing videos as text tokens captures the key visual information and text tokens are naturally aligned with text models that are strong retrievers after the contrastive pretraining process. All the empirical analysis establishes a solid foundation for future research on …

Authors

Xudong Lin,Simran Tiwari,Shiyuan Huang,Manling Li,Mike Zheng Shou,Heng Ji,Shih-Fu Chang

Published Date

2023

What, when, and where?--Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks showing that our method improves over current baselines in various settings, including spatial, temporal, and untrimmed multi-action spatio-temporal grounding.

Authors

Brian Chen,Nina Shvetsova,Andrew Rouditchenko,Daniel Kondermann,Samuel Thomas,Shih-Fu Chang,Rogerio Feris,James Glass,Hilde Kuehne

Journal

arXiv preprint arXiv:2303.16990

Published Date

2023/3/29