Shih-Fu Chang
Columbia University in the City of New York
H-index: 134
North America-United States
Description
Shih-Fu Chang, With an exceptional h-index of 134 and a recent h-index of 71 (since 2020), a distinguished researcher at Columbia University in the City of New York, specializes in the field of Multimedia, Computer Vision, Machine Learning, Signal Processing, Information Retrieval.
His recent articles reflect a diverse array of research interests and contributions to the field:
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities
From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos
MoDE: CLIP Data Experts via Clustering
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval
What, when, and where?--Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Professor Information
University | Columbia University in the City of New York |
---|---|
Position | Professor of Electrical Engineering and Computer Science |
Citations(all) | 72959 |
Citations(since 2020) | 21353 |
Cited By | 59894 |
hIndex(all) | 134 |
hIndex(since 2020) | 71 |
i10Index(all) | 559 |
i10Index(since 2020) | 269 |
University Profile Page | Columbia University in the City of New York |
Research & Interests List
Multimedia
Computer Vision
Machine Learning
Signal Processing
Information Retrieval
Top articles of Shih-Fu Chang
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets.In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the …
Authors
Ali Zare,Yulei Niu,Hammad Ayyubi,Shih-fu Chang
Journal
arXiv preprint arXiv:2403.18600
Published Date
2024/3/27
Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities
Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, in Figure 1, the abstract event of "war'' manifests at a lower semantic level through subevents "tanks firing'' (in video) and airplane "shot'' (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and …
Authors
Hammad A. Ayyubi,Christopher Thomas,Lovish Chum,Rahul Lokesh,Long Chen,Yulei Niu,Xudong Lin,Xuande Feng,Jaywon Koo,Sounak Ray,Shih-Fu Chang
Published Date
2024/2/24
From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models (LLMs), have revolutionized various natural language processing (NLP) tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. The paper begins by defining chart understanding, outlining problem formulations, and discussing fundamental building blocks crucial for studying chart understanding tasks. In the section on tasks and datasets, we explore various tasks within chart understanding and discuss their evaluation metrics and sources of both charts and textual inputs. Modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed in a dedicated section, highlighting issues such as domain-specific charts, lack of efforts in evaluation, and agent-oriented settings. This survey paper serves to provide valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies …
Authors
Kung-Hsiang Huang,Hou Pong Chan,Yi R Fung,Haoyi Qiu,Mingyang Zhou,Shafiq Joty,Shih-Fu Chang,Heng Ji
Journal
arXiv preprint arXiv:2403.12027
Published Date
2024/3/18
SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos
We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and action space. Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures. In this work, we point out that State CHangEs MAtter (SCHEMA) for procedure planning in instructional videos. We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures. Specifically, we explicitly represent each step as state changes and track the state changes in procedures. For step representation, we leveraged the commonsense knowledge in large language models (LLMs) to describe the state changes of steps via our designed chain-of-thought prompting. For state change tracking, we align visual state observations with language state descriptions via cross-modal contrastive learning, and explicitly model the intermediate states of the procedure using LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations.
Authors
Yulei Niu,Wenliang Guo,Long Chen,Xudong Lin,Shih-Fu Chang
Journal
arXiv preprint arXiv:2403.01599
Published Date
2024/3/3
MoDE: CLIP Data Experts via Clustering
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.
Authors
Jiawei Ma,Po-Yao Huang,Saining Xie,Shang-Wen Li,Luke Zettlemoyer,Shih-Fu Chang,Wen-Tau Yih,Hu Xu
Journal
arXiv preprint arXiv:2404.16030
Published Date
2024/4/24
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.
Authors
Haotian Zhang,Haoxuan You,Philipp Dufter,Bowen Zhang,Chen Chen,Hong-You Chen,Tsu-Jui Fu,William Yang Wang,Shih-Fu Chang,Zhe Gan,Yinfei Yang
Journal
arXiv preprint arXiv:2404.07973
Published Date
2024/4/11
Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval
Multi-channel video-language retrieval require models to understand information from different channels (eg video+ question, video+ speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, eg, CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, eg, SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data. Further analysis shows that this is because representing videos as text tokens captures the key visual information and text tokens are naturally aligned with text models that are strong retrievers after the contrastive pretraining process. All the empirical analysis establishes a solid foundation for future research on …
Authors
Xudong Lin,Simran Tiwari,Shiyuan Huang,Manling Li,Mike Zheng Shou,Heng Ji,Shih-Fu Chang
Published Date
2023
What, when, and where?--Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks showing that our method improves over current baselines in various settings, including spatial, temporal, and untrimmed multi-action spatio-temporal grounding.
Authors
Brian Chen,Nina Shvetsova,Andrew Rouditchenko,Daniel Kondermann,Samuel Thomas,Shih-Fu Chang,Rogerio Feris,James Glass,Hilde Kuehne
Journal
arXiv preprint arXiv:2303.16990
Published Date
2023/3/29
Professor FAQs
What is Shih-Fu Chang's h-index at Columbia University in the City of New York?
The h-index of Shih-Fu Chang has been 71 since 2020 and 134 in total.
What are Shih-Fu Chang's top articles?
The articles with the titles of
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities
From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos
MoDE: CLIP Data Experts via Clustering
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval
What, when, and where?--Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
...
are the top articles of Shih-Fu Chang at Columbia University in the City of New York.
What are Shih-Fu Chang's research interests?
The research interests of Shih-Fu Chang are: Multimedia, Computer Vision, Machine Learning, Signal Processing, Information Retrieval
What is Shih-Fu Chang's total number of citations?
Shih-Fu Chang has 72,959 citations in total.
What are the co-authors of Shih-Fu Chang?
The co-authors of Shih-Fu Chang are Yu-Gang Jiang, Rongrong Ji 纪荣嵘, Alexander C. Loui, Winston Hsu, Lexing Xie, hari sundaram.