Antonio Torralba

Massachusetts Institute of Technology

H-index: 138

North America-United States

Description

Antonio Torralba, With an exceptional h-index of 138 and a recent h-index of 112 (since 2020), a distinguished researcher at Massachusetts Institute of Technology, specializes in the field of vision, computer vision.

His recent articles reflect a diverse array of research interests and contributions to the field:

CAvatar: Real-time Human Activity Mesh Reconstruction via Tactile Carpets

Follow Anything: Open-Set Detection, Tracking, and Following in Real-Time

A Multimodal Automated Interpretability Agent

Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects

Find: A function description benchmark for evaluating interpretability methods

Automatic Discovery of Visual Circuits

A Vision Check-up for Language Models

3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes

Professor Information

University	Massachusetts Institute of Technology
Position	Professor of Computer Science
Citations(all)	123395
Citations(since 2020)	72680
Cited By	77229
hIndex(all)	138
hIndex(since 2020)	112
i10Index(all)	285
i10Index(since 2020)	258
Email	Access Email
University Profile Page	Massachusetts Institute of Technology

Research & Interests List

vision

computer vision

Top articles of Antonio Torralba

CAvatar: Real-time Human Activity Mesh Reconstruction via Tactile Carpets

Human mesh reconstruction is essential for various applications, including virtual reality, motion capture, sports performance analysis, and healthcare monitoring. In healthcare contexts such as nursing homes, it is crucial to employ plausible and non-invasive methods for human mesh reconstruction that preserve privacy and dignity. Traditional vision-based techniques encounter challenges related to occlusion, viewpoint limitations, lighting conditions, and privacy concerns. In this research, we present CAvatar, a real-time human mesh reconstruction approach that innovatively utilizes pressure maps recorded by a tactile carpet as input. This advanced, non-intrusive technology obviates the need for cameras during usage, thereby safeguarding privacy. Our approach addresses several challenges, such as the limited spatial resolution of tactile sensors, extracting meaningful information from noisy pressure maps, and …

Authors

Wenqiang Chen,Yexin Hu,Wei Song,Yingcheng Liu,Antonio Torralba,Wojciech Matusik

Journal

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Published Date

2024/1/12

Follow Anything: Open-Set Detection, Tracking, and Following in Real-Time

Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this letter, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed follow anything ( FAn ), is an open-vocabulary and multimodal model – it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models ( foundation models ), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial …

Authors

Alaa Maalouf,Ninad Jadhav,Krishna Murthy Jatavallabhula,Makram Chahine,Daniel M Vogt,Robert J Wood,Antonio Torralba,Daniela Rus

Journal

IEEE Robotics and Automation Letters

Published Date

2024/2/14

A Multimodal Automated Interpretability Agent

This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery. It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior. We evaluate applications of MAIA to computer vision models. We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images. Across several trained models and a novel dataset of synthetic vision neurons with paired ground-truth descriptions, MAIA produces descriptions comparable to those generated by expert human experimenters. We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.

Authors

Tamar Rott Shaham,Sarah Schwettmann,Franklin Wang,Achyuta Rajaram,Evan Hernandez,Jacob Andreas,Antonio Torralba

Journal

arXiv preprint arXiv:2404.14394

Published Date

2024/4/22

Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects

Our world is full of identical objects (\emphe.g., cans of coke, cars of same model). These duplicates, when seen together, provide additional and strong cues for us to effectively reason about 3D. Inspired by this observation, we introduce Structure from Duplicates (SfD), a novel inverse graphics framework that reconstructs geometry, material, and illumination from a single image containing multiple identical objects. SfD begins by identifying multiple instances of an object within an image, and then jointly estimates the 6DoF pose for all instances.An inverse graphics pipeline is subsequently employed to jointly reason about the shape, material of the object, and the environment light, while adhering to the shared geometry and material constraint across instances. Our primary contributions involve utilizing object duplicates as a robust prior for single-image inverse graphics and proposing an in-plane rotation-robust Structure from Motion (SfM) formulation for joint 6-DoF object pose estimation. By leveraging multi-view cues from a single image, SfD generates more realistic and detailed 3D reconstructions, significantly outperforming existing single image reconstruction models and multi-view reconstruction approaches with a similar or greater number of observations.

Authors

Tianhang Cheng,Wei-Chiu Ma,Kaiyu Guan,Antonio Torralba,Shenlong Wang

Journal

arXiv preprint arXiv:2401.05236

Published Date

2024/1/10

Find: A function description benchmark for evaluating interpretability methods

Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate methods that use pretrained language models (LMs) to produce code-based and natural language descriptions of function behavior. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built with an off-the-shelf LM augmented with black-box access to functions …

Authors

Sarah Schwettmann,Tamar Shaham,Joanna Materzynska,Neil Chowdhury,Shuang Li,Jacob Andreas,David Bau,Antonio Torralba

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

Automatic Discovery of Visual Circuits

To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.

Authors

Achyuta Rajaram,Neil Chowdhury,Antonio Torralba,Jacob Andreas,Sarah Schwettmann

Journal

arXiv preprint arXiv:2404.14349

Published Date

2024/4/22

A Vision Check-up for Language Models

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

Authors

Pratyusha Sharma,Tamar Rott Shaham,Manel Baradad,Stephanie Fu,Adrian Rodriguez-Munoz,Shivam Duggal,Phillip Isola,Antonio Torralba

Journal

arXiv preprint arXiv:2401.01862

Published Date

2024/1/3

3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes

Given a visual scene, humans have strong intuitions about how a scene can evolve over time under given actions. The intuition, often termed visual intuitive physics, is a critical ability that allows us to make effective plans to manipulate the scene to achieve desired outcomes without relying on extensive trial and error. In this paper, we present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids. Our method is composed of a conditional Neural Radiance Field (NeRF)-style visual frontend and a 3D point-based dynamics prediction backend, using which we can impose strong relational and structural inductive bias to capture the structure of the underlying environment. Unlike existing intuitive point-based dynamics works that rely on the supervision of dense point trajectory from simulators, we relax the requirements and only assume access to multi-view RGB images and (imperfect) instance masks acquired using color prior. This enables the proposed model to handle scenarios where accurate point estimation and tracking are hard or impossible. We generate datasets including three challenging scenarios involving fluid, granular materials, and rigid objects in the simulation. The datasets do not include any dense particle information so most previous 3D-based intuitive physics pipelines can barely deal with that. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space. We also show that once trained, our model can achieve strong generalization in complex …

Authors

Haotian Xue,Antonio Torralba,Josh Tenenbaum,Dan Yamins,Yunzhu Li,Hsiao-Yu Tung

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13