Antonio Torralba
Massachusetts Institute of Technology
H-index: 138
North America-United States
Description
Antonio Torralba, With an exceptional h-index of 138 and a recent h-index of 112 (since 2020), a distinguished researcher at Massachusetts Institute of Technology, specializes in the field of vision, computer vision.
His recent articles reflect a diverse array of research interests and contributions to the field:
CAvatar: Real-time Human Activity Mesh Reconstruction via Tactile Carpets
Follow Anything: Open-Set Detection, Tracking, and Following in Real-Time
A Multimodal Automated Interpretability Agent
Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects
Find: A function description benchmark for evaluating interpretability methods
Automatic Discovery of Visual Circuits
A Vision Check-up for Language Models
3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes
Professor Information
University | Massachusetts Institute of Technology |
---|---|
Position | Professor of Computer Science |
Citations(all) | 123395 |
Citations(since 2020) | 72680 |
Cited By | 77229 |
hIndex(all) | 138 |
hIndex(since 2020) | 112 |
i10Index(all) | 285 |
i10Index(since 2020) | 258 |
University Profile Page | Massachusetts Institute of Technology |
Research & Interests List
vision
computer vision
Top articles of Antonio Torralba
CAvatar: Real-time Human Activity Mesh Reconstruction via Tactile Carpets
Human mesh reconstruction is essential for various applications, including virtual reality, motion capture, sports performance analysis, and healthcare monitoring. In healthcare contexts such as nursing homes, it is crucial to employ plausible and non-invasive methods for human mesh reconstruction that preserve privacy and dignity. Traditional vision-based techniques encounter challenges related to occlusion, viewpoint limitations, lighting conditions, and privacy concerns. In this research, we present CAvatar, a real-time human mesh reconstruction approach that innovatively utilizes pressure maps recorded by a tactile carpet as input. This advanced, non-intrusive technology obviates the need for cameras during usage, thereby safeguarding privacy. Our approach addresses several challenges, such as the limited spatial resolution of tactile sensors, extracting meaningful information from noisy pressure maps, and …
Authors
Wenqiang Chen,Yexin Hu,Wei Song,Yingcheng Liu,Antonio Torralba,Wojciech Matusik
Journal
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Published Date
2024/1/12
Follow Anything: Open-Set Detection, Tracking, and Following in Real-Time
Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this letter, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed follow anything ( FAn ), is an open-vocabulary and multimodal model – it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models ( foundation models ), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial …
Authors
Alaa Maalouf,Ninad Jadhav,Krishna Murthy Jatavallabhula,Makram Chahine,Daniel M Vogt,Robert J Wood,Antonio Torralba,Daniela Rus
Journal
IEEE Robotics and Automation Letters
Published Date
2024/2/14
A Multimodal Automated Interpretability Agent
This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery. It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior. We evaluate applications of MAIA to computer vision models. We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images. Across several trained models and a novel dataset of synthetic vision neurons with paired ground-truth descriptions, MAIA produces descriptions comparable to those generated by expert human experimenters. We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.
Authors
Tamar Rott Shaham,Sarah Schwettmann,Franklin Wang,Achyuta Rajaram,Evan Hernandez,Jacob Andreas,Antonio Torralba
Journal
arXiv preprint arXiv:2404.14394
Published Date
2024/4/22
Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects
Our world is full of identical objects (\emphe.g., cans of coke, cars of same model). These duplicates, when seen together, provide additional and strong cues for us to effectively reason about 3D. Inspired by this observation, we introduce Structure from Duplicates (SfD), a novel inverse graphics framework that reconstructs geometry, material, and illumination from a single image containing multiple identical objects. SfD begins by identifying multiple instances of an object within an image, and then jointly estimates the 6DoF pose for all instances.An inverse graphics pipeline is subsequently employed to jointly reason about the shape, material of the object, and the environment light, while adhering to the shared geometry and material constraint across instances. Our primary contributions involve utilizing object duplicates as a robust prior for single-image inverse graphics and proposing an in-plane rotation-robust Structure from Motion (SfM) formulation for joint 6-DoF object pose estimation. By leveraging multi-view cues from a single image, SfD generates more realistic and detailed 3D reconstructions, significantly outperforming existing single image reconstruction models and multi-view reconstruction approaches with a similar or greater number of observations.
Authors
Tianhang Cheng,Wei-Chiu Ma,Kaiyu Guan,Antonio Torralba,Shenlong Wang
Journal
arXiv preprint arXiv:2401.05236
Published Date
2024/1/10
Find: A function description benchmark for evaluating interpretability methods
Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate methods that use pretrained language models (LMs) to produce code-based and natural language descriptions of function behavior. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built with an off-the-shelf LM augmented with black-box access to functions …
Authors
Sarah Schwettmann,Tamar Shaham,Joanna Materzynska,Neil Chowdhury,Shuang Li,Jacob Andreas,David Bau,Antonio Torralba
Journal
Advances in Neural Information Processing Systems
Published Date
2024/2/13
Automatic Discovery of Visual Circuits
To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We introduce a new method for identifying these subgraphs: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
Authors
Achyuta Rajaram,Neil Chowdhury,Antonio Torralba,Jacob Andreas,Sarah Schwettmann
Journal
arXiv preprint arXiv:2404.14349
Published Date
2024/4/22
A Vision Check-up for Language Models
What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.
Authors
Pratyusha Sharma,Tamar Rott Shaham,Manel Baradad,Stephanie Fu,Adrian Rodriguez-Munoz,Shivam Duggal,Phillip Isola,Antonio Torralba
Journal
arXiv preprint arXiv:2401.01862
Published Date
2024/1/3
3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes
Given a visual scene, humans have strong intuitions about how a scene can evolve over time under given actions. The intuition, often termed visual intuitive physics, is a critical ability that allows us to make effective plans to manipulate the scene to achieve desired outcomes without relying on extensive trial and error. In this paper, we present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids. Our method is composed of a conditional Neural Radiance Field (NeRF)-style visual frontend and a 3D point-based dynamics prediction backend, using which we can impose strong relational and structural inductive bias to capture the structure of the underlying environment. Unlike existing intuitive point-based dynamics works that rely on the supervision of dense point trajectory from simulators, we relax the requirements and only assume access to multi-view RGB images and (imperfect) instance masks acquired using color prior. This enables the proposed model to handle scenarios where accurate point estimation and tracking are hard or impossible. We generate datasets including three challenging scenarios involving fluid, granular materials, and rigid objects in the simulation. The datasets do not include any dense particle information so most previous 3D-based intuitive physics pipelines can barely deal with that. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space. We also show that once trained, our model can achieve strong generalization in complex …
Authors
Haotian Xue,Antonio Torralba,Josh Tenenbaum,Dan Yamins,Yunzhu Li,Hsiao-Yu Tung
Journal
Advances in Neural Information Processing Systems
Published Date
2024/2/13
Professor FAQs
What is Antonio Torralba's h-index at Massachusetts Institute of Technology?
The h-index of Antonio Torralba has been 112 since 2020 and 138 in total.
What are Antonio Torralba's top articles?
The articles with the titles of
CAvatar: Real-time Human Activity Mesh Reconstruction via Tactile Carpets
Follow Anything: Open-Set Detection, Tracking, and Following in Real-Time
A Multimodal Automated Interpretability Agent
Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects
Find: A function description benchmark for evaluating interpretability methods
Automatic Discovery of Visual Circuits
A Vision Check-up for Language Models
3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes
...
are the top articles of Antonio Torralba at Massachusetts Institute of Technology.
What are Antonio Torralba's research interests?
The research interests of Antonio Torralba are: vision, computer vision
What is Antonio Torralba's total number of citations?
Antonio Torralba has 123,395 citations in total.
What are the co-authors of Antonio Torralba?
The co-authors of Antonio Torralba are William T. Freeman, Joshua B. Tenenbaum, Fredo Durand, Rob Fergus, Sanja Fidler, Aude Oliva.