Yann LeCun

Yann LeCun

New York University

H-index: 145

North America-United States

Yann LeCun Information

University

New York University

Position

Chief AI Scientist at Facebook & Silver Professor at the Courant Institute

Citations(all)

338733

Citations(since 2020)

228006

Cited By

196357

hIndex(all)

145

hIndex(since 2020)

113

i10Index(all)

364

i10Index(since 2020)

286

Email

University Profile Page

New York University

Yann LeCun Skills & Research Interests

AI

machine learning

computer vision

robotics

image compression

Top articles of Yann LeCun

Learning and Leveraging World Models in Visual Representation Learning

Authors

Quentin Garrido,Mahmoud Assran,Nicolas Ballas,Adrien Bardes,Laurent Najman,Yann LeCun

Journal

arXiv preprint arXiv:2403.00504

Published Date

2024/3/1

Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Authors

Amir Bar,Arya Bakhtiar,Danny Tran,Antonio Loquercio,Jathushan Rajasegaran,Yann LeCun,Amir Globerson,Trevor Darrell

Journal

arXiv preprint arXiv:2404.09991

Published Date

2024/4/15

Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. Current video datasets separately contain egomotion and interaction examples, but rarely both at the same time. In addition, EgoPet offers a radically distinct perspective from existing egocentric datasets of humans or vehicles. We define two in-domain benchmark tasks that capture animal behavior, and a third benchmark to assess the utility of EgoPet as a pretraining resource to robotic quadruped locomotion, showing that models trained from EgoPet outperform those trained from prior datasets.

Fast and exact enumeration of deep networks partitions regions

Authors

Randall Balestriero,Yann LeCun

Published Date

2023/6/4

One fruitful formulation of Deep Networks (DNs) enabling their theoretical study and providing practical guidelines to practitioners relies on Piecewise Affine Splines. In that realm, a DN’s input-mapping is expressed as per-region affine map-ping where those regions are implicitly determined by the model’s architecture and form a partition of their input space. That partition –which is involved in all the results spanned from this line of research– has so far only been computed on 2/3-dimensional slices of the DN’s input space or estimated by random sampling. In this paper, we provide the first parallel algorithm that does exact enumeration of the DN’s partition regions. The proposed algorithm enables one to finally assess the closeness of the commonly employed approximations methods, e.g. based on random sampling of the DN input space. One of our key finding is that if one is only interested in regions with "large" …

An Information Theory Perspective on Variance-Invariance-Covariance Regularization

Authors

Ravid Shwartz-Ziv,Randall Balestriero,Kenji Kawaguchi,Tim GJ Rudner,Yann LeCun

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

Variance-Invariance-Covariance Regularization (VICReg) is a self-supervised learning (SSL) method that has shown promising results on a variety of tasks. However, the fundamental mechanisms underlying VICReg remain unexplored. In this paper, we present an information-theoretic perspective on the VICReg objective. We begin by deriving information-theoretic quantities for deterministic networks as an alternative to unrealistic stochastic network assumptions. We then relate the optimization of the VICReg objective to mutual information optimization, highlighting underlying assumptions and facilitating a constructive comparison with other SSL algorithms and derive a generalization bound for VICReg, revealing its inherent advantages for downstream tasks. Building on these results, we introduce a family of SSL methods derived from information-theoretic principles that outperform existing SSL techniques.

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Authors

Shengbang Tong,Zhuang Liu,Yuexiang Zhai,Yi Ma,Yann LeCun,Saining Xie

Journal

arXiv preprint arXiv:2401.06209

Published Date

2024/1/11

Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

Authors

Xiaoxin He,Yijun Tian,Yifei Sun,Nitesh V Chawla,Thomas Laurent,Yann LeCun,Xavier Bresson,Bryan Hooi

Journal

arXiv preprint arXiv:2402.07630

Published Date

2024/2/12

Given a graph with textual attributes, we enable users to `chat with their graph': that is, to ask questions about the graph using a conversational interface. In response to a user's questions, our method provides textual replies and highlights the relevant parts of the graph. While existing works integrate large language models (LLMs) and graph neural networks (GNNs) in various ways, they mostly focus on either conventional graph tasks (such as node, edge, and graph classification), or on answering simple graph queries on small or synthetic graphs. In contrast, we develop a flexible question-answering framework targeting real-world textual graphs, applicable to multiple applications including scene graph understanding, common sense reasoning, and knowledge graph reasoning. Toward this goal, we first develop our Graph Question Answering (GraphQA) benchmark with data collected from different tasks. Then, we propose our G-Retriever approach, which integrates the strengths of GNNs, LLMs, and Retrieval-Augmented Generation (RAG), and can be fine-tuned to enhance graph understanding via soft prompting. To resist hallucination and to allow for textual graphs that greatly exceed the LLM's context window size, G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem. Empirical evaluations show that our method outperforms baselines on textual graph tasks from multiple domains, scales well with larger graph sizes, and resists hallucination. (Our codes and datasets are available at: https://github.com/XiaoxinHe/G-Retriever.)

Learning by Reconstruction Produces Uninformative Features For Perception

Authors

Randall Balestriero,Yann LeCun

Journal

arXiv preprint arXiv:2402.11337

Published Date

2024/2/17

Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance--a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90\% of the pixel variance can be solved with 45\% test accuracy. Using the bottom subspace instead, accounting for only 20\% of the pixel variance, reaches 55\% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task.

To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review

Authors

Ravid Shwartz Ziv,Yann LeCun

Published Date

2024/3/12

Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory has shaped deep neural networks, particularly the information bottleneck principle. This principle optimizes the trade-off between compression and preserving relevant information, providing a foundation for efficient network design in supervised contexts. However, its precise role and adaptation in self-supervised learning remain unclear. In this work, we scrutinize various self-supervised learning approaches from an information-theoretic perspective, introducing a unified framework that encapsulates the self-supervised information-theoretic learning problem. This framework includes multiple encoders and decoders, suggesting that all existing work on self-supervised learning can be seen as specific instances. We aim to unify these approaches to understand their underlying principles better and address the main challenge: many works present different frameworks with differing theories that may seem contradictory. By weaving existing research into a cohesive narrative, we delve into contemporary self-supervised methodologies, spotlight potential research areas, and highlight inherent challenges. Moreover, we discuss how to estimate information-theoretic quantities and their associated empirical problems. Overall, this paper provides a comprehensive review of the intersection of information theory, self-supervised learning, and deep neural networks, aiming for a better understanding through our …

Learning With Fewer Labels in Computer Vision

Authors

Li Liu,Timothy Hospedales,Yann LeCun,Mingsheng Long,Jiebo Luo,Wanli Ouyang,Matti Pietikäinen,Tinne Tuytelaars

Journal

IEEE Transactions on Pattern Analysis and Machine Intelligence

Published Date

2024/2/6

Undoubtedly, Deep Neural Networks (DNNs), from AlexNet to ResNet to Transformer, have sparked revolutionary advancements in diverse computer vision tasks. The scale of DNNs has grown exponentially due to the rapid development of computational resources. Despite the tremendous success, DNNs typically depend on massive amounts of training data (especially the recent various foundation models) to achieve high performance and are brittle in that their performance can degrade severely with small changes in their operating environment. Generally, collecting massive-scale training datasets is costly or even infeasible, as for certain fields, only very limited or no examples at all can be gathered. Nevertheless, collecting, labeling, and vetting massive amounts of practical training data is certainly difficult and expensive, as it requires the painstaking efforts of experienced human annotators or experts, and in …

Revisiting feature prediction for learning visual representations from video

Authors

Adrien Bardes,Quentin Garrido,Jean Ponce,Xinlei Chen,Michael Rabbat,Yann LeCun,Mahmoud Assran,Nicolas Ballas

Journal

arXiv preprint arXiv:2404.08471

Published Date

2024/2/15

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Predicting masked tokens in stochastic locations improves masked image modeling

Authors

Amir Bar,Florian Bordes,Assaf Shocher,Mahmoud Assran,Pascal Vincent,Nicolas Ballas,Trevor Darrell,Amir Globerson,Yann LeCun

Journal

arXiv preprint arXiv:2308.00566

Published Date

2023/7/31

Self-supervised learning is a promising paradigm in deep learning that enables learning from unlabeled data by constructing pretext tasks that require learning useful representations. In natural language processing, the dominant pretext task has been masked language modeling (MLM), while in computer vision there exists an equivalent called Masked Image Modeling (MIM). However, MIM is challenging because it requires predicting semantic content in accurate locations. E.g, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose FlexPredict, a stochastic model that addresses this challenge by incorporating location uncertainty into the model. Specifically, we condition the model on stochastic masked token positions to guide the model toward learning features that are more robust to location uncertainties. Our approach improves downstream performance on a range of tasks, e.g, compared to MIM baselines, FlexPredict boosts ImageNet linear probing by 1.6% with ViT-B and by 2.5% for semi-supervised video segmentation using ViT-L.

URLOST: Unsupervised Representation Learning without Stationarity or Topology

Authors

Zeyu Yun,Juexiao Zhang,Bruno Olshausen,Yann LeCun,Yubei Chen

Journal

arXiv preprint arXiv:2310.04496

Published Date

2023/10/6

Unsupervised representation learning has seen tremendous progress but is constrained by its reliance on data modality-specific stationarity and topology, a limitation not found in biological intelligence systems. For instance, human vision processes visual signals derived from irregular and non-stationary sampling lattices yet accurately perceives the geometry of the world. We introduce a novel framework that learns from high-dimensional data lacking stationarity and topology. Our model combines a learnable self-organizing layer, density adjusted spectral clustering, and masked autoencoders. We evaluate its effectiveness on simulated biological vision data, neural recordings from the primary visual cortex, and gene expression datasets. Compared to state-of-the-art unsupervised learning methods like SimCLR and MAE, our model excels at learning meaningful representations across diverse modalities without depending on stationarity or topology. It also outperforms other methods not dependent on these factors, setting a new benchmark in the field. This work represents a step toward unsupervised learning methods that can generalize across diverse high-dimensional data modalities.

Variance-Covariance Regularization Improves Representation Learning

Authors

Jiachen Zhu,Katrina Evtimova,Yubei Chen,Ravid Shwartz-Ziv,Yann LeCun

Journal

arXiv preprint arXiv:2306.13292

Published Date

2024

Transfer learning has emerged as a key approach in the machine learning domain, enabling the application of knowledge derived from one domain to improve performance on subsequent tasks. Given the often limited information about these subsequent tasks, a strong transfer learning approach calls for the model to capture a diverse range of features during the initial pretraining stage. However, recent research suggests that, without sufficient regularization, the network tends to concentrate on features that primarily reduce the pretraining loss function. This tendency can result in inadequate feature learning and impaired generalization capability for target tasks. To address this issue, we propose Variance-Covariance Regularization (VCR), a regularization technique aimed at fostering diversity in the learned network features. Drawing inspiration from recent advancements in the self-supervised learning approach, our approach promotes learned representations that exhibit high variance and minimal covariance, thus preventing the network from focusing solely on loss-reducing features. We empirically validate the efficacy of our method through comprehensive experiments coupled with in-depth analytical studies on the learned representations. In addition, we develop an efficient implementation strategy that assures minimal computational overhead associated with our method. Our results indicate that VCR is a powerful and efficient method for enhancing transfer learning performance for both supervised learning and self-supervised learning, opening new possibilities for future research in this domain.

Compact and optimal deep learning with recurrent parameter generators

Authors

Jiayun Wang,Yubei Chen,Stella X Yu,Brian Cheung,Yann LeCun

Published Date

2023

Deep learning has achieved tremendous success by training increasingly large models, which are then compressed for practical deployment. We propose a drastically different approach to compact and optimal deep learning: We decouple the Degrees of freedom (DoF) and the actual number of parameters of a model, optimize a small DoF with predefined random linear constraints for a large model of an arbitrary architecture, in one-stage end-to-end learning. Specifically, we create a recurrent parameter generator (RPG), which repeatedly fetches parameters from a ring and unpacks them onto a large model with random permutation and sign flipping to promote parameter decorrelation. We show that gradient descent can automatically find the best model under constraints with in fact faster convergence. Our extensive experimentation reveals a log-linear relationship between model DoF and accuracy. Our RPG demonstrates remarkable DoF reduction, and can be further pruned and quantized for additional run-time performance gain. For example, in terms of top-1 accuracy on ImageNet, RPG achieves 96% of ResNet18's performance with only 18% DoF (the equivalent of one convolutional layer) and 52% of ResNet34's performance with only 0.25% DoF! Our work shows significant potential of constrained neural optimization in compact and optimal deep learning.

Blockwise self-supervised learning at scale

Authors

Shoaib Ahmed Siddiqui,David Krueger,Yann LeCun,Stéphane Deny

Journal

arXiv preprint arXiv:2302.01647

Published Date

2023/2/3

Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.

The ssl interplay: Augmentations, inductive bias, and generalization

Authors

Vivien Cabannes,Bobak T Kiani,Randall Balestriero,Yann LeCun,Alberto Bietti

Published Date

2023/2/6

Self-supervised learning (SSL) has emerged as a powerful framework to learn representations from raw data without supervision. Yet in practice, engineers face issues such as instability in tuning optimizers and collapse of representations during training. Such challenges motivate the need for a theory to shed light on the complex interplay between the choice of data augmentation, network architecture, and training algorithm.% on the resulting performance in downstream tasks. We study such an interplay with a precise analysis of generalization performance on both pretraining and downstream tasks in kernel regimes, and highlight several insights for SSL practitioners that arise from our theory.

Just How Flexible are Neural Networks in Practice?

Authors

Ravid Shwartz-Ziv,Micah Goldblum,Arpit Bansal,C Bayan Bruss,Yann LeCun,Andrew Gordon Wilson

Published Date

2023/10/13

It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) whereas stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples predicts generalization; (5) ReLU activation functions enable fitting more data despite being designed to avoid vanishing and exploding gradients in deep architectures.

POLICE: Provably optimal linear constraint enforcement for deep neural networks

Authors

Randall Balestriero,Yann LeCun

Published Date

2023/6/4

Deep Neural Networks (DNNs) outshine alternative function approximators in many settings thanks to their modularity in composing any desired differentiable operator. The formed parametrized functional is then tuned to solve a task at hand from simple gradient descent. This modularity comes at the cost of making strict enforcement of constraints on DNNs, e.g. from a priori knowledge of the task, or from desired physical properties, an open challenge. In this paper we propose the first provable affine constraint enforcement method for DNNs that only requires minimal changes into a given DNN’s forward-pass, that is computationally friendly, and that leaves the optimization of the DNN’s parameter to be unconstrained, i.e. standard gradient-based method can be employed. Our method does not require any sampling and provably ensures that the DNN fulfills the affine constraint on a given input space’s region at any …

Active self-supervised learning: A few low-cost relationships are all you need

Authors

Vivien Cabannes,Leon Bottou,Yann Lecun,Randall Balestriero

Published Date

2023

Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, ie positive views. Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies eg applying known data-augmentations to the same input. In this work, we generalize and formalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples. PAL achieves three main objectives. First, it is a theoretically grounded learning framework that encapsulates standard SSL but also supervised and semi-supervised learning depending on the employed oracle. Second, it provides a consistent algorithm to embed a priori knowledge, eg some observed labels, into any SSL losses without any change in the training pipeline. Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs.

Reverse engineering self-supervised learning

Authors

Ido Ben-Shaul,Ravid Shwartz-Ziv,Tomer Galanti,Shai Dekel,Yann LeCun

Journal

Advances in Neural Information Processing Systems

Published Date

2023/12/15

Understanding the learned representation and underlying mechanisms of Self-Supervised Learning (SSL) often poses a challenge. In this paper, we ‘reverse engineer’SSL, conducting an in-depth empirical analysis of its learned internal representations, encompassing diverse models, architectures, and hyperparameters. Our study reveals an intriguing process within the SSL training: an inherent facilitation of semantic label-based clustering, which is surprisingly driven by the regularization component of the SSL objective. This clustering not only enhances downstream classification, but also compresses the information. We further illustrate that the alignment of the SSL-trained representation is more pronounced with semantic classes rather than random functions. Remarkably, the learned representations align with semantic classes across various hierarchical levels, with this alignment intensifying when going deeper into the network. This ‘reverse engineering’approach provides valuable insights into the inner mechanism of SSL and their influences on the performance across different class sets.

Augmented language models: a survey

Authors

Grégoire Mialon,Roberto Dessì,Maria Lomeli,Christoforos Nalmpantis,Ram Pasunuru,Roberta Raileanu,Baptiste Rozière,Timo Schick,Jane Dwivedi-Yu,Asli Celikyilmaz,Edouard Grave,Yann LeCun,Thomas Scialom

Journal

arXiv preprint arXiv:2302.07842

Published Date

2023/2/15

This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demonstrations. While adhering to a standard missing tokens prediction objective, such augmented LMs can use various, possibly non-parametric external modules to expand their context processing ability, thus departing from the pure language modeling paradigm. We therefore refer to them as Augmented Language Models (ALMs). The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks and even outperforming most regular LMs on several benchmarks. In this work, after reviewing current advance in ALMs, we conclude that this new research direction has the potential to address common limitations of traditional LMs such as interpretability, consistency, and scalability issues.

Language, common sense, and the Winograd schema challenge

Authors

Jacob Browning,Yann LeCun

Published Date

2023/10/6

Since the 1950s, philosophers and AI researchers have held that disambiguating natural language sentences depended on common sense. In 2011, the Winograd Schema Challenge was established to evaluate the common-sense reasoning abilities of a machine by testing its ability to disambiguate sentences. The designers argued only a system capable of “thinking in the full-bodied sense” would be able to pass the test. However, by 2021, the original authors concede the test has been soundly defeated by large language models which still seem to lack common sense of full-bodied thinking. In this paper, we argue that disambiguating sentences only seemed like a good test of common-sense based on a certain picture of the relationship between linguistic comprehension and semantic knowledge—one typically associated with the early computational theory of mind and Symbolic AI. If this picture is rejected, as …

An information-theoretic perspective on variance-invariance-covariance regularization

Authors

Ravid Shwartz-Ziv,Randall Balestriero,Kenji Kawaguchi,Tim GJ Rudner,Yann LeCun

Published Date

2023/12

In this paper, we provide an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg) for self-supervised learning. To do so, we first demonstrate how information-theoretic quantities can be obtained for deterministic networks as an alternative to the commonly used unrealistic stochastic networks assumption. Next, we relate the VICReg objective to mutual information maximization and use it to highlight the underlying assumptions of the objective. Based on this relationship, we derive a generalization bound for VICReg, providing generalization guarantees for downstream supervised learning tasks and present new self-supervised learning methods, derived from a mutual information maximization objective, that outperform existing methods in terms of performance. This work provides a new information-theoretic perspective on self-supervised learning and Variance-Invariance-Covariance Regularization in particular and guides the way for improved transfer learning via information-theoretic self-supervised learning objectives.

Emp-ssl: Towards self-supervised learning in one training epoch

Authors

Shengbang Tong,Yubei Chen,Yi Ma,Yann Lecun

Journal

arXiv preprint arXiv:2304.03977

Published Date

2023/4/8

Recently, self-supervised learning (SSL) has achieved tremendous success in learning image representation. Despite the empirical success, most self-supervised learning methods are rather "inefficient" learners, typically taking hundreds of training epochs to fully converge. In this work, we show that the key towards efficient self-supervised learning is to increase the number of crops from each image instance. Leveraging one of the state-of-the-art SSL method, we introduce a simplistic form of self-supervised learning method called Extreme-Multi-Patch Self-Supervised-Learning (EMP-SSL) that does not rely on many heuristic techniques for SSL such as weight sharing between the branches, feature-wise normalization, output quantization, and stop gradient, etc, and reduces the training epochs by two orders of magnitude. We show that the proposed method is able to converge to 85.1% on CIFAR-10, 58.5% on CIFAR-100, 38.1% on Tiny ImageNet and 58.5% on ImageNet-100 in just one epoch. Furthermore, the proposed method achieves 91.5% on CIFAR-10, 70.1% on CIFAR-100, 51.5% on Tiny ImageNet and 78.9% on ImageNet-100 with linear probing in less than ten training epochs. In addition, we show that EMP-SSL shows significantly better transferability to out-of-domain datasets compared to baseline SSL methods. We will release the code in https://github.com/tsb0601/EMP-SSL.

Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank

Authors

Quentin Garrido,Randall Balestriero,Laurent Najman,Yann Lecun

Published Date

2023/7/3

Joint-Embedding Self Supervised Learning (JE-SSL) has seen a rapid development, with the emergence of many method variations but only few principled guidelines that would help practitioners to successfully deploy them. The main reason for that pitfall comes from JE-SSL’s core principle of not employing any input reconstruction therefore lacking visual cues of unsuccessful training. Adding non informative loss values to that, it becomes difficult to deploy SSL on a new dataset for which no labels can help to judge the quality of the learned representation. In this study, we develop a simple unsupervised criterion that is indicative of the quality of the learned JE-SSL representations: their effective rank. Albeit simple and computationally friendly, this method—coined RankMe—allows one to assess the performance of JE-SSL representations, even on different downstream datasets, without requiring any labels. A further benefit of RankMe is that it does not have any training or hyper-parameters to tune. Through thorough empirical experiments involving hundreds of training episodes, we demonstrate how RankMe can be used for hyperparameter selection with nearly no reduction in final performance compared to the current selection method that involve a dataset’s labels. We hope that RankMe will facilitate the deployment of JE-SSL towards domains that do not have the opportunity to rely on labels for representations’ quality assessment.

An Information-Theoretic Understanding of Maximum Manifold Capacity Representations

Authors

Berivan Isik,Victor Lecomte,Rylan Schaeffer,Yann LeCun,Mikail Khona,Ravid Shwartz-Ziv,Sanmi Koyejo,Andrey Gromov

Published Date

2023/12/18

Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is interesting for at least two reasons. Firstly, MMCR is an oddity in the zoo of MVSSL methods: it is not (explicitly) contrastive, applies no masking, performs no clustering, leverages no distillation, and does not (explicitly) reduce redundancy. Secondly, while many self-supervised learning (SSL) methods originate in information theory, MMCR distinguishes itself by claiming a different origin: a statistical mechanical characterization of the geometry of linear separability of data manifolds. However, given the rich connections between statistical mechanics and information theory, and given recent work showing how many SSL methods can be understood from an information-theoretic perspective, we conjecture that MMCR can be similarly understood from an information-theoretic perspective. In this paper, we leverage tools from high dimensional probability and information theory to demonstrate that an optimal solution to MMCR's nuclear norm-based objective function is the same optimal solution that maximizes a well-known lower bound on mutual information.

Mc-jepa: A joint-embedding predictive architecture for self-supervised learning of motion and content features

Authors

Adrien Bardes,Jean Ponce,Yann LeCun

Journal

arXiv preprint arXiv:2307.12698

Published Date

2023/7/24

Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder, demonstrating that the two associated objectives; the optical flow estimation objective and the self-supervised learning objective; benefit from each other and thus learn content features that incorporate motion information. The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks, as well as with common self-supervised learning approaches on downstream tasks such as semantic segmentation of images and videos.

Catalyzing next-generation artificial intelligence through neuroai

Authors

Anthony Zador,Sean Escola,Blake Richards,Bence Ölveczky,Yoshua Bengio,Kwabena Boahen,Matthew Botvinick,Dmitri Chklovskii,Anne Churchland,Claudia Clopath,James DiCarlo,Surya Ganguli,Jeff Hawkins,Konrad Körding,Alexei Koulakov,Yann LeCun,Timothy Lillicrap,Adam Marblestone,Bruno Olshausen,Alexandre Pouget,Cristina Savin,Terrence Sejnowski,Eero Simoncelli,Sara Solla,David Sussillo,Andreas S Tolias,Doris Tsao

Published Date

2023/3/22

Neuroscience has long been an essential driver of progress in artificial intelligence (AI). We propose that to accelerate progress in AI, we must invest in fundamental research in NeuroAI. A core component of this is the embodied Turing test, which challenges AI animal models to interact with the sensorimotor world at skill levels akin to their living counterparts. The embodied Turing test shifts the focus from those capabilities like game playing and language that are especially well-developed or uniquely human to those capabilities – inherited from over 500 million years of evolution – that are shared with all animals. Building models that can pass the embodied Turing test will provide a roadmap for the next generation of AI.

Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph representation learning

Authors

Xiaoxin He,Xavier Bresson,Thomas Laurent,Adam Perold,Yann LeCun,Bryan Hooi

Published Date

2023/10/13

Representation learning on text-attributed graphs (TAGs) has become a critical research problem in recent years. A typical example of a TAG is a paper citation graph, where the text of each paper serves as node attributes. Initial graph neural network (GNN) pipelines handled these text attributes by transforming them into shallow or hand-crafted features, such as skip-gram or bag-of-words features. Recent efforts have focused on enhancing these pipelines with language models (LMs), which typically demand intricate designs and substantial computational resources. With the advent of powerful large language models (LLMs) such as GPT or Llama2, which demonstrate an ability to reason and to utilize general knowledge, there is a growing need for techniques which combine the textual modelling abilities of LLMs with the structural learning capabilities of GNNs. Hence, in this work, we focus on leveraging LLMs to capture textual information as features, which can be used to boost GNN performance on downstream tasks. A key innovation is our use of \emph{explanations as features}: we prompt an LLM to perform zero-shot classification, request textual explanations for its decision-making process, and design an \emph{LLM-to-LM interpreter} to translate these explanations into informative features that enhance downstream GNNs. Our experiments demonstrate that our method achieves state-of-the-art results on well-established TAG datasets, including \texttt{Cora}, \texttt{PubMed}, \texttt{ogbn-arxiv}, as well as our newly introduced dataset, \texttt{arXiv-2023}. Furthermore, our method significantly speeds up training, achieving a 2.88 times …

V-JEPA: Latent Video Prediction for Visual Representation Learning

Authors

Adrien Bardes,Quentin Garrido,Jean Ponce,Xinlei Chen,Michael Rabbat,Yann LeCun,Mido Assran,Nicolas Ballas

Published Date

2023/10/13

This paper shows that the masked-modelling principle driving the success of large foundational language models can be effectively applied to video by making predictions in latent space. We introduce V-JEPA, a method for self-supervised learning from video that predicts masked spatio-temporal regions in a learned representation space. Our latent video prediction strategy produces visual features that can be applied to various downstream image and video tasks without adaption of the model's parameters (using only frozen evaluation), achieving 82.1% on Kinetics-400 and 71.2% on Something-Something-v2, surpassing the previous best video models by +4 and +10 points respectively. We also demonstrate the benefit of video pretraining compared to image pretraining for tasks involving motion understanding, where V-JEPA outperforms the largest state-of-the-art image models, DINOv2 and OpenCLIP. Finally, V-JEPA trained only on video achieves 77.9% on ImageNet classification without any image fine-tuning, surpassing the previous best video model by +6 points top-1.

A cookbook of self-supervised learning

Authors

Randall Balestriero,Mark Ibrahim,Vlad Sobal,Ari Morcos,Shashank Shekhar,Tom Goldstein,Florian Bordes,Adrien Bardes,Gregoire Mialon,Yuandong Tian,Avi Schwarzschild,Andrew Gordon Wilson,Jonas Geiping,Quentin Garrido,Pierre Fernandez,Amir Bar,Hamed Pirsiavash,Yann LeCun,Micah Goldblum

Journal

arXiv preprint arXiv:2304.12210

Published Date

2023/4/24

Self-supervised learning, dubbed the dark matter of intelligence, is a promising path to advance machine learning. Yet, much like cooking, training SSL methods is a delicate art with a high barrier to entry. While many components are familiar, successfully training a SSL method involves a dizzying set of choices from the pretext tasks to training hyper-parameters. Our goal is to lower the barrier to entry into SSL research by laying the foundations and latest SSL recipes in the style of a cookbook. We hope to empower the curious researcher to navigate the terrain of methods, understand the role of the various knobs, and gain the know-how required to explore how delicious SSL can be.

Self-supervised learning with lie symmetries for partial differential equations

Authors

Grégoire Mialon*,Quentin Garrido*,Hannah Lawrence,Danyal Rehman,Yann LeCun,Bobak Kiani

Published Date

2024/2/14

Machine learning for differential equations paves the way for computationally efficient alternatives to numerical solvers, with potentially broad impacts in science and engineering. Though current algorithms typically require simulated training data tailored to a given setting, one may instead wish to learn useful information from heterogeneous sources, or from real dynamical systems observations that are messy or incomplete. In this work, we learn general-purpose representations of PDEs from heterogeneous data by implementing joint embedding methods for self-supervised learning (SSL), a framework for unsupervised representation learning that has had notable success in computer vision. Our representation outperforms baseline approaches to invariant tasks, such as regressing the coefficients of a PDE, while also improving the time-stepping performance of neural solvers. We hope that our proposed methodology will prove useful in the eventual development of general-purpose foundation models for PDEs.

Gaia: a benchmark for general ai assistants

Authors

Grégoire Mialon,Clémentine Fourrier,Craig Swift,Thomas Wolf,Yann LeCun,Thomas Scialom

Journal

arXiv preprint arXiv:2311.12983

Published Date

2023/11/21

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

Adapting Grounded Visual Question Answering Models to Low Resource Languages

Authors

Ying Wang,Jonas Pfeiffer,Nicolas Carion,Yann LeCun,Aishwarya Kamath

Published Date

2023

While huge progress has been made on a variety of vision and language tasks in recent years, most major advances have been restricted to the English language due to the scarcity of relevant training and evaluation datasets in other languages. A popular approach to address this gap, has been to utilize machine-translated multi-modal datasets or multi-lingual text-only datasets for pre-training. This approach not only fails to exploit existing pre-trained state-of-the-art English multi-modal models, but also is not a viable solution for low-resource languages where translation quality is not as reliable. Therefore, we propose xMDETR, a multi-lingual grounded vision-language model based on the state-of-the-art model MDETR, by adapting it to new languages without machine-translated data, while also keeping most of the pre-trained weights frozen. xMDETR leverages mono-lingual pre-trained MDETR to achieve results competitive to state of the art on xGQA, a standard multilingual VQA benchmark. It is also interpretable, providing bounding boxes for key phrases in the multi-lingual questions. Our method utilizes several architectural as well as data-driven techniques such as training a new embedding space with a Masked Language Modeling (MLM) objective, code-switching, and adapters for efficient and modular training. We also explore contrastive losses to enforce the bridging of multi-modal and multi-lingual representations on multi-lingual multi-modal data, when available. We evaluate xMDETR on xGQA in both zero-shot and few-shot settings, improving results on Portuguese, Indonesian and Bengali, while remaining competitive on other …

Self-supervised learning of split invariant equivariant representations

Authors

Quentin Garrido,Laurent Najman,Yann Lecun

Journal

arXiv preprint arXiv:2302.10283

Published Date

2023/2/14

Recent progress has been made towards learning invariant or equivariant representations with self-supervised learning. While invariant methods are evaluated on large scale datasets, equivariant ones are evaluated in smaller, more controlled, settings. We aim at bridging the gap between the two in order to learn more diverse representations that are suitable for a wide range of tasks. We start by introducing a dataset called 3DIEBench, consisting of renderings from 3D models over 55 classes and more than 2.5 million images where we have full control on the transformations applied to the objects. We further introduce a predictor architecture based on hypernetworks to learn equivariant representations with no possible collapse to invariance. We introduce SIE (Split Invariant-Equivariant) which combines the hypernetwork-based predictor with representations split in two parts, one invariant, the other equivariant, to learn richer representations. We demonstrate significant performance gains over existing methods on equivariance related tasks from both a qualitative and quantitative point of view. We further analyze our introduced predictor and show how it steers the learned latent space. We hope that both our introduced dataset and approach will enable learning richer representations without supervision in more complex scenarios. Code and data are available at https://github.com/facebookresearch/SIE.

Self-supervised learning from images with a joint-embedding predictive architecture

Authors

Mahmoud Assran,Quentin Duval,Ishan Misra,Piotr Bojanowski,Pascal Vincent,Michael Rabbat,Yann LeCun,Nicolas Ballas

Published Date

2023

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

Self-Supervised Learning for Understanding Text, Imaging, Equations and Everything Else

Authors

Yann LeCun

Journal

Video Memorie della Societa Astronomica Italiana

Published Date

2023/12

Self-Supervised Learning for Understanding Text, Imaging, Equations and Everything Else - NASA/ADS Now on home page ads icon ads Enable full ADS view NASA/ADS Self-Supervised Learning for Understanding Text, Imaging, Equations and Everything Else LeCun, Yann Abstract Publication: Video Memorie della Societa Astronomica Italiana Pub Date: December 2023 Bibcode: 2023VMSAI...4....1L Keywords: Astroinformatics No Sources Found © The SAO/NASA Astrophysics Data System adshelp[at]cfa.harvard.edu The ADS is operated by the Smithsonian Astrophysical Observatory under NASA Cooperative Agreement NNX16AC86A NASA logo Smithsonian logo Resources About ADS ADS Help What's New Careers@ADS Social @adsabs ADS Blog Project Switch to full ADS Is ADS down? (or is it just me...) Smithsonian Institution Smithsonian Privacy Notice Smithsonian Terms of Use Smithsonian …

Introduction to latent variable energy-based models: A path towards autonomous machine intelligence

Authors

Anna Dawid,Yann LeCun

Journal

arXiv preprint arXiv:2306.02572

Published Date

2023/6/5

Current automated systems have crucial limitations that need to be addressed before artificial intelligence can reach human-like levels and bring new technological revolutions. Among others, our societies still lack Level 5 self-driving cars, domestic robots, and virtual assistants that learn reliable world models, reason, and plan complex action sequences. In these notes, we summarize the main ideas behind the architecture of autonomous intelligence of the future proposed by Yann LeCun. In particular, we introduce energy-based and latent variable models and combine their advantages in the building block of LeCun's proposal, that is, in the hierarchical joint embedding predictive architecture (H-JEPA).

A generalization of vit/mlp-mixer to graphs

Authors

Xiaoxin He,Bryan Hooi,Thomas Laurent,Adam Perold,Yann LeCun,Xavier Bresson

Published Date

2023/7/3

Graph Neural Networks (GNNs) have shown great potential in the field of graph representation learning. Standard GNNs define a local message-passing mechanism which propagates information over the whole graph domain by stacking multiple layers. This paradigm suffers from two major limitations, over-squashing and poor long-range dependencies, that can be solved using global attention but significantly increases the computational cost to quadratic complexity. In this work, we propose an alternative approach to overcome these structural limitations by leveraging the ViT/MLP-Mixer architectures introduced in computer vision. We introduce a new class of GNNs, called Graph ViT/MLP-Mixer, that holds three key properties. First, they capture long-range dependency and mitigate the issue of over-squashing as demonstrated on Long Range Graph Benchmark and TreeNeighbourMatch datasets. Second, they offer better speed and memory efficiency with a complexity linear to the number of nodes and edges, surpassing the related Graph Transformer and expressive GNN models. Third, they show high expressivity in terms of graph isomorphism as they can distinguish at least 3-WL non-isomorphic graphs. We test our architecture on 4 simulated datasets and 7 real-world benchmarks, and show highly competitive results on all of them. The source code is available for reproducibility at: https://github. com/XiaoxinHe/Graph-ViT-MLPMixer.

Gradient-based Planning with World Models

Authors

Jyothir SV,Siddhartha Jalagam,Yann LeCun,Vlad Sobal

Journal

arXiv preprint arXiv:2312.17227

Published Date

2023/12/28

The enduring challenge in the field of artificial intelligence has been the control of systems to achieve desired behaviours. While for systems governed by straightforward dynamics equations, methods like Linear Quadratic Regulation (LQR) have historically proven highly effective, most real-world tasks, which require a general problem-solver, demand world models with dynamics that cannot be easily described by simple equations. Consequently, these models must be learned from data using neural networks. Most model predictive control (MPC) algorithms designed for visual world models have traditionally explored gradient-free population-based optimisation methods, such as Cross Entropy and Model Predictive Path Integral (MPPI) for planning. However, we present an exploration of a gradient-based alternative that fully leverages the differentiability of the world model. In our study, we conduct a comparative analysis between our method and other MPC-based alternatives, as well as policy-based algorithms. In a sample-efficient setting, our method achieves on par or superior performance compared to the alternative approaches in most tasks. Additionally, we introduce a hybrid model that combines policy networks and gradient-based MPC, which outperforms pure policy based methods thereby holding promise for Gradient-based planning with world models in complex real-world tasks.

Methods, systems, and media for detecting spoofing in mobile authentication

Published Date

2022/8/23

Provided herein are devices, systems, and methods for detecting spoofing of a 3D object, using a 2D representation, in a mobile object authentication process, comprising capturing image data of the 3D object by a front-facing camera, to record a current spatial characteristic of the 3D object, while a front-facing screen displays an authentication pattern comprising a plurality of regions, wherein at least one of the regions varies in at least one of: brightness, position, size, shape, and color over time causing a variance of lighting effects which create highlights and shadows on the 3D object over time. The devices, systems, and methods thereby provide an efficient and secure process for determining if spoofing of the 3D object, using a 2D representation, is attempted in a mobile authentication process, by comparing the current spatial characteristic of the 3D object with a stored reference spatial characteristic of the 3D …

Pre-train your loss: Easy bayesian transfer learning with informative priors

Authors

Ravid Shwartz-Ziv,Micah Goldblum,Hossein Souri,Sanyam Kapoor,Chen Zhu,Yann LeCun,Andrew G Wilson

Journal

Advances in Neural Information Processing Systems

Published Date

2022/12/6

Deep learning is increasingly moving towards a transfer learning paradigm whereby large foundation models are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task, and does not reflect the belief that our knowledge of the source task should affect the locations and shape of optima on the downstream task. Instead, we show that we can learn highly informative posteriors from the source task, through supervised or self-supervised approaches, which then serve as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on a variety of downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies. These highly informative priors also can be saved for future use, similar to pre-trained weights, and stand in contrast to the zero-mean isotropic uninformative priors that are typically used in Bayesian deep learning.

Vicregl: Self-supervised learning of local visual features

Authors

Adrien Bardes,Jean Ponce,Yann LeCun

Journal

Advances in Neural Information Processing Systems

Published Date

2022/12/6

Most recent self-supervised methods for learning image representations focus on either producing a global feature with invariance properties, or producing a set of local features. The former works best for classification tasks while the latter is best for detection and segmentation tasks. This paper explores the fundamental trade-off between learning local and global features. A new method called VICRegL is proposed that learns good global and local features simultaneously, yielding excellent performance on detection and segmentation tasks while maintaining good performance on classification tasks. Concretely, two identical branches of a standard convolutional net architecture are fed two differently distorted versions of the same image. The VICReg criterion is applied to pairs of global feature vectors. Simultaneously, the VICReg criterion is applied to pairs of local feature vectors occurring before the last pooling layer. Two local feature vectors are attracted to each other if their l2-distance is below a threshold or if their relative locations are consistent with a known geometric transformation between the two input images. We demonstrate strong performance on linear classification and segmentation transfer tasks. Code and pretrained models are publicly available at: https://github. com/facebookresearch/VICRegL

What Do We Maximize in Self-Supervised Learning And Why Does Generalization Emerge?

Authors

Ravid Shwartz-Ziv,Randall Balestriero,Kenji Kawaguchi,Yann LeCun

Published Date

2022/9/29

In this paper, we provide an information-theoretic (IT) understanding of self-supervised learning methods, their construction, and optimality. As a first step, we demonstrate how IT quantities can be obtained for deterministic networks, as an alternative to the commonly used unrealistic stochastic networks assumption. Secondly, we demonstrate how different SSL models can be (re)discovered based on first principles and highlight what the underlying assumptions of different SSL variants are. Third, we derive a novel generalization bound based on our IT understanding of SSL methods, providing generalization guarantees for the downstream supervised learning task. As a result of this bound, along with our unified view of SSL, we can compare the different approaches and provide general guidelines to practitioners. Consequently, our derivation and insights can contribute to a better understanding of SSL and transfer learning from a theoretical and practical perspective.

Minimalistic unsupervised learning with the sparse manifold transform

Authors

Yubei Chen,Zeyu Yun,Yi Ma,Bruno Olshausen,Yann LeCun

Journal

arXiv preprint arXiv:2209.15261

Published Date

2022/9/30

We describe a minimalistic and interpretable method for unsupervised learning, without resorting to data augmentation, hyperparameter tuning, or other engineering designs, that achieves performance close to the SOTA SSL methods. Our approach leverages the sparse manifold transform, which unifies sparse coding, manifold learning, and slow feature analysis. With a one-layer deterministic sparse manifold transform, one can achieve 99.3% KNN top-1 accuracy on MNIST, 81.1% KNN top-1 accuracy on CIFAR-10 and 53.2% on CIFAR-100. With a simple gray-scale augmentation, the model gets 83.2% KNN top-1 accuracy on CIFAR-10 and 57% on CIFAR-100. These results significantly close the gap between simplistic "white-box" methods and the SOTA methods. Additionally, we provide visualization to explain how an unsupervised representation transform is formed. The proposed method is closely connected to latent-embedding self-supervised methods and can be treated as the simplest form of VICReg. Though there remains a small performance gap between our simple constructive model and SOTA methods, the evidence points to this as a promising direction for achieving a principled and white-box approach to unsupervised learning.

Joint embedding predictive architectures focus on slow features

Authors

Vlad Sobal,Jyothir SV,Siddhartha Jalagam,Nicolas Carion,Kyunghyun Cho,Yann LeCun

Journal

arXiv preprint arXiv:2211.10831

Published Date

2022/11/20

Many common methods for learning a world model for pixel-based environments use generative architectures trained with pixel-level reconstruction objectives. Recently proposed Joint Embedding Predictive Architectures (JEPA) offer a reconstruction-free alternative. In this work, we analyze performance of JEPA trained with VICReg and SimCLR objectives in the fully offline setting without access to rewards, and compare the results to the performance of the generative architecture. We test the methods in a simple environment with a moving dot with various background distractors, and probe learned representations for the dot's location. We find that JEPA methods perform on par or better than reconstruction when distractor noise changes every time step, but fail when the noise is fixed. Furthermore, we provide a theoretical explanation for the poor performance of JEPA-based methods with fixed noise, highlighting an important limitation.

VICReg: Variance-invariance-covariance regularization for self-supervised learning

Authors

Adrien Bardes,Jean Ponce,Yann LeCun

Published Date

2022

Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation. In this paper, we introduce VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually. VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks. In addition, we show that incorporating our new variance term into other methods helps stabilize the training and leads to performance improvements.

Masked siamese convnets

Authors

Li Jing,Jiachen Zhu,Yann LeCun

Journal

arXiv preprint arXiv:2206.07700

Published Date

2022/6/15

Self-supervised learning has shown superior performances over supervised methods on various vision benchmarks. The siamese network, which encourages embeddings to be invariant to distortions, is one of the most successful self-supervised visual representation learning approaches. Among all the augmentation methods, masking is the most general and straightforward method that has the potential to be applied to all kinds of input and requires the least amount of domain knowledge. However, masked siamese networks require particular inductive bias and practically only work well with Vision Transformers. This work empirically studies the problems behind masked siamese networks with ConvNets. We propose several empirical designs to overcome these problems gradually. Our method performs competitively on low-shot image classification and outperforms previous methods on object detection benchmarks. We discuss several remaining issues and hope this work can provide useful data points for future general-purpose self-supervised learning.

Deep learning, reinforcement learning, and world models

Authors

Yutaka Matsuo,Yann LeCun,Maneesh Sahani,Doina Precup,David Silver,Masashi Sugiyama,Eiji Uchibe,Jun Morimoto

Published Date

2022/8/1

Deep learning (DL) and reinforcement learning (RL) methods seem to be a part of indispensable factors to achieve human-level or super-human AI systems. On the other hand, both DL and RL have strong connections with our brain functions and with neuroscientific findings. In this review, we summarize talks and discussions in the “Deep Learning and Reinforcement Learning” session of the symposium, International Symposium on Artificial Intelligence and Brain Science. In this session, we discussed whether we can achieve comprehensive understanding of human intelligence based on the recent advances of deep learning and reinforcement learning algorithms. Speakers contributed to provide talks about their recent studies that can be key technologies to achieve human-level intelligence.

Intra-Instance VICReg: Bag of Self-Supervised Image Patch Embedding Explains the Performance

Authors

Yubei Chen,Adrien Bardes,ZENGYI LI,Yann LeCun

Published Date

2022/9/29

Recently, self-supervised learning (SSL) has achieved tremendous empirical advancements in learning image representation. However, our understanding and knowledge of the representation are still limited. This work shows that the success of the SOTA Siamese-network-based SSL approaches is primarily based on learning a distributed representation of image patches. In particular, we show that when we learn a representation only for fixed-scale image patches and aggregate different patch representations for an image (instance), it can achieve on par or even better results than the baseline methods on several benchmarks. Further, we show that the patch representation aggregation can also improve various SOTA baseline methods by a large margin. We also establish a formal connection between the Siamese-network-based SSL objective and the image patches co-occurrence statistics modeling, which supplements the prevailing invariance perspective. By visualizing the nearest neighbors of different image patches in the embedding space and projection space, we show that while the projection has more invariance, the embedding space tends to preserve more equivariance and locality. While it is important to push the SOTA engineering frontier, we show that it is also a promising direction to simplify the SOTA methods to build better understanding.

Unsupervised learning of structured representations via closed-loop transcription

Authors

Shengbang Tong,Xili Dai,Yubei Chen,Mingyang Li,Zengyi Li,Brent Yi,Yann LeCun,Yi Ma

Journal

arXiv preprint arXiv:2210.16782

Published Date

2022/10/30

This paper proposes an unsupervised method for learning a unified representation that serves both discriminative and generative purposes. While most existing unsupervised learning approaches focus on a representation for only one of these two goals, we show that a unified representation can enjoy the mutual benefits of having both. Such a representation is attainable by generalizing the recently proposed \textit{closed-loop transcription} framework, known as CTRL, to the unsupervised setting. This entails solving a constrained maximin game over a rate reduction objective that expands features of all samples while compressing features of augmentations of each sample. Through this process, we see discriminative low-dimensional structures emerge in the resulting representations. Under comparable experimental conditions and network complexities, we demonstrate that these structured representations enable classification performance close to state-of-the-art unsupervised discriminative representations, and conditionally generated image quality significantly higher than that of state-of-the-art unsupervised generative models. Source code can be found at https://github.com/Delay-Xili/uCTRL.

Blockwise self-supervised learning with Barlow Twins

Authors

Shoaib Ahmed Siddiqui,David Krueger,Yann LeCun,Stephane Deny

Published Date

2022/9/29

Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. Notably, we show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48\%, only 1.1\% below the accuracy of an end-to-end pretrained network (71.57\% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.

On the duality between contrastive and non-contrastive self-supervised learning

Authors

Quentin Garrido,Yubei Chen,Adrien Bardes,Laurent Najman,Yann Lecun

Journal

arXiv preprint arXiv:2206.02574

Published Date

2022/6/3

Recent approaches in self-supervised learning of image representations can be categorized into different families of methods and, in particular, can be divided into contrastive and non-contrastive approaches. While differences between the two families have been thoroughly discussed to motivate new approaches, we focus more on the theoretical similarities between them. By designing contrastive and covariance based non-contrastive criteria that can be related algebraically and shown to be equivalent under limited assumptions, we show how close those families can be. We further study popular methods and introduce variations of them, allowing us to relate this theoretical result to current practices and show the influence (or lack thereof) of design choices on downstream performance. Motivated by our equivalence result, we investigate the low performance of SimCLR and show how it can match VICReg's with careful hyperparameter tuning, improving significantly over known baselines. We also challenge the popular assumption that non-contrastive methods need large output dimensions. Our theoretical and quantitative results suggest that the numerical gaps between contrastive and non-contrastive methods in certain regimes can be closed given better network design choices and hyperparameter tuning. The evidence shows that unifying different SOTA methods is an important direction to build a better understanding of self-supervised learning.

What do we maximize in self-supervised learning?

Authors

Ravid Shwartz-Ziv,Randall Balestriero,Yann LeCun

Published Date

2022/7/20

In this paper, we examine self-supervised learning methods, particularly VICReg, to provide an information-theoretical understanding of their construction. As a first step, we demonstrate how information-theoretic quantities can be obtained for a deterministic network, offering a possible alternative to prior work that relies on stochastic models. This enables us to demonstrate how VICReg can be (re)discovered from first principles and its assumptions about data distribution. Furthermore, we empirically demonstrate the validity of our assumptions, confirming our novel understanding of VICReg. Finally, we believe that the derivation and insights we obtain can be generalized to many other SSL methods, opening new avenues for theoretical and practical understanding of SSL and transfer learning.

Coarse-to-fine vision-language pre-training with fusion in the backbone

Authors

Zi-Yi Dou,Aishwarya Kamath,Zhe Gan,Pengchuan Zhang,Jianfeng Wang,Linjie Li,Zicheng Liu,Ce Liu,Yann LeCun,Nanyun Peng,Jianfeng Gao,Lijuan Wang

Journal

arXiv preprint arXiv:2206.07643

Published Date

2022/6/15

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently:(i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github. com/microsoft/FIBER.

Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods

Authors

Randall Balestriero,Yann LeCun

Journal

Advances in Neural Information Processing Systems

Published Date

2022/12/6

Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations. Although SSL has recently reached a milestone: outperforming supervised methods in many modalities\dots the theoretical foundations are limited, method-specific, and fail to provide principled design guidelines to practitioners. In this paper, we propose a unifying framework under the helm of spectral manifold learning. Through the course of this study, we will demonstrate that VICReg, SimCLR, BarlowTwins et al. correspond to eponymous spectral methods such as Laplacian Eigenmaps, ISOMAP et al. From this unified viewpoint, we obtain (i) the close-form optimal representation,(ii) the close-form optimal network parameters in the linear regime,(iii) the impact of the pairwise relations used during training on each of those quantities and on downstream task performances, and most importantly,(iv) the first theoretical bridge between contrastive and non-contrastive methods to global and local spectral methods respectively hinting at the benefits and limitations of each. For example, if the pairwise relation is aligned with the downstream task, all SSL methods produce optimal representations for that downstream task.

Variance covariance regularization enforces pairwise independence in self-supervised representations

Authors

Grégoire Mialon,Randall Balestriero,Yann LeCun

Published Date

2022/9/29

Self-Supervised Learning (SSL) methods such as VICReg, Barlow Twins or W-MSE avoid collapse of their joint embedding architectures by constraining or regularizing the covariance matrix of their projector’s output. This study highlights important properties of such strategy, which we coin Variance-Covariance regularization (VCReg). More precisely, we show that VCReg enforces pairwise independence between the features of the learned representation. This result emerges by bridging VCReg applied on the projector’s output to kernel independence criteria applied on the projector’s input. This provides the first theoretical motivations and explanations of VCReg. We empirically validate our findings where (i) we put in evidence which projector’s characteristics favor pairwise independence, (ii) we use these findings to obtain nontrivial performance gains for VICReg, (iii) we demonstrate that the scope of VCReg goes beyond SSL by using it to solve Independent Component Analysis. We hope that our findings will support the adoption of VCReg in SSL and beyond.

Decoupled contrastive learning

Authors

Chun-Hsiao Yeh,Cheng-Yao Hong,Yen-Chi Hsu,Tyng-Luh Liu,Yubei Chen,Yann LeCun

Published Date

2022/10/23

Contrastive learning (CL) is one of the most successful paradigms for self-supervised learning (SSL). In a principled way, it considers two augmented “views” of the same image as positive to be pulled closer, and all other images as negative to be pushed further apart. However, behind the impressive success of CL-based techniques, their formulation often relies on heavy-computation settings, including large sample batches, extensive training epochs, etc. We are thus motivated to tackle these issues and establish a simple, efficient, yet competitive baseline of contrastive learning. Specifically, we identify, from theoretical and empirical studies, a noticeable negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, leading to unsuitable learning efficiency concerning the batch size. By removing the NPC effect, we propose decoupled contrastive learning (DCL) loss, which removes the positive term from the …

Minimalistic unsupervised representation learning with the sparse manifold transform

Authors

Yubei Chen,Zeyu Yun,Yi Ma,Bruno Olshausen,Yann LeCun

Published Date

2022/9/29

We describe a minimalistic and interpretable method for unsupervised representation learning that does not require data augmentation, hyperparameter tuning, or other engineering designs, but nonetheless achieves performance close to the state-of-the-art (SOTA) SSL methods. Our approach leverages the sparse manifold transform, which unifies sparse coding, manifold learning, and slow feature analysis. With a one-layer deterministic (one training epoch) sparse manifold transform, it is possible to achieve KNN top-1 accuracy on MNIST, KNN top-1 accuracy on CIFAR-10, and on CIFAR-100. With simple gray-scale augmentation, the model achieves KNN top-1 accuracy on CIFAR-10 and on CIFAR-100. These results significantly close the gap between simplistic ``white-box'' methods and SOTA methods. We also provide visualization to illustrate how an unsupervised representation transform is formed. The proposed method is closely connected to latent-embedding self-supervised methods and can be treated as the simplest form of VICReg. Though a small performance gap remains between our simple constructive model and SOTA methods, the evidence points to this as a promising direction for achieving a principled and white-box approach to unsupervised representation learning, which has potential to significantly improve learning efficiency.

Separating the world and ego models for self-driving

Authors

Vlad Sobal,Alfredo Canziani,Nicolas Carion,Kyunghyun Cho,Yann LeCun

Journal

arXiv preprint arXiv:2204.07184

Published Date

2022/4/14

Training self-driving systems to be robust to the long-tail of driving scenarios is a critical problem. Model-based approaches leverage simulation to emulate a wide range of scenarios without putting users at risk in the real world. One promising path to faithful simulation is to train a forward model of the world to predict the future states of both the environment and the ego-vehicle given past states and a sequence of actions. In this paper, we argue that it is beneficial to model the state of the ego-vehicle, which often has simple, predictable and deterministic behavior, separately from the rest of the environment, which is much more complex and highly multimodal. We propose to model the ego-vehicle using a simple and differentiable kinematic model, while training a stochastic convolutional forward model on raster representations of the state to predict the behavior of the rest of the environment. We explore several configurations of such decoupled models, and evaluate their performance both with Model Predictive Control (MPC) and direct policy learning. We test our methods on the task of highway driving and demonstrate lower crash rates and better stability. The code is available at https://github.com/vladisai/pytorch-PPUU/tree/ICLR2022.

Sistema de prevenção de mistificação de identidade

Published Date

2022/7/12

H04L9/32—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, eg authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentialsH04L9/3226—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, eg authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using a predetermined code, eg password, passphrase or PIN

Joint embedding self-supervised learning in the kernel regime

Authors

Bobak T Kiani,Randall Balestriero,Yubei Chen,Seth Lloyd,Yann LeCun

Journal

arXiv preprint arXiv:2209.14884

Published Date

2022/9/29

The fundamental goal of self-supervised learning (SSL) is to produce useful representations of data without access to any labels for classifying the data. Modern methods in SSL, which form representations based on known or constructed relationships between samples, have been particularly effective at this task. Here, we aim to extend this framework to incorporate algorithms based on kernel methods where embeddings are constructed by linear maps acting on the feature space of a kernel. In this kernel regime, we derive methods to find the optimal form of the output representations for contrastive and non-contrastive loss functions. This procedure produces a new representation space with an inner product denoted as the induced kernel which generally correlates points which are related by an augmentation in kernel space and de-correlates points otherwise. We analyze our kernel model on small datasets to identify common features of self-supervised learning algorithms and gain theoretical insights into their performance on downstream tasks.

Toward next-generation artificial intelligence: Catalyzing the neuroai revolution

Authors

Anthony Zador,Sean Escola,Blake Richards,Bence Ölveczky,Yoshua Bengio,Kwabena Boahen,Matthew Botvinick,Dmitri Chklovskii,Anne Churchland,Claudia Clopath,James DiCarlo,Surya Ganguli,Jeff Hawkins,Konrad Koerding,Alexei Koulakov,Yann LeCun,Timothy Lillicrap,Adam Marblestone,Bruno Olshausen,Alexandre Pouget,Cristina Savin,Terrence Sejnowski,Eero Simoncelli,Sara Solla,David Sussillo,Andreas S Tolias,Doris Tsao

Journal

arXiv preprint arXiv:2210.08340

Published Date

2022/10/15

Neuroscience has long been an essential driver of progress in artificial intelligence (AI). We propose that to accelerate progress in AI, we must invest in fundamental research in NeuroAI. A core component of this is the embodied Turing test, which challenges AI animal models to interact with the sensorimotor world at skill levels akin to their living counterparts. The embodied Turing test shifts the focus from those capabilities like game playing and language that are especially well-developed or uniquely human to those capabilities, inherited from over 500 million years of evolution, that are shared with all animals. Building models that can pass the embodied Turing test will provide a roadmap for the next generation of AI.

A data-augmentation is worth a thousand samples: Analytical moments and sampling-free training

Authors

Randall Balestriero,Ishan Misra,Yann LeCun

Journal

Advances in Neural Information Processing Systems

Published Date

2022/12/6

Data-Augmentation (DA) is known to improve performance across tasks and datasets. We propose a method to theoretically analyze the effect of DA and study questions such as: how many augmented samples are needed to correctly estimate the information encoded by that DA? How does the augmentation policy impact the final parameters of a model? We derive several quantities in close-form, such as the expectation and variance of an image, loss, and model's output under a given DA distribution. Up to our knowledge, we obtain the first explicit regularizer that corresponds to using DA during training for non-trivial transformations such as affine transformations, color jittering, or Gaussian blur. Those derivations open new avenues to quantify the benefits and limitations of DA. For example, given a loss at hand, we find that common DAs require tens of thousands of samples for the loss to be correctly estimated and for the model training to converge. We then show that for a training loss to have reduced variance under DA sampling, the model's saliency map (gradient of the loss with respect to the model's input) must align with the smallest eigenvector of the sample's covariance matrix under the considered DA augmentation; this is exactly the quantity estimated and regularized by TangentProp. Those findings also hint at a possible explanation on why models tend to shift their focus from edges to textures when specific DAs are employed.

Masked Siamese ConvNets: Towards an Effective Masking Strategy for General-purpose Siamese Networks

Authors

Li Jing,Jiachen Zhu,Yann LeCun

Published Date

2022/9/29

Siamese Networks are a popular self-supervised learning framework that learns useful representation without human supervision by encouraging representations to be invariant to distortions. Existing methods heavily rely on hand-crafted augmentations, which are not easily adapted to new domains. To explore a general-purpose or domain-agnostic siamese network, we investigate using masking as augmentations in siamese networks. Recently, masking for siamese networks has only been shown useful with transformer architectures, e.g. MSN and data2vec. In this work, we identify the underlying problems of masking for siamese networks with arbitrary backbones, including ConvNets. We propose an effective and general-purpose masking strategy and demonstrate its effectiveness on various siamese network frameworks. Our method generally improves siamese networks' performances in the few-shot image classification, and object detection tasks.

projUNN: efficient method for training deep networks with unitary matrices

Authors

Bobak Kiani,Randall Balestriero,Yann LeCun,Seth Lloyd

Journal

Advances in Neural Information Processing Systems

Published Date

2022/12/6

In learning with recurrent or very deep feed-forward networks, employing unitary matrices in each layer can be very effective at maintaining long-range stability. However, restricting network parameters to be unitary typically comes at the cost of expensive parameterizations or increased training runtime. We propose instead an efficient method based on rank- updates--or their rank- approximation--that maintains performance at a nearly optimal training runtime. We introduce two variants of this method, named Direct (projUNN-D) and Tangent (projUNN-T) projected Unitary Neural Networks, that can parameterize full -dimensional unitary or orthogonal matrices with a training runtime scaling as . Our method either projects low-rank gradients onto the closest unitary matrix (projUNN-T) or transports unitary matrices in the direction of the low-rank gradient (projUNN-D). Even in the fastest setting (), projUNN is able to train a model's unitary parameters to reach comparable performances against baseline implementations. In recurrent neural network settings, projUNN closely matches or exceeds benchmarked results from prior unitary neural networks. Finally, we preliminarily explore projUNN in training orthogonal convolutional neural networks, which are currently unable to outperform state of the art models but can potentially enhance stability and robustness at large depth.

A data-augmentation is worth a thousand samples: Exact quantification from analytical augmented sample moments

Authors

Randall Balestriero,Ishan Misra,Yann LeCun

Journal

arXiv preprint arXiv:2202.08325

Published Date

2022/2/16

Data-Augmentation (DA) is known to improve performance across tasks and datasets. We propose a method to theoretically analyze the effect of DA and study questions such as: how many augmented samples are needed to correctly estimate the information encoded by that DA? How does the augmentation policy impact the final parameters of a model? We derive several quantities in close-form, such as the expectation and variance of an image, loss, and model's output under a given DA distribution. Those derivations open new avenues to quantify the benefits and limitations of DA. For example, we show that common DAs require tens of thousands of samples for the loss at hand to be correctly estimated and for the model training to converge. We show that for a training loss to be stable under DA sampling, the model's saliency map (gradient of the loss with respect to the model's input) must align with the smallest eigenvector of the sample variance under the considered DA augmentation, hinting at a possible explanation on why models tend to shift their focus from edges to textures.

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27

Authors

Yann LeCun

Journal

Open Review

Published Date

2022/6/27

How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of abstraction, enabling them to reason, predict, and plan at multiple time horizons? This position paper proposes an architecture and training paradigms with which to construct autonomous intelligent agents. It combines concepts such as configurable predictive world model, behavior driven through intrinsic motivation, and hierarchical joint embedding architectures trained with self-supervised learning.

Light-weight probing of unsupervised representations for reinforcement learning

Authors

Wancong Zhang,Anthony GX-Chen,Vlad Sobal,Yann LeCun,Nicolas Carion

Journal

arXiv preprint arXiv:2208.12345

Published Date

2022/8/25

Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. To alleviate this issue, we design an evaluation protocol for unsupervised RL representations with lower variance and up to 600x lower computational cost. Inspired by the vision community, we propose two linear probing tasks: predicting the reward observed in a given state, and predicting the action of an expert in a given state. These two tasks are generally applicable to many RL domains, and we show through rigorous experimentation that they correlate strongly with the actual downstream control performance on the Atari100k Benchmark. This provides a better method for exploring the space of pretraining algorithms without the need of running RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective.

Volta: Vision-language transformer with weakly-supervised local-feature alignment

Authors

Shraman Pramanick,Li Jing,Sayan Nag,Jiachen Zhu,Hardik Shah,Yann LeCun,Rama Chellappa

Journal

Transactions on Machine Learning Research (TMLR)

Published Date

2023

Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.

Graph MLP-Mixer

Authors

Xiaoxin He,Bryan Hooi,Thomas Laurent,Adam Perold,Yann LeCun,Xavier Bresson

Published Date

2022/9/29

Graph Neural Networks (GNNs) have shown great potential in the field of graph representation learning. Standard GNNs define a local message-passing mechanism which propagates information over the whole graph domain by stacking multiple layers. This paradigm suffers from two major limitations, over-squashing and poor long-range dependencies, that can be solved using global attention but significantly increases the computational cost to quadratic complexity. In this work, we consider an alternative approach to overcome these structural limitations while keeping a low complexity cost. Motivated by the recent MLP-Mixer architecture introduced in computer vision, we propose to generalize this network to graphs. This GNN model, namely Graph MLP-Mixer, can make long-range connections without over-squashing or high complexity due to the mixer layer applied to the graph patches extracted from the original graph. As a result, this architecture exhibits promising results when comparing standard GNNs vs. Graph MLP-Mixers on benchmark graph datasets.

The effects of regularization and data augmentation are class dependent

Authors

Randall Balestriero,Leon Bottou,Yann LeCun

Journal

Advances in Neural Information Processing Systems

Published Date

2022/12/6

Regularization is a fundamental technique to prevent over-fitting and to improve generalization performances by constraining a model's complexity. Current Deep Networks heavily rely on regularizers such as Data-Augmentation (DA) or weight-decay, and employ structural risk minimization, ie cross-validation, to select the optimal regularization hyper-parameters. In this study, we demonstrate that techniques such as DA or weight decay produce a model with a reduced complexity that is unfair across classes. The optimal amount of DA or weight decay found from cross-validation over all classes leads to disastrous model performances on some classes eg on Imagenet with a resnet50, the``barn spider''classification test accuracy falls from to only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay. Those results demonstrate that our search for ever increasing generalization performance---averaged over all classes and samples---has left us with models and regularizers that silently sacrifice performances on some classes. This scenario can become dangerous when deploying a model on downstream tasks eg an Imagenet pre-trained resnet50 deployed on INaturalist sees its performances fall from to on class\# 8889 when introducing random crop DA during the Imagenet pre-training phase. Those results demonstrate that finding a correct measure of a model's complexity without class-dependent preference remains an open research question.

Neural manifold clustering and embedding

Authors

Zengyi Li,Yubei Chen,Yann LeCun,Friedrich T Sommer

Journal

arXiv preprint arXiv:2201.10000

Published Date

2022/1/24

Given a union of non-linear manifolds, non-linear subspace clustering or manifold clustering aims to cluster data points based on manifold structures and also learn to parameterize each manifold as a linear subspace in a feature space. Deep neural networks have the potential to achieve this goal under highly non-linear settings given their large capacity and flexibility. We argue that achieving manifold clustering with neural networks requires two essential ingredients: a domain-specific constraint that ensures the identification of the manifolds, and a learning algorithm for embedding each manifold to a linear subspace in the feature space. This work shows that many constraints can be implemented by data augmentation. For subspace feature learning, Maximum Coding Rate Reduction (MCR) objective can be used. Putting them together yields {\em Neural Manifold Clustering and Embedding} (NMCE), a novel method for general purpose manifold clustering, which significantly outperforms autoencoder-based deep subspace clustering. Further, on more challenging natural image datasets, NMCE can also outperform other algorithms specifically designed for clustering. Qualitatively, we demonstrate that NMCE learns a meaningful and interpretable feature space. As the formulation of NMCE is closely related to several important Self-supervised learning (SSL) methods, we believe this work can help us build a deeper understanding on SSL representation learning.

Tico: Transformation invariance and covariance contrast for self-supervised visual representation learning

Authors

Jiachen Zhu,Rafael M Moraes,Serkan Karakulak,Vlad Sobol,Alfredo Canziani,Yann LeCun

Journal

arXiv preprint arXiv:2206.10698

Published Date

2022/6/21

We present Transformation Invariance and Covariance Contrast (TiCo) for self-supervised visual representation learning. Similar to other recent self-supervised learning methods, our method is based on maximizing the agreement among embeddings of different distorted versions of the same image, which pushes the encoder to produce transformation invariant representations. To avoid the trivial solution where the encoder generates constant vectors, we regularize the covariance matrix of the embeddings from different images by penalizing low rank solutions. By jointly minimizing the transformation invariance loss and covariance contrast loss, we get an encoder that is able to produce useful representations for downstream tasks. We analyze our method and show that it can be viewed as a variant of MoCo with an implicit memory bank of unlimited size at no extra memory cost. This makes our method perform better than alternative methods when using small batch sizes. TiCo can also be seen as a modification of Barlow Twins. By connecting the contrastive and redundancy-reduction methods together, TiCo gives us new insights into how joint embedding methods work.

Deep generative models create new and diverse protein structures

Authors

Zeming Lin,Tom Sercu,Yann LeCun,Alexander Rives

Journal

Machine Learning for Structural Biology Workshop, NeurIPS

Published Date

2021

We explore the use of modern variational autoencoders for generating protein structures. Models are trained across a diverse set of natural protein domains. Threedimensional structures are encoded implicitly in the form of an energy function that expresses constraints on pairwise distances and angles. Atomic coordinates are recovered by optimizing the parameters of a rigid body representation of the protein chain to fit the constraints. The model generates diverse structures across a variety of folds, and exhibits local coherence at the level of secondary structure, generating alpha helices and beta sheets, as well as globally coherent tertiary structure. A number of generated protein sequences have high confidence predictions by AlphaFold that agree with their designs. The majority of these have no significant sequence homology to natural proteins.Most designed proteins are variations on existing proteins. It is of great interest to create de novo proteins that go beyond what has been invented by nature. A line of recent work has explored generative models for protein structures [1, 2, 3, 4, 5, 6]. The main challenge for a generative model is to propose stable structures that can be realized as the minimum energy state for a protein sequence, ie the endpoint of folding. The space of possible three-dimensional conformations of a protein sequence is exponentially large [7], but out of this set of possible conformations, most do not correspond to stable realizable structures.

Deep learning for AI

Authors

Yoshua Bengio,Yann Lecun,Geoffrey Hinton

Journal

Communications of the ACM

Published Date

2021/6/21

How can neural networks learn the rich internal representations required for difficult tasks such as recognizing objects or understanding language?

Mdetr-modulated detection for end-to-end multi-modal understanding

Authors

Aishwarya Kamath,Mannat Singh,Yann LeCun,Gabriel Synnaeve,Ishan Misra,Nicolas Carion

Published Date

2021

Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. We pre-train the network on 1.3 M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. We then fine-tune on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation, achieving state-of-the-art results on popular benchmarks. We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR. The code and models are available at https://github. com/ashkamath/mdetr.

Neural potts model

Authors

Tom Sercu,Robert Verkuil,Joshua Meier,Brandon Amos,Zeming Lin,Caroline Chen,Jason Liu,Yann LeCun,Alexander Rives

Journal

bioRxiv

Published Date

2021/4/11

AbstractWe propose the Neural Potts Model objective as an amortized optimization problem. The objective enables training a single model with shared parameters to explicitly model energy landscapes across multiple protein families. Given a protein sequence as input, the model is trained to predict a pairwise coupling matrix for a Potts model energy function describing the local evolutionary landscape of the sequence. Couplings can be predicted for novel sequences. A controlled ablation experiment assessing unsupervised contact prediction on sets of related protein families finds a gain from amortization for low-depth multiple sequence alignments; the result is then confirmed on a database with broad coverage of protein sequences.

Sparse coding with multi-layer decoders using variance regularization

Authors

Katrina Evtimova,Yann LeCun

Journal

arXiv preprint arXiv:2112.09214

Published Date

2021/12/16

Sparse representations of images are useful in many computer vision applications. Sparse coding with an penalty and a learned linear dictionary requires regularization of the dictionary to prevent a collapse in the norms of the codes. Typically, this regularization entails bounding the Euclidean norms of the dictionary's elements. In this work, we propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder. Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold over a set of sparse representations for a given set of inputs. Furthermore, we explore ways to effectively train sparse coding systems with multi-layer decoders since they can model more complex relationships than linear dictionaries. In our experiments with MNIST and natural image patches, we show that decoders learned with our approach have interpretable features both in the linear and multi-layer case. Moreover, we show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations when compared to autoencoders with linear dictionaries. Additionally, sparse representations obtained with our variance regularization approach are useful in the downstream tasks of denoising and classification in the low-data regime.

Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

Authors

Zeyu Yun,Yubei Chen,Bruno A Olshausen,Yann LeCun

Journal

arXiv preprint arXiv:2103.15949

Published Date

2021/3/29

Transformer networks have revolutionized NLP representation learning since they were introduced. Though a great effort has been made to explain the representation in transformers, it is widely recognized that our understanding is not sufficient. One important reason is that there lack enough visualization tools for detailed analysis. In this paper, we propose to use dictionary learning to open up these "black boxes" as linear superpositions of transformer factors. Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors, e.g., word-level polysemy disambiguation, sentence-level pattern formation, and long-range dependency. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights. We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work. The code is available at https://github.com/zeyuyun1/TransformerVis

Understanding dimensional collapse in contrastive self-supervised learning

Authors

Li Jing,Pascal Vincent,Yann LeCun,Yuandong Tian

Journal

arXiv preprint arXiv:2110.09348

Published Date

2021/10/18

Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on an explicit trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.

Inspirational adversarial image generation

Authors

Baptiste Rozière,Morgane Riviere,Olivier Teytaud,Jérémy Rapin,Yann LeCun,Camille Couprie

Journal

IEEE Transactions on Image Processing

Published Date

2021/3/18

The task of image generation started receiving some attention from artists and designers, providing inspiration for new creations. However, exploiting the results of deep generative models such as Generative Adversarial Networks can be long and tedious given the lack of existing tools. In this work, we propose a simple strategy to inspire creators with new generations learned from a dataset of their choice, while providing some control over the output. We design a simple optimization method to find the optimal latent parameters corresponding to the closest generation to any input inspirational image. Specifically, we allow the generation given an inspirational image of the user’s choosing by performing several optimization steps to recover optimal parameters from the model’s latent space. We tested several exploration methods from classical gradient descents to gradient-free optimizers. Many gradient-free …

Learning in high dimension always amounts to extrapolation

Authors

Randall Balestriero,Jerome Pesenti,Yann LeCun

Journal

arXiv preprint arXiv:2110.09485

Published Date

2021/10/18

The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional (100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

Barlow twins: Self-supervised learning via redundancy reduction

Authors

Jure Zbontar,Li Jing,Ishan Misra,Yann LeCun,Stéphane Deny

Published Date

2021/3/4

Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow’s redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.

Recurrent parameter generators

Authors

Jiayun Wang,Yubei Chen,Stella Yu,Brian Cheung,Yann LeCun

Published Date

2021/10/6

We present a generic method for recurrently using the same parameters for many different convolution layers to build a deep network. Specifically, for a network, we create a recurrent parameter generator (RPG), from which the parameters of each convolution layer are generated. Though using recurrent models to build a deep convolutional neural network (CNN) is not entirely new, our method achieves significant performance gain compared to the existing works. We demonstrate how to build a one-layer-size neural network to achieve similar performance compared to other traditional CNN models on various applications and datasets. We use the RPG to build a ResNet18 network with the number of weights equivalent to one convolutional layer of a conventional ResNet and show this model can achieve ImageNet top-1 accuracy. Additionally, such a method allows us to build an arbitrarily complex neural network with any amount of parameters. For example, we build a ResNet34 with model parameters reduced by more than times, which still achieves ImageNet top-1 accuracy. Furthermore, the RPG can be further pruned and quantized for better run-time performance in addition to the model size reduction. We provide a new perspective for model compression. Rather than shrinking parameters from a large model, RPG sets a certain parameter-size constraint and uses the gradient descent algorithm to automatically find the best model under the constraint. Extensive experiment results are provided to demonstrate the power of the proposed recurrent parameter generator.

Implicit Rank-Minimizing Autoencoder

Authors

Li Jing,Jure Zbontar,Yann LeCun

Published Date

2020/10/1

An important component of autoencoder methods is the method by which the information capacity of the latent representation is minimized or limited. In this work, the rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions. By inserting a number of extra linear layers between the encoder and the decoder, the system spontaneously learns representations with a low effective dimension. The model, dubbed Implicit Rank-Minimizing Autoencoder (IRMAE), is simple, deterministic, and learns continuous latent space. We demonstrate the validity of the method on several image generation and representation learning tasks.

métodos, sistemas e mídias para detectar a falsificação em autenticação móvel

Published Date

2020/9/24

RCQGIGSZSJQZDX-KAMYIIQDSA-N 4-[(2Z)-2-(2-hydroxy-4-oxocyclohexa-2, 5-dien-1-ylidene) hydrazinyl] benzenesulfonic acid Chemical compound OC1= CC (= O) C= C\C1= N\NC1= CC= C (S (O)(= O)= O) C= C1 RCQGIGSZSJQZDX-KAMYIIQDSA-N 0.000 description 1

The mind of a mouse

Authors

Larry F Abbott,Davi D Bock,Edward M Callaway,Winfried Denk,Catherine Dulac,Adrienne L Fairhall,Ila Fiete,Kristen M Harris,Moritz Helmstaedter,Viren Jain,Narayanan Kasthuri,Yann LeCun,Jeff W Lichtman,Peter B Littlewood,Liqun Luo,John HR Maunsell,R Clay Reid,Bruce R Rosen,Gerald M Rubin,Terrence J Sejnowski,H Sebastian Seung,Karel Svoboda,David W Tank,Doris Tsao,David C Van Essen

Journal

Cell

Published Date

2020/9/17

Large scientific projects in genomics and astronomy are influential not because they answer any single question but because they enable investigation of continuously arising new questions from the same data-rich sources. Advances in automated mapping of the brain's synaptic connections (connectomics) suggest that the complicated circuits underlying brain function are ripe for analysis. We discuss benefits of mapping a mouse brain at the level of synapses.

System and method for biometric authentication in connection with camera-equipped devices

Published Date

2020/7/28

The present invention relates generally to the use of biometric technology for authentication and identification, and more particularly to non-contact based solutions for authenticating and identifying users, via computers, such as mobile devices, to selectively permit or deny access to various resources. In the present invention authentication and/or identification is performed using an image or a set of images of an individual's palm through a process involving the following key steps:(1) detecting the palm area using local classifiers;(2) extracting features from the region (s) of interest; and (3) computing the matching score against user models stored in a database, which can be augmented dynamically through a learning process.

See List of Professors in Yann LeCun University(New York University)

Yann LeCun FAQs

What is Yann LeCun's h-index at New York University?

The h-index of Yann LeCun has been 113 since 2020 and 145 in total.

What are Yann LeCun's top articles?

The articles with the titles of

Learning and Leveraging World Models in Visual Representation Learning

EgoPet: Egomotion and Interaction Data from an Animal's Perspective

Fast and exact enumeration of deep networks partitions regions

An Information Theory Perspective on Variance-Invariance-Covariance Regularization

Eyes wide shut? exploring the visual shortcomings of multimodal llms

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

Learning by Reconstruction Produces Uninformative Features For Perception

To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review

...

are the top articles of Yann LeCun at New York University.

What are Yann LeCun's research interests?

The research interests of Yann LeCun are: AI, machine learning, computer vision, robotics, image compression

What is Yann LeCun's total number of citations?

Yann LeCun has 338,733 citations in total.

What are the co-authors of Yann LeCun?

The co-authors of Yann LeCun are Yoshua Bengio, Rob Fergus, Bernhard Boser, Richard E. Howard, Joan Bruna, Clement Farabet.

    Co-Authors

    H-index: 227
    Yoshua Bengio

    Yoshua Bengio

    Université de Montréal

    H-index: 83
    Rob Fergus

    Rob Fergus

    New York University

    H-index: 66
    Bernhard Boser

    Bernhard Boser

    University of California, Berkeley

    H-index: 60
    Richard E. Howard

    Richard E. Howard

    Rutgers, The State University of New Jersey

    H-index: 45
    Joan Bruna

    Joan Bruna

    New York University

    H-index: 21
    Clement Farabet

    Clement Farabet

    New York University

    academic-engine

    Useful Links