Unsupervised data augmentation for consistency training

Advances in neural information processing systems

Published On 2020

Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. By substituting simple noising operations with advanced data augmentation methods such as RandAugment and back-translation, our method brings substantial improvements across six language and three vision tasks under the same consistency training framework. On the IMDb text classification dataset, with only 20 labeled examples, our method achieves an error rate of 4.20, outperforming the state-of-the-art model trained on 25,000 labeled examples. On a standard semi-supervised learning benchmark, CIFAR-10, our method outperforms all previous approaches and achieves an error rate of 5.43 with only 250 examples. Our method also combines well with transfer learning, eg, when finetuning from BERT, and yields improvements in high-data regime, such as ImageNet, whether when there is only 10% labeled data or when a full labeled set with 1.3 M extra unlabeled examples is used. Code is available at https://github. com/google-research/uda.

Journal

Advances in neural information processing systems

Volume

Page

6256-6268

Authors

Eduard Hovy

Carnegie Mellon University

H-Index

104

Research Interests

NLP

University Profile Page

Carnegie Mellon University

Access Email

Other articles from Advances in neural information processing systems journal

Min Zhang

Soochow University

Advances in Neural Information Processing Systems

Beyond MLE: Convex Learning for Text Generation

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution that best explain the observed data. In the context of text generation, MLE is often used to train generative language models, which can then be used to generate new text. However, we argue that MLE is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. In these tasks, the goal of model is to generate the most appropriate response, which does not necessarily require it to estimate the entire data distribution with MLE. To this end, we propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution. We investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss, demonstrating that convex functions can sharpen the optimal distribution, thereby enabling the model to better capture outputs with high probabilities. Experiments on various text generation tasks and models show the effectiveness of our approach. It enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. Moreover, our approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks. Source code is available at\url {https://github. com/ictnlp/Convex-Learning}.

2024/2/13

Unsupervised data augmentation for consistency training

Authors

Eduard Hovy

Carnegie Mellon University

Other Articles from authors

A survey of data augmentation approaches for NLP

Self-training with noisy student improves imagenet classification

Other articles from Advances in neural information processing systems journal

Beyond MLE: Convex Learning for Text Generation

Adaptive Selective Sampling for Online Prediction with Experts

Towards Characterizing the First-order Query Complexity of Learning (Approximate) Nash Equilibria in Zero-sum Matrix Games

First-and second-order bounds for adversarial linear contextual bandits

CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

Can language models solve graph problems in natural language?

Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift

Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization

Breaking the sample size barrier in model-based reinforcement learning with a generative model

TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models

Multi-Objective Intrinsic Reward Learning for Conversational Recommender Systems

Benchmarking distribution shift in tabular data with tableshift

VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models

Feature Selection in the Contrastive Analysis Setting

On the Robustness of Removal-Based Feature Attributions

Learning Energy-Based Prior Model with Diffusion-Amortized MCMC

Policy Gradient for Rectangular Robust Markov Decision Processes

Understanding multi-phase optimization dynamics and rich nonlinear behaviors of relu networks

Fed-grab: Federated long-tailed learning with self-adjusting gradient balancer

Spectral Co-Distillation for Personalized Federated Learning