Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, data augmentation.

2783 papers with code • 3 benchmarks • 63 datasets

Data augmentation involves techniques used for increasing the amount of data, based on different modifications, to expand the amount of examples in the original dataset. Data augmentation not only helps to grow the dataset but it also increases the diversity of the dataset. When training machine learning models, data augmentation acts as a regularizer and helps to avoid overfitting.

Data augmentation techniques have been found useful in domains like NLP and computer vision. In computer vision, transformations like cropping, flipping, and rotation are used. In NLP, data augmentation techniques can include swapping, deletion, random insertion, among others.

Further readings:

  • A Survey of Data Augmentation Approaches for NLP
  • A survey on Image Data Augmentation for Deep Learning

( Image credit: Albumentations )

data augmentation research paper

Benchmarks Add a Result

--> --> --> -->
Trend Dataset Best ModelPaper Code Compare
DeiT-B (+MixPro)
Shake-Shake (26 2×96d) (Faster AA)
DiffAug

data augmentation research paper

Most implemented papers

Yolov4: optimal speed and accuracy of object detection.

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy.

Improved Baselines with Momentum Contrastive Learning

data augmentation research paper

Contrastive unsupervised learning has recently shown encouraging progress, e. g., in Momentum Contrast (MoCo) and SimCLR.

AutoAugment: Learning Augmentation Policies from Data

In our implementation, we have designed a search space where a policy consists of many sub-policies, one of which is randomly chosen for each image in each mini-batch.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

On LibriSpeech, we achieve 6. 8% WER on test-other without the use of a language model, and 5. 8% WER with shallow fusion with a language model.

Improved Regularization of Convolutional Neural Networks with Cutout

Convolutional neural networks are capable of learning powerful representational spaces, which are necessary for tackling complex learning tasks.

3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation

This paper introduces a network for volumetric segmentation that learns from sparsely annotated volumetric images.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings.

Supervised Contrastive Learning

Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models.

EfficientNetV2: Smaller Models and Faster Training

By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87. 3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2. 0% accuracy while training 5x-11x faster using the same computing resources.

Unsupervised Data Augmentation for Consistency Training

In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

A Survey on Data Augmentation Approaches for NLP

Research areas.

Natural Language Processing

Meet the teams driving innovation

Our teams advance the state of the art through research, systems engineering, and collaboration across Google.

Teams

  • Survey Paper
  • Open access
  • Published: 19 July 2021

Text Data Augmentation for Deep Learning

  • Connor Shorten   ORCID: orcid.org/0000-0001-6253-6861 1 ,
  • Taghi M. Khoshgoftaar 1 &
  • Borko Furht 1  

Journal of Big Data volume  8 , Article number:  101 ( 2021 ) Cite this article

49k Accesses

230 Citations

6 Altmetric

Metrics details

Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

Introduction

Nearly all the successes of Deep Learning stem from supervised learning. Supervised learning describes the use of loss functions that align predictions with manually annotated ground truth. Deep Learning can achieve remarkable performance through the combination of this learning strategy and large labeled datasets. The problem is that collecting these annotated datasets is very difficult at the scale required. For example, one of the key Deep Learning applications for COVID-19 rapid response was question answering [ 1 ]. Tang et al. [ 2 ] constructed COVID-QA, a supervised learning dataset in which articles are annotated with an answer span to a given question. The authors of the paper describe working for 23 hours to produce 124 question-answer pairs. Fitting 124 question-answer annotations without overfitting is extremely challenging in the current state of Deep Learning. In addition to question answering, Natural Language Processing (NLP) researchers are also exploring the application of abstractive summarization in which a model outputs a novel summary from a collection of input documents. Cachola et al. [ 3 ] were able to collect a dataset of 5.4K Too Long; Didn’t Read (TLDR) summaries of 3.2K machine learning papers. This required employing 28 undergraduate students to refine data bootstrapped from the OpenReview platform. These anecdotes are provided to highlight the difficulty of curating annotated big data for knowledge-intensive NLP tasks with millions of examples.

The Deep Learning research community is currently exploring many solutions to the problem of learning without labeled big data. In addition to Data Augmentation, self-supervised learning and transfer learning have performed very well. Few and zero-shot learning are categories of research gaining interest as well. In this survey, we explore getting more performance out of the supervised data available with Data Augmentation. Our survey additionally explores how Data Augmentation is driving key advances in learning strategies outside of supervised learning. This includes self-supervised learning from unlabeled datasets, and transfer learning from other domains, whether that data is labeled or unlabeled.

Data Augmentation describes a set of algorithms that construct synthetic data from an available dataset. This synthetic data typically contains small changes in the data that the model’s predictions should be invariant to. Synthetic data can also represent combinations between distant examples that would be very difficult to infer otherwise. Data Augmentation is one of the most useful interfaces to influence the training of Deep Neural Networks. This is largely due to the interpretable nature of the transformations and the window to observe how the model is failing.

Preventing overfitting is the most common use case of Data Augmentation. Without augmentation, or regularization more generally, Deep Neural Networks are prone to learning spurious correlations and memorizing high-frequency patterns that are difficult for humans to detect. In NLP, this could describe high frequency numeric patterns in token embeddings, or memorizations of particular forms of language that do not generalize. Data Augmentation can aid in these types of overfitting by shuffling the particular forms of language. To overcome the noisy data, the model must resort to learning abstractions of information which are more likely to generalize.

Data Augmentation is a regularization strategy. Other regularization techniques have been developed such as dropout [ 4 ] or weight penalties [ 5 ]. These techniques apply functional regularization by either adding noise to intermediate activations of the network or adding constraints to the functional form. These techniques have found successes, but they lack the power to express the esoteric concept of semantic invariance. Data Augmentation enables an intuitive interface for demonstrating label-preserving transformations.

Our survey presents several strategies for applying Data Augmentation to text data. We cluster these augmentations into symbolic or neural methods. Symbolic methods use rules or discrete data structures to form synthetic examples. This includes Rule-Based Augmentations, Graph-Structured Augmentations, Feature-Space Augmentation, and MixUp. Neural augmentations use a deep neural network trained on a different task to augment data. Neural augmentations surveyed include Back-Translation, Generative Data Augmentation, and Style Augmentation. In addition to symbolic vs. neural-based augmentations, we highlight other distinctions between augmentations such as task-specific versus task-agnostic augmentations and form versus meaning augmentations. We describe these distinctions further throughout our survey.

Generalization is the core challenge of Deep Learning. How far can we extrapolate from the instances available? The same interface used to control the training data is also useful for simulating potential test sets and distribution shifts. We can simulate distribution shift by applying augmentations to a dataset, such as adding random tokens to an email spam detector or increasing the prevalence of tokens that lie on the long-tail of the frequency distributions. These simulated shifts can also describe higher-level linguistic phenomenon. This involves deeper fact chaining than what was seen in the training set, or the ability to change predictions given counterfactual evidence. As our tools for Generative Data Augmentation continue to improve, we will be able to simulate more semantic distribution shifts. This looks like a very promising direction to advance generalization testing.

Our survey on Text Data Augmentation for Deep Learning builds on our work surveying Image Data Augmentation for Deep Learning [ 6 ]. In Computer Vision, this describes applying transformations such as rotating images, horizontally flipping them, or increasing the brightness to form augmented examples. We found that it is currently much easier to apply label-preserving transformations in Computer Vision than NLP. It is additionally easier to stack these augmentations in Computer Vision, enabling even more diversity in the augmented set, which has been shown to be a key contributor to success. Data Augmentation research has been more thoroughly explored in Computer Vision than NLP. We present some ideas that have found interesting results with images, but remain to be tested in the text data domain. Finally, we discuss the intersection of visual supervision for language understanding and how vision-language models may help overcome the grounding problem. We discuss the grounding problem in greater detail under our Motifs Of Data Augmentation section.

Our next section presents practical implementation decisions for text data augmentation. We begin by describing the use of a consistency regularization loss to further influence the impact of augmented data. Differently from consistency regularization, contrastive learning additionally uses negative examples to structure the loss function. The next key question is how to control the strength and sampling of each augmentation. Augmentation controllers apply a meta-level abstraction to the hyperparameters of augmentation selection and the magnitude of the transformation. This is commonly explored with an adversarial controller that aims to produce mistakes in the model. We also describe controllers that search for performance improvements such as AutoAugment [ 7 ], Population-Based Augmentation [ 8 ], and RandAugment [ 9 ]. Although similar in concept, we discuss the key distinction between augmentation controllers and curriculum learning. Another important consideration for implementing Data Augmentation is the CPU to GPU transfer in the preprocessing pipeline, as well as the conceptual understanding of offline versus online augmentation. Finally, we describe the application of augmentation to alleviate issues caused by class imbalance.

Our Discussion section presents opportunities to explore text data augmentation. We begin with task-specific augmentations describing how key NLP tasks such as question answering differ from natural language inference, particularly with respect to input length or the categorization as a knowledge-intensive task. We quickly previewed that self-supervised and transfer learning are also emerging solutions to learning with limited labeled data. We discuss the use of Data Augmentation in self-supervised learning and then recent works with transfer and multi-task learning. Finally, we discuss AI-GAs, short for AI-generating Algorithms [ 10 ]. This is a very interesting idea encompassing papers such as POET [ 11 ], Generative Teaching Networks [ 12 ], and the Synthetic Petri Dish [ 13 ] which describe algorithms that learn the environment to learn from. We present how this differs from augmentation controllers or curriculum learning, the idea of skill acquisition from artificial data, and opportunities to test these ideas in NLP.

Data Augmentation for NLP prevents overfitting, provides the easiest way to inject prior knowledge into a Deep Learning system, and offers a view into the generalization ability of these models. Our survey is organized as follows:

We begin with the key Motifs Of Data Augmentation that augmentations strive to achieve.

We provide a list of Text Data Augmentations. This list can be summarized into symbolic augmentations, using rules and graph-structured decomposition to form new examples, and neural augmentations, that use auxiliary neural networks to sample new data.

Following our list of available augmentations, we dive deeper into Testing Generalization with Data Augmentation.

We continue with a comparison of Image versus Text Augmentation.

Returning to Text Data Augmentation, we describe Practical Considerations for Implementation.

Finally, we present interesting ideas and research questions in our Discussion section.

Our Conclusion briefly summarizes the motivation and findings of our survey.

Data Augmentation has been a heavily studied area of Machine Learning. The advancement of the prior knowledge encoded in augmentations is one of the key distinctions between previous works and now. As we will discuss in depth later in the survey, the success of Data Augmentation in Computer Vision has been fueled by the ease of designing label-preserving transformations. For example, a cat image is still a cat after rotating it, translating it on the x or y axis, increasing the intensity of the red channel, and so on. It is easy to brainstorm these semantically-preserving augmentations for images, whereas it is much harder to do this in the text domain.

We believe our survey on text data augmentation is well-timed with respect to questions such as why now? What has changed recently? Recent advances in generative modeling such as StyleGAN for images, GPT-3 for text [ 14 ], and DALL-E unifying both text and images [ 15 ], have been astounding. We summarize many exciting works on the use of prompting for adapting language models for downstream tasks. As discussed in further detail later on, we believe these advances in generative modeling could be game changing for the way we store datasets and build Deep Learning models. More particularly, it could become common to use labeled datasets solely for the sake of evaluation, rather than representation learning.

Our survey has some similarities to Feng et al. [ 16 ] which has been published roughly around the same time as ours. Both surveys seek a clear definition of Data Augmentation and aim to highlight key motifs. Additionally, both surveys narrate the development of NLP augmentation around the successes of augmentation in Computer Vision and how these may transfer. Feng et al. [ 16 ] provide a deeper enumeration of task-specific augmentation than is covered in our survey. Our survey adds important concepts such as the debate between Meaning versus Form, Counterfactual Examples, and the use of prompts in Generative Data Augmentation.

Many of the successes of Deep Learning stem from access to large labeled datasets such as ImageNet [ 17 ]. However, constructing these datasets is very challenging and time-consuming. Therefore, researchers are looking for alternative ways to leverage data without manual annotation. This is a large motivation behind the success of self-supervised language modeling with papers such as GPT-3 [ 14 ] or BERT [ 18 ]. Data Augmentation follows this same motivation as overcoming the challenge of learning with limited labeled data and avoiding manually labeling data. For example, many of the surveyed studies highlight the success of their algorithms when sub-setting the labeled data.

Transfer Learning has been one of the most effective solutions to this challenge of learning from limited labeled datasets [ 19 ]. Transfer Learning references initialization of the model for learning with the weights learned from a previous task. This previous task usually has the benefit of big data, whether that data is labeled such as ImageNet or unlabeled, as is used in self-supervised language models. There are many research questions around the procedure of Transfer Learning. In our Discussion section we discuss opportunities with Data Augmentation such as freezing the base feature extractor and training separate heads on the original and augmented datasets.

Self-supervised learning describes a general set of algorithms that learn from unlabeled data with supervised learning. This is done by algorithmically labeling the data. Some of the most popular self-supervised learning tasks include generation, contrastive learning, and pretext tasks. Generation describes how language models are trained. A token is algorithmically selected to be masked out and the masked out token is used as the label for supervised learning. Contrastive learning aligns representations of data algorithmically determined to be similar (usually through the use of augmentations), and distances these representations from negatives (usually other samples in the mini-batch). Pretext tasks describe ideas such as applying an augmentation to data and tasking the model to predict the transformation. The augmentation interface powers many task constructions in self-supervised learning.

Motifs of text data augmentation

This section will introduce a unifying view of objective the augmentations presented in the rest of the survey address. We introduce the key motifs of Text Data Augmentation as Strengthening Decision Boundaries, Brute Force Training, Causality and Counterfactual Examples, and the distinction between Meaning versus Form. These concepts dig into the understanding of Data Augmentation and their particular application to language processing.

Strengthening decision boundaries

Data Augmentation is commonly applied to classification problems where class boundaries are learned from label assignments. Augmented examples are typically only slightly different from existing samples. Training on these examples results in added space between the original example and its respective class boundary. Well defined class boundaries result in more robust classifiers and uncertainty estimates. For example, these boundaries are often reported with lower dimensional visualizations derived from t-SNE [ 20 ] or UMAP [ 21 ].

A key motif of Data Augmentation is to perturb data so that the model is more familiar with the local space around these examples. Expanding the radius from each example in the dataset will overall help the model get a better sense of the decision boundary and result in smother interpolation paths. This is in reference to small changes to the original data points. In NLP this could be deleting or adding words, synonym swaps, or well controlled paraphrases. The model becomes more robust to the local space and decision boundary based on available labels simply by increased exposure.

Brute force training

Deep Neural Networks are highly parametric models with very high variance that can easily model their training data. Fitting the training data is surprisingly robust to interpolation, or moving within the data points provided. What Deep Learning struggles with, as we will unpack in Generalization Testing with Data Augmentation, is extrapolating outside of data points provided during training. A potential solution to this is to brute force the data space with the training data.

The upper bound solution to many problems in Computer Science is to simply enumerate all candidate solutions. Brute force solutions rely on computing speed to overpower the complexity of a given problem. In Deep Learning, this entails training on an exhaustive set of natural language sequences such that all potential distributions the test set could be sampled from are covered in the training data. This way, even the most extreme edge cases will have been covered in the training set. The design of brute force training requires exhaustive coverage of the natural language manifold. A key question is whether this idea is reasonable or not? It may be better to identify key regions that are missing, although that it is challenging to probe for and define.

Causality and counterfactual examples

Vital to achieving the goals of Deep Learning, is to learn causal representations [ 22 ], as opposed to solely representing correlations. The field of Causal Inference demonstrates how to use interventions to establish causality. Reinforcement Learning is the most similar branch of Deep Learning research in which an agent deliberately samples interventions to learn about its environment. In this survey, we consider how the results of interventions can be integrated into observational language data. This is also similar to the subset of Reinforcement Learning known as the offline setting [ 23 ].

Many of the Text Data Augmentations described throughout the survey utilize the terminology of Counterfactual Examples [ 24 ]. These Counterfactual Examples describe augmentations such as the introduction of negations or numeric alterations to flip the label of the example. The construction of counterfactuals in language generally relies on human expertise, rather than algorithmic construction. Although the model does not deliberately sample these interventions akin to a randomized control trial, the hope is that it can still establish causal links between semantic concepts and labels by observing the result of interventions.

Liu et al. [ 25 ] lay the groundwork for formal causal language in Data Augmentation. This entails the use of structured causal models and the procedure of abduction, action, and prediction to generate counterfactual examples. These experiments rely on phrasal alignment between sequences in neural machine translation to sample counterfactual replacements. Their counterfactual augmentation improves on a baseline English to French translation system from 26.0 to 28.92 according to the BLEU metric. It seems possible that this phrasal alignment could be extended to other sequence-to-sequence problems such as abstractive question answering, summarization, or dialogue systems. This explicit counterfactual structure is different from most reviewed works that rather use natural language prompts to automate counterfactual sampling. For example, DINO [ 26 ] generates natural language inference data by either seeding the generation with “mean the same thing” or “are on completely different topics”. We think it is an interesting research direction to see if rigorous causal modeling such as computing the conditional probabilities of the context removing the variable [ 27 ] will provide benefits over prompts and large language models.

Meaning versus form

One of the most interesting ideas in language processing is the distinction between meaning and form. Bender and Koller [ 28 ] introduced the argument, providing several ideas and thought experiments. A particularly salient anecdote to illustrate this is known as the octopus example. In this example, two people are stranded on separate islands, communicating through an underwater cable. This underwater cable is intercepted by an intelligent octopus who learns to mimic the speaking patterns of each person. The octopus does this well enough that it can substitute for either person, as in the Turing test. However, when one of the stranded islanders encounters a bear and seeks advice, the octopus is unable to help. This is because the octopus has learned the form of their communication, but it has not learned the underlying meaning of the world in which their language describes.

We will present many augmentations in this paper that aid in learning form. Similar to the concept of strengthening decision boundaries, ideas like synonym swap or rotating syntactic trees will help the octopus further strengthen its understanding of how language is generally organized. With respect to achieving an understanding of meaning in these models and defining this esoteric concept, many have turned to ideas in grounding and embodiment. Grounding typically refers to pairing language with other modalities such as vision-language or audio-language models. However, grounding can also refer to abstract concepts and worlds constructed solely from language. Embodiment references learning agents that act in their environment. Although Bender and Koller propose that meaning cannot be learned from form alone, many other works highlight different areas of the language modeling task such as assertions [ 29 ] or multiple embedded tasks [ 30 ] that could lead to learning meaning. Another useful way of thinking about meaning versus form could be to look at recently developed benchmarks in language processing such as the distinction between GLUE [ 31 ] and SuperGLUE [ 32 ] tasks that predominantly test an understanding of form to knowledge-intensive tasks such as KILT [ 33 ] that better probe for meaning. In our survey, we generally use the terms “understanding” and “meaning” to describe passing black-box tests designed by humans. We believe that drilling into the definition of these terms is one of the most promising pursuits in language processing research.

Text data augmentations

We described Data Augmentation as a strategy to prevent overfitting via regularization. This regularization is enabled through an intuitive interface. As we study a task or dataset, we learn more about what kind of priors or what kind of additional data we need to collect to improve the system. For example, we might discover characteristics about our question answering dataset such as that it fails with symmetric consistency on comparison questions. The following list of augmentations describes the mechanisms we currently have available to inject these priors into our datasets.

Symbolic augmentation

We categorize these augmentations as “Symbolic Augmentations” in contrast to “Neural Augmentations”. As stated earlier, the key difference is the use of auxiliary neural networks, or other types of statistical models, to generate data compared to using symbolic rules to augment data. A key benefit of symbolic augmentation is the interpretability for the human designer. Symbolic augmentations also work better with short transformations, such as replacing words or phrases to form augmented examples. However, some information-heavy applications rely on longer inputs such as question answering or summarization. Symbolic rules are limited in applying global transformations such as augmenting entire sentences or paragraphs.

Rule-based augmentation

Rule-based Augmentations construct rules to form augmented examples. This entails if-else programs for augmentation and symbolic templates to insert and re-arrange existing data. Easy Data Augmentation from Wei et al. [ 34 ] presents four augmentations. Figure 1 highlights the performance improvement with EDA, note the smallest subset of 500 labeled examples benefits the most. One of the main reasons to be excited about Easy Data Augmentation is that it is relatively easy to use off-the-shelf. Many of the Augmentations mentioned later in this survey, are still in the research phase, waiting for large-scale testing and adoption. Easy Data Augmentation includes random swapping, random deletion, random insertion, and random synonym replacement. Examples of this are shown in Fig. 2 .

figure 1

Success of EDA applied to 5 text classification datasets. A key takeaway from these results is the performance difference with less data. The gain is much more pronounced with 500 labeled examples, compared to 5,000 or the full training set

figure 2

Examples of easy data augmentation transformations

There are many opportunities to build on these augmentations. Firstly, we note that with random swapping, the classification of the word is incredibly useful. From the Data Augmentation perspective of introducing semantic invariances, “I am jogging”, is much more similar to “I am swimming” than “I am yelling”. Further designing token vocabularies with this kind of structure should lead to an improvement.

Programs for Rule-based augmentation further encompass many of the adversarial attacks that have been developed for NLP. Adversarial attacks are equivalent to augmentations, differing solely in the intention of their construction. As an example of a rule-based attack, Jin et al. [ 35 ] present TextFooler. TextFooler first computes word importance scores by looking at the change in output when deleting each word. TextFooler then selects the words which most significantly changed the outputs for synonym replacement. This is an example of a rule-based symbolic program that can be used to organize the construction of augmented examples.

Another rule-based strategy available is Regular Expression Augmentation. Regular Expression filtering is one of the most common ways to clean data that has been scraped from the internet, as well as several other data sources such as Clinical Notes [ 36 ]. Regular Expressions describe matching patterns in text. This is usually used to clean data, but it can also be used to find common forms of language and generate extensions that align with a graph-structured grammar. For example, matching patterns like “This object is adjective” and extending it with patterns such as, “and adjective”. Another strategy is to re-order the syntactics based on the grammar such as “This object is adjective” to “An adjective object”.

Min et al. [ 37 ] propose rules for augmentation based on syntactic heuristics. This includes Inversion, swapping the subject and object in sentences, and Passivization where the hypothesis in premise-hypothesis NLI (Natural-Language Inference) pairs are translated to the passive version of the sentence. An example of Inversion is the change from “The lawyer saw the actor” to “The actor saw the lawyer”. An example of Passivization is changing from “This small collection contains 16 El Grecos” to “This small collection is contained by 16 El Grecos”. The authors show improvement applying these augmentations on the HANS challenge set for NLI [ 38 ].

Graph-structured augmentation

An interesting opportunity for text data augmentation is to construct graph-structured representations of text data. This includes relation and entity encodings in knowledge graphs, grammatical structures in syntax trees, or metadata grounding language data, such as citation networks. These augmentations add explicit structural information, a relatively new integration with Deep Learning architectures. The addition of structure can aid in finding label-preserving transformations, representation analysis, and adding prior knowledge to the dataset or application. We will begin our analysis of Graph-Structured Augmentation by unpacking the difference between structured versus unstructured representations.

Deep Learning operates by converting high-dimensional, and sometimes sparse, data into lower-dimensional, continuous vector embedding spaces. The learned vector space has corresponding metrics such as L2 or cosine similarity distance functions. This is a core distinction from topological spaces, in which distance between points is not defined. A topological space is a more general mathematical space with less constraints than Euclidean or metric spaces. Topological spaces encode information that is challenging to integrate in modern Deep Learning architectures. Rather than designing entirely new architectures, we can leverage the power of structured data through the Data Augmentation interface.

One of the most utilized structures in language processing is the Knowledge Graph [ 39 ]. A Knowledge Graph is composed of (entity, relation, entity) tuple relations. The motivation of the augmentation scheme is that paths along the graph provide information about entities and relations which are challenging to represent without structure. Under the scope of Rule-based Augmentation, we presented the idea of synonym swap. One strategy to implement synonym swap would be to use a Knowledge Graph with “is equivalent” relationships to find synonyms. This can be more practical than manually defining dictionaries with synonym entries. This is especially the case thanks to rapid acceleration in automated knowledge graph construction from unlabeled data. Knowledge Graphs often contain more fine-grained relations as well.

Previously, we mentioned how random synonym replacement would benefit enormously from the perspective of preserving the class label with better swaps. Improved swaps describe transitions such as “I am jogging” to “I am running” compared to “I am yelling”, or even “I am market”. Structured language in graph-form is a very useful tool to achieve this augmentation capability. These kinds of graphs have been heavily developed with notable examples such as WordNet [ 40 ], Penn Treebank [ 41 ], and the ImageNet class label structure [ 17 ]. Graphs such as WordNet describe words in relationship to one another through “synsets”.

Graphs are made up of nodes and edges. In WordNet, each node represents a word such as “tiger”. The genius of WordNet is the simplification of which edges to connect. In WordNet, the nodes are connected with the same edge type, a “synset” relationship. Synsets are loosely defined as words belonging to a similar semantic category. The word “tiger” would have a synset relation with nodes such as “lion” or “jaguar”. The word “tiger” may also have finer-grained synset relations with nodes that describe more particular types of tigers. WordNet is an example of a Graph-Structured Augmentation that builds on synonym replacement. WordNet describes a graph where each node is related to another graph by being a “synset”.

We additionally consider graphs that contain finer grained edge classifications, this kind of graph is frequently referred to as a Knowledge Graph [ 39 ]. As an example, CoV-KGE [ 42 ] contains 39 different types of edges relating biomedical concept nodes such as drugs or potential binding targets. Huang et al. [ 43 ] provide another interesting example of constructing a knowledge graph from the long context provided as input to abstractive summarization. This graph enables semantic swaps that preserve global consistency.

Another heavily studied area of adding structure to text data is known as syntactic parsing. Syntactic parsing describes different tasks that require structural analysis of text such as the construction of syntax or dependency trees. Recently, Glavas and Vulic [ 44 ] demonstrated that supervised syntactic parsing offered little to no benefit in the modern pre-train, then fine-tune pipeline with large language models.

The final use of structure for Text Data Augmentation we consider is to integrate metadata via structural information. For example, scientific literature mining has become a very popular application of NLP. These applications could benefit from the underlying citation network characterizing these papers, in addition to the text content of the papers themselves. Particularly, network structure has played an enormous role in biology and medicine. Li et al. [ 45 ] present many of these graphs in high-level application domains such as molecules, genomics, therapeutics, and healthcare. The integration of this structure with text data could be a key component to grounding text representations.

In the theme of our survey, we note that these auxiliary graphs may benefit from augmentation as well. Data Augmentation for explicitly graph-structured data is still in its early stages. Zhao et al. [ 46 ] propose an edge augmentation technique that “exposes GNNs to likely (but nonexistent) edges and limiting exposure to unlikely (but existent) ones” [ 46 ]. This graph augmentation leads to an average accuracy improvement of 5% across 6 popular node classification datasets. Kong et al. [ 47 ] further demonstrate the effectiveness of adversarially controlled node feature augmentation on graph classification.

In the section, Practical Considerations for Implementation, we will present the use of consistency regularization and contrastive learning to further enforce the use of augmented data in training. Building on these ideas, we can use graph-structures to assign nearest neighbor assignments and regularize embeddings. Neural Structured Learning [ 48 ] describes constructing a graph connecting instances that share fine-grained class labels. This is used to penalize a misclassification of “golden retriever” less so than “elephant” if the ground truth label is “labrador retriever”. Li et al. [ 49 ] similarly construct an embedding graph to enforce consistency between predictions of strong and weakly augmented data.

MixUp augmentation

MixUp Augmentation describes forming new examples by meshing existing examples together, sometimes blending the labels as well. As an example, MixUp may take half of one text sequence and concatenate it with half of another sequence in the dataset to form a new example. MixUp may be one of the best interfaces available to connect distant points and illuminate a path of interpolation.

Most implementations of MixUp vary with respect to the layer in which samples are interpolated. Guo et al. [ 50 ] test MixUp at word and sentence levels. This difference is shown in Fig. 3 . Their wordMixup technique combines existing samples by averaging embedding vectors at the input layer. The sentMixup approach combines existing samples by averaging sentence embeddings as each original sequence is passed through siamese encoders. Their experiments find a significant improvement in reducing overfitting compared to no regularization or using dropout.

figure 3

Left, word-level mixup. Right, sentence-level mixup. The red outline highlights where augmentation occurs in the processing pipeline

Feature space augmentation

Feature Space Augmentation describes augmenting data in the intermediate representation space of Deep Neural Networks. Nearly all Deep Neural Networks follow a sequential processing structure where input data is progressively transformed into distributed representations and eventually, task-specific predictions. Feature Space Augmentations isolate intermediate features and apply noise to form new data instances. This noise could be sampled from standard uniform or gaussian distributions, or they could be designed with adversarial controllers.

MODALS [ 51 ] presents a few strategies for feature space augmentations. Shown in Fig. 4 , these strategies describe how to move along class boundaries to form new examples in the feature space. Hard example interpolation (a) forms a new example by moving it in the direction of existing embeddings that lie on the decision boundary for classification. Hard example extrapolation (b) describes moving existing examples along the same angle they currently lie from the mean vector of the class boundary. Gaussian noise (c) entails adding Gaussian noise in the feature space. Difference transform (d) moves an existing sample in the directional distance calculated from two separate points in the same class. As described as one of the general Motifs Of Data Augmentation, MODALS aims to strengthen decision boundaries. Research in Supervised Contrastive Learning [ 52 ], replacing the commonly used KL-divergence of logits and class labels with contrastive losses such as NCE with positives and negatives formed based on class labels, has been shown to improve these boundaries. It could be useful to explore how this benefits the MODALS algorithm.

We also consider Differentiable Data Augmentation [ 53 ] techniques to fall under the umbrella of Feature Space Augmentation. Data Augmentation is a function f(x) that produces augmented examples x’. Similar to any other layers in the network, we can treat the beginning of the network as an augmentation module and backpropagate gradients through it. We can also separate the augmentation function and add it to the inputs such that the transformation is not too dramatic, akin to adding an optimized noise map to the input. Minderer et al. [ 54 ] use this technique to facilitate self-supervised pretext tasks.

figure 4

Directions for feature space augmentation explored in MODALS

Neural augmentation

The following augmentations rely on auxiliary neural networks to generate new training data. This entails using a model trained on supervised Neural Machine Translation datasets to translate from one language to another and back to sample new instances, or a model trained on generative language modeling to replace masked out tokens or sentences to produce new data. We additionally discuss the use of neural style transfer in NLP to translate from one writing style to another or one semantic characteristic such as formal to casual writing.

Back-translation augmentation

Back-translation describes translating text from one language to another and then back from the translation to the original language. An example could be taking 1,000 IMDB movie reviews in English and translating them to French and back, Chinese and back, or Arabian and back. There has been an enormous interest in machine translation. This has resulted in the curation of large labeled datasets of parallel sentences. We can also imagine the use of other text datasets such as translations between programming languages or writing styles as we describe in more detail under Style Augmentation.

Back-translation leverages the semantic invariances encoded in supervised translation datasets to produce semantic invariances for the sake of augmentation. Also interestingly, back-translation is used to train unsupervised translation models by enforcing consistency on the back-translations. This form of back-translation is also heavily used to train machine translation models with a large set of monolingual data and a limited set of paired translation data. Outside of translation we could imagine structuring these domain pairings such as scientific papers and news articles or college-level and high-level reading and so on.

An interesting design question with this may be to weigh the importance of using a high performance machine translation model for the back-translation. However, as stated by Pham et al., the lesson has been “better translation quality of the pseudo-parallel data does not necessarily lead to a better final translation model, while lower-quality but diverse data often yields stronger results instead” [ 55 ]. The curation of paired languages and domains could also impact the final performance. Exploring back-translation augmentation for question answering Longpre et al. discuss “curating our input data and learning regime to encourage representations that are not biased by any one domain or distribution” [ 56 ].

Style augmentation

Finally, we present another augmentation strategy utilizing Deep Networks to augment data for the training of other Deep Nets. In our previous survey of Image Data Augmentation, we explored works that use Neural Style Transfer for augmentation. Artistic style transfers such as a picasso-themed dog image, may be useful as an OOD augmentation in a Negative Data Augmentation framework, which we will present later. However, we are more interested in styles within the dataset. This is an interesting strategy to prevent overfitting to high-frequency features or blurring out the form of language such as to focus on meaning. In the text data domain, this could describe transferring the writing-style of one author to another for applications such as abstractive summarization or context for extractive question answering.

Data Augmentation is often deployed to focus models on semantics, rather than particular forms of language. These particular forms could emerge from one author’s writing style or general tonality in the language such as an optimistic versus a pessimistic writer. Style transfer offers an interesting window to extract semantic similarities between writing styles. This could help with modeling contexts in question answering systems or documents for information retrieval.

Generative data augmentation

Generative Data Augmentation is one of the most exciting emerging ideas in Deep Learning. This includes generating photorealistic facial images [ 57 ] or indistinguishable text passages [ 14 ]. These models have been very useful for Transfer Learning, but the question remains: What is the killer application of the generative task? These generations are certainly interesting for artistic applications, but more importantly is their use for representation learning and Data Augmentation.

We note a core distinction in the use of generative models for Data Augmentation. A popular use is to take a pre-trained language model of the shelf and optionally fine-tune it further with the language modeling task. This is the standard operating procedure of Transfer Learning. However, the fine-tuning is usually done with the Supervised Learning task, rather than additional language modeling. The pre-trained language models have learned many interesting properties of language because they are trained on massive datasets. An interesting example that is publicly available is The Pile [ 58 ]. The Pile is 800GB of text data spanning Wikipedia, comment forums, entire books, and many more examples of data like this. Even though these models and datasets are very impressive, additional benefits will likely be achieved by domain-tuning with additional language modeling on the limited dataset.

Language modeling is a very useful pre-training stage and we often have more data for language modeling than a downstream task like question-answering. Whereas we may only have 100 question-answer pairs, the question, answer, and surrounding context could easily contain 300 words each, accounting for a total of 3,000 words for constructing language modeling examples. A dataset size of 3,000 compared to 100 can make a large difference in success with Deep Learning and is the prime reason for our interest in Data Augmentation to begin with. Gururangan et al. [ 59 ] present an argument for this use of language models since downstream performance is dramatically improved when pre-training on a relevant dataset. This distinction of “relevant dataset” is in contrasting reference to what is used to train models like GPT-3 [ 14 ].

One of the most popular strategies for training a language model for Generative Data Augmentation is Conditional BERT (C-BERT) [ 60 ]. C-BERT augments data by replacing masked out tokens of the original instance. The key novelty is that it takes an embedding of the class label as input, such as to preserve the semantic label when replacing masked out tokens. This targets the label-preserving property of Data Augmentation. The C-BERT training strategy can be used when fine-tuning a model pre-trained on another dataset or starting from a random initialization.

An emerging strategy to adapt pre-trained generative models to downstream tasks is to re-purpose the interface of masking out tokens. This is known as prompting. The output of language models can be guided with text templates for the sake of generating or labeling new data. Testing the efficacy of prompting with respect to the objective of learning from limited data, Scao and Rush [ 61 ] show that prompting is often worth 100s of data points on SuperGLUE classification tasks [ 32 ]. This is in direct comparison with the more heavily studied paradigm of Transfer Learning, head-based fine-tuning. We will present a few variants on implementing prompts, this includes in-context learning, pattern-exploiting training, and prompt tuning.

The first implementation of prompting we consider is in-context learning. In-context learning became well known when demonstrated with GPT-3. The idea is to prepend each input with a fixed task description and a collection of examples of the task. This does not require any further gradient updates of the model. Brown et al. [ 14 ] show that scale is crucial to making this work reporting significant performance drops from 175B parameters to 13B and less. This technique has likely not yet hit its ceiling, especially with the development of transformer models that can in sequences longer than 512 tokens as inputs. Similar to excitement about retrieval-augmented modeling, this will allow in-context learning models to process more demonstrations of the task. However, due to limitations of scale, methods that continue with gradient updates are more practically useful.

The next implementation of prompting we will present is prompt tuning. Prompt tuning describes first embedding the prompt into a continuous space, and then optimizing the embedding with gradient descent while keeping the rest of the network frozen. Similarly to GPT-3, Lester et al. [ 62 ] show that scale improves performance with prompt tuning and that prompt tuning significantly outperforms the in-context learning results reported from Brown et al. [ 14 ]. Performance can be further improved by ensembling optimized prompts and running inference as a single batch of the input and the appended prompts. Tuned prompt ensembling improves the average performance of the prompts on SuperGLUE from 88.5, and the best performing individual prompt at 89.8, to 90.5. The authors further highlight that analysis of the optimized prompt embedding can aid in task complexity and similarity metrics, as well as Meta-Learning. Prompt tuning shares the same underlying concept of prepending context to the input of downstream tasks to facilitate fine-tuning, however this technique is more in line with research on Transfer Learning with minimal modifications. For example, adapter layers [ 63 ] aim to introduce a small number of parameters to fine-tune a pre-trained Transformer.

An emerging theme in the pre-train then fine-tune paradigm has been that domain and task alignment tends to improve fine-tuned performance. Gururangan et al. [ 59 ] demonstrate the effectiveness of data domain alignment and Zhang et al. [ 64 ] demonstrate effectiveness of task alignment in the proposed PEGASUS algorithm. In correspondence with the lesson of alignment, Zhong et al. [ 65 ] tune language models to be better fitted to answer prompts. This is done by manually annotating 441 questions across 43 existing datasets that map every task to a “Yes” or “No” answer. Measured by AUC-ROC plots, the authors show that further fine-tuning on prompt specialization improves these models and that this also benefits from scale. The authors call for the organization of NLP datasets into unified formats that better aids in fine-tuning models for answering prompts.

Pattern exploiting training (PET) [ 66 ] uses the pre-trained language model to label task-specific unlabeled data. This is done with manually-defined templates that convert the supervised learning task into a language modeling task. The outputs of the language model are then mapped to supervised learning labels with a verbalizer. Gradient-descent optimization is applied to verbalized outputs to fine-tune it with the same cross-entropy loss function used to train classifiers. Schick and Shutze [ 67 ] demonstrated that the PET technique enables much smaller models to surpass GPT-3 with 32 labeled examples from SuperGLUE. Tam et al. [ 68 ] further developed the algorithm to ADAPET. ADAPET utilizes dense supervision in the labeling task, applying the loss to the entire vocabulary distribution without a verbalizer and additional requiring the model to predict the masked tokens in the context given the label, similarly to conditional-BERT. ADAPET outperforms PET without the use of task-specific unlabeled data.

A limitation to pattern-exploiting training, in-context learning, and prompt tuning, is that they require retaining a large language model for downstream tasks. Most applications are interested in compressing these models for the sake of efficiency. Under the scope of Label Augmentation, we will present the use of knowledge distillation. For now, we consider compression by generating data to train a smaller model with. This approach is most similar to pattern-exploiting training, except that rather than use the pre-trained language model to label data, we will instead use it to generate entire examples.

Drawing inspiration from the success of MixUp, which was presented in further detail in MixUp Augmentation, Yoo et al. developed GPT3Mix [ 69 ]. The input to GPT3Mix begins with a Task Specification that defines the task such as, “Text Type T = movie review, Label Type L = sentiment”. Akin to MixUp, the next inputs are examples of the task formulated as “text type: example text k (label type: example label k)”, such as “Example 1: The cat is running my mat. (negative)”. The final piece of the input is the template to generate new examples. Further, the generated example is “soft-labeled” by the generating probabilities of each token in the process of generating the new example. GPT3Mix achieves massive performance improvements over no augmentation, Easy Data Augmentation, and BackTranslation when subsetting available data to extreme levels such as 0.1% and 0.3%.

Schick and Shutze [ 26 ] also explore the strategy of generating data from language models, presenting Datsets from Instructions (DINO). DINO uses a task description and one example from the dataset to generate pairwise classification datasets. Interestingly, they contrast task descriptions which entail the resulting label to decode language model generation. For example, the task description could begin with “Write two sentences that” and continue with either “mean the same thing” or “are on completely different topics”. The generation accounts for the token another label description would generate. Evaluated on the STS text similarity dataset, representations learned from DINO show improvements over state-of-the-art sentence embedding techniques trained with supervised learning, such as Universal Sentence Encoders [ 70 ] and Siamese BERT and RoBERTa models [ 71 ].

While built on the same underlying concept, discrete versus continuous prompt search diverge heavily from one another. Discrete prompt search has the benefit of interpretability. For example, comparing different task descriptions and examples provided by a human annotator offers insights into what the model has learned. However, prompt optimization in the continuous embedding space fully automates the search. Continuous prompt optimization is likely more susceptible to overfitting due to the freedom of the optimization space.

Another somewhat similar theme to prompting in NLP has been to augment knowledge-enhanced text generation with retrieval. Popular models include Retrieval-Augmented Generation (RAG) [ 72 ], and Retrieval-Augmented Language Model Pre-training (REALM) [ 73 ]. Shuster et al. [ 74 ] show how retrieving information to prepend to the input reduces the problem of hallucination in text generation. Once this retrieved information is embedded into the continuous representation space of language models, it is a similar optimization problem as prompt tuning.

Another interesting idea is the intersection of Data Privacy and Generative Data Augmentation. Can we store data in the parameters of models instead of centralized databases? The idea of Federated Learning [ 75 ] is to send copies of the global model weights to a local database such as to avoid a centralized database. Which models should we send to local databases? Classifiers or generative models? If we send a generative model, we have the potential to cover more of the data distribution and learn more about general data manifolds such as the use of language more broadly, however, we risk exposing more critical information [ 76 ].

Label augmentation

Supervised Learning, describes fitting an input, x, to a label, y. Throughout this survey, we have presented strategies for regularizing the x values. In this section, we explore research looking to entertain the y class labels. The most successful example of this is Knowledge Distillation [ 77 ]. Knowledge Distillation describes transforming the traditional one-hot encoded y labels into a soft distribution by re-labeling xs with the logits of another neural network’s prediction. This has been very influential in compression such as DistilBERT [ 78 ], information retrieval [ 79 ], and achieving state-of-the-art classification results in Computer Vision [ 80 ].

In addition to Knowledge Distillation, several other strategies have been developed to augment the label space. Label smoothing uses a heuristic adjustment to the density on negative classes and has been highly influential for training classifiers [ 81 ] and generative adversarial networks [ 82 ]. Another exciting approach is the use of a meta-controller, similar to knowledge distillation, but massively different in that the Teacher is learning from the gradients of the Student’s loss to update the label augmentation. Notable examples exploring this include Meta Pseudo Labels [ 83 ] and Teaching with Commentaries [ 84 ]. This ambitious idea of learning to augment data through outer-inner loop gradients have also been explored in the data space, x, with Generative Teaching Networks [ 12 ]. As of the time of this writing, Generative Teaching Networks have only been applied to image data. A similar idea is “Meta Back-Translation” [ 55 ], in this work, the authors “propose a meta-learning framework where the back-translation model learns to match the forward translation model’s gradients on the development data with those on the pseudo-parallel data.”

Thakur et al. [ 85 ] present the Augmented SBERT to augment data labels for distillation. The authors note that the cross-encoder, although much slower and less efficient than bi-encoders, tends to reach higher accuracy on pairwise classification tasks such as ranking or duplicate question detection. The paper proposes to label data with the cross-encoder and fit these augmented labels with the bi-encoder. Also worth mentioning is that the cross-encoder heavily outperforms the bi-encoder with less training data. Thakur et al. find a significant benefit strategically selecting data to soft label with the cross encoder. We have found this idea throughout experiments in Data Augmentation, discussing it further in our Discussion section under Curriculum Learning.

Testing generalization with data augmentation

The holy grail of Machine Learning is to achieve out-of-distribution (OOD) generalization. This is distinct from in-distribution generalization where the training and test sets are sampled from the same data distribution. In order to measure OOD generalization, we need to make assumptions about how the distribution will shift. As Arjvosky writes, “if the test data is arbitrary or unrelated to the training data, then generalization is obviously futile” [ 86 ]. Chollet further describes the relationship between system-centric and developer-aware generalization, as well as levels of generalization such as absent, local, broad, and extreme [ 87 ]. We argue that Data Augmentation is the natural interface to quantify the relationship between test and train data distributions and levels of generalization.

A classic tool to test for generalization is to simply report the difference in accuracy between the training and test sets. However, as shown in papers such as Deep Double Descent [ 88 ], the phenomenon of overfitting is generally poorly understood with large-scale Deep Neural Networks. We believe it is more practical to study overfitting and generalization in the data space. For example, the success of adversarial examples shows that Deep Neural Networks cannot generalize to distributions added with adversarially optimized noise maps. Jia and Liang [ 89 ] show that models trained on SQuAD cannot generalize when adversarially optimized sentences are added to the context, an example of this is shown in Fig. 5 . In addition to adversarial attacks, many other datasets show intuitive examples of distribution shifts where Deep Neural Networks fail to generalize.

figure 5

Fooled by injected text. Image taken from Jia and Liang [ 89 ]

We present Data Augmentation as a black-box test for generalization. CheckList [ 90 ] proposes a foundational idea for these kinds of tests in NLP. CheckList is designed to test the linguistic capabilities of models such as robustness to negation, vocabulary perturbations, or temporal consistency. We view this as introducing a distribution shift of linguistic phenomena in the test set. Clark et al. [ 91 ] construct a toy example for transformers to see how far they can generalize fact chaining. In this test, the training data requires the model to chain together more or less facts than are tested in the test set. Again, the distribution shift is controlled with an intuitive interface again to Data Augmentation. Finally, WILDS [ 92 ] is a collection of real-world distribution shifts. These real-world shifts can also be mapped to Data Augmentations.

Kaushiik et al. [ 24 ] describes employing human-labelers to construct a set of counterfactual movie reviews and natural language inference examples. The authors construct an elegant annotation interface and task Mechanical Turk workers to minimally edit examples such as to switch the label. For example, converting “The world of Atlantis, hidden beneath the earth’s core, is fantastic” to “The world of Atlantis, hidden beneath the earth’s core is supposed to be fantastic”. For movie reviews, the authors group the workers’ revisions into categories such as recasting fact as hoped for, suggesting sarcasm, inserting modifiers, inserting phrases, diminishing value qualifiers, differing perspectives, and changing ratings. For natural language inference, the authors group the workers’ revisions into categories such as modifying/removing actions, substituting entities, adding details to entities, inserting relationships, numerical modifications, using/removing negation, and unrelated hypothesis. These examples are constructed for testing generalization to these counterfactual examples.

Returning to our description of Generative Data Augmentation, are generative models capable of making these edits? If GPT-3 was given an IMDB review with the task prompt of “change this movie review from positive to negative”, it could probably manage it. We leave it to future work to investigate the generalization shifts induced by human-designed counterfactuals and generative models. To further motivate this study, the authors note that their dataset construction came with a hefty price tag of $10,778.14. Inference costs of generative models are unlikely to approach this cost, unless working with extremely large models. Highlighting that a similar categorization of the changes as Kaushik et al. use [ 24 ] could help us understand the linguistic phenomena underlying this kind of generalization test.

Generative Data Augmentation provides another lens to study generalization. Nakkiran et al. propose a novel way of studying generalization in “The Deep Bootstrap Framework” [ 93 ]. The idea is to compare the Online test error to the Bootstrap test error. The Online error describes the performance of a model trained on an infinite data stream, i.e. without repeating samples. The Bootstrap test error describes the common training setup in Deep Learning, repeating batches of the same data. The authors simulate the Online learning scenario by fitting a generative model, in this particular case a Denoising diffusion probabilistic model [ 94 ]. The generative model is used to sample 6 million examples, compared to the standard 50,000 samples used to train CIFAR-10. Garg et al. [ 95 ] additionally propose RATT, a technique that analyzes learning curves and generalization when randomly labeled unlabeled data is added to the training batch. The augmentations described in this survey may be able to simulate this unlabeled data and provide similar insights.

To conclude, when is overfitting problematic? How much of a data distribution are modern neural networks capable of covering? Deep Neural Networks have a remarkable ability to interpolate within the training data distribution. A potential solution could be to leverage Data Augmentation to expand the training distribution such that there are no reasonable out-of-distribution shifts in the test sets. Even if all the potential distributions cannot be compressed into a single neural network, this interface can illuminate where the model will fail.

Image versus text augmentation

Our survey on Text Data Augmentation for Deep Learning is intended to follow a similar format as our prior work on Image Data Augmentation for Deep Learning [ 6 ]. We note there are many similarities between the Easy Data Augmentations and basic geometric and color space transformations used in Computer Vision. Most similarly, both are easy to implement and complement nearly any problem working with text or image data respectively. We have described how Easy Data Augmentation can easily interface with text classification, pairwise classification, extractive question answering, abstractive summarization, and chatbots, to name a few. Similarly, geometric and color space transformations in Computer Vision are used in image classification, object detection, semantic segmentation, and image generation.

As described in the beginning of our survey, Data Augmentation biases the model towards certain semantic invariances. Image Data Augmentation has largely been successful because it is easy to think semantic invariances relevant to vision. These include semantic invariance to horizontal flips, rotations, and increased brightness, to name a few. Comparatively, it is much harder to define transformations to text data that are guaranteed to be semantically invariant. All of the augmentations described in Easy Data Augmentation have the potential to perturb the original data such that it changes the ground truth label, y.

Another interesting trend is the integration of vision and language in recent models such as CLIP and DALL-E. For the sake of Data Augmentation, a notable example is Vokenization from Tan and Bansal [ 96 ]. The authors align tokens such as “humans” with images of “humans” and so on, even for verbs such as “speaking”. The masked language modeling task then uses the visual tokens as additional supervision for predicting masked out tokens. There is some noise in this alignment such as finding a visual token for words such as “by” or “the”. Tan and Basil report visual grounding ratios for tokens of 54.8%, 57.6%, and 41.7% on curated vision-language datasets compared to 26.6%, 27.7%, and 28.3% for solely language corpora. Across the SST-2, QNLI, QQP, MNLI, SQuAD v1.1 and v2.0, and SWAG benchmark tasks, Vokenization improves BERT-Large from 79.4 to 82.1 and RoBERTa-Large from 77.6 to 80.6. There are many interesting vision-language datasets labeled for tasks such as visual question answering, image captioning, and text-image retrieval, to name a few. Vision-language Data Augmentation schemes such as Vokenization look to be a very promising area of research.

A recent trend in Image Data Augmentation has been its integration in the training of generative models, namely generative adversarial networks (GANs) [ 97 ]. The GAN framework, similar to the ELECTRA model [ 98 ], consists of a generator and a discriminator. The generator transforms random noise into images and the discriminator classifies these images as either coming from the generator or the provided training set. Following, we will describe why this does not work as well as autoregressive modeling for text. Returning to how Data Augmentation has been used for GANs, this investigation began with Zhang et al.’s work on consistency regularization [ 99 ]. Consistency regularization requires the discriminator to make the same classification on a real image and an augmented view of that same image. Unfortunately, this led to the augmentations being “leaked” into the generated distribution such that the generator produces augmented data as well.

We will end this discussion by presenting some ideas from LeCun and Misra [ 100 ] on the key distinction between generative modeling between Images and Text. The key issue stated in the article is handling uncertainty. As an example, take the masked token completion task: “The mask chases the mask in the savana”. LeCun and Misra point out that the language model can easily “associate a score or a probability to all words in the vocabulary: high score for lion’, ‘cheetah’, and a few other predators, and low scores for all other words in the vocabulary” [ 100 ]. In comparison, applying this kind of density on candidate images in highly intractable. The missing token can only be 1 of a typical 30,000 tokens, whereas a missing 8x8 RGB patch can take on a ridiculously large, 255x8x8x3 values. Therefore, image models need to rely on energy-based models that learn joint embedding spaces and assign similarity scores, rather than exactly modeling the probability of each missing patch. Perhaps the GAN framework, or something similar, will take over in NLP once generative modeling expands its scope to sentence-level or paragraph-level generation, such as the pre-training task used for abstractive summarization in PEGASUS [ 64 ].

Another interesting success of Data Augmentation has been its application in Reinforcement Learning. This has been heavily studied with Robotic Control from Visual Inputs and the Atari benchmark. One of the biggest bottlenecks with robotic learning, and most deep reinforcement learning problems, is a lack of data. It is challenging to restart a robot laundry folder back to the beginning of the unfolded shirt and collect millions of trajectories. To solve this problem, researchers have turned to forming augmented trajectories from collections in a replay buffer. Amongst many applications of reinforcement learning with Text data that have been proposed, patient care control is particularly exciting. Ji et al. [ 101 ] explore the use of model-based reinforcement learning for patient care of septic patients using the MIMIC-III dataset [ 102 ]. The authors use clinical notes to sanity check the model-based rollouts of physiological patient state markers. A promising area of research will be to apply Text Data Augmentation to collected clinical note trajectories to improve patient care and trajectory simulation.

Practical considerations for implementation

This section presents many details of implementing Text Data Augmentation that make a large performance difference in terms of evaluation metrics and training efficiency.

Consistency regularization

Consistency regularization is a strong compliment to the priors introduced via Data Augmentation. A consistency loss requires a model to minimize the distance in representations of an instance and the augmented example derived from it. In line with the motif of strengthening decision boundaries, consistency regularization enforces a connection between original and augmented samples. This is usually implemented in a multi-task learning framework where a model simultaneously optimizes the downstream task and a secondary consistency term.

Consistency regularization has been successfully applied to translate between programming languages by enforcing consistency on back-translations [ 103 ]. Alberti et al. [ 104 ] use a slightly different form of consistency regularization to generate synthetic question-answer pairs. Rather than minimizing the distance between representations of original and augmented examples, the framework requires that the model outputs the exact same answer when predicting from context, question inputs as when a separate model generates the question from context, answer inputs. The original BERT-Large model achieves an F1 score of 83.1 when fine-tuned on the SQuAD2. Fine-tuning BERT with an additional 7 million questions generated with the consistency condition improves performance to 84.8.

Consistency regularization is a common technique for self-supervised representation learning because unlabeled data should still have this property of consistent representations before and after augmentation. Xie et al. [ 105 ] deploy consistency regularization as shown in Fig. 6 . This technique surpasses the previous state-of-the-arts trained solely with supervised learning using significantly less data. These improvements continue even in the extreme case of only 20 labeled examples. As an example of the performance gain, the fine-tuned BERT model achieves a 6.5% error rate on IMDB review classification, which is reduced to 4.2% with UDA. The multi-task loss formulation is also fairly common in consistency regularization implementations.

figure 6

Unsupervised data augmentation schema. Image taken from Xie et al. [ 105 ]

Contrastive learning

Contrastive learning differs from consistency regularization by utilizing negative samples to normalize the loss function. This is a critical distinction because the negative samples can provide a significant learning signal. We believe that the development of Text Data Augmentation can benefit from adapting successful examples in Computer Vision. The use of Data Augmentation to power contrastive self-supervised learning has been one of the most interesting stories in Computer Vision. This involves frameworks such as SimCLR [ 106 ], MoCo [ 107 ], SwAV [ 108 ], and BYOL [ 109 ], to name a few. This training strategy should be well suited for information retrieval in NLP.

Krishna et al. [ 110 ] propose contrastive REALM (c-REALM). The contrastive loss is used to align the embedding of the question and supervised answer, and contrast the question with other supervised answers from the mini-batch. However, this technique of contrastive learning is more akin to supervised contrastive learning [ 52 ], than frameworks such as SimCLR. In SimCLR, Data Augmentation is used to form the positive pairs. This strategy has not been heavily explored in information retrieval, likely due to the lack of augmentations. Hopefully, the list we have provided will help those interested pursue this idea.

Gunel et al. [ 111 ] demonstrate significant improvements on GLUE benchmark tasks by training with a supervised contrastive loss in addition to cross-entropy loss on one-hot encoded label vectors. The gain is especially pronounced when learning from 20 labeled examples, while they do not report much of a difference at 1,000 labeled examples. In addition to quantitative metrics, the authors highlight that the embeddings of classes are much more spread out through the lens of a t-SNE visualization.

Contrastive learning, similarly to consistency regularization, describes making the representation of an instance and a transformation-derived pair similar. However, contrastive learning adds a negative normalization that additionally pushes these representations away from other instances in the samples mini-batch. Contrastive learning has achieved large advances in representation Computer Vision such as SimCLR [ 106 ] and MoCo [ 107 ]. Using Data Augmentation for contrastive learning is a very promising area of research with recent extensions to the information-retrieval language model REALM [ 73 ]. We refer interested readers to a report from Rethmeier and Augenstein [ 112 ] for more details on early efforts to apply contrastive learning to NLP.

Consistency regularization and contrastive learning are candidate solutions to a common problem found by inspecting model performance. For example, Thorne et al. [ 113 ] find that fact verification models achieve better accuracy when classifying if claims are supported or refuted by the evidence when ignoring the evidence. Contrastive learning would require the model to correctly associated supporting evidence by contrasting it with refuting evidence. Consistency Regularization would more so describe having a similar prediction when the evidence has been slightly perturbed, such as inserting a random word or replacing it with a paraphrase that shares the same semantics.

Negative data augmentation

Negative Data Augmentation is a similar concept to the negative examples used in contrastive learning. However, a key difference is that contrastive learning generally uses other data points as the negatives, whereas Negative Data Augmentation entails applying aggressive augmentations. These augmentations are not just limited to label corruptions, but may push the example out of the natural language distribution entirely. Returning to the motif of Meaning versus Form [ 28 ] these augmentations may not be useful for learning meaning, but they can help reinforce the form of natural language. Sinha et al. [ 114 ] demonstrate how this can be used to improve contrastive learning and generative adversarial networks.

Augmentation controllers

A large contributor to the success of Data Augmentation in Computer Vision is the development of controllers. Controllers reference algorithms that optimize the strength of augmentations throughout training. The strength of augmentations describe the magnitude of operation such as inserting 3 additional words compared to 15. Augmentation strength also describes how many augmentations are stacked together such as random insertion followed by deletion followed by back-translation and so on, described more next. Successful controllers such as AutoAugment [ 7 ], Population-Based Augmentation [ 8 ], or RandAugment [ 9 ] have not yet seen large-scale adoption in NLP.

When applying Easy Data Augmentation, several hyperparameters arise. Hyperparameter optimization is one of the active areas of Deep Learning research [ 115 , 116 , 117 ]. This presents a perfect problem to find optimal values for random augmentation samplings, as well as magnitudes such as: how many tokens to delete? SpanBERT [ 118 ], for example, shows that instead of masking out single tokens for language modeling, masking out multiple tokens at a time, known as spans, results in better downstream performance.

Adversarial augmentation

Adversarial attacks and the use of adversarially optimized inputs for augmentation is very similar to the previous discussion on controllers. The key differentiation is that adversarially controllers target misclassifications whereas controllers generally try to avoid misclassifications. Particularly, adversarial optimization aims to improve robustness to high-frequency pattern shifts. Adversarial attacks on text data generally range from introducing typos to swiping out individual or chunks of words. There is a great deal of ambiguity with this since many of these perturbations would be cleaned and filtered by the text data preprocessing techniques such as spell checkers, case normalizations, or regular expression filtering.

TextAttack [ 119 ] is an open-source library implementing adversarial text attacks and providing APIs for Data Augmentation. There are four main components of an attack in the TextAttack framework, a goal function, constraints, transformations, and a search method. This pipeline is illustrated in Fig. 7 . The goal function defines the target output, for example instead of solely flipping the predicted output we may want to target a 50-50 density. The constraints define how far the input can be changed. The transformation describes the tools available to change the input such as synonym swaps, deletions, applying back-translation, and all the other techniques discussed previously. Finally, the search method describes the algorithm for searching for the attack. Similar to our discussion of controllers there are many different ways to perform black-box searches such as grid or random searches, bayesian optimization, and evolutionary search, to name a few [ 115 ].

figure 7

Developing attacks in TextAttack [ 119 ]

A key consideration with adversarial augmentation is how quickly we can construct adversarial examples. Many adversarial example construction techniques such as Szegedy et al. [ 120 ] rely on iterative optimization such as L-BFGS to find the adversarial example. This would be a significant bottleneck in Deep Learning training to wait for the adversarial search at each training batch. Towards solving this issue, Wang et al. [ 121 ] reduce time consumption up to 60% with their DEAT algorithm. The high-level idea of DEAT is to use batch replay to avoid repeatedly computing adversarial batches.

Stacking augmentations

Stacking augmentations is a strategy that has improved vision models but is less straightforward to apply to text data. One strategy for this is CoDA [ 122 ]. CoDA introduces a local consistency loss to make sure stacking augmentations has not overly corrupted the sample, and a global loss to preserve local neighborhoods around the original instance.

Tokenization

The preprocessing pipeline of tokenization presents a formidable challenge for implementing Data Augmentations. It is common to tokenize, or convert word tokens to their respect numeric index in a vocabulary-embedding lookup table offline before it reaches the Data Loader itself. Applying Data Augmentations on these index lists could require significantly more engineering effort. Even for simple synonym replacement, additional code will have to be written to construct dictionaries of the synonyms index value for swaps. Notably, researchers are exploring tokenizer-free models such as byT5 [ 123 ] and CANINE [ 124 ]. These models process byte-level sequences such as ASCII codes [ 125 , 126 ] and will require special processing to integrate these augmentations.

Position embeddings

Another more subtle detail of Transformer implementations are the use of position embeddings. The original Transformer [92] uses sine and cosine functions to integrate positional information into text sequences. Another subtle Data Augmentation could be to explore perturbing the parameters that render these encodings.

Augmentation on CPUs or GPUs?

Another important aspect of Data Augmentation is to understand the typical data preprocessing pipeline from CPUs to GPUs. It has been standard practice to apply Data Augmentation to data on the CPU before it is passed to the GPU for model training. However, recent practice has looked at applying Data Augmentation directly on the GPU. This is done in Keras, for example, by adding Data Augmentation as a layer in the model immediately after the input layer. It is also worth noting clever schemes such as Data Echoing from Choi et al. [ 127 ] that apply additional techniques to avoid idle time between CPU data loading and GPU model training.

Offline and online augmentation

Similarly to the discussion of augmenting data on the CPU or on the GPU, another important consideration is to make sure the Data Augmentation is happening online, compared to offline. This refers to when the original instance is augmented in the data pipeline. Offline augmentation refers to augmenting the data and storing the augmented examples to the disk. Online augmentation describes augmenting the data as a new batch of the original data is loaded for a training step. We note that Online augmentation is much more powerful than Offline augmentation. Offline augmentation offers the slight benefit of faster loading times, but it does not really take advantage of the stochasticity and diversity enabled with most of the described augmentations.

Another important detail of this pipeline is augmentation multiplicity [ 128 ]. Augmentation multiplicity refers to the number of augmented samples derived from one original example. Fort et al. [ 128 ] and Hoffer et al. [ 129 ] illustrate how increasing augmentation multiplicity can improve performance. This approach could introduce significant memory overhead without an online augmentation pipeline. Additionally Wei et al. [ 130 ] point out that examples are often augmented online such that the model never actually trains with the original instances. Wei et al. propose separating the model into two fine-tuning heads, one which trains solely on the unaugmented data and the other trained on high magnitude augmentations. These works highlight the opportunity to explore fine-grained details in augmentation pipelines.

Curriculum learning

Curriculum Learning describes having a human or meta-controller structured organization to the data batches. This includes varying the strength of Data Augmentation throughout training. Kucnik and Smith [ 131 ] find that it is much more efficient to subsample a portion of the dataset to be augmented, rather than augmenting the entire dataset. Wei et al. [ 132 ] demonstrate the efficacy of gradually introducing augmented examples to original examples in the training of triplet networks for text classification. We note this is very similar to our discussion of controllers for augmentation and searching for optimal magnitude and chaining parameters. Thakur et al. [ 85 ] describe that “selecting the sentence pairs is non-trivial and crucial for the success of the method”.

Class imbalance

A prevalent issue explored in classification models is Class Imbalance [ 133 ]. In addition to customized loss functions, sampling techniques are a promising solution to overcome biases stemming from Class Imbalance. These solutions generally describe strategies such as random oversampling or undersampling [ 134 , 135 ], in addition to interpolation strategies such as synthetic minority oversampling technique (SMOTE) [ 136 ]. SMOTE is a general framework to oversample minority instances by averaging between them. From the list of augmentations we have covered, we note that MixUp is very similar to this technique and has been explored for text data. It may be useful to use other techniques for oversampling to avoid potential pitfalls of duplicating instances.

Task-specific augmentation for NLP

NLP encompasses many different task formulations. This ranges from text classification to paraphrase identification, question answering, and abstractive summarization, to name a few. The off-the-shelf Data Augmentation prescribed in the previous section will need slight adaptations for each of these tasks. For example, when augmenting the context in a question answering dataset, it is important to be mindful of removing the answer. The largest difference we have found between tasks from the perspective of Data Augmentation is that they vary massively with respect to input length. Short sequences will have to be more mindful of how augmentations change the original example. Longer sequences have more design decisions such as how to sample nested sentences for back-translation and so on. We refer interested readers to Feng et al. [ 16 ] who enumerate how Data Augmentation applies to summarization, question answering, sequence tagging, parsing, grammatical error correction, neural machine translation, data-to-text natural language generation (NLG), open-ended and conditional generation, dialogue, and multimodal tasks.

Self-supervised learning and data augmentation

In both the case of self-supervised learning and Data Augmentation, we are looking to inject prior knowledge about a data domain. When a model is deployed, what is more likely: the data distribution changes or the task the model is supposed to perform with the data changes? In self-supervised learning, we look for ways to set up tasks and loss functions for representation learning. In Data Augmentation, we look for priors to manipulate the data distribution. A key advantage of Data Augmentation is that it is much easier to stack priors than self-supervised learning. In order to utilize multiple priors, self-supervised learning relies on highly unstable multi-task learning or costly multi-stage learning. In contrast, Data Augmentation only requires random sampling operations to integrate multiple priors.

We note that many of the key successes in self-supervised Learning rely on Data Augmentation, or have at least been dramatically improved by Data Augmentation. For example, the success of contrastive learning relies on Data Augmentation to form two views of the original instance. The most data-efficient GAN frameworks achieve data-efficiency through the use of Data Augmentation [ 137 ]. Further, DistAug [ 138 ] even tests Data Augmentation with large scale pixel autoregressive modeling in the ImageGPT model [ 139 ].

Transfer and multi-task learning

Transfer learning has been one of the most successful approaches to training deep neural networks. This looks especially promising as more annotated datasets are collected and unified in dataset hubs. A notable example of which is HuggingFace datasets [ 140 ], containing 884 datasets at the time of this publication. In addition to transfer learning, researchers have additionally explored multi-task learning in which a model simultaneously optimizes multiple tasks. This has been well explored in T5 [ 141 ], which converts all tasks into language modeling. We believe there is room for Data Augmentation experiments in this space, such as the use of MixUp to combine data from multiple tasks or Back-Translation between curated datasets.

Wei et al. [ 130 ] propose an interesting extension, named as Multi-Task View (MTV), to the common practice of transfer learning to better utilize augmented subsets and share information across distributions. Multi-Task View (MTV) trains separate heads on augmented subsets and ensembles predictions for the final output. Geva et al. [ 142 ] have also shown utility in sharing a feature extractor base and training separate heads. In this case, Geva et al. train each head with a different task and reformulate inputs into unifying prompts for inference. Similar to the discussion of prompting under Generative Data Augmentation, there remains a significant opportunity to explore transfer learning, multi-task learning, and Data Augmentation.

One of the most interesting ideas in artificial intelligence research is AI-GAs (AI-generating algorithms) [ 10 ]. An AI-generating algorithm is composed of three pillars, meta-learning architectures, meta-learning the learning algorithms themselves, and generating effective learning environments. We believe that Data Augmentation and this interface to control data distributions will play a large role in the third pillar of generating learning environments. For example, embedding learning agents in teacher-student loops in which the teacher controls augmentation parameters to render the learning environment.

Learning the learning environment itself has been successfully applied to bipedal walking control with neural networks in POET [ 11 ]. POET is a co-evolutionary framework of control parameters and parameters that render walking terrains. Data Augmentation may be the most natural way of extending this framework to understanding language in which the environment searches for magnitude parameters of augmentation or subsets of data, as in curriculum learning. AI-GAs have been applied to vision problems in examples such as Generative Teaching Networks [ 12 ] and Synthetic Petri Dish [ 13 ]. In GTNs, a teacher network generates training data for a student network. Notably, the training data has high-frequency noise patterns that do not resemble natural image data. It could be interesting to see how well GTNs could generate text embeddings similar to the continuous optimization of prompt tuning.

In conclusion, this survey has presented several strategies for applying Data Augmentation in Text data. These augmentations provide an interface to allow developers to inject priors about their task and data domain into the model. We have additionally presented how Data Augmentation can help simulate distribution shift and test generalization. As Data Augmentation for NLP is relatively immature compared to Computer Vision, we highlight some of the key similarities and differences. We have also presented many ideas surrounding Data Augmentation, from practical engineering considerations to broader discussions of the potential of data augmentation in building artificial intelligence. Data Augmentation is a very promising strategy and we hope our discussion section helps motivate further research interest.

Availability of data and materials

Not applicable.

Shorten C, Khoshgoftaar T, Furht B. Deep learning applications for covid-19. J Big Data. 2021. https://doi.org/10.1186/s40537-020-00392-9 .

Article   Google Scholar  

Tang R, Nogueira R, Zhang E, Gupta N, Cam P, Cho K, Lin J. Rapidly bootstrapping a question answering dataset for covid-19. 2020. arXiv:2004.11339 . Accessed Jul 2021

Cachola I, Lo K, Cohan A, Weld DS. TLDR: extreme summarization of scientific documents. 2020. arXiv:2004.15011 . Accessed Jul 2021

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58 .

MathSciNet   MATH   Google Scholar  

Kukačka J, Golkov V, Cremers D. Regularization for deep learning: a taxonomy 2017 . arXiv:1710.10686 . Accessed Jul 2021

Shorten C, Khoshgoftaar T. A survey on image data augmentation for deep learning. J Big Data. 2019;6:1–48.

Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV. AutoAugment: learning augmentation policies from data 2018. arXiv:1805.09501 . Accessed Jul 2021

Ho D, Liang E, Stoica I, Abbeel P, Chen X. Population based augmentation: efficient learning of augmentation policy schedules 2019. arXiv:1905.05393 . Accessed Jul 2021

Cubuk ED, Zoph B, Shlens J, Le QV. RandAugment: practical automated data augmentation with a reduced search space 2019. arXiv:1909.13719 . Accessed Jul 2021

Clune J. AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence 2019. arXiv:1905.10985 . Accessed Jul 2021

Wang R, Lehman J, Clune J, Stanley KO. Poet: Open-ended coevolution of environments and their optimized solutions. In: Proceedings of the Genetic and Evolutionary Computation Conference. GECCO ’19, pp. 142–151. Association for Computing Machinery, New York, NY, USA 2019. https://doi.org/10.1145/3321707.3321799 .

Such FP, Rawal A, Lehman J, Stanley KO, Clune J. Generative teaching networks: accelerating neural architecture search by learning to generate synthetic training data 2019. arXiv:1912.07768 . Accessed Jul 2021

Rawal A, Lehman J, Such FP, Clune J, Stanley KO. Synthetic petri dish: a novel surrogate model for rapid architecture search 2020. arXiv:2005.13092 . Accessed Jul 2021

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners 2020. arXiv:2005.14165 . Accessed Jul 2021

OpenAI: DALL.E: Creating Images from Text. OpenAI 2021. https://openai.com/blog/dall-e/ . Accessed Jul 2021

Feng SY, Gangal V, Wei J, Chandar S, Vosoughi S, Mitamura T, Hovy E. A Survey of Data Augmentation Approaches for NLP. 2021;2105:03075.

Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 2009. Ieee

Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota 2019. https://doi.org/10.18653/v1/N19-1423 .

Weiss K, Khoshgoftaar T, Wang D. A survey of transfer learning. J Big Data. 2016. https://doi.org/10.1186/s40537-016-0043-6 .

van der Maaten L, Hinton G. Viualizing data using t-sne. J Mach Learn Res. 2008;9:2579–605.

MATH   Google Scholar  

McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction 2018. arXiv:1802.03426 . Accessed Jul 2021

Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A, Bengio Y. Towards causal representation learning 2021. arXiv:2102.11107 . Accessed Jul 2021

Levine S, Kumar A, Tucker G, Fu J. Offline Reinforcement learning: tutorial, review, and perspectives on open problems 2020. arXiv:2005.01643 . Accessed Jul 2021

Kaushik D, Hovy E, Lipton ZC. Learning the difference that makes a difference with counterfactually-augmented data 2019. arXiv:1909.12434 . Accessed Jul 2021

Liu Q, Kusner M, Blunsom P. Counterfactual data augmentation for neural machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 187–197. Association for Computational Linguistics, Online 2021. https://www.aclweb.org/anthology/2021.naacl-main.18 . Accessed Jul 2021

Schick T, Schütze H. Generating Datasets with Pretrained Language Models 2021. arXiv:2104.07540 . Accessed Jul 2021

Pearl J. Causality: models, reasoning, and inference. USA: Cambridge University Press; 2000.

Bender EM, Koller A. Climbing towards NLU: On meaning, form, and understanding in the age of data. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5185–5198. Association for Computational Linguistics, Online 2020. https://doi.org/10.18653/v1/2020.acl-main.463 .

Merrill W, Goldberg Y, Schwartz R, Smith NA. Provable limitations of acquiring meaning from ungrounded form: what will future language models understand? 2021. arXiv:2104.10809 . Accessed Jul 2021

Weber L, Jumelet J, Bruni E, Hupkes D. Language modelling as a multi-task problem. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2049–2060. Association for Computational Linguistics, Online 2021. https://www.aclweb.org/anthology/2021.eacl-main.176 . Accessed Jul 2021

Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks For NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium 2018. https://doi.org/10.18653/v1/W18-5446 .

Sarlin P-E, DeTone D, Malisiewicz T, Rabinovich A. SuperGlue: Learning Feature Matching with Graph Neural Networks 2019. arXiv:1911.11763 . Accessed Jul 2021

Petroni F, Piktus A, Fan A, Lewis P, Yazdani M, Cao ND, Thorne J, Jernite Y, Karpukhin V, Maillard J, Plachouras V, Rocktäschel T, Riedel S. KILT: a benchmark for knowledge intensive language tasks 2020. arXiv:2009.02252 . Accessed Jul 2021

Wei J, Zou K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Association for Computational Linguistics, Hong Kong, China 2019. https://doi.org/10.18653/v1/D19-1670 .

Jin D, Jin Z, Zhou J, Szolovits P. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. Proc Conf AAAI Artif Intell. 2020;34:8018–25. https://doi.org/10.1609/aaai.v34i05.6311 .

Spasic I, Nenadic G. Clinical text data in machine learning: systematic review JMIR. Med Inform. 2020. https://doi.org/10.2196/17984 .

Min J, McCoy RT, Das D, Pitler E, Linzen T. Syntactic data augmentation increases robustness to inference heuristics. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2339–2352. Association for Computational Linguistics, Online 2020. https://doi.org/10.18653/v1/2020.acl-main.212 .

McCoy T, Pavlick E, Linzen T. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3428–3448. Association for Computational Linguistics, Florence, Italy 2019. https://doi.org/10.18653/v1/P19-1334 .

Ji S, Pan S, Cambria E, Marttinen P, Yu PS. A survey on knowledge graphs: representation, acquisition and applications 2020. https://doi.org/10.1109/TNNLS.2021.3070843 .

Miller GA. Wordnet: a lexical database for english. Commun ACM. 1995;38(11):39–41. https://doi.org/10.1145/219717.219748 .

Marcus MP, Marcinkiewicz MA, Santorini B. Building a large annotated corpus of english: the penn treebank. Comput Linguist. 1993;19(2):313–30.

Google Scholar  

Zeng X, Song X, Ma T, Pan X, Zhou Y, Hou Y, Zhang Z, Karypis G, Cheng F. Repurpose open data to discover therapeutics for COVID-19 using deep learning 2020. arXiv:2005.10831 . Accessed Jul 2021

Huang L, Wu L, Wang L. Knowledge graph-augmented abstractive summarization with semantic-driven cloze reward, 2020; pp. 5094–5107. https://doi.org/10.18653/v1/2020.acl-main.457

Glavaš, G., Vulić, I.: Is supervised syntactic parsing beneficial for language understanding tasks? an empirical investigation. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3090–3104. Association for Computational Linguistics, Online 2021. https://www.aclweb.org/anthology/2021.eacl-main.270 . Accessed Jul 2021

Li MM, Huang K, Zitnik M. Representation learning for networks in biology and medicine: advancements, challenges, and opportunities 2021. arXiv:2104.04883 . Accessed Jul 2021

Zhao T, Liu Y, Neves L, Woodford O, Jiang M, Shah N. Data augmentation for graph neural networks 2020. arXiv:2006.06830 . Accessed Jul 2021

Kong K, Li G, Ding M, Wu Z, Zhu C, Ghanem B, Taylor G, Goldstein T. FLAG: adversarial data augmentation for graph neural networks 2020. arXiv:2010.09891 . Accessed Jul 2021

Gopalan A, Juan D-C, Magalhaes CI, Ferng C-S, Heydon A, Lu C-T, Pham P, Yu G, Fan Y, Wang Y. Neural structured learning: Training neural networks with structured signals. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining. WSDM ’21, pp. 1150–1153. Association for Computing Machinery, New York, NY, USA 2021. https://doi.org/10.1145/3437963.3441666 .

Li J, Xiong C, Hoi S. Comatch: Semi-supervised learning with contrastive graph regularization. arXiv:2011.11183 . Accessed Jul 2021

Guo H, Mao Y, Zhang R. Augmenting data with mixup for sentence classification: an empirical study. arXiv:1905.08941 . Accessed Jul 2021

Cheung T-H, Yeung, D.-Y.: Modals: Modality-agnostic automated data augmentation in the latent space. In: International Conference on Learning Representations 2021. https://openreview.net/forum?id=XjYgR6gbCEc . Accessed Jul 2021

Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D. Supervised contrastive learning 2020. arXiv:2004.11362 . Accessed Jul 2021

Li Y, Hu G, Wang Y, Hospedales T, Robertson NM, Yang Y. DADA: differentiable automatic data augmentation 2020. arXiv:2003.03780 . Accessed Jul 2021

Minderer M, Bachem O, Houlsby N, Tschannen M. Automatic shortcut removal for self-supervised representation learning 2020. arXiv:2002.08822 . Accessed Jul 2021

Pham H, Wang X, Yang Y, Neubig G. Meta back-translation. In: International Conference on Learning Representations 2021. https://openreview.net/forum?id=3jjmdp7Hha . Accessed Jul 2021

Longpre S, Wang Y, DuBois C. How effective is task-agnostic data augmentation for pretrained transformers? arXiv:2010.01764 . Accessed Jul 2021

Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks 2018. arXiv:1812.04948 . Accessed Jul 2021

Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, Phang J, He H, Thite A, Nabeshima N, Presser S, Leahy C. The pile: an 800GB dataset of diverse text for language modeling 2020. arXiv:2101.00027 . Accessed Jul 2021

Gururangan S, Marasović, A., Swayamdipta S, Lo K, Beltagy I, Downey D, Smith NA. Don’t stop pretraining: Adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. Association for Computational Linguistics, Online 2020. https://doi.org/10.18653/v1/2020.acl-main.740 . Accessed Jul 2021

Wu X, Lv S, Zang L, Han J, Hu S. Conditional BERT contextual augmentation 2018. arXiv:1812.06705 . Accessed Jul 2021

Scao TL, Rush AM. How many data points is a prompt worth? 2021. arXiv:2103.08493 . Accessed Jul 2021

Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning 2021. arXiv:2104.08691 . Accessed Jul 2021

Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, de Laroussilhe Q, Gesmundo A, Attariyan M, Gelly S. Parameter-efficient transfer learning for NLP 2019. arXiv:1902.00751 . Accessed Jul 2021

Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 11328–11339. PMLR? 2020. http://proceedings.mlr.press/v119/zhang20ae.html . Accessed Jul 2021

Zhong R, Lee K, Zhang Z, Klein D. Meta-tuning language models to answer prompts better 2021. arXiv:2104.04670 . Accessed Jul 2021

Schick T, Schütze H. Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 255–269. Association for Computational Linguistics, Online 2021. https://www.aclweb.org/anthology/2021.eacl-main.20 . Accessed Jul 2021

Schick T, Schütze H. It’s Not Just Size That Matters: small language models are also few-shot learners 2020. arXiv:2009.07118 . Accessed Jul 2021

Tam D, Menon RR, Bansal M, Srivastava S, Raffel C. Improving and simplifying pattern exploiting training 2021. arXiv:2103.11955 . Accessed Jul 2021

Yoo KM, Park D, Kang J, Lee S-W, Park W. GPT3Mix: leveraging large-scale language models for text augmentation 2021. arXiv:2104.08826 . Accessed Jul 2021

Cer D, Yang Y, Kong, S.-y., Hua N, Limtiaco N, St. John R, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Strope B, Kurzweil R. Universal sentence encoder for English. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174. Association for Computational Linguistics, Brussels, Belgium 2018. https://doi.org/10.18653/v1/D18-2029 . Accessed Jul 2021

Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics, Hong Kong, China 2019. https://doi.org/10.18653/v1/D19-1410 . Accessed Jul 2021

Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih, W.-t., Rocktäschel T, Riedel S, Kiela D. Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474. Curran Associates, Inc.,? 2020. https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf . Accessed Jul 2021

Guu K, Lee K, Tung Z, Pasupat P, Chang, M.-W.: Realm: Retrieval-augmented language model pre-training. arXiv:2002.08909 . Accessed Jul 2021

Shuster K, Poff S, Chen M, Kiela D, Weston J. Retrieval augmentation reduces hallucination in conversation 2021. arXiv:2104.07567 . Accessed Jul 2021

Kairouz P, McMahan HB, Avent B, Bellet A, Bennis M, Bhagoji AN, Bonawitz K, Charles Z, Cormode G, Cummings R, D’Oliveira, R.G.L., Eichner H, Rouayheb SE, Evans D, Gardner J, Garrett Z, Gascón A, Ghazi B, Gibbons PB, Gruteser M, Harchaoui Z, He C, He L, Huo Z, Hutchinson B, Hsu J, Jaggi M, Javidi T, Joshi G, Khodak M, Konečný, J., Korolova A, Koushanfar F, Koyejo S, Lepoint T, Liu Y, Mittal P, Mohri M, Nock R, Özgür A, Pagh R, Raykova M, Qi H, Ramage D, Raskar R, Song D, Song W, Stich SU, Sun Z, Suresh AT, Tramèr F, Vepakomma P, Wang J, Xiong L, Xu Z, Yang Q, Yu FX, Yu H, Zhao S. Advances and open problems in federated learning 2019. arXiv:1912.04977 . Accessed Jul 2021

Carlini N, Tramer F, Wallace E, Jagielski M, Herbert-Voss A, Lee K, Roberts A, Brown T, Song D, Erlingsson U, Oprea A, Raffel C. Extracting training data from large language models 2020. arXiv:2012.07805 . Accessed Jul 2021

Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop 2015. arXiv:1503.02531 . Accessed Jul 2021

Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 2019. arXiv:1910.01108 . Accessed Jul 2021

Chen X, He B, Hui K, Sun L, Sun Y. Simplified TinyBERT: knowledge distillation for document retrieval 2020. arXiv:2009.07531 . Accessed Jul 2021

Xie Q, Luong M-T, Hovy E, Le QV. Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020

Müller R, Kornblith S, Hinton G. When does label smoothing help? 2019. arXiv:1906.02629 . Accessed Jul 2021

Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X, Chen X. Improved techniques for training gans. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates, Inc.,? 2016. https://proceedings.neurips.cc/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf . Accessed Jul 2021

Pham H, Dai Z, Xie Q, Luong M-T, Le QV. Meta pseudo labels 2020. arXiv:2003.10580 . Accessed Jul 2021

Raghu A, Raghu M, Kornblith S, Duvenaud D, Hinton G. Teaching with commentaries. In: International Conference on Learning Representations 2021. https://openreview.net/forum?id=4RbdgBh9gE . Accessed Jul 2021

Thakur N, Reimers N, Daxenberger J, Gurevych I. Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks 2020. arXiv:2010.08240 . Accessed Jul 2021

Arjovsky M. Out of Distribution generalization in machine learning 2021. arXiv:2103.02667 . Accessed Jul 2021

Chollet F. On the measure of intelligence 2019. arXiv:1911.01547 . Accessed Jul 2021

Nakkiran P, Kaplun G, Bansal Y, Yang T, Barak B, Sutskever I. Deep double descent: Where bigger models and more data hurt. In: International Conference on Learning Representations 2020. https://openreview.net/forum?id=B1g5sA4twr . Accessed Jul 2021

Jia R, Liang P. Adversarial examples for evaluating reading comprehension systems. arXiv:1707.07328 . Accessed Jul 2021

Ribeiro MT, Wu T, Guestrin C, Singh S. Beyond accuracy: Behavioral testing of NLP models with CheckList. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912. Association for Computational Linguistics, Online 2020. https://doi.org/10.18653/v1/2020.acl-main.442

Clark P, Tafjord O, Richardson K. Transformers as soft reasoners over language 2020. arXiv:2002.05867 . Accessed Jul 2021

Koh PW, Sagawa S, Marklund H, Xie SM, Zhang M, Balsubramani A, Hu, W.-h., Yasunaga M, Phillips RL, Beery S, Leskovec J, Kundaje A, Pierson E, Levine S, Finn C, Liang P. Wilds: a benchmark of in-the-wild distribution shifts. arXiv:2012.07421 . Accessed Jul 2021

Nakkiran P, Neyshabur B, Sedghi H. The deep bootstrap framework: Good online learners are good offline generalizers. In: International Conference on Learning Representations 2021. https://openreview.net/forum?id=guetrIHLFGI . Accessed Jul 2021

Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models 2020. arXiv:2006.11239 . Accessed Jul 2021

Garg S, Balakrishnan S, Kolter JZ, Lipton ZC. RATT: Leveraging unlabeled data to guarantee generalization 2021. arXiv:2105.00303 . Accessed Jul 2021

Tan H, Bansal M. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2066–2080. Association for Computational Linguistics, Online 2020. https://doi.org/10.18653/v1/2020.emnlp-main.162

Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14, pp. 2672–2680. MIT Press, Cambridge, MA, USA 2014

Clark K, Luong M-T, Le QV, Manning CD. Electra: Pre-training text encoders as discriminators rather than generators. In: International Conference on Learning Representations 2020. https://openreview.net/forum?id=r1xMH1BtvB . Accessed Jul 2021

Zhang H, Zhang Z, Odena A, Lee H. Consistency regularization for generative adversarial networks. In: International Conference on Learning Representations 2020. https://openreview.net/forum?id=S1lxKlSKPH . Accessed Jul 2021

Self-supervised learning: The Dark Matter of Intelligence. https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence . Accessed 14 May 2021.

Ji CX, Oberst M, Kanjilal S, Sontag D. Trajectory inspection: A method for iterative clinician-driven design of reinforcement learning studies. AMIA 2021 Virtual Informatics Summit 2021

Johnson A, Pollard T, Shen L, Lehman L-W, Feng M, Ghassemi M, Moody B, Szolovits P, Celi L, Mimic-iii MR. A freely accessible critical care database. Sci Data. 2016. https://doi.org/10.1038/sdata.2016.35 .

Lachaux M-A, Roziere B, Chanussot L, Lample G. Unsupervised translation of programming languages 2020. arXiv:2006.03511 . Accessed Jul 2021

Alberti C, Andor D, Pitler E, Devlin J, Collins M. Synthetic QA corpora generation with roundtrip consistency 2019. arXiv:1906.05416 . Accessed Jul 2021

Xie Q, Dai Z, Hovy E, Luong M-T, Le QV. Unsupervised data augmentation for consistency training 2019. arXiv:1904.12848 . Accessed Jul 2021

Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR? 2020. http://proceedings.mlr.press/v119/chen20j.html . Accessed Jul 2021

He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020

Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A. Unsupervised learning of visual features by contrasting cluster assignments 2020. arXiv:2006.09882 . Accessed Jul 2021

Grill J-B, Strub F, Altché, F., Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG, Piot B, Kavukcuoglu K, Munos R, Valko M. Bootstrap your own latent: a new approach to self-supervised Learning 2020. arXiv:2006.07733 . Accessed Jul 2021

Krishna K, Roy A, Iyyer M. Hurdles to progress in long-form question answering 2021. arXiv:2103.06332 . Accessed Jul 2021

Gunel B, Du J, Conneau A, Stoyanov V. Supervised contrastive learning for pre-trained language model fine-tuning. In: International Conference on Learning Representations 2021. https://openreview.net/forum?id=cu7IUiOhujH . Accessed Jul 2021

Rethmeier N, Augenstein I. A primer on contrastive pretraining in language processing: methods, lessons learned and perspectives 2021. arXiv:2102.12982 . Accessed Jul 2021

Thorne J, Vlachos A, Christodoulopoulos C, Mittal A. FEVER: a large-scale dataset for fact extraction and verification 2018. arXiv:1803.05355 . Accessed Jul 2021

Sinha A, Ayush K, Song J, Uzkent B, Jin H, Ermon S. Negative data augmentation. In: International Conference on Learning Representations 2021. https://openreview.net/forum?id=Ovp8dvB8IBH . Accessed Jul 2021

He X, Zhao K, Chu X. Automl: a survey of the state-of-the-art. Knowl-Based Syst. 2021. https://doi.org/10.1016/j.knosys.2020.106622 .

Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A. Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res. 2018;18:1–52.

Li L, Jamieson K, Rostamizadeh A, Gonina E, Hardt M, Recht B, Talwalkar A. A system for massively parallel hyperparameter tuning 2018. arXiv:1810.05934 . Accessed Jul 2021

Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O. Spanbert: improving pre-training by representing and predicting spans. TACL. 2019;8:64–77.

Morris JX, Lifland E, Yoo JY, Grigsby J, Jin D, Qi Y. TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP 2020. 2005.05909. Accessed Jul 2021

Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R. Intriguing properties of neural networks. In: International Conference on Learning Representations 2014. http://arxiv.org/abs/1312.6199 . Accessed Jul 2021

Wang F, Zhang Y, Zheng Y, Ruan W. Gradient-guided dynamic efficient adversarial training. arXiv:2103.03076 . Accessed Jul 2021

Qu Y, Shen D, Shen Y, Sajeev S, Chen W, Han J. Coda: Contrast-enhanced and diversity-promoting data augmentation for natural language understanding. In: International Conference on Learning Representations 2021. https://openreview.net/forum?id=Ozk9MrX1hvA . Accessed Jul 2021

Xue L, Barua A, Constant N, Al-Rfou R, Narang S, Kale M, Roberts A, Raffel C. ByT5: Towards a token-free future with pre-trained byte-to-byte models 2021. arXiv:2105.13626 . Accessed Jul 2021

Clark JH, Garrette D, Turc I, Wieting J. CANINE: pre-training an efficient tokenization-free encoder for language representation 2021. arXiv:2103.06874 . Accessed Jul 2021

Prusa JD, Khoshgoftaar TM. Designing a better data representation for deep neural networks and text classification. https://doi.org/10.1109/IRI.2016.61 . Accessed Jul 2021

Prusa J, Khoshgoftaar T. Improving deep neural network design with new text data representations. JBig Data. 2017. https://doi.org/10.1186/s40537-017-0065-8 .

Choi D, Passos A, Shallue CJ, Dahl GE. Faster neural network training with data echoing 2019. arXiv:1907.05550 . Accessed Jul 2021

Fort S, Brock A, Pascanu R, De S, Smith SL. Drawing multiple augmentation samples per image during training efficiently decreases test error 2021. arXiv:2105.13343 . Accessed Jul 2021

Hoffer E, Ben-Nun T, Hubara I, Giladi N, Hoefler T, Soudry D. Augment your batch: Improving generalization through instance repetition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020

Wei J, Huang C, Xu S, Vosoughi S. Text augmentation in a multi-task view. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2888–2894. Association for Computational Linguistics, Online 2021. https://www.aclweb.org/anthology/2021.eacl-main.252

Kuchnik M, Smith V. Efficient augmentation via data subsampling. In: International Conference on Learning Representations 2019. https://openreview.net/forum?id=Byxpfh0cFm . Accessed Jul 2021

Wei J, Huang C, Vosoughi S, Cheng Y, Xu S. Few-shot text classification with triplet networks, data augmentation, and curriculum learning 2021. arXiv:2103.07552 . Accessed Jul 2021

Johnson J, Khoshgoftaar T. Survey on deep learning with class imbalance. J Big Data. 2019;6:27. https://doi.org/10.1186/s40537-019-0192-5 .

Prusa J, Khoshgoftaar TM, Dittman DJ, Napolitano A. Using random undersampling to alleviate class imbalance on tweet sentiment data. In: 2015 IEEE International Conference on Information Reuse and Integration, pp. 197–202 2015. https://doi.org/10.1109/IRI.2015.39

Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A. 2010;40(1):185–97. https://doi.org/10.1109/TSMCA.2009.2029559 .

Chawla N, Bowyer K, Hall L, Kegelmeyer W. Smote: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57. https://doi.org/10.1613/jair.953 .

Article   MATH   Google Scholar  

Karras T, Aittala M, Hellsten J, Laine S, Lehtinen J, Aila T. Training generative adversarial networks with limited data. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 12104–12114. Curran Associates, Inc., ??? 2020. https://proceedings.neurips.cc/paper/2020/file/8d30aa96e72440759f74bd2306c1fa3d-Paper.pdf . Accessed Jul 2021

Jun H, Child R, Chen M, Schulman J, Ramesh A, Radford A, Sutskever I. Distribution augmentation for generative modeling. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 5006–5019. PMLR? 2020. http://proceedings.mlr.press/v119/jun20a.html . Accessed Jul 2021

Image GPT. https://openai.com/blog/image-gpt/ . Accessed 14 May 2021.

HuggingFace Datasets. https://huggingface.co/docs/datasets/ . Accessed 14 May 2021.

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ. Exploring the limits of transfer learning with a unified text-to-text transformer 2019. arXiv:1910.10683 . Accessed Jul 2021

Geva M, Katz U, Ben-Arie A, Berant J. What’s in your Head? Emergent behaviour in multi-task transformer models 2021. arXiv:2104.06129 . Accessed Jul 2021

Download references

Acknowledgements

We would like to thank the reviewers in the Data Mining and Machine Learning Laboratory at Florida Atlantic University. Additionally, we acknowledge partial support by the NSF (IIS-2027890). Opinions, findings, conclusions, or recommendations in this paper are the authors’ and do not reflect the views of the NSF.

NSF RAPID (IIS-2027890).

Author information

Authors and affiliations.

Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA

Connor Shorten, Taghi M. Khoshgoftaar & Borko Furht

You can also search for this author in PubMed   Google Scholar

Contributions

CS performed the literature review and drafted the manuscript. TMK worked with CS to develop the article’s framework and focus. TMK introduced this topic to CS. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Connor Shorten .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Shorten, C., Khoshgoftaar, T.M. & Furht, B. Text Data Augmentation for Deep Learning. J Big Data 8 , 101 (2021). https://doi.org/10.1186/s40537-021-00492-0

Download citation

Received : 22 June 2021

Accepted : 28 June 2021

Published : 19 July 2021

DOI : https://doi.org/10.1186/s40537-021-00492-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data Augmentation
  • Natural Language Processing
  • Overfitting

data augmentation research paper

Advertisement

Advertisement

A survey of automated data augmentation algorithms for deep learning-based image classification tasks

  • Regular Paper
  • Open access
  • Published: 17 March 2023
  • Volume 65 , pages 2805–2861, ( 2023 )

Cite this article

You have full access to this open access article

data augmentation research paper

  • Zihan Yang 1 ,
  • Richard O. Sinnott 1 ,
  • James Bailey 1 &
  • Qiuhong Ke 2  

5930 Accesses

15 Citations

1 Altmetric

Explore all metrics

In recent years, one of the most popular techniques in the computer vision community has been the deep learning technique. As a data-driven technique, deep model requires enormous amounts of accurately labelled training data, which is often inaccessible in many real-world applications. A data-space solution is Data Augmentation (DA), that can artificially generate new images out of original samples. Image augmentation strategies can vary by dataset, as different data types might require different augmentations to facilitate model training. However, the design of DA policies has been largely decided by the human experts with domain knowledge, which is considered to be highly subjective and error-prone. To mitigate such problem, a novel direction is to automatically learn the image augmentation policies from the given dataset using Automated Data Augmentation (AutoDA) techniques. The goal of AutoDA models is to find the optimal DA policies that can maximize the model performance gains. This survey discusses the underlying reasons of the emergence of AutoDA technology from the perspective of image classification. We identify three key components of a standard AutoDA model: a search space, a search algorithm and an evaluation function. Based on their architecture, we provide a systematic taxonomy of existing image AutoDA approaches. This paper presents the major works in AutoDA field, discussing their pros and cons, and proposing several potential directions for future improvements.

Similar content being viewed by others

data augmentation research paper

OnlineAugment: Online Data Augmentation with Less Domain Knowledge

data augmentation research paper

BO-Aug: learning data augmentation policies via Bayesian optimization

data augmentation research paper

A comprehensive survey of recent trends in deep learning for digital images augmentation

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Promoted by recent advances in neural network architectures, deep learning has made great progress in Computer Vision (CV) [ 1 , 2 , 3 , 4 ]. In particular, deep learning models have been successfully applied to image classification tasks in diverse areas from medical imaging [ 5 , 6 ] to agriculture [ 7 , 8 ]. However, to achieve enhanced performance, deep learning, as a data-driven technology, places significant demands on both the quantity and quality of data for model training and testing. Effectively training a supervised model highly relies on enormous amounts of annotated data, which is often challenging for most practical applications [ 9 ].

To address the issue of data insufficiency, Data Augmentation (DA) is widely utilized. In general, data augmentation refers to the process of artificially generating data samples to increase the size of training data [ 10 ]. In the imaging domain, this is usually done by applying image Transformation Functions (TFs), such as translation, rotation or flipping. For computer vision tasks, image DA has been utilised in nearly all supervised neural network architectures to increase data volume and variety, including traditional data-driven models [ 11 , 12 , 13 , 14 ], and few/zero-shot learning [ 15 ]. Besides supervised approaches, DA techniques are also extensively applied in the field of unsupervised learning. For example, contrastive self-supervised learning relies on image transformations to incorporate data invariance in the representation space across various augmentations [ 16 ].

In the context of image augmentation, a DA policy refers to a set of image operations, which are used to transform the image data. When applying image DA, choosing a carefully designed augmentation scheme (i.e. DA policy) is necessary to improve the effectiveness of DA and hence the associated network training [ 1 , 17 ]. For instance, data augmented by random image operations can be redundant. But overly aggressive TFs might corrupt the original semantics, and introduce potential biases into the training dataset [ 13 ]. Therefore, different datasets or domains may require different types of augmentations. Specifically, when standard supervised approaches are applied, classification tasks with limited data may require label-preserving augmentations to provide direct semantic supervision. However, for few/zero-shot learning models, more emphasis is placed on increasing data diversity in order to generate an enriched training set [ 18 ], which might promote more aggressive augmentation TFs.

In spite of the ubiquity and importance of DA techniques, there is little selection strategy in DA policy design when given certain datasets. Unlike other machine learning topics that have been thoroughly explored, less attention has been put into finding effective DA policies to benefit particular dataset, and hence improve the model accuracy. Instead, DA policies are often intuitively decided based on past experience or limited trials [ 19 ]. Decisions on augmentation strategies are still made by human experts, based on prior knowledge. For example, the standard augmentation policy on ImageNet data was proposed in 2012 [ 1 ]. This is still widely used in most contemporary networks without much modification [ 20 ]. Furthermore, criteria for selecting good augmentation methods on different datasets may greatly vary due to the nature of given tasks. The traditional trial-and-error approach based on training loss or accuracy can give rise to extensive, redundant data collections, wasting computational efforts and resources.

figure 1

General taxonomy of image data augmentation techniques

Motivated by progress in Automated Machine Learning (AutoML), there has been a rising interest in automatically searching effective augmentation policies from training data [ 20 , 21 , 22 , 23 ]. Such techniques are often referred to as Automated Data Augmentation (AutoDA). Figure  shows a basic taxonomy of DA techniques, depicting the relationship between traditional DA and advanced AutoDA, as well as several sub-classes of AutoDA. Compared with standard DA, AutoDA emphasizes the automation aspect of DA policy selection. Recent research has found that instead of manually design the DA schemes, directly learning a DA strategy from the target dataset has the potential to significantly improve model performance [ 10 , 24 , 25 , 26 ]. Specifically, the DA policy that can yield the most performance gain on classification model is regarded as the optimal augmentation policy for a given dataset. Among various contemporary works, AutoAugment (AA) stands out as the first AutoDA model, achieving state-of-the-art results on several popular image classification datasets, including CIFAR-10/100 [ 27 ], ImageNet [ 28 ] and SVHN [ 29 ]. More importantly, AA provides essential theoretical foundation for later works that support automated augmentation [ 21 , 22 , 23 , 30 , 31 ].

The progress of automating DA policy search can potentially change the existing process of model training. AutoDA model can automatically select the most effective combination of augmentation transformations to form the final DA policy. Once the optimal augmentation policy is found, the training set augmented by the learned policy can dramatically boost the model performance without extra input. Furthermore, AutoDA methods can be designed to be directly applied on the datasets of interest. The optimal DA policy learned from the data is regarded as the best augmentation formula for the target task, and hence it should guarantee the best model performance. Another desirable aspect of AutoDA techniques is their transferability. According to the findings in [ 20 ], learned DA policies can also be applied on other similar datasets with promising results.

Although considerable progress has been made for DA policy search, there is still a lack of comprehensive survey that can systematically summarize the diverse methods. To the best of our knowledge, no one has conducted a qualitative comparison of existing AutoDA methods or provided a systematic evaluation of their advantages and disadvantages. To fill this gap, this paper aims to identify the current state of research in the field of Automated Data Augmentation (AutoDA), especially for image classification tasks.

In this paper, we mainly review contemporary AutoDA works in imaging domain. We provide a systematic analysis, identifying three key components of standard AutoDA techniques, i.e. search space, search algorithm and evaluation function. Based on the different choices on search algorithms in reviewed works, we then propose a two-layer taxonomy of all AutoDA approaches. We also evaluate AutoDA approaches in terms of the efficiency of search algorithm, as well as the final performance of trained classification model. Through comparative analysis, we identify major contributions and limitations of these methods. Lastly, we summarise several main challenges and propose potential future directions in AutoDA field.

Our main contributions can be summarized as follows:

Background on image data augmentation, including traditional approaches and Automated Data Augmentation (AutoDA) models (Sect.  2 ).

Introduction of three key components within standard AutoDA models, along with evaluation metrics and benchmarks used in most works (Sect.  3 ).

A hierarchical taxonomy of the mainstream AutoDA algorithms for image classification tasks from the perspective of hyper-parameter optimization (Sect.  4 ).

Thorough review of each AutoDA method in the taxonomy, detailing the search algorithm applied (Sects.  5 and 6 ).

Discussion about the current state of AutoDA technique, as well as the existing challenges and potential opportunities in future (Sect.  7 ).

2 Background

This section introduces background information about data augmentation in the computer vision field with focus on image classification tasks. We first provide a general overview of how DA technique developed and been applied to computer vision tasks. Then we briefly describe several traditional image processing operations that are involved in most AutoDA models. Finally, we discuss the recent advances in AutoDA techniques and how such techniques relate to Automated Machine Learning (AutoML).

2.1 Historical overview of image data augmentation

The early application of image augmentation can be traced back to LeNet-5 [ 32 ], where a Data Augmentation (DA) technique was applied by distorting images for recognizing handwritten and machine-printed characters. This work was one of the earliest pre-trained Convolutional Neural Networks (CNNs) that used DA for image classification tasks. Generally, DA can be regarded as an oversampling method. The objective of oversampling is to mitigate the negative influence of limited data or class imbalances by increasing data samples. A naive approach for oversampling is random oversampling, which randomly duplicates data points in minority classes until a desired data amount or data distribution is achieved. However, the duplicate images created by this technique can result in model overfitting towards the minority class. This problem becomes even more notable when deep learning technique is used. To add more variety to generated samples, DA via image transformations has emerged.

The most early famous use case of image DA was AlexNet model [ 1 ]. AlexNet significantly improved classification results on ImageNet data [ 28 ] through the use of a revolutionary CNN architecture. In their work, image augmentation was used to artificially expand the dataset. Multiple image operations were applied to the original training set, including random cropping, horizontal flipping and colour adjustment in RGB space. These transformation functions helped mitigate overfitting problems during model training. According to the experimental results in [ 1 ], image DA reduced the final error rate by approximately \(1\%\) . Since then, image augmentation has been regarded as a necessary pre-processing procedure before training complex CNNs, from VGG [ 3 ] to ResNet [ 4 ] and Inception [ 33 ].

Image augmentation is not limited to the basic image processing. Following the proposal of Generative Adversarial Network (GAN) in [ 34 ], related works flourished in the following decade. Among them, the most influential technique was Neural Architecture Search (NAS) [ 35 ]. NAS is a type of AutoML technique, which is the process of searching for model architectures through automation. The advancement of NAS greatly promoted the development of DA technology in the imaging field. Applying concepts and techniques from NAS and AutoML has gained increasing interest in the CV community. Recent progress include Neural Augmentation [ 36 ], which tests the effectiveness of GANs in image augmentation; Smart Augmentation [ 10 ], which generates synthetic image data using neural networks; and AA [ 20 ], which is aimed at the automation of image transformation selection for DA. The latter work forms the basis for AutoDA and is the focus of this survey.

Most of the augmentation methods mentioned before were designed for image classification. The ultimate goal of image DA in classification tasks is to improve the predication accuracy of discriminative models. However, the same technique is applicable for other computer vision tasks, for example Object Detection (OD), where image augmentation can be combined with advanced deep neural networks including YOLO [ 37 ] and R-CNN series [ 38 , 39 , 40 ]. Semantic segmentation task [ 41 ] can also benefit from DA before training complex networks such as U-Net [ 42 ]. In this study, we particularly focus on the application of Automated DA (AutoDA) for image classification tasks, as there exists more published datasets in this domain that allow to conduct a fair evaluation. For some AutoDA methods, we also discuss the possibilities of applying them in object detection tasks if there are experimental results available.

2.2 Traditional image augmentation techniques

Image augmentation aims to enhance both the quantity and quality of datasets so that neural networks can be better trained [ 43 ]. Usually, DA does this in two ways, either through traditional image operations or based on deep learning technology. Traditional augmentation often emphasizes on preserving the image’s original label and transforms existing images into a new form [ 43 ]. This method can be achieved through various image processing approaches, including but not limited to geometric transformations, adjustment in colour space or even combinations of them.

Another augmentation technique based on deep learning attempts to generate synthetic images as the training set. Major techniques involve Mixup augmentation [ 44 ], GANs and transformations in feature space [ 25 ]. Due to the complexity of deep learning DA, only the basic image processing operations are considered in recent automated DA methods. Hence, we focus on the basic image transformations that can be easily parameterized. The rest of this section briefly introduces several basic image processing functions that are usually considered in AutoDA models, including geometric transformations, flipping, rotation, cropping, colour adjustment and kernel filters. Another two augmentation algorithms are also covered due to their presence in AutoAugment work [ 20 ], namely Cutout [ 45 ] and SamplePairing [ 46 ].

2.2.1 Geometric transformations

The simplest place to start image augmentation is using geometric transformation functions, such as image translation or scaling. These operations are easy to implement and can also be combined with other transformations to form more advanced DA algorithms. One important thing when applying such operations is whether they can preserve the original image label after the transformation [ 47 ]. From the perspective of image augmentation, the ability to keep label consistency can also be called safety feature of transformation functions [ 43 ]. In other words, transformations that may risk corrupting annotation information are considered to be unsafe. In general, geometric transformations tend to preserve the labels as they only change the position of key features. However, depending on the magnitude of the transformation function, the application of the chosen operation might not always be safe. For example, translation of the y axis with a high magnitude may end up completely shifting the object-of-interest outside of the visible area, therefore it fails to preserve the label of the post-processed image.

2.2.2 Rotation

Rotating the image by a given angle is another common DA technique. It is a special type of geometric transformation, which also has the risk of removing meaningful semantic information from the visible area. Aggressive operations with a large rotation angle are usually unsafe, especially for text-related data, e.g. "6" and "9" in SVHN data [ 29 ]. However, according to [ 43 ], slight rotation within the range of \(1^{\circ }\) to \(30^{\circ }\) can be helpful for most image classification tasks.

2.2.3 Flipping

Flipping is different from rotation augmentation as it generates mirror images. Flipping can be done either horizontally or vertically, while the former is more commonly used in computer vision [ 43 ]. This is one of the simple yet most effective augmentation techniques on image data, especially for CIFAR-10/100 [ 27 ] and ImageNet [ 28 ]. The safety feature of flipping largely depends on the type of input data. For normal classification or object detection tasks, flipping augmentation preserves the original label. But it can be unsafe for data involving digits or texts, such as SVHN data [ 29 ].

2.2.4 Cropping

Cropping is not only a basic DA method, but also an important pre-processing step before training when there are various sizes of image samples in the input data. Before being fed into the model, training data needs to be cropped into a unified \(x \times y\) dimension for later matrix calculations. As an augmentation technique, cropping has a similar effect to geometric translation. Both augmentation methods remove part of the original image patch, while image translation keeps the same spatial resolution of the input and output image. In contrast, cropping will reduce the size of processed image. As described previously, cropping can be a safe/unsafe depends on its associated magnitude value. Aggressive operations might crop the distinguishable features, affecting label consistency, whereas a reasonable magnitude value helps to improve the quality of the training data.

2.2.5 Colour adjustment

Adjusting values in colour space is another practical augmentation strategy that has been commonly adopted. Through the value jitters of single colour channels, it is possible to quickly obtain different colour representations of an image. These RGB values can also be manipulated through matrix operations to mimic different lighting conditions. Alternatively, self-defined rules on pixel values can be applied to implement transformations such as Solarize, Equalize, Posterize functions using in Python Image Library (PIL) [ 48 ]. Different from previous DA transformations, colour adjustment preserves the original size and content of input images. However, it might discard some colour information and thus might raise safety issues. For example, if colour is a discrimitative feature of an object of interest, when manipulating the colour values, the distinctive colour of the object may be hard to observe and hence confuse the model. The magnitude of colour transformation is again the determining factor that affects its safety property.

2.2.6 Kernel filters

Instead of directly changing pixel values in colour space, they can be manipulated via kernel filters. This is a widely used technique in computer vision field for image processing. A filter is usually a matrix of self-defined numbers, with much smaller size than the input image. Depending on the element values in the matrix, kernel filters can provide various functionalities. The most common kernel filters include blurring and sharpening. To apply the kernel filter on input image, we treat it as a sliding window, and scroll it over the whole image to get the pixel values out of matrix multiplications as our final output. A Gaussian kernel can cause blurring effect on the filtered image, which can better prepare the model for low-quality images with limited resolution. In contrast, a sharpening filter emphasizes the details in the image, which can help the model gain more information about the key features.

2.2.7 Cutout

Besides simple transformations, another interesting augmentation technique is Cutout [ 45 ]. Cutout is inspired by the concept of dropout regularisation, performed on input data instead of embedded units within neural network [ 49 ]. This algorithm is specifically designed for object occlusion problems. Occlusion happens when some parts of the object are vague or occluded (hidden) by other non-relevant objects, in which case, only partial observation of the object is possible. This is a common problem especially in real-world scenarios. Cutout augmentation combats this by randomly cropping a small patch out of the original image to simulate the occlusion cases. Training on such transformed data, models are forced to learn from the whole picture rather than just a section of it, which enhances its ability to distinguish object features. Another convenient feature of the Cutout algorithm is that it can be applied along with other image augmentation methods, such as geometric or colour transformation, to generate more diverse training data.

2.2.8 SamplePairing

SamplePairing [ 46 ] is an example of a complex augmentation algorithm that combines several simple transformations. It creates a completely new image by randomly choosing two data samples from the training set and mixing them. In standard SamplePairing, such combination is done by calculating the average of pixel values in two samples. The label of the generated images follows the first image and ignores the annotation of the second sample in the input pair. One of the advantages of SamplePairing augmentation is that it can create up to \(N^2\) new data points out of dataset of size N via simple permutation. SamplePairing is straightforward augmentation method that generates synthetic data points out of the original data. The enhancement of data quantity and variety significantly improves model performance and avoids model overfitting problems. This technique is especially helpful for computer vision tasks with limited training data.

2.3 Development of automated data augmentation (AutoDA)

With various image augmentation operations available, the question is how to choose an effective DA policy from these transformations for CV tasks. A naive solution is to apply random augmentations, generating vast amounts of transformed data for training. However, without appropriate control on the type and magnitude of augmentation TFs, the augmented data points might be simple duplicates or even semantically corrupted, which can lead to performance loss. Furthermore, overly augmented data might require excessive computational resources during model training, causing efficiency issues. A systematic selection strategy for a DA policy is therefore needed. A DA policy refers to a composition of various image distortion functions, which can be applied to training data for data augmentation.

Despite extensive research on DA transformations, the selection of a given augmentation policy usually relies on human experts. Especially in the context of CV tasks, the decision on which image operations to use is mainly made by machine learning engineers based on past experience or domain knowledge. Therefore, the optimal strength of a given DA policy is highly task-specific. For example, geometric and colour transformations are commonly used in standard classification datasets, including CIFAR-10/100 [ 27 ] and ImageNet [ 28 ]. While resizing and elastic deformations are more popular on digit images such as MNIST [ 50 , 51 ] and SVHN [ 29 ] datasets. There is no universal agreement on augmentation strategies for all types of CV tasks. In most cases, DA policies need to be manually selected based on prior knowledge. However, human effort involved in deep learning is usually considered biased and error-prone. There is no theoretical evidence to support the optimal human-decided DA policies. It is infeasible to manually search for the optimal DA policy that can achieve the best model performance. Additionally, without the help of advanced ML technique, finding an effective DA policy must rely on empirical results from multiple experiments, which can be excessively time-consuming.

To reduce the potential bias and accelerate the design process, there has been increasing interest in automating the selection of DA policies. This technique is known as Automated Data Augmentation (AutoDA). The development of AutoDA is motivated by the advancements in Neural Architecture Search (NAS) [ 35 ], which automatically searches for the optimal architecture for deep neural networks instead of by manual approach. The majority of AutoDA techniques rely on different search algorithms to search for the most effective (optimal) augmentation policy for a given dataset. In the context of AutoDA, an optimal DA policy is the augmentation scheme that can yield the most performance gain and highest accuracy score.

The earliest AutoDA work can be traced back to Transformation Adversarial Networks for Data Augmentations (TANDA) [ 52 ] in 2017. This was the first attempt to automatically construct and tune DA policies according to provided data. The parameterization in TANDA inspired the design of search space in AutoAugment (AA) [ 20 ], and provided a standard problem formulation to the AutoDA field. AA used Reinforcement Learning (RL) to perform the augmentation search. During the search, augmentation policies were sampled via a Recurrent Neural Network (RNN) controller and then used for model training. Instead of directly searching on the target data, AA created a subset out of original training set as a proxy task. The evaluation of augmentation policies was also conducted on a simplified network instead of the final classification model. Unfortunately, searching in AA requires thousands of GPU hours to complete even under reduced setting.

With the establishment of augmentation search space, efficiency problems have become the focus of later AutoDA works. Fast AutoAugment (Fast AA) [ 21 ] is one of the most popular improved versions of the original AA. Instead of RNN, Fast AA applies Bayesian optimization to sample the next augmentation policy to be evaluated, which greatly reduces the search cost. Additionally, Fast AA firstly uses density matching for policy evaluation, which completely eliminates the need for repeated training. Another approach to improve search efficiency is via parallel computation. Population-Based Augmentation (PBA) [ 23 ] adopts Population-Based Training to optimize the augmentation policy using several subsets of the target data simultaneously. The search goal in PBA is also slightly different than previous approaches. PBA aims to find a dynamic schedule during model training, rather than a static policy. Both Fast AA and PBA substantially reduce the complexity of the AA algorithm, and maintain comparable performance at the same time. However, there is still an expensive searching phase in these models especially when faced with large datasets or complicated models, which inevitably leads to poor efficiency.

To further enhance the scalability of AutoDA models, techniques such as gradient-based hyper-parameter optimization have been explored recently. AutoDA based on gradient is usually achieved by various gradient approximators to estimate the gradient of augmentation hyper-parameters with regard to model performance. This process ensures the hyper-parameters can be differentiated and hence optimized along with the model training. Adversarial AutoAugment (AAA) [ 53 ] and Online Hyper-parameter Learning AutoAugment (OHL-AA) [ 54 ] apply the REINFORCE gradient estimator [ 55 ] to achieve gradient approximation. Other gradient estimators are also applicable in AutoDA. For example, DARTS [ 56 ] is employed in Faster AutoAugment (Faster) [ 22 ] and Differentiable Automatic Data Augmentation (DADA) [ 57 ]. Using the same policy model as in Fast AA [ 21 ], OHL-AA optimizes augmentation policies in an online fashion during model training. There is no separate stage for searching in OHL-AA. Instead, it adopts a bi-level optimization framework, where the algorithm updates the weights of the classification model and hyper-parameters of augmentation policy at the same time. This scheme significantly reduces the search time. Similarly, there are two optimization objectives in AAA, one of which is the minimization of training loss, and the other is the minimization of adversarial loss [ 53 ]. Two objectives are optimized simultaneously in AAA, providing a much more computationally affordable solution.

Even though gradient-based approaches are more efficient in comparison to vanilla AA, these methods are still based on an expensive policy search. The bi-level setting also increases the complexity of the model training stage. Recent advancements in AutoDA aim to further enhance the efficiency of augmentation design by excluding the need for search. Proposed in 2020, RandAugment (RA) [ 31 ] reparameterizes the classical search space. It replaces the individual parameter for each transformation with two global variables. A simple grid search is performed in RA to optimize two hyper-parameters. The findings in RA not only suggest that the traditional search phase may not be necessary, but also indicate that the search using surrogate models could be sub-optimal. It was found that the effectiveness of the DA policy was relevant to the size of the model and training set, thus challenges all previous approaches based on proxy tasks. Another AutoDA model that does not rely on searching is UniformAugment (UA) [ 58 ]. UA further simplifies the augmentation space through invariance theory. The authors hypothesize an approximate invariant augmentation space. Any augmentation policy sampled from that space could lead to similar model performances, thus completely removing the search phase. Despite the promising speed, the model performance is a bottleneck in these search-free methods. Neither approach is able to make significant progress on model accuracy when compared with previous approaches. To apply AutoDA techniques in practice, further research and experiments are required.

3 Automated data augmentation techniques

This section aims at introducing the basic concepts and terminologies of Automated Data Augmentation (AutoDA) techniques. In general, finding an optimal DA policy is formulated as a standard search problem in most works [ 20 , 21 , 22 , 23 , 54 , 57 ]. A standard AutoDA model consists of three major components: a search space, a search algorithm and an evaluation function. In this section, the functionalities and relationships of three component are discussed. We also describe how to assess the proposed AutoDA models, including two different evaluations based on direct and indirect approaches. Lastly, we introduce several commonly used datasets and benchmarks for comparative analysis.

3.1 Key components

For image classification tasks, the major objective of AutoDA models is to achieve the best classification accuracy using an optimal DA policy automatically learned from a given dataset. Inspired by the DA strategy modelling in Transformation Adversarial Networks for Data Augmentations (TANDA) [ 52 ], AutoAugment (AA) [ 20 ] is considered to be the first work that attempted to automate the augmentation policy search. AA formulates the AutoDA task as a search problem, and provided basic parameterization for the search space. The parameterization in AA is largely adopted in later AutoDA works, and is regarded as the de facto standard [ 21 , 23 , 59 ]. Specifically, there are three key components within a standard AutoDA formulation:

Definition 1

( Search Space ) is regarded as the domain of DA policies to be optimized where all candidate solutions via augmentation hyper-parameters are defined.

Definition 2

( Search Algorithm ) is used to retrieve augmentation policies from the search space and to sample the next search point based on a reward signal returned by an evaluation function.

Definition 3

( Evaluation Function ) is the procedure of assessing or ranking sampled DA policies by assigning reward values. This usually relies on the training of the classification model.

3.1.1 Search space

The search space defines how DA policies are formed for subsequent searches. Specifically for image classification tasks, the augmentation policy refers to a composition of several image operations, which can be described by augmentation hyper-parameters. Generally, a complete augmentation policy consists of multiple sub-policies, each of which is used to augment one training batch. A sub-policy is composed of several basic image Transformation Functions (TFs). An augmentation policy is usually parameterized by two hyper-parameters: the application probability and the operation magnitude. The probability describes the probability of applying a certain transformation function on input images, while the magnitude determines the strength of the operation. Each TF within a DA policy is associated with a pair of probability and magnitude hyper-parameters. Depending on the choice of search or optimization algorithm, the formulation of the search space can vary greatly. For example, some works completely re-parameterize the search space to reduce the search complexity [ 31 , 58 ]. However, the DA policy parameterization proposed in AA [ 20 ] has been widely used in later works with little or no modification.

3.1.2 Evaluation functions

The evaluation of augmentation policies is conducted from two perspectives, including the effectiveness and safety. The former feature emphasizes the impacts of DA on final classification results, while the safety feature focuses on the label preservation of the transformed data. Generally, the efficacy of augmentations is judged by the performance of classification models based on training loss or accuracy values. Such procedures can also be called direct evaluation functions, since the strength of the DA strategy is directly reflected in how much performance gains this augmentation policy can produce. The higher the classification accuracy, the better the associated DA policy.

Another alternative evaluation is an indirect method, emphasizing the safety feature of data augmentation. Examining the safety of DA policy often resorts to the use of density matching [ 21 ]. The main objective of density matching is to match the distribution of the augmented data to the original training data. The basic idea behind this indirect evaluation is to treat the transformed images as missing data points of the input data, thereby improving the generalizability of the classification model. Smaller density differences indicate higher similarity of data distributions, which can lead to better augmentation strategies. Using density matching, the policy evaluation does not require back-propagation of the model training. Such algorithms can be regarded as the indirect evaluation function of AutoDA tasks.

3.2 Overall workflow

figure 2

General workflow of a standard AutoDA model involving three key components

The relationship between these components is depicted in Fig. . Firstly, the AutoDA model specifies the parameterization of DA policies for the given task, providing a finite number of potential solutions to be searched and evaluated. Within the defined search space, the search algorithm then samples DA policies and passes the candidates to the evaluation section. In earlier AutoDA works, augmentation policies were sampled one by one [ 20 , 60 ], while later approaches tend to employ multi-threaded processes, sampling multiple candidates and evaluating them in a distributed fashion. This substantially improves the search efficiency. After a DA policy is selected by the search algorithm, it is then rated by the evaluation function to compute the reward signal. Each augmentation strategy is associated with a reward value, indicating its effectiveness in improving model performance. Finally, the reward information is used to update the search algorithm, guiding the sampling of the next DA policy to be evaluated. The entire search recursion process ends when the optimal policy is found. This can be determined by examining the difference in performance gain between the current search point and the previous point. However, the stopping criteria might lead to excessive searching with little improvement, especially in the later phases, resulting in a waste of resources. In most practical implementations of AutoDA algorithms, the search algorithm will stop when a self-defined stopping condition is fulfilled, for example after a certain number of search epochs [ 21 , 23 , 61 ].

3.3 Two stages of AutoDA

The standard AutoDA pipeline can be divided into two major stages:

Definition 4

( Generation stage ) is the process of generating the optimal augmentation policy when given certain datasets. A DA policy is typically described by a sequence of augmentation hyper-parameters. Usually, the final DA solution is generated by a search or optimization algorithm, which samples various candidate strategies from the defined search space, and relies on an evaluation function to assess efficacy of the searched policies.

Definition 5

( Application stage ) is the process of applying the policy learned in the generation stage. This is done by augmenting the target dataset using the obtained DA policy to artificially increase both the data quantity and variety, and then train the classification model on the transformed training set.

With the aim of finding the best augmentation strategy for the target dataset, a typical AutoDA problem is mainly solved in the policy generation stage. The best DA policy here specifically refers to the hyper-parameter setting that can maximize the classification model accuracy or minimize the training loss in the later phase, i.e. it can best solve the classification task in application phase. We identify several criterion used in published studies to determine the completion of policy generation:

The sampled DA policy can help train the classification model to achieve the highest accuracy scores.

The sampled DA policy can provide comparable performance gains to the optimal policy, or

A certain number of training/searching epochs has been completed.

Theoretically, the policy generation stage can only end when the first criterion is achieved, i.e. the sampled policy is evaluated to be the optimal augmentation strategy for the given dataset. However, it is often impractical to thoroughly explore the entire search space to identify the best DA policy. A potential solution is to set a specific threshold for model accuracy or training loss to help decide whether the policy is optimal. However, in application scenarios, the optimal strength of data augmentation for classification models is often unknown. It is therefore tricky to set such thresholds as success criteria.

An alternative strategy to stop the generation phase is to relax the optimal criteria. In other words, if the sampled policy can produce comparable improvement in model performance to the optimal DA strategy, it is considered to be optimal. This idea has been widely adopted in many existing AutoDA works [ 21 , 60 , 62 , 63 ]. It can be implemented by using the performance difference. For example, if the difference in performance gains between the sampled policy and the best rewarded one is smaller than a certain value, then this policy can be treated as the final output of the generation stage [ 60 , 63 ]. A more popular alternative is to use density matching. Instead of directly training the classification model, density matching compares the distribution/density of the original data and the augmented samples. The assumption of density matching is that the optimal DA policy can best generalize the classification model by matching the density of given data with the density of the transformed data [ 21 , 62 ].

In practice, the most commonly used stopping criteria is to manually decide the search limit. Once the training has been conducted after a certain number of epochs, policy generation will be forced to stop and output the DA policy with the best model performance so far. The selection of epoch number usually depends on the available computational resources as well as the complexity of the given task. There is no universal agreement on the stopping criteria.

3.4 Datasets

This section aims at providing a brief overview of the datasets employed in the considered approaches. Annotated datasets are generally used as benchmarks to provide a fair comparison among different AutoDA algorithms and architectures. Furthermore, the growth in size of data and complexity of application scenarios increases the challenge, resulting in constant development of new and improved techniques.

The most used datasets for the task of automated augmentation search are: (i) CIFAR-10/100 [ 27 ], (ii) SVHN [ 29 ], (iii) ImageNet [ 64 ]. CIFAR stands for Canadian Institute for Advanced Research. CIFAR-10 and CIFAR-100 share the same name as both are used for CIFAR research, while the numbers specifies the total number of classes in the dataset. SVHN refers to Street View House Numbers (SVHN). ImageNet is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [ 64 ]. The characteristics of each dataset are shown in Table , while their statistics are summarized in Table .

The CIFAR and ImageNet dataset are published at the same year, both present standard image classification task and are commonly used in computer vision researches. However, ImageNet is much larger than the CIFAR series in scale and diversity. There are over 5, 000 different categories in the original ImageNet set, with 3.2 million images that have been hand-annotated [ 28 ]. For the AutoDA search problem, the enormous quantity of ImageNet data might require significant amounts of computational resources. Training on the complete ImageNet is usually infeasible in practice. Instead, it is often more suitable to use a reduced ImageNet subset for the target task. Additionally, the distribution of instances among different classes can also vary considerably, which can decrease the performance of AutoDA models. To address these issues, each AutoDA work that conducts experiments using ImageNet data uses a distinctive trimming method to set up a smaller and cleaner subset for model evaluation. Nevertheless, due to the diversity of data and imbalanced class distribution, the classification on ImageNet subset is still considered to be a relatively difficult task when compared with other datasets (for augmentation search). In AutoDA works, the reduced ImageNet datasets are constructed differently, with varying sizes and class numbers. Each of them will be described in the works where they are employed.

The CIFAR series consists of much fewer categories, designated by their number [ 27 ]. The CIFAR-10 dataset consists of 60, 000 \(32\times 32\) colour images in total. The data distribution among classes in CIFAR-10 is more controlled and unified. 60, 000 images are evenly distributed into 10 classes, providing 6000 images per class. The splitting ration of train:test data is 5:1. In CIFAR-10 dataset, the test batch contains 10, 000 images, randomly selected from the full dataset, but each class contains exactly 1, 000 images. The training set contains the remaining 50, 000 instances. The formulation of CIFAR-100 dataset is similar to CIFAR-10, except there are 100 classes in CIFAR-100, each of which comprises 600 images. The train:test ratio is also 5:1, providing 500 training images and 100 test images per class. With a more balanced data distribution and limited class number, CIFAR data is usually more suitable to benchmark proposed AutoDA algorithms.

SVHN refers to Street View House Numbers. It is also collected from real-world scenarios, widely used for deep learning related researches. Similar to MNIST data [ 32 ], images in SVHN are also digits, cropped from house numbers in Google Street View images [ 29 ]. The major task of SVHN is to recognize numbers in natural scene images. There are 10 categories in total, each of which represents one digit, e.g. digit 1 has label 1. In SVHN, there are 73, 257 digits for training, 26, 032 digits for testing, and 531, 131 additional data items that can be used as extra training data. In contrast to previous datasets, SVHN specifically emphasizes the pictures of digits and numbers. This might reveal the relationship between DA selection strategy and data types. However, unlike CIFAR, the SVHN data distribution among classes is biased. There are more 0 and 1 digits present in the data, resulting in a skewed class distribution in both training and test set. Seen in Table  2 , for SVHN, the number of images per class ranges from 5000 to 14, 000. Such imbalance can be considered as a challenge to better assess AutoDA models from different perspectives.

3.5 Evaluation metrics

To measure the effectiveness of AutoDA approaches, an intuitive way is to evaluate the performance of the final classification model, i.e., after it is has been trained using datasets augmented by AutoDA methods. There are two major aspects during AutoDA evaluation. First, the AutoDA model to be tested needs to be applied to the original dataset, in order to generate the optimal DA policy, which is then used to augment the data. After obtaining augmented training data, the target classifier is trained to solve the given classification task. A commonly used evaluation metric for classification network is accuracy, which is defined by the ratio of the number of correct predictions to the total number of cases examined:

where T , F stands for true and false respectively and indicate whether a prediction is correct or not, and P , N represent positive and negative results. Accuracy is considered to be a valid classification metric in this study because all datasets used in experiments, e.g. CIFAR-10/100, SVHN and reduced ImageNet, are well balanced without any class skew. The metric equivalent to Accuracy is the Error rate , which is the proportion of erroneously classified instances compared to the total instances:

Another dimension to evaluate AutoDA algorithms is their efficiency. Despite the impressive efficacy of some DA approaches, the total computational cost of DA policy generation can be significant [ 20 ], which hinders their large-scale application to many real world scenarios. A more comprehensive measure of AutoDA models should incorporate both their efficacy and their performance efficiency. Following the tradition of most AutoDA works, we consider the total GPU time (GPU hours) needed for a single run of a given AutoDA algorithm as a key evaluation metric used in this work.

4 Taxonomy of image AutoDA methods

Table  shows a summary of primary works in the AutoDA field. The column Key technique describes the most important technique adapted in each AutoDA model to formulate augmentation search problems. These methods are usually borrowed from other ML-related field, such as NAS or hyper-parameter optimization. Policy optimizer indicates the algorithm or controller that is used to optimize or update the augmentation policy during the search. These AutoDA approaches are classified into two major types based on the stage involved to solve the classification task using the learned DA policy, namely one-stage or two-stage. Additionally, from the perspective of hyper-parameter optimization, these methods can be further categorized into three classes: gradient-based, gradient-free and search-free. Table 3 provides a categorization that projects the underlying optimization algorithm used by each of the methods.

Based on the application sequence of the two stages of the AutoDA model, we classify all existing works into two major categories: one-stage and two-stage approaches (as shown in Fig.  1 ). Two-stage approaches conduct both the generation and application respectively. In a typical two-stage method, the optimal augmentation policy is generated according to the task dataset from the first stage. After that, the learned augmentation strategies is applied on the training set to train the discriminative model. There are two separate stages required when utilizing a two-stage algorithm. The one-stage approaches combine the generation and application together through the use of gradient approximation methods. By estimating the gradient of a DA policy with regard to model performance, one-stage approaches are able to optimize augmentation policy and classification model simultaneously. As a result, they can obtain the final results and trained model through a single run.

4.1 Two-stage methods

There are two steps involved in applying AutoDA to discriminative tasks in the imaging domain. Generally, an AutoDA model searches for the optimal augmentation strategy and then applies the obtained policy on the target data for model training. Due to the separate processes of searching and training, this kind of approach is described as two-stage in this paper. The general framework of a typical (two-stage) approach is displayed in Fig. . In the first stage, given a specific dataset, the search algorithm looks for the best composition of image transformation functions, also known as the DA policy. The generation stage ends once the optimal policy is identified by the evaluation function or the searching reaches a given time limit. In the second stage, the learnt policy is applied on the target training set - ideally with additional data of increased quantity and targeted variety. Then the augmented training samples are fed into the classification model for final training.

figure 3

Overall framework of two-stage AutoDA approaches

The algorithm used to find the best scheme for data augmentation has been explored in a wide range of existing works. We categorize them into three different classes according to the problem formulation. Some works treat augmentation searching as a standard gradient-free optimization problem [ 20 , 21 , 23 , 30 , 52 , 59 , 60 ], whilst other methods approach it from the gradient perspective by means of various gradient approximation algorithms [ 22 ]. Other options re-parameterize the entire AutoDA problem in a way that eliminates the need for searching - so called search-free approaches.

4.1.1 Gradient-free

Gradient-free approaches search for the best parameters of the augmentation policy based on model hyper-parameter optimization without gradient approximation. Intuitively, such optimizations can be accomplished by selecting several values for each hyper-parameter, completing a model training for each combination on the target task, and then computing the evaluation metrics of the model performance using all hyper-parameter values. The first attempt to automate such a search process was Transformation Adversarial Networks for Data Augmentations (TANDA) [ 52 ], which utilizes a Generative Adversarial Network (GAN) architecture at its core. The objective of the generator is to propose appropriate sequences of arbitrary augmentation operations, which are then sent to a discriminator for effectiveness assessment. The problem formulation in [ 66 ] motivated the deep learning community to explore other methods such as AA [ 20 ]. AA inherits the augmentation sequence modeling in [ 10 ], but applies a different strategy based on Reinforcement Learning (RL). Several possible augmentation policies are sampled via a Recurrent Neural Network (RNN) controller, that are then assessed through training a simplified child model instead of given classification model. Despite its promising performance in terms of model improvement, AA has a non-negligible limitation, which is the extremely low efficiency. It can take up to 15, 000 GPU hours to complete a search over ImageNet data. Even with the smallest CIFAR-10 set, AA still requires thousands of GPU hours to complete a single run.

The majority of the later works on this topic aim to contribute to efficiency enhancements and computational cost reductions. For example, [ 30 ] utilizes a similar reinforcement learning method, but slightly modifies the search procedure by sharing the same augmentation parameters from earlier stages. Such auto-augment techniques can be further improved through the application of advanced evolutionary algorithms such as Population-Based Training (PBT) [ 67 ]. Some simple searching algorithms have also been found to be beneficial to accelerate the first stage. For instance, [ 60 , 63 ] replace the original RNN controller with a traditional Greedy Breadth First Search algorithm to simplify the process, and therefore reduce the overall computation cost. In addition to the selection of the search algorithm, modification of the evaluation function can also greatly reduce the computational demands. A landmark work in this direction is Fast AutoAugment (Fast AA) [ 21 ], which takes advantage of the variable kernel density [ 68 ] and proposes an efficient density matching algorithm as a substitute. In the AutoDA context, the data density represents the overall distribution of data. Instead of a training classification model, density matching evaluates DA policies by comparing the distribution of the original training set and the transformed data. Such algorithms eliminate the need for re-training the model and hence result in a significant efficiency boost.

Another approach is to focus on the effectiveness or precision of the learned augmentation policy. All of the aforementioned methods focus on the resources and time consumption of the search phase. Not much progress has been made in terms of the improvement of classification accuracy. To fill this gap, [ 65 ] proposes a more fine-grained Patch AutoAugment (PAA) technique which optimizes the augmentation transformations targeted to local regions of images rather than the whole image. Other state-of-the-art methods in the Network Architecture Search (NAS) field help to increase the augmentation precision. One example is Knowledge Distillation (KD) [ 69 ] as used in [ 59 ].

4.1.2 Gradient-based

In contrast to gradient-free algorithms, approaches that approximate the gradient of hyper-parameters to be searched are referred to as gradient-based optimizations. So far, the only two-stage approach based on gradients is Faster AutoAugment (Faster AA) [ 22 ]. This achieves a more efficient augmentation search for image classification tasks than prior methods including AutoAugment [ 20 ], Fast AA [ 21 ] and PBA [ 23 ]. The authors of Faster AA adapt an innovative gradient approximation method, namely Relaxed Bernoulli distribution [ 70 ], to relax the non-differentiable distributions of hyper-parameters and use their gradients as input to a standard optimization algorithm. The consecutive two phases can therefore be done within a single pass. Faster AA model jointly optimizes the hyper-parameters of the augmentation policy (i.e. generation phase) and weights of the classification model (i.e. application phase). The simplification of the policy search space significantly reduces the search cost especially when compared to previous algorithms whilst maintaining the performance. What should be emphasized here is that the model trained during the search in Faster AA is actually abandoned later. To get the final classification result, the learned policy is applied to train the target classification model again. Hence there are still two stages involved in the Faster AA scheme.

4.1.3 Search-free

Despite the advantages of the aforementioned approaches, the added complexity of standard two-stage AutoDA methods might need prohibitive computing resources, for example the original implementation of AA in [ 20 ]. Subsequent works mainly aim to accelerate the search cost [ 21 , 23 ] and utilize gradient approximation [ 22 ]. However, these approaches still require an expensive search stage, which usually relies on a simplified proxy task to alleviate efficiency issues. This setting presumes that the learned DA policy based on the proxy task can be transferred to the larger target dataset. However, such assumptions are challenged in [ 31 ]. According to the findings in [ 31 ], a proxy task might produce sub-optimal DA policies.

To solve the aforementioned problems, several works aim to re-formulate the search problem in AutoDA. These approaches are acknowledged as search-free methods due to the complete exclusion of the search phase. By challenging the optimality of traditional AutoDA methods, search-free approaches re-parameterize the entire search space, resulting in a small number of hyper-parameters, which can be manually adjusted. Therefore, there is no need to conduct the search anymore [ 62 ]. Additionally, it is now feasible to directly learn from the full target dataset instead of a reduced proxy task. Therefore, AutoDA models may learn an augmentation policy more tailored to the task of interest instead of through small proxy tasks.

Existing works such as [ 58 ] and [ 31 ] both belong to the search-free category. Both approaches completely re-parameterize the entire search space so that there is no need to perform resource-intensive searches at all. RandAugment (RA) replaces the enormous search space with a small search space controlled by only two parameters. Both parameters are human-interpretable such that a simple grid search is quite effective. Inspired by RA, UniformAugment (UA) further reduces the complexity of the search space by assuming the approximate invariance of the augmentation space, where uniform sampling is sufficient. Both methods completely avoid a search phase and dramatically increase the efficiency of AutoDA algorithms while maintaining their performance.

4.2 One-stage methods

The biggest difference between two-stage and one-stage approaches is the joint optimization process in the latter. Previous approaches in the two-stage category mainly rely on an additional surrogate model for policy sampling. They then evaluate the sampled policies via full training on another classification network. The expensive training and evaluation procedure leads to efficiency bottlenecks of AutoDA techniques. To mitigate this issue, one-stage approaches complete the policy generation and application in one single step, eliminating the need for repetitive model training. In standard one-stage schemes, the weights of the classification network and the hyper-parameters of the augmentation policy are optimized simultaneously. This is implemented by a bi-level optimization scheme [ 71 ].

At the inner level, they seek to optimize the weights of the discriminative networks, whilst at the outer level looking for hyper-parameters that describe the optimal augmentation policy, under which they can obtain the best performed model as solution to the inner problem. Due to the dependency of inner and outer level optimization, the learning of these two goals are conducted in an interleaved way. Specifically, a separate augmentation network is adapted to describe the probability distribution of sampled policies. The parameters of such a policy model are regarded as hyper-parameters, which are updated after a given number of epochs of inner training [ 53 , 54 ]. In this bi-level framework, the distribution hyper-parameters and network weights are optimized simultaneously. The minimization of training loss (inner objective) can be easily achieved through classical Stochastic Gradient Descent (SGD), while the vanilla gradient of outer objective is relatively hard to obtain, as the model accuracy is non-differentiable with regard to augmentation hyper-parameters. Therefore, one-stage AutoDA models need to leverage gradient approximation to estimate such gradients for later optimization. In other words, all one-stage approaches in AutoDA are based on gradients.

4.2.1 Gradient-based

As its name suggests, gradient-based models optimize the augmentation policy from the perspective of gradients. The reason it has to rely on gradient approximation is because the original model accuracy is non-differentiable with regard to augmentation policy distribution. Only after the relaxation of distribution, can the gradient of validation accuracy or training loss with regard to hyper-parameters be obtained. There are several advantages of gradient-based approaches. Due to the differentiable accuracy, gradient-based method can directly optimize the hyper-parameters according to the estimated gradient. There is no need to invest a significant amount of time in training child models to test sampled policies. This substantially reduces the workload of policy evaluation. The removal of expensive evaluation procedures also enables the AutoDA algorithm to scale up to even larger datasets and deeper models. The first one-stage AutoDA work based on gradients was Online Hyper-parameter Learning AutoAugment (OHL-AA) [ 54 ] in 2019, based on the REINFORCE gradient estimator [ 55 ]. The augmentation policy model in OHL-AA is similar to previous works [ 20 , 21 ], while the original search problem is reformulated as a bi-level optimization task. Published in the same year, Adversarial AutoAugment (AAA) [ 53 ] employs the same gradient approximator in an adversarial framework, which further eases the efficiency issue. As the NAS technique develops in 2020, Differentiable Automatic Data Augmentation (DADA) [ 57 ] and Automated Dataset Optimization (AutoDO) [ 62 ] use the more advanced DARTS estimator [ 56 ].

5 Two-stage approaches

In this section, we review two-stage strategies in detail, with focus on the pipeline of the algorithms. We start from the fundamental definition of the augmentation parameters and corresponding search space used in each method. After that, the core algorithms are explored along with their overall workflow. Following that, the major contribution of each method is covered based on experimental results provided in the original paper. Then, we provide a systematic analysis and evaluate the pros and cons of the different two-stage category approaches. Finally, we compare all available two-stage algorithms from the perspective of their accuracy and efficiency, and give suggestions on model selection from a practical application perspective.

5.1 Gradient-free optimization

5.1.1 transformation adversarial networks for data augmentations (tanda).

TANDA is considered to be the earliest work supporting automatic discovery of optimised data augmentation policies. Even though other works aimed at automating data augmentation, most of them focused on either creating innovative augmentation algorithms [ 10 ], or generating synthetic training data based on a given set of starting images [ 24 ]. TANDA, on the other hand, used only the basic image operations based on a user’s specification, and output a sequence of transformation functions as the final augmentation policy. This made it more relevant to many scenarios with diverse data augmentation demands.

An augmentation policy is represented as a sequence of image processing operations in [ 52 ]. Users need to specify a range of augmentation operations for the TANDA model to select from, which are also called as Transformation Functions (TFs). In order to support various types of TFs, TANDA regards them as black-box functions that ignores application details, and only emphasizes the final effect of such transformations. For instance, a \(30^{\circ }\) rotation can be achieved with one single TF, or alternatively it can be split into a combination of three \(10^{\circ }\) rotation transformations. The policy modelling in TANDA might not be deterministic or differentiable, but it provides an applicable way of tuning the TF hyper-parameters.

The major objective of TANDA is to learn a model that can generate augmentation policies composed of a fixed number of TFs. Depending on the types of TFs, the DA policy is modelled in two different ways. The first policy model, namely the straightforward mean field model, assumes each TF in an augmentation policy is selected independently. Therefore, the probability of each operation is optimized individually. Mean field modelling largely reduces the number of learnable hyper-parameters during the search. However, this independent representation can be biased, especially when TFs affect each other. In practical scenarios, a certain image processing operation can lead to totally different effects if applied with other TFs. The actual sequence of TF application also matters when some of the TFs are not commutative. To fully represent the interaction among augmentation TFs, TANDA offers another option to model DA policies, the Long Short-Term Memory (LSTM) network. The LSTM model in TANDA outputs probability distributions over all TFs, which emphases the relationship among searched TFs.

figure 4

TANDA workflow [ 52 ]. Upper/lower sections indicate the policy generation/application stage respectively

The TANDA model applies standard the GAN architecture, consisting of a generator G and a discriminative model D . The general workflow of TANDA is illustrated in Fig. . There are two stages involved in TANDA: policy generation and application. The policy generation phase can be viewed as a classical min-max game in GAN. The goal of sequence generator model G is to sample DA policies that are most likely to fool the discriminator model D , while the D tries to distinguish the transformed images out of the original data. This is done by assigning reward values to the input data. Ideally, in-distribution data points will get higher values whereas the images generated via augmentation will be assigned lower rewards. The reward information is then used to update G for the next policy sampling. After the searching is completed, the final generator is used to augment the original training set to better train the classification network.

There are various advantages of TANDA. Firstly, the performance improvement out of TANDA is convincing. From the experimental results in [ 52 ], TANDA outperforms most contemporaneous heuristic DA approaches. In terms of problem formulation, the LSTM policy model tends to be more effective than mean field representation in most cases, which empirically encourages the sequence modelling in the AutoDA scheme. The proposal of these two policy models is considered to be the most significant contribution of TANDA. The representation of augmentation transformations inspired AutoAugment (AA) [ 20 ], which also utilized the LSTM model for policy prediction. Furthermore, the positive influence resulted from sequential modelling provides empirical support for later Population-Based Augmentation (PBA) [ 23 ], which outputs application schedules rather than a fixed policy. The use of unlabelled data is also a favorable characteristic especially for tasks with limited data. Additionally, a trained TANDA model shows a certain degree of robustness against TF mis-specification. In TANDA, there is no limitation on the selection of the TF range or requirement for safety property of available transformations, therefore it is much easier for users to use in practice. More importantly, TANDA is open-source and can be adapted and applied to any task with limited datasets, not only in the imaging domain but also for text data.

5.1.2 AutoAugment (AA)

AutoAugment (AA) [ 20 ] is one of the most popular AutoDA approaches. The majority of subsequent works in this field [ 21 , 22 , 23 ] adapt a similar setup as AA, especially the definition of the search space and policy model. However, the AA algorithm itself does not provide an optimal solution to the policy search problem due to its severe efficiency issues. However, as the authors of AA emphasize, the fundamental contribution of AA lies in the automated approach to DA and the development of the search space, rather than the search strategy.

AA formulates the automation of DA policy design as a discrete search problem. In AA, an augmentation policy is a composition of 5 sub-policies, each of which is applied to one training batch. One sub-policy consists of two sequential transformation functions, such as geometric translation, flipping or colour distortion. Each TF in an augmentation policy is described by two hyper-parameters, i.e. the probability of applying this transformation and the magnitude of the application. Inspired by TANDA, the application sequence of these TFs is emphasized. For simplification, the range of probability and magnitude is discrete. The probability is evenly discretized into 11 values, ranging from 0 to 1, whilst the magnitude is selected from positive integers between 1 to 10. The 14 operations implemented in AA are all from standard Python Image Library (PIL). Two additional augmentation techniques, Cutout [ 45 ] and SamplePairing [ 46 ], are also considered due to their effectiveness in classification tasks. Overall, there are 16 distinct TFs in AA’s search space. Finding an augmentation policy via AA thus has \((16\times 10\times 11)^{10}\approx 2.9\times 10^{32}\) possibilities.

figure 5

AutoAugment workflow [ 20 ]. Upper/lower sections indicate the policy generation/application stage respectively

To automate the process of constructing DA policy, AA has to search over an enormous search space. It now becomes a discrete search problem using the aforementioned formulation. At a high level, the workflow of AA is displayed in Fig. . One of the key components in the AutoDA model is the search algorithm. In the search phase, the search algorithm is used to generate an augmentation policy, which is then evaluated for updates. AA chooses a simple Recurrent Neural Network (RNN) as its search algorithm/controller to sample policy P . The evaluation procedure is done through model training, but using reduced data and a simplified model. Such a model is also called a child model, due to its similar but much simpler architecture when compared to the final classification network. After testing the trained child model on a validation set, the validation accuracy is regarded as reward R to update the search controller. Generally, the reward signal R reflects how effective a policy P is in improving the performance of a child model. The training of the child model has to be done multiple times, because R is not differentiable over policy hyper-parameters, i.e. probability and magnitude.

Through extensive experiments, AA achieves excellent results. It can be directly applied on the target data and achieves competitive model accuracy. Experiments in [ 20 ] report state-of-the-art results for common datasets, including CIFAR-10/100, ImageNet and SVHN. AA not only shows superiority in terms of DA policy design, but also provides the option of transferring the searched policy to other similar data. For example, the augmentation policy leaned on CIFAR-10 can function well on similar data CIFAR-100. There is no need to conduct expensive searches on the later, as the policies discovered by AA are able to be generalized over multiple models and datasets. This is a viable alternative especially when direct search is unaffordable. Another advantage of AA is its simple structure and procedure. The search phase is actually conducted over a subset of data, using a simplified child model. Those simplifications provide direct evaluation of augmentation policies, without the recourse to any complicated approximation algorithms. More importantly, AA standardizes the modelling of the augmentation policy and search space in the AutoDA field. The policy model it designs has been widely acknowledged as the de facto solution.

However, AA has serious disadvantages. The choice of algorithms in AA can be substantially improved. It applies Reinforcement Learning as the search algorithm, but this selection is made mainly out of convenience. The authors of AA also indicate that other search algorithms, such as genetic programming [ 72 ] or even random search [ 73 , 74 ], may further improve the final performance. Furthermore, the reduced dataset and simplified model used during the search phase can result in sub-optimal results. According to [ 31 ], the power of an augmentation policy largely depends on the size of model and dataset. Therefore, simplification in AA is likely to introduce bias into the found policy. Additionally, the final policy is formed by a simple concatenation of the 5 best policies found in the data batch. The application schedule of these policies is not considered in AA. The greatest shortcoming of AA lies in its efficiency. Evaluation of augmentation policies relies on expensive model training. Due to the stochasticity of DA policies introduced by the probability hyper-parameter, such training has to be conducted for a certain number of epochs till the policy starts to take effect. In most cases, running AA is extremely resource-intensive, which raises timing and cost issues. This also becomes the major challenge for AutoDA tasks and promotes multiple later methods aiming at efficiency improvement.

5.1.3 Augmentation-wise weight sharing (AWS)

A major reason for the inefficiency of AA is the repeated training process during policy evaluation. To enhance the efficiency of evaluation, some methods [ 23 , 53 , 54 ] sacrifice reliability to some extent. On the contrary, Augmentation-wise Weight Sharing (AWS) designs a proxy task based on the weight sharing concept in NAS, proposing a faster but still accurate evaluation process. The augmentation policies found by AWS also achieve competitive accuracy when compared to other AutoDA methods.

figure 6

Augmentation-wise Weight Sharing workflow [ 30 ]. Upper/lower sections indicate the policy generation/application stage respectively

Inspired by the idea of early stopping, the authors of AWS hypothesize that the benefit of DA is mainly shown in the later phase of training. This assumption is supported through empirical observations in [ 30 ]. Motivated by this observation, AWS proposes a new proxy task to test the sampled policies. In this proxy task, the original search stage is split into two parts. The AWS pipeline is displayed in Fig. . In the early stage, the child model is trained using a fixed augmentation strategy, i.e. a shared policy. During this phase, the search controller will not sample policies or be updated. Only the child model used for policy evaluation will be trained for a certain number of epochs. The network weights obtained after the first part of training will be shared and reused during the later evaluation, so that AWS model does not need to repeat the full training for each of the sampled policies. The major challenge here is to select a representative shared policy for the initial stage. According to the findings in [ 30 ], simple uniform sampling can work for most tasks.

In the second part of searching, AWS samples augmentation policies via a controller, and updates the model according to an associated accuracy reward. The reward information is obtained from the shared model, instead of an untrained child model. Therefore, in order to evaluate the sampled policies, it is only necessary to resume training for a few epochs using these policies. Since the training of the child model in AWS is divided into two parts based on the different DA policies utilized, AWS is an augmentation-wise algorithm. The idea of weight sharing originates from NAS, where training from scratch is prohibitively expensive. This scheme substantially accelerates the overall evaluation procedure. The design of proxy tasks in AWS is flexible, so it can be combined with other search algorithms. Standard AWS follows a similar setting to the original AA [ 20 ] applying Reinforcement Learning (RL) techniques.

The major contribution of AWS is the effective period of the data augmentation technique. The empirical conclusion in [ 30 ] is that the DA policies mainly improve the model in the late training phase. This phenomenon reflects the greatest innovation in AWS, its unique augmentation-wise proxy task that substitutes the traditional evaluation procedure. By sharing the policy at the early phase of searching, the child model only needs to be pre-trained once. The selection of the shared augmentation policy in the first-part searching is done via a uniform sampling on the search space. The network weights are then re-used in the later application stage to evaluate each of the sampled augmentation policies. There is no need to conduct child model training from scratch thousands of times. Compared to the original AA [ 20 ], it is much more efficient to obtain reward signals in AWS through the use of weight-sharing strategies. The efficiency gains of AWS makes it have the potential to scale on even larger datasets. Moreover, according to [ 30 ], the evaluation process in AWS is still reliable. This is because in the second part of searching, the child model will be fine-tuned by DA policies to reflect the strength of each of the policies.

The disadvantages of AWS cannot be ignored however. Overall, there are excessive simplifications in AWS, aimed to increasing search efficiency. For example, sampled policies are evaluated by child models on the reduced data, and the early stage of training is substituted by shared model weights. Such settings however might lead to sub-optimal results. The final policy AWS model outputs may be more designed to the proxy task rather than the target dataset according to the findings in [ 31 ]. In terms of the search algorithm, AWS utilizes the same RL framework as in AA, bringing not much improvement, especially when compared with methods such as Fast AA [ 21 ] and PBA [ 23 ]. Lastly, AWS is not open-source, which makes it less accessible for users.

5.1.4 Greedy AutoAugment (GAA)

To improve the search efficiency, Greedy AutoAugment (GAA) [ 60 , 63 ] adapts a completely different algorithm. The GAA model applies a greedy search algorithm to exponentially reduce its complexity when sampling the next policy to be searched. From the experiments conducted in [ 60 ], the TFs learned by GAA are able to further enhance the generalization ability of the classification. Moreover, the greedy idea in GAA can be a reliable complement to other search approaches in AutoDA tasks.

The policy model of GAA follows a similar setup in AA [ 20 ]. A complete augmentation policy is comprised of k sub-policies, each of which contains two consecutive TFs. Each TF is described by two essential hyper-parameters: probability and magnitude. The values of these two parameters are modeled following the same discretization as described earlier. There are 11 values for probability parameter, ranging from 0 to 1 with uniform spacing, whilst the discrete values for magnitude are positive integers, range from 1 to 10. However, GAA employs a wider range of augmentation transformations. There are 20 available image transformation functions in GAA that can be selected to form the DA policy, including 4 extra operations compared to original AA. Assuming each augmentation policy contains L image operations, where L is a positive integer greater than 0, then the search space can be defined as \((20\times 11\times 10)^L\) . In this setup, the expansion of the search space is exponential to the value of L , which can be infeasible when using larger L values.

To tackle this problem, the search space in GAA is re-formulated into a much simpler setup. It has been argued in [ 60 , 63 ] that using separate probability parameters for each TF may not be necessary, especially when dealing with small amounts of data. Instead, all images in the original training set should be fully augmented till the enhancement of model performance becomes obvious. Therefore, GAA completely discards the probability hyper-parameter of each TF. Moreover, it adapts constant a value 1 to represent the application probability of all available augmentation functions. Rather than optimising the probability hyper-parameter for each TF, GAA simply selects the TF that can give the best accuracy results. This is quite different from other AutoDA methods in how an augmentation policy is formed. By doing so, GAA successfully reduces the search space to \((20\times 10)^L\) .

However, even with the reduced search space, the growth of such space is still exponential. GAA employs tree-based Breadth-First Search (BFS) to tackle this problem. In a standard BFS pipeline, the next search point is sampled in a greedy way, which means GAA will choose the TF that can bring the most performance gain to the child model. GAA only needs to evaluate the best image operation at each epoch, rather than all available TFs in the augmentation space. The greedy BFS changes the exponential growth of search space from \((20\times 10)^L\) to a linear one. The size of final search space in GAA is \(20 \times 10 \times k\) , where k is the number of sub-policies within one DA policy. By default, k is set to 5 in GAA for a fair comparison with AA.

The search process of GAA is conducted as follows. Firstly, it goes through all possible TFs and their magnitude values, while the probabilities are all set to 1. Then, each operation is scored with its respective accuracy value obtained from the training of child model. As discussed before, searching in GAA is conducted in a greedy manner. Accordingly, only the TF with the highest score/accuracy will be stored. It is then concatenated to the next operation. This search procedure will be repeated k times to find the top k best sub-policies. Each of these selected sub-policies will be concatenated with all previously learned TFs to form the final policy of GAA.

The efficiency of GAA is substantially improved without a performance drop. From the experiments in [ 60 , 63 ], GAA requires 360 times less computational cost than the original AA [ 20 ] while still maintaining comparable model accuracy. Though the improvement in search speed is appealing, GAA has several limitations. The most significant disadvantage of GAA comes from its simplification of the search space. In GAA, the augmentation policy is formed by selecting TFs one after another, solely based on the performance of the selected operation. This setting assumes that each TF is independent even though they might affect each other in practice. As discussed in [ 52 ], certain image operations might have pretty different results depending on which other TFs are applied together. This conclusion is also supported by experiments in [ 23 ]. However, due to the greedy nature of BFS, GAA only looks one step forward during the search, and hence can be easily trapped in local maximum that results in sub-optimal solutions. In addition, most of the hyper-parameters in GAA are mainly selected manually. There is also a lack of empirical evidence supporting such decisions.

5.1.5 Population-based augmentation (PBA)

Population-Based Augment (PBA) [ 67 ] is one of the most widely accepted AutoDA approaches. Unlike GAA that models DA policies as independent TFs, PBA emphasizes the relationship between them. In PBA, the standard augmentation policy search problem is treated as a special hyper-parameter optimization, where the schedule of these parameters is stressed. The schedule here refers to the application sequence of TFs. The augmentation policy is not a fixed setting. Instead, it changes as training progresses. To accommodate additional sequential information, PBA leverages the Population-Based Training (PBT) [ 67 ] technique. This algorithm optimizes the hyper-parameters along with the network weights simultaneously to achieve optimal performance. The final output of PBA is not a fixed configuration, but an application schedule of selected TFs when training the classification model. However, due to efficiency considerations, searching in PBA is still conducted on a simplified child model instead of the target network. PBA discards the trained child model as other AutoDA methods. The searched DA schedule is then adapted to the target training set to help train more complicated models.

In order to directly compare with AA, PBA retains similar settings as much as possible. The same 15 augmentation TFs are available in PBA except SamplePairing [ 46 ]. The discretion of policy hyper-parameters follows the same formulation, allowing 11 values for magnitude and 10 for probability. In PBA, a sub-policy still consists of two TFs, which are applied on one of the training batches. The policy modelling in PBA is motivated by the need for fair comparison with AA, rather than achieving optimal performance. Since the order of augmentation TFs in policy matters, PBA has an enormous search space even when compared with AA. For a single augmentation function, there are \((10\times 11)^{15\times 2} \approx 1.75\times 10^{61}\) possibilities, much more than \((16\times 10\times 11)^{10} \approx 2.9\times 10^{32}\) in AA [ 20 ].

Despite having a larger search space, PBA demonstrates that searching for a schedule is considerably more efficient than enforcing a fixed regulation. This is due to several factors. In traditional AutoDA methods such as AA, evaluating the sampled policies is extremely time-consuming. Such process needs to be conducted via a full training of a child model, because data augmentation techniques primarily take effect in the later stage of model training. In order to estimate the effectiveness of a fixed policy, the child model has to be trained for a certain number of epochs till the model can actually benefit from the policy. However, it is totally different when testing a policy with an application schedule. If two newly sampled policies share the same prefix TFs, the evaluation algorithm can reuse the prior training weights for the evaluation of both policies. This is similar to weight-sharing idea in AWS [ 30 ] but it is more reliable. Moreover, it is also argued in [ 23 ] that DA can provide better accuracy results when utilizing schedule information. Different types of augmentation TFs may be appropriate for different epochs during model training. It is a natural thought to choose the most suitable augmentation functions according to the training stage.

figure 7

Population-Based Augmentation workflow [ 23 ]. Upper/lower sections indicate the policy generation/application stage respectively

The basic workflow of PBA is displayed in Fig. . To start, several child models, i.e. population, are initialized and trained concurrently. Each child model is responsible for evaluating a single augmentation policy. After 3 epochs of training, PBA runs one epoch of Gradient Descent (GD) by testing all child models on the validation dataset. The validation accuracy estimates the performance of each policy used to train a child model respectively. After obtaining the performance rank of all child models and corresponding DA policies, PBA employs a classical “exploit-and-explore” procedure to update policies, where the lower ranked models copy the parameters of the higher ranker ones. Specifically, PBA uses Truncation Selection [ 67 ] for exploitation. For exploration, PBA randomly perturbs the parameter values during sampling and exploiting. PBA does not re-initialize the child model from scratch, which greatly reduces the computational cost.

The improvement of PBA for search efficiency is substantial. It is approximately 1, 000 times faster than AA while still preserving similar accuracy. This is mainly due to the joint optimization of the child model and policy hyper-parameters. Even though the trained model is discarded after the search, re-using the network weights without repetitive training requires much less computational resources. The output of PBA is an application schedule for the augmentation policy. This can be represented as an augmentation function f ( t ), with training epoch t as a variable. The final output of PBA reports moderate probability values of all operations. This may be due to the random perturbation when updating the policies. The magnitude values of all augmentation TFs also share a pattern. In the early phase of training, the increase in magnitude values is rapid. As training progresses, all magnitudes will reach a stable state. explaining this phenomenon, the authors of PBA argue that an effective evaluation procedure should be conducted for at least a certain number of epochs till the DA policies fully function on the model [ 23 ]. Moreover, the experimental results in [ 23 ] also suggest that simple TFs might be more suitable in the initial stage, while more complicated DA operations can be applied later.

5.1.6 Fast AutoAugment (Fast AA)

Besides PBA, another widely accepted AutoDA is Fast AutoAugment (Fast AA) [ 21 ]. This method is motivated by Bayesian Data Augmentation [ 24 ] to solve the AutoDA problem. However, the objective of search phase in Fast AA is no longer focused on the highest model accuracy. Instead, Fast AA treats the augmented images as the data points from the original data distribution, which can best enhance the generalization ability of the model to be trained. Therefore, in Fast AA, the modified optimization objective focuses on minimizing the distribution distance between the original data and new data generated by DA policies. This is realized by adapting a density matching algorithm. This algorithm operates by matching the density of the original and the generated data, and hence completely eliminates the need for re-training the child model.

Similar to the aforementioned methods, Fast AA employs the same problem formulation as in AA [ 20 ], but uses continuous values instead. The two hyper-parameters of augmentation TF (probability and magnitude) have far more possibilities in Fast AA. However, in contrast to PBA [ 23 ], the application of Fast AA policies is random and without any sequential order. The final policy of Fast AA is a combination of 5 sub-policies. Each sub-policy is a conjunction of 2 TFs, following the same experimental setting of AA. The authors of Fast AA emphasize that the number of sub-policies can be further tuned during the search [ 21 ] because of the efficiency of the approach.

figure 8

Fast AutoAugment workflow [ 21 ]. Upper/lower sections indicate the policy generation/application stage respectively

Fast AA optimizes the augmentation policy during the process of density matching between training data pairs. Figure  displays the general workflow of Fast AA. The target data is divided into two sets, namely the training data \(D_{train}\) and validation data \(D_{valid}\) . At a high level, the goal of the search algorithm is to find a DA policy P that can best match the density of \(D_{train}\) and \(P(D_{valid})\) augmented by P . Unfortunately, it is infeasible to directly compare data distributions for each of the sampled policies. Fast AA tackles this problem by approximating the data distribution via model predictions. Specifically, if the same model can achieve equally promising results over two datasets, it is reasonable to consider that two sets might share a similar distribution.

In detail, Fast AA first splits the full data into several folds, each of which is assigned to a classification model M and processed in parallel. Every data fold consists of two separated datasets \(D_{train}\) and \(D_{valid}\) . Training data is used to pre-train the model M , while the validation data will be retained for the subsequent policy search. After pre-training, the Fast AA model starts to sample augmentation policies via a classical Bayes Optimization [ 24 ]. These policies are then evaluated based on the performance of the trained model M on augmented data \(P(D_{valid})\) . Specifically, the policy that can generate data to maximize the performance of M is desired. During the evaluation process in Fast AA, there is no training step involved at all. As such, it is a much more efficient and approach than other training-based evaluation methods.

The experimental results in [ 21 ] show that this evaluation setup leads to superior speed, and provide competitive accuracy to AA on various data. Fast AA achieves promising results not only for direct policy search, but also the transfer of learned policies to new data. The evidence presented in [ 21 ] suggests that the transferability of Fast AA far exceeds the original AA. This is because Fast AA conducts the search directly on full datasets using target classifications, which minimizes the sub-optimality of learned policies. Therefore, Fast AA also has the potential to achieve better results when given more complex tasks. Making use of the efficiency of Fast AA model, the total number of sub-policies contained in an augmentation policy can be more than original setting. The experiments in [ 21 ] show an improved generalization performance with more sub-policies searched by Fast AA. Moreover, it is also possible for Fast AA to search augmentation policies for each class. By doing so, Fast AA also obtains a slightly improved performance.

The biggest advantage of Fast AA lies in its efficiency. After introducing density matching into the evaluation process, Fast AA does not need to train the child model at all. As a result, Fast AA achieves a significantly faster search speed than the vanilla AA [ 20 ]. Moreover, contrary to all previous AutoDA approaches [ 20 , 23 , 30 ], Fast AA does not necessarily simplify the given task by searching on a proxy task. This avoids possible sub-optimality, and hence guarantees the competitive performance of Fast AA. Another benefit of Fast AA is its continuous search space instead of discrete values, which finds more candidate policies during the search and potentially enhances the final result. Furthermore, the operations in Fast AA can be conducted in parallel, making it more practical to implement for ordinary users.

5.1.7 AutoAugment with knowledge distillation (AA-KD)

Although AA-based algorithms have been a powerful augmentation method for many classification tasks, they are often sensitive to the choice of TFs. An inappropriate DA policy may even deteriorate the model performance. In traditional AutoDA methods [ 20 , 21 , 23 ], usually there exists a trade-off between data diversity and label safety. Aggressive TFs can give more diverse training data that may better generalize the model, but they can potentially corrupt annotation information. In contrast, mild augmentation operations preserve the original image label, while the diversity of augmented data might be constrained and only lead to limited performance improvements.

This has been investigated in AutoAugment with Knowledge Distillation (AA-KD) [ 59 ]. Examples in [ 59 ] show that some aggressive TFs can remove important semantics from the training data, which results in augment ambiguity. Augment ambiguity happens when the original label of a given image is no longer the most suitable annotation for the augmented data. Such phenomenon might confuse the classification model and result in performance drops. AA-KD approaches this problem by utilizing a Knowledge Distillation (KD) technique, which provides label correction after augmentation.

The authors of KD-AA point out that aggressive transformations can still be helpful for model training, only under the premise that associated labels are adjusted accordingly. Using the original labels of all transformed images is not the best option. Therefore, each of the data samples should be treated differently based on the different effects of an augmentation operation. KD-AA leverages the idea of Knowledge Distillation to gain more diverse data as possible, while still providing accurate annotations. In KD-AA, a teacher model is used to generate complementary information for label corrections. Each transformed image is described by the original label as well as the teacher signal. The latter is also called a soft label, as the ground-truth label is slightly softened by the associated teacher signal. Such soft labels are then used during the training of student models in the policy application phase. The teacher-student framework in KD-AA is mainly designed to filter out biased or noisy annotations resulted from DA to further enhance the later model training.

The major contribution of KD-AA is to recognise that KD is a useful technique to enhance traditional AutoDA methods. It is a complement to augmentation search algorithms rather than a complete AutoDA model. The effectiveness of KD-AA is demonstrated via experiments using AA [ 20 ] and RandAugment [ 31 ] in [ 59 ]. With much larger magnitude values, AutoDA supported by KD produces consistent improvements in model accuracy. The authors of KD-AA also report the possibility of employing semi-supervised learning techniques with KD in AutoDA [ 75 , 76 , 77 ], where the input data is mostly unlabelled.

5.1.8 Patch AutoAugment (PAA)

Generally, AutoDA methods search for the augmentation policies at the image level. The same DA policy is used to transform the entire image. However, depending on different content within various regions of an image, the optimal TF may be different. Treating an image as a whole might ignore the difference in its internal regions, which constrains the diversity of augmented data [ 78 ]. Moreover, overly aggressive augmentation functions may potentially modify or remove semantic features, causing safety concerns in terms of label preservation. To address the above-mentioned problem, a more fine-grained approach, Patch AutoAugment (PAA) [ 65 ] has been proposed. PAA considers an image as a combination of several patches, and optimizes DA policies at patch level. To fully represent the inner relationship between each patch, PAA formulates AutoDA tasks as a Multi-Agent Reinforcement Learning (MARL) problem, where each agent handles a single patch and updates the final DA policies corporately.

Similar to AA, the problem modelling in PAA follows the basic use of Reinforcement Learning (RL) model. In a standard RL framework, given a current state, the agent/controller samples a DA policy and receives the corresponding reward signal from the child model training. Then the PAA model updates the policy according to the reward and moves to the next state. The optimization objective of an agent in PAA is the maximization of reward to search for the best performing DA policy. Moreover, PAA employs the idea from MARL [ 79 ], using multiple agents to search over a single image. Each agent handles a sub-region out of the original picture, and shares a global reward to cooperatively update the next augmentation strategy.

In PAA, the augmentation policy model consists of a global state, local observations and actions. Firstly, an input image is divided into several non-overlapping patches of equal size. Each patch is controlled by one policy agent for the policy optimization. To accommodate the contextual relationship between other patches, a global state is shared among all agents, representing the semantics of the whole image. The global state is obtained by extracting the deep features of the entire input image through a standard CNN model. In PAA, the ResNet-18 network [ 4 ] pre-trained on ImageNet data [ 28 ] is used. In addition to the global state, each agent also utilizes local information, i.e. observations, to update its own DA policy. Local observations are also described by the deep features of the associated patches. Unlike global information, this information is unavailable to other agents. During the search, each of the agents update its policy based on both global and local information. The action of the policy model in PAA represents standard TF techniques, controlled by the probability and magnitude. There are 15 operations defined in PAA. Similar to PBA [ 23 ], PAA outputs an application schedule of TFs instead of fixed policies. However, since the search in PAA is conducted on a child model for a limited number of epochs, such schedule needs to be linearly scaled up when applied to target networks in the second stage.

The optimal strength of PAA is its patch-based policy search. Through the use of Grad-CAM [ 80 ], the importance of each region in the image is clearly shown in [ 65 ]. This further shows that the optimal augmentation strategy for each patch can vary depending on the different semantics. It is therefore reasonable to perform the augmentation search at a more fine-grained level. Different patches might prefer totally different TFs during the training. For example, regions that contain important features may prefer mild TFs, which can better preserve the semantics. However for less important patches that do not include objects of interest, aggressive operations with larger magnitude values might provide a higher level of variety in the augmented data. Overall, a fine-grained PAA can not only provide sufficient variety for the proposed policies, but also further enhance model performance.

5.2 Gradient-based optimization

5.2.1 faster autoaugment (faster aa).

From the perspective of hyper-parameter optimization, all of the aforementioned two-stage AutoDA methods do not directly optimize augmentation policies via gradients. This is because the augmentation operations are usually not differentiable with respect to the hyper-parameters of policy probability and magnitude [ 22 ]. As a result, it is often tricky to obtain the gradient information of validation accuracy with regard to policy hyper-parameters [ 54 ]. However, several works have proposed to approximate the hyper-parameters as probability distributions, and relax such distributions in a way that a DA policy can be optimized based on gradients [ 22 , 54 , 57 , 62 ]. Faster AutoAugment (Faster AA) [ 22 ] is the only two-stage approach that is based on gradient optimization.

Faster AA also employs a similar policy model as in previous works [ 20 , 21 , 23 ]. In Faster AA, a DA policy consists of several sub-policies, where a single sub-policy contains 2 consecutive augmentation TFs. Each TF is described by two hyper-parameters: the probability and magnitude. Operations used in Faster AA are the same 16 image TFs from the original AA work, including 14 basic image transformations functions implemented in PIL library and 2 extra augmentation algorithms, i.e. Cutout [ 45 ] and SamplePairing [ 46 ]. Since a DA policy is often non-differentiable based on its hyper-parameters, traditional AutoDA models have to conduct a full training on child model to evaluate a policy. Even after the discretization of policy hyper-parameters, this formulation still requires exorbitant computational resources [ 20 , 23 , 30 ]. To address this challenge, Faster AA approximates the original search space into a differentiable setting and directly optimizes the gradient, which significantly reduces the search cost.

Inspired by the bi-level optimization in OHL-AA [ 54 ], Faster AA adapts a differentiable framework for DA policy search, which substantially accelerates the search. The key modification in Faster AA is the approximation of policy gradient via a straight-through estimator [ 81 , 82 ] inspired by DARTS [ 55 ]. The success of DARTS in NAS fields makes it suitable to AutoDA tasks. In Faster AA, the distribution of augmentation hyper-parameters are approximated as Relaxed Bernoulli distribution [ 70 ]. After the relaxation of search space, each TF within an augmentation policy can be differentiable with regard to the probability and magnitude hyper-parameters [ 81 , 82 ]. This makes it easier to calculate their gradients in Faster AA. Using gradient estimation techniques, Faster AA can directly optimise DA policies based on gradient. This makes DA policy optimization end-to-end differentiable, and thus provides much more control over the entire process especially when compared to previous black-box optimizations, such as Reinforcement Learning in AA [ 20 ].

figure 9

Faster AutoAugment workflow [ 22 ]. Upper/lower sections indicate the policy generation/application stage respectively

To further reduce the search cost, inspired by Fast AA [ 21 ], Faster AA also applies a density matching technique during policy evaluation. The objective of optimization is to minimize the distance between data distributions of the original and the augmented data. Faster AA also employs an adversarial framework to help the policy sampling. Due to the use of Density Matching, the overall workflow of Faster AA is similar to Fast AA as displayed in Fig. . However, Faster AA uses a reduced dataset D as its input to further improve the search efficiency. Input data D is firstly split into two sets: a training set \(D_M\) to prepare the evaluation model and an augmentation set \(D_A\) for policy searching. The pre-trained model M is then used to estimate the performance of sampled DA policies P in the first stage of Faster AA. The Faster AA model examines the search space in a gradient-based manner to identify the optimal DA policies. After searching, the final policy will be applied to augmented the full dataset and train the classification model M . During both the searching and training phase, Faster AA trains the same target model M to provide more reliable evaluation.

Faster AA is the first two-stage AutoDA model that resorts to gradient approximation to achieve faster searches than other state-of-the-art algorithms, such as Fast AA [ 21 ] and PBA [ 23 ]. Importantly, it introduces straight-through estimators [ 81 ] to approximate the gradient of non-differentiable AutoDA task. By relaxing the original distributions of policy hyper-parameters, Faster AA can directly back-propagate the augmentation process and optimize DA policies based on the gradient. The black-box optimization used in traditional two-stage AutoDA models is therefore transformed into a more transparent and controlled process to significantly improve the search speed. In addition, Faster AA follows the density matching idea of Fast AA [ 21 ], completely removing the repetitive training of the model in the first stage. Instead, policy evaluation is conducted by minimizing the distribution distance between the original and the augmented data. As a result, Faster AA can substantially reduce the required computational resources compared to AA [ 20 ]. The experimental results in [ 22 ] shows the competitive performance of Faster AA compared to other AutoDA methods, in terms of both search efficiency and final accuracy.

5.3 Search-free

5.3.1 randaugment (ra).

RandAugment (RA) [ 31 ] is the first search-free scheme in the AutoDA field. To reduce the cost of the search phase, the parameter space in RA is significantly smaller, defined by only two hyper-parameters. This reduced space allows RA to learn DA policies directly from the full dataset without resorting to a separate proxy task. In fact, the simplification of policy hyper-parameters in RA is so dramatic, that a simple grid search is sufficient to output an effective DA policy. Therefore, the policy generation stage in RA is quite different from the classical search scheme in other approaches, where the latter usually involves selective sampling and expensive evaluation. According to [ 31 ], it is possible to apply more advanced sampling methods instead of a naive grid search, which may further reduce the computational cost. Therefore, RA can be recognised as a search-free AutoDA model. Additionally, RA is also able to optimize DA policies based on different sizes of classification models and training data. The experimental results in [ 31 ] also show that RA can produce competitive accuracy result when compared with other search-based AutoDA approaches.

The primary goal of RA is to reduce the complexity caused by the separate policy search stage in the earlier two-stage AutoDA methods. To do so, RA eliminates the need for expensive policy searches by greatly simplifying the search problem. The entire search phase in traditional AutoDA method is removed in RA out of efficiency considerations. This is because most of the computational workload comes from the first stage when the model repetitively samples DA policies and evaluates them. It is also a complicated bi-level optimization problem to conduct the policy search and network training simultaneously. Another downside of prior search-based methods lies on the proxy task used during searching. In the proxy task, AutoDA models search on a reduced sub-set of the original training data, and evaluate sampled augmentation policies using a simpler network. Both simplifications are applied in order to decrease the search cost. A major premise of this framework is that such a proxy task can reflect some core features of the target task, so that the final policy is also the optimal augmentation scheme for the full training data. While the DA policy found through the proxy task is able to produce promising performance [ 20 , 21 , 23 , 26 ], it is likely to be a sub-optimal result [ 31 ]. According to [ 31 ], the optimal strength of an augmentation policy depends on the size of both the training set and network. Therefore, searching for DA policies on a proxy task can only produce results suitable to solve the proxy task instead of the target task, which leads to sub-optimal solutions.

To avoid sub-optimal results, AutoDA models need to directly search for DA policies over the full training set. However, this is usually computationally infeasible in practice as the traditional search space in AA [ 20 ] is extremely large. In order to mitigate such efficiency issues, RA substantially reduces the number of hyper-parameters to optimise. In RA, the reduction in the size of search space is tackled in two ways, including the simplification on the existing formulation and the proposal of new parameters. In prior methods [ 20 , 21 , 23 ], each TF in an augmentation policy is controlled by two hyper-parameters, probability and magnitude. While in RA, all image operations are selected with uniform probability, which depends entirely on the total number of available TFs in the search space. For instance, given K different TFs in RA, the probability of applying each operation is \(\frac{1}{K}\) .

To further reduce the search space, RA simplifies the magnitude hyper-parameter as well. The value range of the magnitude hyper-parameter follows the same setting as in original AA [ 20 ], with 11 discrete values in total, ranging from 0 to 10. In previous AutoDA models, the scale of each transformation function is also specified by its respective magnitude. However, after examining changes in each operation magnitude during searching [ 23 ], the authors of RA point out that all magnitude values follow a similar schedule over time. Therefore, RA postulates that it may be sufficient to use a shared magnitude hyper-parameter M for all TFs. As a result, in RA, all image operations within DA policies share the same probability and magnitude hyper-parameters, which significantly reduces the parameter space. Besides reformulating the search space, RA also proposes a new free parameter to improve the performance gain, namely the number of TFs N within one augmentation policy. N is predominantly manually decided in most popular AutoDA methods [ 20 , 21 , 23 ] due to limited computational resources. While in RA, automating the search of N becomes feasible because of the extremely reduced parameter space. Optimizing the TF number N can eliminate human bias and further improve performance.

After the re-parameterization of parameter space, there are only two hyper-parameters to optimize in RA: the number of TFs N to form a complete augmentation policy, and the global magnitude value M to control all TFs. Both hyper-parameters can be easily interpreted by humans so that larger values of N and M indicate more aggressive augmentation strategies, while smaller values represent more conservative schemes. After RA has reformulated the entire search problem, various advanced algorithms can be applied to perform standard hyper-parameter optimization [ 83 ]. However, since the final search space in RA is extremely small, the authors of RA suggest that a simple grid search can yield sufficient performance gains, which is supported by experiment [ 31 ].

RA makes several noteworthy contributions to the AutoDA task. Via re-parameterization of standard AutoDA problem, RA employs a reduced search space, which is only controlled by two hyper-parameters, N and M . N indicates how many TFs are contained within a single DA policy, and M refers to the uniform distortion parameter for all image operations. In RA, the optimization of N and M is achieved by naive grid search. This feature allows RA to easily scale to larger datasets and deeper models without significantly increasing the search cost. Moreover, RA shows promising performance on various datasets, matching or even outperforming previous AutoDA models including AA [ 20 ], Fast AA [ 21 ] and PBA [ 23 ]. This finding demonstrates the limitations of prior approaches based on proxy tasks. The experimental results in [ 31 ] are also in agreement with this finding, which shows that the optimal DA policy depends on the size of training data and discriminative network. Transferring a DA policy learned from a simplified proxy task can lead to performance degradation. After removing the expensive search phase, RA avoids the sub-optimality of learned DA policies through direct searches on the target dataset and classification model. Finally, the results in [ 31 ] reveals the relationship between augmentation policy and the size of dataset and model. Most existing AutoDA methods optimize DA policies using reduced data and smaller models to accelerate the search [ 20 , 21 , 23 ], however this leads to sub-optimal performance. In practical applications, searching a full dataset can be computationally infeasible. Therefore, the findings in [ 31 ] have stimulated future innovations, aimed at balancing effectiveness and efficiency.

5.3.2 UniformAugment (UA)

UniformAugment (UA) [ 58 ] is another search-free method that also significantly reduces the parameter space. Unlike RA which employs a grid search to tune its augmentation parameters, UA completely eliminates the need for hyper-parameter optimization. Instead, UA restricts the range of values over which the policy hyper-parameters can be sampled, so that all DA policies falling into this range can preserve the original label of most of the data. Such a range is defined as an approximately invariant augmentation space in UA. A simple uniform sampling from the invariant space can produce effective DA policies, and eventually lead to sufficient performance gains. As a result, UA greatly surpasses all existing AutoDA models in terms of efficiency. The efficacy of UA is also demonstrated by extensive experiments [ 31 ]. Using the same 15 augmentation TFs that are implemented in AA [ 20 ] and other approaches [ 21 , 23 , 31 ], UA achieves comparable improvements in model accuracy. Furthermore, due to the removal of the search phase, UA is by far the most scalable AutoDA method, and can be easily applied to different tasks in the real world.

The key concept in UA is the introduction of invariant augmentation space. In [ 58 ], an approximately invariant space is defined as a selected value range for the policy hyper-parameters. Each DA policy sampled from such a space is able to retain the representative features of the original data after transformation. In other words, most of the augmented data can still remain within the distribution of the original training set, without change of label information. From the perspective of Group Theory [ 84 ], when given such an invariant augmentation space, further optimizing policy hyper-parameters within this space can only yield limited performance gains, and is therefore unnecessary in practice. In that case, a naive random sampling approach might also lead to effective strategies, thus avoiding expensive computing cost. The experiments in [ 58 ] demonstrates the promising performance of UA, supporting the invariance assumption. However, UA is based on the premise that an invariant augmentation space is already known. While in UA, the invariant range is actually manually decided via empirical evidence from prior works [ 20 , 21 , 23 , 31 ].

Similar to RA [ 31 ], UA also explores the influence of two hyper-parameters M and N , where M refers to the operation magnitude and N is the total number of TFs in a given policy. According to empirical evidence and theoretical analysis [ 28 , 36 ], a good augmentation policy should be approximately invariant in order to generate in-distribution data, while being able to maximize data variety at the same time [ 58 ]. Usually, the generalizability of a classification network can be improved if the model is trained on more diverse data. This assumption emphasizes the importance of the value range for the magnitude M . Constraining M within a narrow range can result in limited data diversity, whereas sampling policies from a wide M range may produce overly aggressive TFs, which can remove original label information. Both results are considered sub-optimal, which suggests that there is a trade-off between diversity and correctness of learned DA policies.

A similar trade-off can be found during the experiments on hyper-parameter N . Usually, optimizing N is impractical in most prior AutoDA methods due to limited computational resources. However, it is possible to examine various N values in search-free models such as RA [ 31 ] and UA. From the results in [ 58 ], a smaller N value usually indicates safer augmentation policies with less TFs applied on the image data. The transformed data tend to be less diverse. In contrast, a larger N value might impose stronger DA operations on the training data and hence has the possibility of corrupting the original labels. In order to obtain the optimal strength of data augmentation, AutoDA models need to balance between effectiveness and safety feature when choosing N . After systematic experiments on N values in [ 58 ], N in UA is set to 2. This is sufficient to effectively improve model performance, while maintaining the same data distribution after augmentation. Moreover, \(N = 2\) is also in line with the original proposal of AA [ 20 ].

The contribution of UA has been revolutionary in the field of AutoDA. It not only proposes an effective automated DA scheme, but also substantially surpasses all existing approaches in terms of efficiency. More importantly, the hypothesis of augmentation invariance in [ 58 ] challenges the central premise of the AutoDA field. Most prior AutoDA models are motivated by the need for automatically searching for optimal augmentation hyper-parameters on given datasets, replacing biased and sub-optimal manual design. However, the necessity of such searching is questioned in [ 58 ]. Authors of UA propose the definition of an invariant augmentation space, in such a way that optimising DA hyper-parameters within that space is not necessary. This assumption is theoretically supported by group theory in [ 84 ]. The comparable performance of UA also provides empirical evidence for the validity of the invariance hypothesis in data augmentation.

However, how to decide an approximately invariant space for augmentation policy remains an open question. The efficacy of UA is mainly based on the application of domain knowledge, adapted from previous works [ 20 , 21 , 23 , 31 ]. However, in the real-world scenarios, a given task and domain might be unknown. In addition, the pre-defined space of UA is also not guaranteed to be invariant, due to the lack of theoretical supports for its selection strategy. Even though UA yields positive results in empirical research, it is very likely that the final performance of a classification model can be further improved through the use of a more reliable policy space. According to [ 31 ], it is important to develop a systematic methodology to determine an invariant policy space when given a specific dataset. Once such a range is decided, it is no longer necessary to perform an expensive search for DA hyper-parameters. Any augmentation strategy sampled from a invariant space should be effective enough for the given task to produce promising performance gains.

6 One-stage approaches

After the pioneering AutoDA works such as AutoAugment (AA) [ 20 ], it is intuitive to approach AutoDA problems from a two-stage perspective. In the first policy generation phase, AutoDA models generate the optimal DA policy for a given dataset. In the second stage, the learned policy is then applied on the training set for model training. Optimization of policy hyper-parameters and network weights are performed in strict order. However, the separate generation stage in two-stage approaches results in additional computational complexity, which is also the major reason for their efficiency issues. For example, the original AA [ 20 ] requires thousands of GPU hours to learn an effective augmentation policy better than the baseline.

To improve the efficiency of AutoDA models, one idea is to perform the policy generation and application simultaneously, thus forgoing the extra computation in two stages. Despite inefficiency of the two-stage methods, it is inherently impractical to merge two stages and optimize the DA policy along with the classification model. This is because tuning policy hyper-parameters based on model performance is not a differentiable optimization problem. In other words, the gradients of augmentation hyper-parameters cannot be directly calculated nor optimized, thus precluding the possibility of joint optimization. However, with advances of gradient approximation techniques from Hyper-parameter Optimization (HPO) field, it is feasible to relax the original distribution and estimate policy gradients, allowing for one-stage AutoDA models.

This section reviews existing one-stage AutoDA methods including Online Hyper-parameter Learning AutoAugment (OHL-AA) [ 54 ], Adversarial AutoAugment (AAA) [ 53 ], Differentiable Automatic Data Augmentation (DADA) [ 57 ] and Automated Dataset Optimization (AutoDO) [ 62 ]. All of these approaches are based on gradient approximation through the use of differentiable frameworks. We discuss the formulation of optimization problem in each method. In particular, we focus on gradient approximation, which is the core technique in one-stage models. Lastly, we discuss the main contributions and limitations of each method.

6.1 Gradient-based optimization

6.1.1 online hyper-parameter learning autoaugment (ohl-aa).

Before the proposal of Online Hyper-parameter Learning AutoAugment (OHL-AA) [ 54 ], the majority of the works in automated DA optimization followed the basic two-stage procedure [ 20 , 21 , 23 ]. Despite promising performance, most two-stage methods have serious bottlenecks in search time and cost. The authors of OHL-AA argue that the major cause of efficiency issues is the offline search, i.e. policy searching is performed independently of the final model training. In contrast, OHL-AA model applies an online scheme, where the policy hyper-parameters and network weights are optimized jointly in a single pass. In OHL-AA, policy hyper-parameters are formulated as probability distributions. Additionally, the gradients of DA policy are approximated via the use of the REINFORCE estimator [ 55 ], which allows for direct optimization during model training. The final outputs of OHL-AA not only include the optimal DA policy for given task, but also contain the completely trained classification network. By combining the searching and training stages, the OHL-AA model reduces additional computational costs resulting from two stages, while still maintaining comparable performance.

Following the problem formulation in AA [ 20 ], OHL-AA also approaches AutoDA problem from the perspective of hyper-parameter optimization, but using a different optimization framework. Specifically, the DA policy in OHL-AA is sampled from a parameterized probability distribution, whose parameters are regarded as DA hyper-parameters that are optimized along with network weights. The joint optimization is achieved via a bi-level framework [ 71 ]. There are two layers of optimization in this bi-level setting. The inner objective is the training of classification model, while the outer objective is the optimization of the policy hyper-parameters through the use of REINFORCE gradient approximator [ 55 ]. Due to this bi-level optimization, OHL-AA is also acknowledged as an online AutoDA model, where the DA policy is updated together with the classification network. By optimizing the DA policy and task model simultaneously, OHL-AA completely discards the searching on small proxy tasks. Unlike previous two-stage methods [ 20 , 23 ], policy optimization in OHL-AA no longer requires thousands of evaluations, e.g. training of surrogate models. OHL-AA can directly optimize the DA policy through classical gradient descent algorithm, which substantially improves the search efficiency.

figure 10

Online Hyper-parameter Learning AutoAugment (OHL-AA) workflow [ 54 ]

Inspired by [ 85 ], policy hyper-parameters in OHL-AA are updated in a forward manner. Specifically, the network weights after a certain number I of epochs of inner optimization are forwarded to the outer optimization for an update at the outer level. In other words, the inner objective is updated for I steps between two adjacent updates at the outer level. The overall workflow of OHL-AA is illustrated in Fig. . The bi-level optimization problem in OHL-AA is represented by two overlapping loops. In the inner loop, a group of the same classification models are trained in parallel using different DA policies, each of which is designed to evaluate the efficacy of the associated augmentation policy. After I training epochs, all models are evaluated on the validation set. Among these models, the one with the highest validation accuracy is selected and broadcast to other models to synchronize network weights.

Outer optimization is performed along with the inner procedures. In the outer loop, after the same I number of inner updates, accuracy values are used to optimize the probability distributions of the DA policy. To be specific, after obtaining the validation accuracies of models, OHL-AA first calculates the average gradient of them using the distribution hyper-parameters (via the REINFORCE algorithm [ 55 ]). Such gradients are then used to update the policy distributions as a one step of gradient ascent. After the update, new augmentation policies are sampled from the updated distributions, and then used to train a group of synchronized networks. The whole process continues iteratively until the network or policy distribution finally converges. Overall, OHL-AA aims to find the optimal policy distribution that can generate the best DA policy. This framework drastically reduces the search cost, as the policy hyper-parameters are updated using only I steps of model optimization instead of a complete training.

The biggest contribution of OHL-AA is the proposal of the bi-level framework. It is also considered to be the first one-stage AutoDA model that optimizes both the DA policy and the network weights in a single pass. The removal of repetitive model training and proxy tasks significantly reduces the overall search cost. According to the results in [ 54 ], OHL-AA is \(60\times \) faster than the original AA [ 20 ] on CIFAR-10 and \(24\times \) faster on ImageNet data, while still maintaining comparable model performance. In addition, the probability distribution of the DA policy in OHL-AA provides a feasible differentiation method for estimating policy gradients of AutoDA problems, which stimulates innovations in later one-stage approaches, such as DADA [ 57 ] and AutoDO [ 62 ].

6.1.2 Adversarial AutoAugment (AAA)

Adversarial AutoAugment (AAA) [ 53 ] is another one-stage AutoDA model that simultaneously optimizes the target network and augmentation policy. In additional to gradient approximation of DA policy, AAA innovatively employs adversarial concepts of GANs, leading to a more computationally efficient AutoDA approach. The ultimate goal of the AAA method is to best train the target classification model, rather than searching for the optimal DA policy. Similar to OHL-AA [ 54 ], training and searching in AAA are conducted in an online way, where the augmentation policy is dynamically updated along the training of discriminative model. Such procedures in AAA avoid the need for re-training the classification model, which significantly decreases the computational cost.

AAA preserves the standard formulation of the AutoDA problem in AA [ 20 ]. In AAA, a complete augmentation policy for full dataset contains 5 sub-policies. Each sub-policy is applied on one data batch before training the target model. A sub-policy is composed of two separate augmentation transformation functions, each of which is controlled by two hyper-parameters, i.e. probability and magnitude. To obtain better performance and easily compare with the original AA [ 20 ], AAA precludes the probability factor of augmentation operations during training. According to [ 53 ], such hyper-parameter requires a certain number of training epochs to take effect. This is effective for offline frameworks such as AA [ 20 ], because policy models are updated based on the result of full training on a child model. However, in an online AAA model, DA policy is dynamically evolved along with the training of target networks. The number of training epochs for each update is not sufficient to fully demonstrate the randomness of image operations, which may constrain the optimal strength of reward signal.

In AAA, network training and DA policy generation is performed simultaneously. Different from standard two-stage approaches, the augmentation policy is dynamically updated rather than fixed during model training. As with other one-stage AutoDA methods, AAA also needs to perform gradient approximation on DA hyper-parameters to support joint optimization in a non-differentiable framework [ 86 , 87 ]. Specifically, the REINFORCE algorithm [ 55 ] is applied in AAA to estimate the policy gradient. The overall framework of AAA follows a standard GAN structure, consisting of a policy model as well as a target model. The training of these two models is formulated as a min-max game in an adversarial way. The policy model here is regarded as an adversary. During the training, the target network aims to minimize the training loss over the input data, while the objective of the policy network is to maximize the training loss of the target model by generating adversarial DA policies. These adversarial policies force the target model to learn from harder data samples and thus substantially improve its generalizability. When updating the policy network, the reward signal used comes from the training losses of target network after normalization. These loss values are associated with different augmentation strategies to indicate the efficacy of DA policies respectively.

The major motivation for proposing AAA is the limited randomness of traditional policy search. Although the enormous search space in most AA-based approaches allows for a large variety of policy candidates [ 20 , 21 ], fixing the sampled policy during the entire model training often leads to an inevitable overfitting problem [ 23 ]. To tackle this issue, AAA chooses to use a dynamic augmentation policy, which is updated based on the state of the target model during training. The concept of dynamic DA policy was first proposed in PBA [ 23 ], where the application schedule of TFs was especially emphasized by sharing TF prefixes. While in AAA, the stochasticity of the policy search was further enhanced, as the entire DA policy was updated along with model training, rather than just a subset of TFs within one policy. The increased randomness in policy sampling provides the augmented data with more diversity, and thus can better train the target model.

Another motivation of AAA is the efficiency problem existing in most AutoDA methods. The first AutoDA model AA [ 20 ] was largely criticized due to its excessive training time. Later works such as PBA [ 23 ] manage to accelerate the whole process by trading time with space. Although the overall search time is substantially decreased, training a large population of child models simultaneously still requires significant computational resources. AAA, however, is considered to be computation-efficient and resource-friendly. During the search in AAA, the target model only needs to be trained once. By reusing the prior computation in training, policy networks are updated based on the intermediate state of target models instead of the final result. By the end of training, the target network is supposed to be optimized via combating adversarial policies. Due to the reduced computational cost and time overheads, AAA can directly perform searching on the full data using the target network. A direct search not only guarantees the effectiveness of AAA, but also eliminates the potential sub-optimality that may result from employing proxy tasks.

The most significant innovation in AAA is its adversarial framework. In fact, adversarial learning is not the first time it has been utilized in AutoDA problems. The earliest TANDA approach [ 52 ] also employed standard GAN structures, which used policy models as generators to sample DA policies, while another discriminative network was used to identify augmented samples out of the original data for policy evaluation. On the contrary, AAA uses policy model as an adversary against the training of the target model. The policy model in AAA produces aggressive DA policies that maximize the training loss. The data transformed by these policies are often more distorted, making it more difficult for target models to distinguish, which in turn allows the target model to learn more robust features via adversarial learning. The final goal of AAA is not just finding the optimal policy on given dataset. Instead, AAA places more emphasis on the final result of target model training. By feeding deformed examples, AAA trains the classification network, allowing it to be more resilient to a variety of data points, thereby greatly enhancing its ability of generalization.

In addition, AAA also outperforms previous AutoDA methods in terms of evaluation. AAA directly evaluates the performance of target model via training loss, while in methods such as TANDA [ 52 ] or Fast AA [ 21 ], the effectiveness of DA policies is estimated using the similarity of the augmented data to the original data. Generally, it is considered to be an effective policy if transformed pictures resemble the original samples. Such approximation might result in potential performance degradation due to the lack of variety in the learned policies. However, this issue can be substantially mitigated in AAA as the training loss is used as an intuitive criteria to evaluate the target model. The experimental results in [ 77 ] also show that classification networks trained in AAA have higher accuracy than previous methods.

6.1.3 Differentiable automatic data augmentation (DADA)

Motivated by the development of differentiable NAS [ 3 , 56 , 88 ], a number of one-stage AutoDA approaches have been proposed following OHL-AA [ 54 ]. Differentiable Automatic Data Augmentation (DADA) is another effective method that relies on gradient approximation to optimize model weights and DA policy at the same time. The basic gradient estimator utilized in DADA is the Gumbel-Softmax gradient estimator [ 70 ]. Additionally, DADA also proposes a new estimator named RELAX, which is designed to solve the imbalance problem of training data. By combining two gradient approximators, DADA suggests an effective and efficient DA policy learning, which is more robust to biased or noisy data.

The formulation of the DA policy in DADA follows the standard setting of AA [ 20 ]. A complete augmentation policy is comprised of 25 sub-policies, each of which will be used to augment one data batch of training set. One sub-policy contains two TFs, which are described by two hyper-parameters (probability and magnitude). Similar to AAA [ 53 ], DADA also uses the probability distributions to encode augmentation TFs when sampling DA policies. The sub-policy is sampled from the categorical distribution, and the hyper-parameters of each transformation are approximated as Bernoulli distributions. After the re-parameterization of the search space, the AutoDA task is formulated as a Monte-Carlo optimization problem [ 89 ] in DADA. The search of the DA policy and training of classification can be conducted simultaneously within this bi-level optimization framework. However, both categorical and Bernoulli distributions are not differentiable. To directly optimize policy hyper-parameters, it is necessary to perform gradient approximation on these non-differentiable prior to policy search. Inspired by DARTS [ 56 ], DADA employs the Gumbel-Softmax gradient estimator [ 70 ], which is also known as a concrete distribution [ 90 ]. Such a gradient estimator is used to relax the distributions of augmentation operations. As for the operation hyper-parameters, i.e. operation probability and magnitude, an unbiased estimator RELAX [ 91 ] is applied to obtain their gradients with regard to model performance.

The gradient relaxation in DADA consists of two parts, including the re-parameterization of categorical distribution for sub-policy selection, and the approximation of Bernoulli distributions for image operation hyper-parameters. In DADA, the sub-policy to augment data batches is selected from a categorical distribution. The preference for each sub-policy is defined using a probability parameter. After optimizing the parameter for categorical distribution, the sub-policies associated with higher probability will be selected to form the final DA policy. However, the parameter of sub-policy conforms to a non-differentiable distribution. Inspired by its success in the NAS field [ 88 , 92 ], DADA employs the Gumbel-Softmax estimator [ 70 ] to approximate the gradient of parameters for sub-policy selection. For hyper-parameters to describe augmentation TFs, both application probability and magnitude are sampled from Bernoulli distributions. Similar to categorical distribution, Bernoulli distributions are not differentiable. To overcome the gradient issue, the same relaxation procedure is applied on Bernoulli distributions to obtain the gradient of TF hyper-parameters. Moreover, to mitigate the bias resulting from gradient approximations, DADA employs the RELAX estimator [ 91 ], to achieve an unbiased gradient estimation, which further improves the policy search.

The major contribution of DADA is the innovation regarding gradient approximation. Instead of using a standard Gumbel-Softmax approximator, DADA applies unbiased RELAX estimator [ 91 ], which estimates gradients more accurately. Through extensive experiments [ 57 ], DADA models using the RELAX gradient estimator achieve higher accuracy especially when compared with models using Gumbel-Softmax. DADA provides not only enhanced model performance, but also offers a significant speedup of the search process over alternative AutoDA approaches. Due to its increased efficiency, the search of DA policy in DADA is conducted on the full dataset instead of a reduced subset, which also improves the final results. In the field of automated DA policy search, the common sense is that using more data for the policy search will provide more information about the target task, and thus lead to a better final policy. On the other hand, a large amount of data will slow down the searching and raise time issues. This results in a trade-off between performance and efficiency in most AutoDA models. However, unlike prior methods such as AA [ 20 ] and Fast AA [ 21 ], DADA is able to well balance model accuracy and search costs in resource-constrained environment. Due to its efficiency, DADA is considered to be a feasible AutoDA approach for practical application.

6.1.4 Automated dataset optimization (AutoDO)

The majority of recent works in the AutoDA field focus on the reduction of the search cost, while Automated Dataset Optimization (AutoDO) [ 62 ] evolves in the direction of mitigating the negative impacts of noisy or imbalanced data. To achieve a robust policy search, AutoDO adapts the idea of density matching [ 21 ] in a bi-level optimization framework. Specifically, the AutoDO model optimizes a set of augmentation hyper-parameters for each data point instead of a batch of data, allowing for more flexibility in tuning distributions of transformed data. In addition, AutoDO further refines the policy estimate by generalizing the training loss and softening the original labels. Through implicit differentiation, AutoDO jointly optimizes the results from three sub-models: the policy sub-model, the loss weighting sub-model and the soft label sub-model. Moreover, by using Fisher information [ 93 , 94 ], AutoDO provides theoretical proof that the complexity of AutoDA problem scales linearly with the size of the dataset.

The proposal of AutoDO is mainly motivated by data problems present in training sets, including biased distributions and noisy labels. This issue becomes more predominant when existing approaches apply the same DA strategy on all data points for augmentation. For instance, data distributions of different classes are usually uneven in practice. However, by sharing the same augmentation policy among the entire training set, data samples in all categories are evenly augmented to increase diversity. Since the intensity of augmentation remains the same, classes with more data points might be over-augmented after transformation, while minority classes may be under-augmented. After data augmentation, an imbalanced or biased training set transformed by shared policy may potentially mislead the classification model. A shared DA policy is therefore not robust enough for data with distortions. For multi-class classification tasks, the overfitting issue may deteriorate significantly [ 95 ], especially when there exists noise in the original data labels [ 96 ]. This phenomenon is defined as the dilemma of shared-policy in [ 62 ]. To overcome this limitation, AutoDO estimates DA hyper-parameters for each training data point, rather than the entire dataset. Additionally, the AutoDO algorithm considers loss weights to constrain distribution biases and soft labels to address label noises.

A complete AutoDO model is composed of three sub-models: augmentation, loss re-weighting and soft-labelling sub-models. The overall workflow of AutoDO can be described as follows. Firstly, data are sampled as input to the augmentation sub-model, where the original data points are transformed by a set of point-wise augmentation operations. Each data sample is separately augmented by a sequence of specific transformation functions. The augmentation hyper-parameters for each image are defined and updated in the augmentation sub-model as well. To be more specific, application probabilities are binary values, while the magnitudes are sampled from a continuous Gaussian distribution. After data augmentation, the distorted data output from the augmentation block is used to train the classification network. During the training, the loss re-weighting sub-model and soft label sub-model are propagated at the same time. Specifically, the loss sub-model is used to normalize the training loss at certain training epoch, restraining the negative impacts of biased distributions. As for the soft label sub-model, it softens the original label of transformed data based on noise-free validation data. A soft labelling technique is applied in AutoDO to preclude potential noises of data notation resulting from aggressive augmentation [ 97 , 98 ]. Lastly, the reward signal produced by loss re-weighting block, along with the soft labels from soft-labelling model are then back-propagated to update the augmentation hyper-parameters accordingly.

In AutoDO, the optimization of the classification network and augmentation policy are conducted simultaneously. Inspired by prior works [ 22 , 54 , 57 ], AutoDO employs a bi-level setting, where the inner objective is to find the optimal network weights for target tasks, while the outer objective is to search for the optimal DA policy by hyper-parameter optimization. Such joint optimization is realized by gradient differentiation [ 99 ]. However, directly solving the bi-level problem is usually computationally infeasible, especially when AutoDO aims to optimize augmentation hyper-parameters per data point. To accommodate the search for large-scale hyper-parameters, AutoDO combines density matching techniques [ 21 ] to develop an implicit differentiation method. To be more specific, the major objective of searching in AutoDO is to minimize the distribution difference between the augmented data and an unbiased and clean validation set. According to analysis in [ 62 ] and using the Fisher information, the modified differentiation framework can yield equivalent results to the DARTS gradient approximator [ 56 ] from previous methods [ 22 , 57 ]. Furthermore, the use of Fisher information suggests a linear relationship between the complexity of AutoDA search and the size of task data.

The effectiveness of AutoDO can be evaluated from two perspectives: class imbalance and label noise. Before being fed into AutoDO model, training data is distorted by adjusting the class distribution and associated labels. To display the strength of DA policy, t-SNE clustering method [ 100 ] is employed to visualise the embedded features of test data in the classification model. The distance between data clusters represents the difference between data categories from the perspective of the model. Usually, larger margins or clear boundaries between clusters are preferable. When compared with gradient-based Fast AA [ 21 ], the AutoDO model produces larger margins between data clusters in t-SNE plots. This result suggests that point-wise augmentation in AutoDO might achieve better performance.

The extensive experiments in [ 62 ] further confirms that AutoDO is more robust to distorted data. When compared with prior AutoDA approaches, AutoDO avoids overfitting to the majority data by optimizing point-wise hyper-parameters instead of a single shared policy. As a result, the augmentation policy learned by AutoDO can better separate images in different classes, mitigating the impacts of biased class distributions. The issue of noisy label is solved by using re-weighted loss and soft labels. From the experimental results, when trained by noisy data, AutoDO achieves superior results compared to other AutoDA models. Additionally, the smooth labels provided by the soft-labelling sub-model further enlarge the margins between data clusters in t-SNE plots, enhancing the generation ability of the model. More importantly, such improvements can be found in both well- and under-represented classes, aligning the accuracy of minority categories. The greatest advantage of AutoDO is its resilience to low-quality data. Overall, AutoDO is considered more applicable for real-world tasks with imperfect data.

7 Discussion

Following the categorization in Fig.  1 , a wide range of AutoDA methods have been covered in this survey. However, automation of data augmentation is still a relatively new concept and has not been fully addressed. The development of AutoDA techniques are still in their infancy. Although AutoDA models have the potential to become an essential component of the standard deep learning pipelines, there are still a number of difficulties to overcome in the future. In this section, we provide a discussion about the reviewed AutoDA algorithms, focusing on the current challenges in this field, as well as some directions we believe important for future work.

7.1 Comparison of one-stage and two-stage approaches

One-stage algorithms usually achieve better efficiency than traditional two-stage approaches as shown in Table . Table  4 displays the estimated amount of GPU hours needed to complete a single run of a given AutoDA algorithm, including the training time for the end classifier. Despite different GPUs being used in these works, we can clearly see that one-stage AutoDA greatly outperform two-stage method in terms of efficiency.

However, gradient approximation generally has a negative impact on final model performance. Table  summarises the error rate of trained classification models using various AutoDA methods on the CIFAR-10/100, SVHN and ImageNet dataset respectively. These data are obtained from the original papers of reviewed AutoDA methods. We only include the best (lowest) error rate result for each AutoDA method among all tested classifiers. According to Table 5 , one-stage algorithms only provide comparable or less performance gains than most two-stage approaches due to the approximation process during policy generation.

7.2 Search space formulation

The policy search space defined by AA [ 20 ] has been widely accepted as the standard setting. Most of the later AutoDA works reuse the same formulation as the basis of their search model. However, without any modification, the parameterization in AA can result in an enormous search space. Even with a limited range of available TFs, the search process of AA still requires extensive computational resources, so it is not feasible in practice and has serious issues when scaling to larger datasets or models.

As AutoDA techniques evolve, traditional settings of the search space in AA-based models is challenged by search-free approaches. These methods aim to avoid the search phase by re-parameterization. For example, in RandAugment (RA) [ 31 ], instead of optimizing the application probability, all TFs share the same global probability value. Moreover, based on empirical evidence, the magnitude is set to be a global variable for all transformations. The final search space in RA is controlled by only two hyper-parameters, the size of classification models and training sets. None of these requires significant computation to be optimized.

UniformAugment (UA) [ 58 ] also re-formulates the search space. Instead of optimizing hyper-parameters, UA proposes the invariance hypothesis. Sampling any policies from an invariant policy space can retain the original label information for the majority of the augmented data. The authors of UA argue that if an augmentation space is approximately invariant, then optimizing the augmentation policy within that space is unnecessary. Therefore, a simple uniform sampling of the invariant space is sufficient to effectively enhance the model performance, which completely eliminates the need for searching.

The emerge of RA and UA has been revolutionary. The removal of the search stage raises questions on the necessity and optimality of searching in traditional AutoDA methods. Particularly, the hypothesis of invariant augmentation space has the potential to completely solve the current efficiency bottleneck of search-based methods. Hence, we believe a viable topic for future research is to discover a methodology for finding invariant augmentation spaces in certain domains. Furthermore, it is worth trying to explore other ways of defining augmentation policy searches for even more simplified search spaces.

7.3 Optimal augmentation transformations

Though extensive efforts have been put into the search of the hyper-parameters that describe image transformation functions, less attention has been paid to the selection of TFs to be applied. Conventionally, the available image operations in AutoDA models are from the PIL Python library. Nearly all image transformation functions in PIL are considered in later search phases. In AA, two additional augmentation techniques, Cutout [ 45 ] and SamplePairing [ 46 ] are also used.

In the majority of later works, similar selections are made for fair comparison purposes. A few of them remove some image operations from the search list [ 23 , 30 ] while others add several new augmentation technique into their model [ 60 , 63 ]. The decision of image operations is made empirically, with little theoretical selection strategy. Though there is discussion around the different impact of each TF in terms of different datasets or sub-regions, no one has systematically investigate the optimal augmentation transformation functions for AutoDA to search. We argue that the optimization of available image operations in AutoDA with various data may be another interesting direction for researchers to explore. When searching for augmentation policies, future users might have the option to switch on or off a certain type of transformation in different application scenarios, so that the obtained augmentation policy can be more tailored to the given task.

7.4 Unsupervised evaluation

The existing AutoDA methods extensively use supervised evaluation to determine which augmentation policies to use for training. Normally, the most generalized model is trained by applying the best augmentation policy. The optimal DA policy is supposed to provide the most enhancement in data variety and quantity while still retaining salient image features. However, in practice, it can be difficult or impossible to obtain accurately labeled data, especially for sensitive tasks [ 108 ]. This is a great challenge for existing AutoDA methods, and gives rise to the emergence of self-supervised AutoDA. This possibility is also discussed in [ 59 ]. So far, only a few of AutoDA models support semi-supervised learning, such as SelfAugment [ 109 ]. However, with the rapid development in AutoDA field, this may be different in the future.

7.5 Biased or noisy data

When evaluating the effectiveness of AutoDA models, most existing works presume that the training data is clean and balanced. This happens with no doubt when using benchmark datasets such as CIFAR-10/100 [ 27 ] and ImageNet [ 28 ]. But in real-world scenarios, things could be totally different. It is an all too common case when training dataset is not only insufficient but extremely biased with label noise. Experimental results in [ 62 ] show that the distorted training data with imbalanced distribution and noisy label can bring negative impacts to AutoDA models and eventually lead to overfitting problems. Additionally, findings in [ 59 ] also indicate that aggressive augmentation transformations might introduce label noise even though the original annotation is correct.

Dealing with biased or noisy data is a great challenge for AutoDA model. AA-KD [ 59 ] is the first AutoDA work targeting this issue. By leveraging the idea of KD, a stand-alone model is applied in AA-KD to provide extra guidance for model training. During the training, the model receives supervision from both ground-truth label and teacher signal, in case the discriminative information is accidentally removed by aggressive augmentation. AutoDO [ 62 ] uses a similar idea to KD by softening the original label to better train the model. Additionally, AutoDO contains another re-weighting sub-model which is used to normalize the training loss. The combination of all three sub-models makes AutoDO much more robust than other AutoDA algorithms when given biased or noisy data.

Overall, both methods outperform other AutoDA methods especially when dealing with imbalanced data with label noise. This further proves that AutoDA model can benefit from the extra handling of label noises. In the future, we expect that more research could tackle this topic. Such developments would greatly improve the applicability of AutoDA methods in real-world tasks.

7.6 Application domains

Though the major focus of this paper is on the image classification tasks, we identify several other domains which might greatly benefit from the AutoDA technique. One is Object Detection (OD) tasks, which has already been partly explored in recently published works, including AutoAugment for OD [ 26 ], DADA [ 57 ], RandAugment [ 31 ], SelfAugment [ 109 ] and Scale-aware AutoAugment (SA) [ 110 ]. Data augmentation may be even more important for OD tasks, especially when annotation is much more time-consuming. The experimental results in [ 26 ] demonstrate that even a direct transfer of DA policies obtained from classification data can be useful for OD tasks. However, according to their findings, such improvement may be limited. Moreover, the extremely long searching time is also a serious problem in terms of applicability.

To further improve the overall model performance, additional adjustment to the original AutoDA scheme must be done. Later works such as RA [ 31 ] concentrate more on the improvement in efficiency, which provides competitive accuracy with AA and superior search speed. Self-supervised evaluation is also found useful for detection training [ 109 ]. DADA shows another possibility of tweaking the pre-trained backbone network used for detection, instead of directly operating on the detection model. The experimental results in [ 57 ] indicate that pre-training the backbone network using DADA can improve the model performance for later detection tasks. Scale-aware Augmentation [ 110 ] is specifically designed for detection tasks by incorporating bounding-box level augmentations. Such design is more fine-grained and thus leads to considerable improvements to various OD networks.

While the aforementioned research mainly focuses on computer vision tasks, Natural Language Processing (NLP) is another field that can greatly benefit from the application of AutoDA techniques. Traditional data augmentation has been used extensively in NLP research. Automating the augmentation procedure in text-based tasks can be another promising research direction. Several studies already take a step on the application of AutoDA to linguistic problems, including the earliest TANDA [ 52 ], Text AutoAugment [ 111 ] and works such as [ 61 , 66 ]. The adaptation of AutoDA methods in NLP yields competitive performance gains by improving the quantity and quality of training data.

8 Conclusions

With the increasing development of deep learning, training performant deep model efficiently largely depends on the quantity and quality of available training data. Data Augmentation (DA) is an essential tool for solving data problems, and been widely used in various computer vision tasks. However, designing an effective DA policy still highly relies on human efforts. It is difficult to select the optimal augmentation policy when given a specific dataset without domain knowledge. Therefore, researchers seek to solve this problem by automating the search of augmentation policies via deep learning, which stimulates the development of Automated Data Augmentation (AutoDA) techniques.

This survey provides a comprehensive overview of AutoDA techniques for image classification tasks in the computer vision field. The focus of this paper is on various search algorithms in AutoDA. In order to describe and categorize approaches for augmentation policy optimization, we introduce searching and training phases for a standard AutoDA pipeline. Based on different optimization approaches, all AutoDA methods can be divided into two-stage or one-stage approaches. The searching process in AutoDA can be further classified into gradient-free, gradient-based or search-free methods. The associated qualitative evaluation describes AutoDA methods in terms of the complexity of the search space, the computational cost, the available augmentation transformations, as well as the reported performance improvements on classification models.

There are some limitations of the current study. Although we present experimental statistics based on original works, we lack a universal benchmark across two-stage and one-stage AutoDA approaches that can be used to better measure their performance. The formulation of the search problem in most AutoDA algorithms is also questionable given search-free methods. Additionally, the choice of supported transformation functions is for comparison purposes rather than performance optimisation. Moreover, most works in the AutoDA field (including ours) concentrates on classic image classification tasks. Less attention is paid to other tasks, such as object detection and segmentation, or even more complicated computer vision challenges.

There are many other open issues with AutoDA analysis including benchmarking the performance of AutoDA methods across multiple datasets; the reasonable formulation of DA search problems; the implementation of various augmentation functions including advanced DA algorithms; tackling novel AutoDA approaches that are applicable to other CV tasks and the possibility of incorporating different computational frameworks such as IoT [ 112 ] or Cloud-based architectures [ 113 ]. A further challenge is the balance between accuracy and efficiency of AutoDA algorithms and the trade-off between safety and the variety of the obtained DA policies. These will be explored in future extensions to this work.

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105

Google Scholar  

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639):115–118

Article   Google Scholar  

Shin H-C, Roth HR, Gao M, Lu L, Xu Z, Nogues I, Yao J, Mollura D, Summers RM (2016) Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging 35(5):1285–1298

Zheng Y-Y, Kong J-L, Jin X-B, Wang X-Y, Su T-L, Zuo M (2019) Cropdeep: The crop vision dataset for deep-learning-based classification and detection in precision agriculture. Sensors 19(5):1058

Kamilaris A, Prenafeta-Boldú FX (2018) Deep learning in agriculture: a survey. Comput Electron Agric 147:70–90

Shijie J, Ping W, Peiyi J, Siping H (2017) Research on data augmentation for image classification based on convolution neural networks. In: 2017 Chinese automation congress (CAC), pp 4165–4170. IEEE

Lemley J, Bazrafkan S, Corcoran P (2017) Smart augmentation learning an optimal data augmentation strategy. IEEE Access 5:5858–5869

Cireşan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural Comput 22(12):3207–3220

Dosovitskiy A, Fischer P, Springenberg JT, Riedmiller M, Brox T (2015) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans Pattern Anal Mach Intell 38(9):1734–1747

Graham B (2014) Fractional max-pooling. arXiv preprint arXiv:1412.6071

Sajjadi M, Javanmardi M, Tasdizen T (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Adv Neural Inf Process Syst 29:1163–1171

Rios A, Kavuluru R (2018) Few-shot and zero-shot multi-label learning for structured label spaces. In: Proceedings of the conference on empirical methods in natural language processing. Conference on empirical methods in natural language processing, vol. 2018, p 3132 . NIH Public Access

Bachman P, Hjelm RD, Buchwalter W (2019) Learning representations by maximizing mutual information across views. Adv Neural Inf Process Syst 32

Paschali M, Simson W, Roy AG, Naeem MF, Göbl R, Wachinger C, Navab N (2019) Data augmentation with manifold exploring geometric transformations for increased performance and robustness. arXiv preprint arXiv:1901.04420

Wang Y, Yao Q, Kwok JT, Ni LM (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv (CSUR) 53(3):1–34

Dao T, Gu A, Ratner A, Smith V, De Sa C, Ré C (2019) A kernel theory of modern data augmentation. In: International conference on machine learning, pp 1528–1537. PMLR

Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2019) Autoaugment: learning augmentation strategies from data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 113–123

Lim S, Kim I, Kim T, Kim C, Kim S (2019) Fast autoaugment. Adv Neural Inf Process Syst 32:6665–6675

Hataya R, Zdenek J, Yoshizoe K, Nakayama H (2020) Faster autoaugment: learning augmentation strategies using backpropagation. In: European conference on computer vision, pp 1–16. Springer

Ho D, Liang E, Chen X, Stoica I, Abbeel P (2019) Population based augmentation: efficient learning of augmentation policy schedules. In: International conference on machine learning, pp 2731–2741 . PMLR

Tran T, Pham T, Carneiro G, Palmer L, Reid I (2017) A bayesian data augmentation approach for learning deep models. arXiv preprint arXiv:1710.10564

DeVries T, Taylor GW (2017) Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538

Zoph B, Cubuk ED, Ghiasi G, Lin T-Y, Shlens J, Le QV (2020) Learning data augmentation strategies for object detection. In: European conference on computer vision, pp 566–583. Springer

Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images

Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255 . Ieee

Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning

Tian K, Lin C, Sun M, Zhou L, Yan J, Ouyang W (2020) Improving auto-augment via augmentation-wise weight sharing. arXiv preprint arXiv:2009.14737

Cubuk ED, Zoph B, Shlens J, Le QV (2020) Randaugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 702–703

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Adv Neural Inf Process Syst 27:2672–2680

Zoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578

Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28

Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp 234–241. Springer

Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big Data 6(1):1–48

Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412

DeVries T, Taylor GW (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552

Inoue H (2018) Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929

Bagherinezhad H, Horton M, Rastegari M, Farhadi A (2018) Label refinery: improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641

Umesh P (2012) Image processing in python. CSI Communications 23

Zhong Z, Zheng L, Kang G, Li S, Yang Y (2020) Random erasing data augmentation. In: AAAI, pp 13001–13008

Sato I, Nishimura H, Yokoi K (2015) Apac: augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229

Simard PY, Steinkraus D, Platt JC et al (2003) Best practices for convolutional neural networks applied to visual document analysis. In: ICDAR, vol. 3

Ratner AJ, Ehrenberg HR, Hussain Z, Dunnmon J, Ré C (2017) Learning to compose domain-specific transformations for data augmentation. Adv Neural Inf Process Syst 30:3239

Zhang X, Wang Q, Zhang J, Zhong Z (2019) Adversarial autoaugment. arXiv preprint arXiv:1912.11188

Lin C, Guo M, Li C, Yuan X, Wu W, Yan J, Lin D, Ouyang W (2019) Online hyper-parameter learning for auto-augmentation strategy. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6579–6588

Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3):229–256

Article   MATH   Google Scholar  

Liu H, Simonyan K, Yang Y (2018) Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055

Li Y, Hu G, Wang Y, Hospedales T, Robertson NM, Yang Y (2020) Dada: differentiable automatic data augmentation. arXiv preprint arXiv:2003.03780

LingChen TC, Khonsari A, Lashkari A, Nazari MR, Sambee JS, Nascimento MA (2020) Uniformaugment: a search-free probabilistic data augmentation approach. arXiv preprint arXiv:2003.14348

Wei L, Xiao A, Xie L, Zhang X, Chen X, Tian Q (2020) Circumventing outliers of autoaugment with knowledge distillation. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp 608–625. Springer

Naghizadeh A, Abavisani M, Metaxas DN (2020) Greedy autoaugment. Pattern Recogn Lett 138:624–630

Niu T, Bansal M (2019) Automatically learning data augmentation policies for dialogue tasks. arXiv preprint arXiv:1909.12868

Gudovskiy D, Rigazio L, Ishizaka S, Kozuka K, Tsukizawa S (2021) Autodo: robust autoaugment for biased data with label noise via scalable probabilistic implicit differentiation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16601–16610

Naghizadeh A, Metaxas DN, Liu D (2021) Greedy auto-augmentation for n-shot learning using deep neural networks. Neural Netw 135:68–77

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

Article   MathSciNet   Google Scholar  

Lin S, Yu T, Feng R, Li X, Jin X, Chen Z (2021) Local patch autoaugment with multi-agent collaboration. arXiv preprint arXiv:2103.11099

Hu Z, Tan B, Salakhutdinov R, Mitchell T, Xing EP (2019) Learning data manipulation for augmentation and weighting. arXiv preprint arXiv:1910.12795

Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, et al (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846

Terrell GR, Scott DW (1992) Variable kernel density estimation. Ann Stat 1236–1265

Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

Jang E, Gu S, Poole B (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144

Colson B, Marcotte P, Savard G (2007) An overview of bilevel optimization. Ann Oper Res 153(1):235–256

Article   MathSciNet   MATH   Google Scholar  

Real E, Aggarwal A, Huang Y, Le QV (2019) Regularized evolution for image classifier architecture search. Proc AAAI Conf Artif Intell 33:4780–4789

Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13 (2)

Mania H, Guy A, Recht B (2018) Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055

Tarvainen A, Valpola H (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780

Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 847–855

Xie C, Tan M, Gong B, Wang J, Yuille AL, Le QV (2020) Adversarial examples improve image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 819–828

Gontijo-Lopes R, Smullin SJ, Cubuk ED, Dyer E (2020) Affinity and diversity: quantifying mechanisms of data augmentation. arXiv preprint arXiv:2002.08973

Boutilier C (1996) Planning, learning and coordination in multiagent decision processes. TARK 96:195–210 ( Citeseer )

Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626

Bengio Y, Léonard N, Courville A (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432

Oord Avd, Vinyals O, Kavukcuoglu K (2017) Neural discrete representation learning. arXiv preprint arXiv:1711.00937

Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning algorithms. Adv Neural Inf Process Syst 25

Chen S, Dobriban E, Lee J (2020) A group-theoretic framework for data augmentation. Adv Neural Inf Process Syst 33:21321–21333

MATH   Google Scholar  

Franceschi L, Donini M, Frasconi P, Pontil M (2017) Forward and reverse gradient-based hyperparameter optimization. In: International conference on machine learning, pp 1165–1173. PMLR

Wang X, Shrivastava A, Gupta A (2017) A-fast-rcnn: Hard positive generation via adversary for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2606–2615

Peng X, Tang Z, Yang F, Feris RS, Metaxas D (2018) Jointly optimize data augmentation and network training: adversarial data augmentation in human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2226–2234

Dong X, Yang Y (2019) Searching for a robust neural architecture in four gpu hours. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1761–1770

Mohamed S, Rosca M, Figurnov M, Mnih A (2020) Monte carlo gradient estimation in machine learning. J Mach Learn Res 21(132):1–62

MathSciNet   MATH   Google Scholar  

Maddison CJ, Mnih A, Teh YW (2016) The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712

Grathwohl W, Choi D, Wu Y, Roeder G, Duvenaud D (2017) Backpropagation through the void: optimizing control variates for black-box gradient estimation. arXiv preprint arXiv:1711.00123

Xie S, Zheng H, Liu C, Lin L (2018) Snas: stochastic neural architecture search. arXiv preprint arXiv:1812.09926

Domingos P (2020) Every model learned by gradient descent is approximately a kernel machine. arXiv preprint arXiv:2012.00152

Gudovskiy D, Hodgkinson A, Yamaguchi T, Tsukizawa S (2020) Deep active learning for biased datasets via fisher kernel self-supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9041–9049

Terhörst P, Kolf JN, Huber M, Kirchbuchner F, Damer N, Morales A, Fierrez J, Kuijper A (2021) A comprehensive study on face recognition biases beyond demographics. arXiv preprint arXiv:2103.01592

Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2021) Understanding deep learning (still) requires rethinking generalization. Commun ACM 64(3):107–115

Tanaka D, Ikami D, Yamasaki T, Aizawa K (2018) Joint optimization framework for learning with noisy labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5552–5560

Yi K, Wu J (2019) Probabilistic end-to-end noise correction for learning with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7017–7025

Lorraine J, Vicol P, Duvenaud D (2020) Optimizing millions of hyperparameters by implicit differentiation. In: International conference on artificial intelligence and statistics, pp 1540–1552. PMLR

Van der Maaten L, Hinton G (2008) Visualizing data using T-SNE. J Mach Learn Res 9 (11)

Yamada Y, Iwamura M, Akiba T, Kise K (2019) Shakedrop regularization for deep residual learning. IEEE Access 7:186126–186136

He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp 630–645. Springer

Gastaldi X (2017) Shake-shake regularization. arXiv preprint arXiv:1705.07485

Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv preprint arXiv:1605.07146

Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708

Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, pp 6105–6114. PMLR

Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

Shin H-C, Orton M, Collins DJ, Doran S, Leach MO (2011) Autoencoder in time-series analysis for unsupervised tissues characterisation in a large unlabelled medical image dataset. In: 2011 10th international conference on machine learning and applications and workshops, vol 1 , pp 259–264. IEEE

Reed CJ, Metzger S, Srinivas A, Darrell T, Keutzer K (2021) Selfaugment: automatic augmentation policies for self-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2674–2683

Chen Y, Li Y, Kong T, Qi L, Chu R, Li L, Jia J (2021) Scale-aware automatic augmentation for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9563–9572

Ren S, Zhang J, Li L, Sun X, Zhou J (2021) Text autoaugment: aearning compositional augmentation policy for text classification. arXiv preprint arXiv:2109.00523

Rajavel R, Ravichandran SK, Harimoorthy K, Nagappan P, Gobichettipalayam KR (2022) Iot-based smart healthcare video surveillance system using edge computing. J Ambient Intell Humanized Comput, 1–13

Rajavel R, Sundaramoorthy B, GR K, Ravichandran SK, Leelasankar K (2022) Cloud-enabled diabetic retinopathy prediction system using optimized deep belief network classifier. J Ambient Intell Humanized Comput 1–9

Download references

Open Access funding enabled and organized by CAUL and its Member Institutions

Author information

Authors and affiliations.

Faculty of Engineering and Information Technology, The University of Melbourne, 700 Swanston Street, Melbourne, VIC, 3010, Australia

Zihan Yang, Richard O. Sinnott & James Bailey

Faculty of Information Technology, Monash University, 20 Exhibition Walk, Clayton, VIC, 3800, Australia

You can also search for this author in PubMed   Google Scholar

Contributions

ZY wrote the main manuscript text and prepared all figures and tables. All authors reviewed the manuscript.

Corresponding author

Correspondence to Zihan Yang .

Ethics declarations

Conflict of interest.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Yang, Z., Sinnott, R.O., Bailey, J. et al. A survey of automated data augmentation algorithms for deep learning-based image classification tasks. Knowl Inf Syst 65 , 2805–2861 (2023). https://doi.org/10.1007/s10115-023-01853-2

Download citation

Received : 13 June 2022

Revised : 08 February 2023

Accepted : 27 February 2023

Published : 17 March 2023

Issue Date : July 2023

DOI : https://doi.org/10.1007/s10115-023-01853-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Automated data augmentation
  • Deep learning
  • Image classification
  • Find a journal
  • Publish with us
  • Track your research
  • Computer Science

Data augmentation for improving deep learning in image classification problem

  • Conference: 2018 International Interdisciplinary PhD Workshop (IIPhDW)

Agnieszka Mikołajczyk-Bareła at Gdansk University of Technology

  • Gdansk University of Technology

Michał Grochowski at Gdansk University of Technology

Abstract and Figures

Image generated with Style Transfer - benign image + maligant image

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Saeed Hamood Alsamhi

  • Comput Model Eng Sci
  • Anshul Mahajan

Sunil Singla

  • Richard Franklin
  • Deyang Zhong

Malithi De Silva

  • Hibrezemed Mengistu

Kris Calpotura

  • Abdurahman Fetulhak
  • K.K. Harini
  • R. Nandhini
  • A.M. Rajeswari
  • R. Deepalakshmi
  • Hemanth Karnati
  • SujayKumar Reddy M
  • Abdulrahman Alahmadi
  • Euclides do Rosário

Saide Manuel Saide

  • Maayan Frid-Adar

Eyal Klang

  • Marek Kulka
  • Maciej Cićkiewicz
  • IEEE T EVOLUT COMPUT

Danilo Vasconcellos Vargas

  • F. Prokopiuk

Qifeng Chen

  • Vladlen Koltun
  • Logan Engstrom

Dimitris Tsipras

  • Ludwig Schmidt
  • Aleksander Madry
  • Raymond A. Yeh

Teck Yian Lim

  • Ferenc Huszár

Wenzhe Shi

  • Sylvain Paris

Eli Shechtman

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Navigating uncharted waters: ASU drives solutions for water resilience

Collage of photos of a lake surrounded by a canyon, a utility worker looking at water pipes and a child washing their hands.

Illustration by Andy Keena

Editor's note: This is the fifth story in  a series exploring how ASU is changing the way the world solves problems .

In the Southwest, water seems to exist in two vastly conflicting states: abundance and scarcity. For some, simply turning on a faucet at work or at home yields a seemingly on-demand supply of one of our planet’s most precious resources. And yet, persisting drought, extreme heat, lessened precipitation and high demand for water have drastically altered our water supply.

The Southwest has grappled with an ongoing megadrought since 2000, the driest period in the last 1,200 years. In a place already known for extreme heat and an arid climate, a secure water supply is especially crucial in order for humanity to thrive.

The  Arizona Water Innovation Initiative at ASU — aimed at providing immediate, actionable and evidence-based solutions to strengthen Arizona’s water security — has already seen great success in patenting technologies, empowering communities and better understanding our state’s water challenges. Additionally, the newly launched  Water Institute draws from existing academic capacity across ASU, led by the  Julie Ann Wrigley Global Futures Laboratory to develop educational, research and communication projects that benefit communities across the world.

The barriers to water resilience are multifaceted: Water is a building block for all life and a driving force behind agricultural, energy and technological development. From the food we eat to the cooling systems that keep our desert summers bearable, water plays a role in just about everything humanity touches. These complexities require a diverse range of expertise, strong collaborative efforts and creativity.

With this unprecedented challenge comes the opportunity to lead a wave of education, technology and collaboration toward water resiliency for all. As a unique test bed for transdisciplinary solutions, ASU is at the forefront of a new mission: to secure a thriving water future in Arizona and beyond.

Managing 'liquid gold' in the Southwest

While all of the Southwest faces a stressed water supply, water resources are not split uniformly between all states in the region. Each state has its own unique set of priorities and management strategies, requiring a more personalized approach.

In Arizona, the annual water demand is roughly 7 million acre-feet that is split between agricultural, municipal and industrial use.  Sarah Porter , the inaugural director of the  Kyl Center for Water Policy at ASU, says there is a lot of variance in how that water is allocated, particularly the water that goes toward municipalities.

Illustration of a football field covered in water to demonstrate the concept of one-acre foot of water.

“Most people get their water from their city water department — some get their water from a private water provider or private water company,” says Porter, an executive committee member of the Arizona Water Innovation Initiative. “A comparatively small number of people get their water another way, typically from a shared well or a well that’s on their property. Under this framework, the responsibility for securing the water supply that is delivered to taps lies with the water provider.”

Putting the responsibility on the water provider means that water portfolios range greatly from city to city. It also means that one city could have a vastly different water portfolio from the next.

“There is a persisting idea that we're all in the same boat in terms of water challenges in Arizona or in the Southwest,” Porter says. “That's simply not the case.”

Amber Wutich , an ASU President’s Professor, director of the Center for Global Health and a 2023 MacArthur Fellow , has dedicated her career to understanding the intersection between water insecurity and the human experience. She says strong water policy is a key element to ensuring future habitability in the Southwest — but it is not a guarantee of water security for all.

“Even great water policy won’t necessarily solve everyone’s problems, and often the most vulnerable get left behind,” says Wutich. “The challenge I am interested in is how to meet the needs of Arizona’s most water-insecure people and communities. Here at ASU, we’re experimenting with new ways to bring together engineered and social infrastructures to ensure water security for all.”

Wutich says water insecurity poses a threat to both physical and mental health, with water insecurity known to contribute to anxiety, depression and PTSD. Wutich leads the “Arizona Water for All” pillar of the Arizona Water Innovation Initiative. This pillar of the initiative focuses on increasing participation in community water decision-making, deploying proven water security solutions, and advancing measurement and monitoring of household water insecurity.

While the current system is not perfect, Porter says it comes with a key advantage: Arizona cities have full-time workers who are responsible for ensuring a strong water supply. In central Arizona specifically, the ongoing goal is making sure that there is 100 years of water. This approach also allows for solutions that can account for nuance and local water use; this would not be possible with a one-size-fits-all approach to water management in the Southwest.

Utilizing creativity in times of urgency

Creativity, Porter says, is an area where universities can thrive in the solutions space. As test beds for budding technologies and groundbreaking discoveries, universities can support the discovery phase of solutions. Where a city may be hesitant to invest in a new, unproven technology, institutes of higher education are uniquely positioned to test, expand and then transition new ideas into implementable solutions.

The Kyl Center for Water Policy, for example, provides modeling of Colorado River scenarios to help inform water managers what their risks of shortage are at the municipal or irrigation district level. This level of detailed modeling often goes beyond what a federal or statewide agency can explore given their constraints. Universities, on the other hand, are the perfect place to consider the "what if" questions.

These “what if” questions can prove to be invaluable, especially in a world that is rapidly adapting to human-driven stressors. Climate change and water supplies are closely linked, says  Dave White , associate vice president of research advancement in ASU Knowledge Enterprise. Rising temperatures alone impact both water supply availability and water demand — a “double whammy” for regions like the Southwest, says White, who leads the Arizona Water Innovation Initiative.

“Because higher average temperatures and higher extreme temperatures are driving greater levels of evaporation and drying out the soils, you're seeing less water available from the system under these higher temperatures,” he says. “At the same time, those higher temperatures are driving up the demand from plants, which impacts agriculture.”

Last year, White served as the lead author of The White House’s Fifth National Climate Assessment, a report considered to be the scientific consensus regarding climate change impacts, mitigation and adaptation strategies across the country. Twelve ASU faculty members also contributed to the report, which detailed a series of key messages.

The first key message, “ Drought and Increasing Aridity Threaten Water Resources ," details the intersection of climate change and water insecurity. It also offers a potential solution: Flexible and adaptive approaches to water management may be able to soften the impacts of climate change-driven changes on people, the environment and the economy.

Adapting to meet the challenge

The effects of climate change on our water supply have not gone unnoticed. In 2021, the federal government declared its first-ever water shortage declaration in the Colorado River, a significant water supply in the Southwest. The river system supplies water for 40 million people in seven Western states and Mexico. It also irrigates more than 5 million acres of farmland.

Map of the area of the United States covered by the Colorado River.

While this shortage spurred important conversations and urgency, White says continued efforts are needed. It can be easy to be comforted by years that offer high snowpack or above-average inflows into the reservoirs, but White says that would be an error.

It would take years of exceptionally strong rainfall and snowpack to combat the megadrought that has persisted across the Southwest for more than 20 years. Through long-term drought and climate change alike, the Colorado River Basin has already undergone structural, systematic changes that pose a new reality to water managers.

“There is a built-in assumption that because we are in a drought, the drought will end,” says Porter. “The reality is that we are looking at a permanent adjustment to our Colorado River supply.”

Porter says that while supplies are dropping, we are not “running out of water.” However, we do have to adjust to using less of the Colorado River. One of the best strategies water managers and policymakers can implement is finding ways to conserve water.

Porter says that while the Phoenix metro area has seen a significant population increase in the last few years, this increase in population density does not always result in a correlated increased water demand. This is less applicable in municipalities located close to the periphery of metropolitan areas — like Buckeye, Queen Creek or Maricopa, for example — but is especially true in more “built out” areas. This is largely due to conservation efforts and built-in supports at the municipal level to support population changes.

Recent numbers are promising: Porter says in the last 20 years, central Arizona experienced about a 45% increase in population but saw only a 14% increase in municipal water demand.

Jay Famiglietti , director of science for the Arizona Water Innovation Initiative and a global futures professor in the  School of Sustainability , says metropolitan areas have shown that they can be more water-efficient, partially due to technologies like sewage recycling and stormwater capture.

Both metropolitan and urban areas also use groundwater, a vital part of the natural water cycle. Famiglietti says that while many cities have found ways to conserve water in growing populations, there is another element to consider: how to feed that population. This is where groundwater — an already strong contributor to water supplies — really shines.

Illustration showing Arizona's main sources of water.

“In the Southwest, groundwater is absolutely crucial for food security because we use so much of that water to grow crops,” says Famiglietti. “We have a growing population to grow food for, and we will need to grow food forever. At the same time, groundwater is a fixed resource. We have to understand our supply and how to protect it.”

Famiglietti measures groundwater supply using satellites. These satellites circle our planet and collect data on “mass variations” on the Earth’s surface. These mass variations are typically made up of water, either through snow, soil moisture, river supplies or groundwater. Using a combination of satellite data and measurements taken from the ground, the technology has allowed Famiglietti to put together a global and regional picture of groundwater supplies.

“This provides us a picture like we’ve never seen before,” Famiglietti says. “Using data, we can do a better job of making the changes that matter. There is a saying: You have to do the measurement to do the management.”

Bolstering water supplies through innovation

Paul Westerhoff , Regents Professor in the  School of Sustainable Engineering and the Built Environment at ASU and the Fulton Chair of Environmental Engineering, has seen tremendous development in water technology in the last 20 years alone. A washing machine purchased today, he says, is significantly more water-conscious than a washing machine purchased a few decades ago. From water treatment to water transportation, the Southwest is getting more creative and efficient in its water solutions.

As the area tackles issues presented by reduced water supplies, it has an opportunity to be a global leader in the water solutions space. Westerhoff, the lead of the Arizona Water Innovation Initiative’s “Global Center for Water Technology” pillar, develops and deploys advanced technologies for water augmentation, conservation, treatment and reuse. The Global Center for Water Technology supports over 20 faculty research teams on new technology development, including patents and startup companies, and working on big challenges posed by the Arizona Department of Environmental Quality.

Water and technology are deeply intertwined — there is the technology you need to clean and transport water, and there is the technology that needs water to function. Whether you are in a restaurant, your home, a hospital or a space station, water is all around you.

“Even when you don't see water, taste water or touch water, you're still using water,” says Westerhoff. “Those systems are becoming more and more water- and energy-efficient, and there is more we can continue to do.”

Earlier this year, Westerhoff  hosted the first-ever Atmospheric Water Harvesting Summit at ASU. Atmospheric water harvesting is an emerging method of water collection that draws water from humidity in the air. The summit gathered participants from around the world, and has since resulted in a global seminar series and newly created  Atmospheric Water Harvesting Association .

Westerhoff says times of urgency inspire innovation, and Arizona is currently shifting from a technology demonstrator to a technology innovator.

“We're really seizing this opportunity to take a bolder step in developing technologies like atmospheric water harvesting and others that will hopefully be exported to the world,” he says. “In the meantime, creating these technologies here in Arizona supports local job growth. As we are tackling these issues, people you know will likely get hired by companies that start here. They'll be able to stay local and have great jobs.”

Upmanu Lall , director of the Water Institute at ASU, says scaling solutions to the global scale will take major collaborative efforts both inside and outside of ASU. The Water Institute, which launched in March of this year, aims to bring together discourse across the university on topics related to water.

“There are around 200 faculty who work on topics related to water here at ASU,” Lall says. “The goal is to unite those 200 people and collaborate to strengthen all of our efforts. From there, we are able to more efficiently work in teams to participate in robust engagement with governments,  the private sector and with nongovernmental organizations.”

In addition to connecting faculty members, the Water Institute aims to create a national consortium of universities, industry and public agencies to assess and meet needs for water and climate adaptation. The institute will also work with the World Bank and related organizations to address groundwater depletion and climate hazard impacts across the world. Other priorities include developing a program in weather engineering, in addition to exploring how to reduce evaporation and increase energy production through the installation of floating solar electricity technology on Western reservoirs.

Lall says it is a university’s responsibility to drive policy and solutions efforts in the water space, but also to provide relevant training for future water leaders. Through the College of Global Futures and the  Ira A. Fulton Schools of Engineering , students are exposed firsthand to a transdisciplinary approach to water problem-solving.

“In the process of educating students, we are also improving the human workforce that we need to solve these issues,” Lall says. “Water insecurity won’t be solved in the next 10 years. It will take generations to develop solutions, and then to redevelop solutions as the challenges shift. It’s crucial that we give students more than the tools to write a paper. They need to be embedded in projects and initiatives that matter.”

More Environment and sustainability

Professor Xuesong Zhou stands along a street featuring multiple modes of transportation. Zhou leads an interdisciplinary team developing open-source systems that help cities build multimodal transportation systems that are equitable and sustainable. Photo by Bobbi Ramirez/ Arizona State University

Tackling traffic with open-source mobility solutions

Traffic congestion, bad air quality and lack of mobility options are some of the most critical issues affecting transportation in the United States.Helping municipalities develop equitable and…

Two people sitting on stage talking to audience

Higher education's role in addressing democracy, climate change

During a conversation Tuesday with David Orr, a professor of practice in The College of Liberal Arts and Sciences, Arizona State University President Michael Crow was asked what profession he would…

A group of people looking up at a mechanical tree at ASU

USTDA director joins policymakers, business leaders in conversation about clean energy collaboration

As technology rapidly accelerates and humanity finds itself in what scientists frequently refer to as the most decisive decade for climate action, we stand at a crossroads: How will we power our…

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Text Data Augmentation for Deep Learning

Connor shorten.

Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA

Taghi M. Khoshgoftaar

Borko furht, associated data.

Not applicable.

Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

Introduction

Nearly all the successes of Deep Learning stem from supervised learning. Supervised learning describes the use of loss functions that align predictions with manually annotated ground truth. Deep Learning can achieve remarkable performance through the combination of this learning strategy and large labeled datasets. The problem is that collecting these annotated datasets is very difficult at the scale required. For example, one of the key Deep Learning applications for COVID-19 rapid response was question answering [ 1 ]. Tang et al. [ 2 ] constructed COVID-QA, a supervised learning dataset in which articles are annotated with an answer span to a given question. The authors of the paper describe working for 23 hours to produce 124 question-answer pairs. Fitting 124 question-answer annotations without overfitting is extremely challenging in the current state of Deep Learning. In addition to question answering, Natural Language Processing (NLP) researchers are also exploring the application of abstractive summarization in which a model outputs a novel summary from a collection of input documents. Cachola et al. [ 3 ] were able to collect a dataset of 5.4K Too Long; Didn’t Read (TLDR) summaries of 3.2K machine learning papers. This required employing 28 undergraduate students to refine data bootstrapped from the OpenReview platform. These anecdotes are provided to highlight the difficulty of curating annotated big data for knowledge-intensive NLP tasks with millions of examples.

The Deep Learning research community is currently exploring many solutions to the problem of learning without labeled big data. In addition to Data Augmentation, self-supervised learning and transfer learning have performed very well. Few and zero-shot learning are categories of research gaining interest as well. In this survey, we explore getting more performance out of the supervised data available with Data Augmentation. Our survey additionally explores how Data Augmentation is driving key advances in learning strategies outside of supervised learning. This includes self-supervised learning from unlabeled datasets, and transfer learning from other domains, whether that data is labeled or unlabeled.

Data Augmentation describes a set of algorithms that construct synthetic data from an available dataset. This synthetic data typically contains small changes in the data that the model’s predictions should be invariant to. Synthetic data can also represent combinations between distant examples that would be very difficult to infer otherwise. Data Augmentation is one of the most useful interfaces to influence the training of Deep Neural Networks. This is largely due to the interpretable nature of the transformations and the window to observe how the model is failing.

Preventing overfitting is the most common use case of Data Augmentation. Without augmentation, or regularization more generally, Deep Neural Networks are prone to learning spurious correlations and memorizing high-frequency patterns that are difficult for humans to detect. In NLP, this could describe high frequency numeric patterns in token embeddings, or memorizations of particular forms of language that do not generalize. Data Augmentation can aid in these types of overfitting by shuffling the particular forms of language. To overcome the noisy data, the model must resort to learning abstractions of information which are more likely to generalize.

Data Augmentation is a regularization strategy. Other regularization techniques have been developed such as dropout [ 4 ] or weight penalties [ 5 ]. These techniques apply functional regularization by either adding noise to intermediate activations of the network or adding constraints to the functional form. These techniques have found successes, but they lack the power to express the esoteric concept of semantic invariance. Data Augmentation enables an intuitive interface for demonstrating label-preserving transformations.

Our survey presents several strategies for applying Data Augmentation to text data. We cluster these augmentations into symbolic or neural methods. Symbolic methods use rules or discrete data structures to form synthetic examples. This includes Rule-Based Augmentations, Graph-Structured Augmentations, Feature-Space Augmentation, and MixUp. Neural augmentations use a deep neural network trained on a different task to augment data. Neural augmentations surveyed include Back-Translation, Generative Data Augmentation, and Style Augmentation. In addition to symbolic vs. neural-based augmentations, we highlight other distinctions between augmentations such as task-specific versus task-agnostic augmentations and form versus meaning augmentations. We describe these distinctions further throughout our survey.

Generalization is the core challenge of Deep Learning. How far can we extrapolate from the instances available? The same interface used to control the training data is also useful for simulating potential test sets and distribution shifts. We can simulate distribution shift by applying augmentations to a dataset, such as adding random tokens to an email spam detector or increasing the prevalence of tokens that lie on the long-tail of the frequency distributions. These simulated shifts can also describe higher-level linguistic phenomenon. This involves deeper fact chaining than what was seen in the training set, or the ability to change predictions given counterfactual evidence. As our tools for Generative Data Augmentation continue to improve, we will be able to simulate more semantic distribution shifts. This looks like a very promising direction to advance generalization testing.

Our survey on Text Data Augmentation for Deep Learning builds on our work surveying Image Data Augmentation for Deep Learning [ 6 ]. In Computer Vision, this describes applying transformations such as rotating images, horizontally flipping them, or increasing the brightness to form augmented examples. We found that it is currently much easier to apply label-preserving transformations in Computer Vision than NLP. It is additionally easier to stack these augmentations in Computer Vision, enabling even more diversity in the augmented set, which has been shown to be a key contributor to success. Data Augmentation research has been more thoroughly explored in Computer Vision than NLP. We present some ideas that have found interesting results with images, but remain to be tested in the text data domain. Finally, we discuss the intersection of visual supervision for language understanding and how vision-language models may help overcome the grounding problem. We discuss the grounding problem in greater detail under our Motifs Of Data Augmentation section.

Our next section presents practical implementation decisions for text data augmentation. We begin by describing the use of a consistency regularization loss to further influence the impact of augmented data. Differently from consistency regularization, contrastive learning additionally uses negative examples to structure the loss function. The next key question is how to control the strength and sampling of each augmentation. Augmentation controllers apply a meta-level abstraction to the hyperparameters of augmentation selection and the magnitude of the transformation. This is commonly explored with an adversarial controller that aims to produce mistakes in the model. We also describe controllers that search for performance improvements such as AutoAugment [ 7 ], Population-Based Augmentation [ 8 ], and RandAugment [ 9 ]. Although similar in concept, we discuss the key distinction between augmentation controllers and curriculum learning. Another important consideration for implementing Data Augmentation is the CPU to GPU transfer in the preprocessing pipeline, as well as the conceptual understanding of offline versus online augmentation. Finally, we describe the application of augmentation to alleviate issues caused by class imbalance.

Our Discussion section presents opportunities to explore text data augmentation. We begin with task-specific augmentations describing how key NLP tasks such as question answering differ from natural language inference, particularly with respect to input length or the categorization as a knowledge-intensive task. We quickly previewed that self-supervised and transfer learning are also emerging solutions to learning with limited labeled data. We discuss the use of Data Augmentation in self-supervised learning and then recent works with transfer and multi-task learning. Finally, we discuss AI-GAs, short for AI-generating Algorithms [ 10 ]. This is a very interesting idea encompassing papers such as POET [ 11 ], Generative Teaching Networks [ 12 ], and the Synthetic Petri Dish [ 13 ] which describe algorithms that learn the environment to learn from. We present how this differs from augmentation controllers or curriculum learning, the idea of skill acquisition from artificial data, and opportunities to test these ideas in NLP.

Data Augmentation for NLP prevents overfitting, provides the easiest way to inject prior knowledge into a Deep Learning system, and offers a view into the generalization ability of these models. Our survey is organized as follows:

  • We begin with the key Motifs Of Data Augmentation that augmentations strive to achieve.
  • We provide a list of Text Data Augmentations. This list can be summarized into symbolic augmentations, using rules and graph-structured decomposition to form new examples, and neural augmentations, that use auxiliary neural networks to sample new data.
  • Following our list of available augmentations, we dive deeper into Testing Generalization with Data Augmentation.
  • We continue with a comparison of Image versus Text Augmentation.
  • Returning to Text Data Augmentation, we describe Practical Considerations for Implementation.
  • Finally, we present interesting ideas and research questions in our Discussion section.
  • Our Conclusion briefly summarizes the motivation and findings of our survey.

Data Augmentation has been a heavily studied area of Machine Learning. The advancement of the prior knowledge encoded in augmentations is one of the key distinctions between previous works and now. As we will discuss in depth later in the survey, the success of Data Augmentation in Computer Vision has been fueled by the ease of designing label-preserving transformations. For example, a cat image is still a cat after rotating it, translating it on the x or y axis, increasing the intensity of the red channel, and so on. It is easy to brainstorm these semantically-preserving augmentations for images, whereas it is much harder to do this in the text domain.

We believe our survey on text data augmentation is well-timed with respect to questions such as why now? What has changed recently? Recent advances in generative modeling such as StyleGAN for images, GPT-3 for text [ 14 ], and DALL-E unifying both text and images [ 15 ], have been astounding. We summarize many exciting works on the use of prompting for adapting language models for downstream tasks. As discussed in further detail later on, we believe these advances in generative modeling could be game changing for the way we store datasets and build Deep Learning models. More particularly, it could become common to use labeled datasets solely for the sake of evaluation, rather than representation learning.

Our survey has some similarities to Feng et al. [ 16 ] which has been published roughly around the same time as ours. Both surveys seek a clear definition of Data Augmentation and aim to highlight key motifs. Additionally, both surveys narrate the development of NLP augmentation around the successes of augmentation in Computer Vision and how these may transfer. Feng et al. [ 16 ] provide a deeper enumeration of task-specific augmentation than is covered in our survey. Our survey adds important concepts such as the debate between Meaning versus Form, Counterfactual Examples, and the use of prompts in Generative Data Augmentation.

Many of the successes of Deep Learning stem from access to large labeled datasets such as ImageNet [ 17 ]. However, constructing these datasets is very challenging and time-consuming. Therefore, researchers are looking for alternative ways to leverage data without manual annotation. This is a large motivation behind the success of self-supervised language modeling with papers such as GPT-3 [ 14 ] or BERT [ 18 ]. Data Augmentation follows this same motivation as overcoming the challenge of learning with limited labeled data and avoiding manually labeling data. For example, many of the surveyed studies highlight the success of their algorithms when sub-setting the labeled data.

Transfer Learning has been one of the most effective solutions to this challenge of learning from limited labeled datasets [ 19 ]. Transfer Learning references initialization of the model for learning with the weights learned from a previous task. This previous task usually has the benefit of big data, whether that data is labeled such as ImageNet or unlabeled, as is used in self-supervised language models. There are many research questions around the procedure of Transfer Learning. In our Discussion section we discuss opportunities with Data Augmentation such as freezing the base feature extractor and training separate heads on the original and augmented datasets.

Self-supervised learning describes a general set of algorithms that learn from unlabeled data with supervised learning. This is done by algorithmically labeling the data. Some of the most popular self-supervised learning tasks include generation, contrastive learning, and pretext tasks. Generation describes how language models are trained. A token is algorithmically selected to be masked out and the masked out token is used as the label for supervised learning. Contrastive learning aligns representations of data algorithmically determined to be similar (usually through the use of augmentations), and distances these representations from negatives (usually other samples in the mini-batch). Pretext tasks describe ideas such as applying an augmentation to data and tasking the model to predict the transformation. The augmentation interface powers many task constructions in self-supervised learning.

Motifs of text data augmentation

This section will introduce a unifying view of objective the augmentations presented in the rest of the survey address. We introduce the key motifs of Text Data Augmentation as Strengthening Decision Boundaries, Brute Force Training, Causality and Counterfactual Examples, and the distinction between Meaning versus Form. These concepts dig into the understanding of Data Augmentation and their particular application to language processing.

Strengthening decision boundaries

Data Augmentation is commonly applied to classification problems where class boundaries are learned from label assignments. Augmented examples are typically only slightly different from existing samples. Training on these examples results in added space between the original example and its respective class boundary. Well defined class boundaries result in more robust classifiers and uncertainty estimates. For example, these boundaries are often reported with lower dimensional visualizations derived from t-SNE [ 20 ] or UMAP [ 21 ].

A key motif of Data Augmentation is to perturb data so that the model is more familiar with the local space around these examples. Expanding the radius from each example in the dataset will overall help the model get a better sense of the decision boundary and result in smother interpolation paths. This is in reference to small changes to the original data points. In NLP this could be deleting or adding words, synonym swaps, or well controlled paraphrases. The model becomes more robust to the local space and decision boundary based on available labels simply by increased exposure.

Brute force training

Deep Neural Networks are highly parametric models with very high variance that can easily model their training data. Fitting the training data is surprisingly robust to interpolation, or moving within the data points provided. What Deep Learning struggles with, as we will unpack in Generalization Testing with Data Augmentation, is extrapolating outside of data points provided during training. A potential solution to this is to brute force the data space with the training data.

The upper bound solution to many problems in Computer Science is to simply enumerate all candidate solutions. Brute force solutions rely on computing speed to overpower the complexity of a given problem. In Deep Learning, this entails training on an exhaustive set of natural language sequences such that all potential distributions the test set could be sampled from are covered in the training data. This way, even the most extreme edge cases will have been covered in the training set. The design of brute force training requires exhaustive coverage of the natural language manifold. A key question is whether this idea is reasonable or not? It may be better to identify key regions that are missing, although that it is challenging to probe for and define.

Causality and counterfactual examples

Vital to achieving the goals of Deep Learning, is to learn causal representations [ 22 ], as opposed to solely representing correlations. The field of Causal Inference demonstrates how to use interventions to establish causality. Reinforcement Learning is the most similar branch of Deep Learning research in which an agent deliberately samples interventions to learn about its environment. In this survey, we consider how the results of interventions can be integrated into observational language data. This is also similar to the subset of Reinforcement Learning known as the offline setting [ 23 ].

Many of the Text Data Augmentations described throughout the survey utilize the terminology of Counterfactual Examples [ 24 ]. These Counterfactual Examples describe augmentations such as the introduction of negations or numeric alterations to flip the label of the example. The construction of counterfactuals in language generally relies on human expertise, rather than algorithmic construction. Although the model does not deliberately sample these interventions akin to a randomized control trial, the hope is that it can still establish causal links between semantic concepts and labels by observing the result of interventions.

Liu et al. [ 25 ] lay the groundwork for formal causal language in Data Augmentation. This entails the use of structured causal models and the procedure of abduction, action, and prediction to generate counterfactual examples. These experiments rely on phrasal alignment between sequences in neural machine translation to sample counterfactual replacements. Their counterfactual augmentation improves on a baseline English to French translation system from 26.0 to 28.92 according to the BLEU metric. It seems possible that this phrasal alignment could be extended to other sequence-to-sequence problems such as abstractive question answering, summarization, or dialogue systems. This explicit counterfactual structure is different from most reviewed works that rather use natural language prompts to automate counterfactual sampling. For example, DINO [ 26 ] generates natural language inference data by either seeding the generation with “mean the same thing” or “are on completely different topics”. We think it is an interesting research direction to see if rigorous causal modeling such as computing the conditional probabilities of the context removing the variable [ 27 ] will provide benefits over prompts and large language models.

Meaning versus form

One of the most interesting ideas in language processing is the distinction between meaning and form. Bender and Koller [ 28 ] introduced the argument, providing several ideas and thought experiments. A particularly salient anecdote to illustrate this is known as the octopus example. In this example, two people are stranded on separate islands, communicating through an underwater cable. This underwater cable is intercepted by an intelligent octopus who learns to mimic the speaking patterns of each person. The octopus does this well enough that it can substitute for either person, as in the Turing test. However, when one of the stranded islanders encounters a bear and seeks advice, the octopus is unable to help. This is because the octopus has learned the form of their communication, but it has not learned the underlying meaning of the world in which their language describes.

We will present many augmentations in this paper that aid in learning form. Similar to the concept of strengthening decision boundaries, ideas like synonym swap or rotating syntactic trees will help the octopus further strengthen its understanding of how language is generally organized. With respect to achieving an understanding of meaning in these models and defining this esoteric concept, many have turned to ideas in grounding and embodiment. Grounding typically refers to pairing language with other modalities such as vision-language or audio-language models. However, grounding can also refer to abstract concepts and worlds constructed solely from language. Embodiment references learning agents that act in their environment. Although Bender and Koller propose that meaning cannot be learned from form alone, many other works highlight different areas of the language modeling task such as assertions [ 29 ] or multiple embedded tasks [ 30 ] that could lead to learning meaning. Another useful way of thinking about meaning versus form could be to look at recently developed benchmarks in language processing such as the distinction between GLUE [ 31 ] and SuperGLUE [ 32 ] tasks that predominantly test an understanding of form to knowledge-intensive tasks such as KILT [ 33 ] that better probe for meaning. In our survey, we generally use the terms “understanding” and “meaning” to describe passing black-box tests designed by humans. We believe that drilling into the definition of these terms is one of the most promising pursuits in language processing research.

Text data augmentations

We described Data Augmentation as a strategy to prevent overfitting via regularization. This regularization is enabled through an intuitive interface. As we study a task or dataset, we learn more about what kind of priors or what kind of additional data we need to collect to improve the system. For example, we might discover characteristics about our question answering dataset such as that it fails with symmetric consistency on comparison questions. The following list of augmentations describes the mechanisms we currently have available to inject these priors into our datasets.

Symbolic augmentation

We categorize these augmentations as “Symbolic Augmentations” in contrast to “Neural Augmentations”. As stated earlier, the key difference is the use of auxiliary neural networks, or other types of statistical models, to generate data compared to using symbolic rules to augment data. A key benefit of symbolic augmentation is the interpretability for the human designer. Symbolic augmentations also work better with short transformations, such as replacing words or phrases to form augmented examples. However, some information-heavy applications rely on longer inputs such as question answering or summarization. Symbolic rules are limited in applying global transformations such as augmenting entire sentences or paragraphs.

Rule-based augmentation

Rule-based Augmentations construct rules to form augmented examples. This entails if-else programs for augmentation and symbolic templates to insert and re-arrange existing data. Easy Data Augmentation from Wei et al. [ 34 ] presents four augmentations. Figure ​ Figure1 1 highlights the performance improvement with EDA, note the smallest subset of 500 labeled examples benefits the most. One of the main reasons to be excited about Easy Data Augmentation is that it is relatively easy to use off-the-shelf. Many of the Augmentations mentioned later in this survey, are still in the research phase, waiting for large-scale testing and adoption. Easy Data Augmentation includes random swapping, random deletion, random insertion, and random synonym replacement. Examples of this are shown in Fig. ​ Fig.2 2 .

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_492_Fig1_HTML.jpg

Success of EDA applied to 5 text classification datasets. A key takeaway from these results is the performance difference with less data. The gain is much more pronounced with 500 labeled examples, compared to 5,000 or the full training set

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_492_Fig2_HTML.jpg

Examples of easy data augmentation transformations

There are many opportunities to build on these augmentations. Firstly, we note that with random swapping, the classification of the word is incredibly useful. From the Data Augmentation perspective of introducing semantic invariances, “I am jogging”, is much more similar to “I am swimming” than “I am yelling”. Further designing token vocabularies with this kind of structure should lead to an improvement.

Programs for Rule-based augmentation further encompass many of the adversarial attacks that have been developed for NLP. Adversarial attacks are equivalent to augmentations, differing solely in the intention of their construction. As an example of a rule-based attack, Jin et al. [ 35 ] present TextFooler. TextFooler first computes word importance scores by looking at the change in output when deleting each word. TextFooler then selects the words which most significantly changed the outputs for synonym replacement. This is an example of a rule-based symbolic program that can be used to organize the construction of augmented examples.

Another rule-based strategy available is Regular Expression Augmentation. Regular Expression filtering is one of the most common ways to clean data that has been scraped from the internet, as well as several other data sources such as Clinical Notes [ 36 ]. Regular Expressions describe matching patterns in text. This is usually used to clean data, but it can also be used to find common forms of language and generate extensions that align with a graph-structured grammar. For example, matching patterns like “This object is adjective” and extending it with patterns such as, “and adjective”. Another strategy is to re-order the syntactics based on the grammar such as “This object is adjective” to “An adjective object”.

Min et al. [ 37 ] propose rules for augmentation based on syntactic heuristics. This includes Inversion, swapping the subject and object in sentences, and Passivization where the hypothesis in premise-hypothesis NLI (Natural-Language Inference) pairs are translated to the passive version of the sentence. An example of Inversion is the change from “The lawyer saw the actor” to “The actor saw the lawyer”. An example of Passivization is changing from “This small collection contains 16 El Grecos” to “This small collection is contained by 16 El Grecos”. The authors show improvement applying these augmentations on the HANS challenge set for NLI [ 38 ].

Graph-structured augmentation

An interesting opportunity for text data augmentation is to construct graph-structured representations of text data. This includes relation and entity encodings in knowledge graphs, grammatical structures in syntax trees, or metadata grounding language data, such as citation networks. These augmentations add explicit structural information, a relatively new integration with Deep Learning architectures. The addition of structure can aid in finding label-preserving transformations, representation analysis, and adding prior knowledge to the dataset or application. We will begin our analysis of Graph-Structured Augmentation by unpacking the difference between structured versus unstructured representations.

Deep Learning operates by converting high-dimensional, and sometimes sparse, data into lower-dimensional, continuous vector embedding spaces. The learned vector space has corresponding metrics such as L2 or cosine similarity distance functions. This is a core distinction from topological spaces, in which distance between points is not defined. A topological space is a more general mathematical space with less constraints than Euclidean or metric spaces. Topological spaces encode information that is challenging to integrate in modern Deep Learning architectures. Rather than designing entirely new architectures, we can leverage the power of structured data through the Data Augmentation interface.

One of the most utilized structures in language processing is the Knowledge Graph [ 39 ]. A Knowledge Graph is composed of (entity, relation, entity) tuple relations. The motivation of the augmentation scheme is that paths along the graph provide information about entities and relations which are challenging to represent without structure. Under the scope of Rule-based Augmentation, we presented the idea of synonym swap. One strategy to implement synonym swap would be to use a Knowledge Graph with “is equivalent” relationships to find synonyms. This can be more practical than manually defining dictionaries with synonym entries. This is especially the case thanks to rapid acceleration in automated knowledge graph construction from unlabeled data. Knowledge Graphs often contain more fine-grained relations as well.

Previously, we mentioned how random synonym replacement would benefit enormously from the perspective of preserving the class label with better swaps. Improved swaps describe transitions such as “I am jogging” to “I am running” compared to “I am yelling”, or even “I am market”. Structured language in graph-form is a very useful tool to achieve this augmentation capability. These kinds of graphs have been heavily developed with notable examples such as WordNet [ 40 ], Penn Treebank [ 41 ], and the ImageNet class label structure [ 17 ]. Graphs such as WordNet describe words in relationship to one another through “synsets”.

Graphs are made up of nodes and edges. In WordNet, each node represents a word such as “tiger”. The genius of WordNet is the simplification of which edges to connect. In WordNet, the nodes are connected with the same edge type, a “synset” relationship. Synsets are loosely defined as words belonging to a similar semantic category. The word “tiger” would have a synset relation with nodes such as “lion” or “jaguar”. The word “tiger” may also have finer-grained synset relations with nodes that describe more particular types of tigers. WordNet is an example of a Graph-Structured Augmentation that builds on synonym replacement. WordNet describes a graph where each node is related to another graph by being a “synset”.

We additionally consider graphs that contain finer grained edge classifications, this kind of graph is frequently referred to as a Knowledge Graph [ 39 ]. As an example, CoV-KGE [ 42 ] contains 39 different types of edges relating biomedical concept nodes such as drugs or potential binding targets. Huang et al. [ 43 ] provide another interesting example of constructing a knowledge graph from the long context provided as input to abstractive summarization. This graph enables semantic swaps that preserve global consistency.

Another heavily studied area of adding structure to text data is known as syntactic parsing. Syntactic parsing describes different tasks that require structural analysis of text such as the construction of syntax or dependency trees. Recently, Glavas and Vulic [ 44 ] demonstrated that supervised syntactic parsing offered little to no benefit in the modern pre-train, then fine-tune pipeline with large language models.

The final use of structure for Text Data Augmentation we consider is to integrate metadata via structural information. For example, scientific literature mining has become a very popular application of NLP. These applications could benefit from the underlying citation network characterizing these papers, in addition to the text content of the papers themselves. Particularly, network structure has played an enormous role in biology and medicine. Li et al. [ 45 ] present many of these graphs in high-level application domains such as molecules, genomics, therapeutics, and healthcare. The integration of this structure with text data could be a key component to grounding text representations.

In the theme of our survey, we note that these auxiliary graphs may benefit from augmentation as well. Data Augmentation for explicitly graph-structured data is still in its early stages. Zhao et al. [ 46 ] propose an edge augmentation technique that “exposes GNNs to likely (but nonexistent) edges and limiting exposure to unlikely (but existent) ones” [ 46 ]. This graph augmentation leads to an average accuracy improvement of 5% across 6 popular node classification datasets. Kong et al. [ 47 ] further demonstrate the effectiveness of adversarially controlled node feature augmentation on graph classification.

In the section, Practical Considerations for Implementation, we will present the use of consistency regularization and contrastive learning to further enforce the use of augmented data in training. Building on these ideas, we can use graph-structures to assign nearest neighbor assignments and regularize embeddings. Neural Structured Learning [ 48 ] describes constructing a graph connecting instances that share fine-grained class labels. This is used to penalize a misclassification of “golden retriever” less so than “elephant” if the ground truth label is “labrador retriever”. Li et al. [ 49 ] similarly construct an embedding graph to enforce consistency between predictions of strong and weakly augmented data.

MixUp augmentation

MixUp Augmentation describes forming new examples by meshing existing examples together, sometimes blending the labels as well. As an example, MixUp may take half of one text sequence and concatenate it with half of another sequence in the dataset to form a new example. MixUp may be one of the best interfaces available to connect distant points and illuminate a path of interpolation.

Most implementations of MixUp vary with respect to the layer in which samples are interpolated. Guo et al. [ 50 ] test MixUp at word and sentence levels. This difference is shown in Fig. ​ Fig.3. 3 . Their wordMixup technique combines existing samples by averaging embedding vectors at the input layer. The sentMixup approach combines existing samples by averaging sentence embeddings as each original sequence is passed through siamese encoders. Their experiments find a significant improvement in reducing overfitting compared to no regularization or using dropout.

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_492_Fig3_HTML.jpg

Left, word-level mixup. Right, sentence-level mixup. The red outline highlights where augmentation occurs in the processing pipeline

Feature space augmentation

Feature Space Augmentation describes augmenting data in the intermediate representation space of Deep Neural Networks. Nearly all Deep Neural Networks follow a sequential processing structure where input data is progressively transformed into distributed representations and eventually, task-specific predictions. Feature Space Augmentations isolate intermediate features and apply noise to form new data instances. This noise could be sampled from standard uniform or gaussian distributions, or they could be designed with adversarial controllers.

MODALS [ 51 ] presents a few strategies for feature space augmentations. Shown in Fig. ​ Fig.4, 4 , these strategies describe how to move along class boundaries to form new examples in the feature space. Hard example interpolation (a) forms a new example by moving it in the direction of existing embeddings that lie on the decision boundary for classification. Hard example extrapolation (b) describes moving existing examples along the same angle they currently lie from the mean vector of the class boundary. Gaussian noise (c) entails adding Gaussian noise in the feature space. Difference transform (d) moves an existing sample in the directional distance calculated from two separate points in the same class. As described as one of the general Motifs Of Data Augmentation, MODALS aims to strengthen decision boundaries. Research in Supervised Contrastive Learning [ 52 ], replacing the commonly used KL-divergence of logits and class labels with contrastive losses such as NCE with positives and negatives formed based on class labels, has been shown to improve these boundaries. It could be useful to explore how this benefits the MODALS algorithm.

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_492_Fig4_HTML.jpg

Directions for feature space augmentation explored in MODALS

We also consider Differentiable Data Augmentation [ 53 ] techniques to fall under the umbrella of Feature Space Augmentation. Data Augmentation is a function f(x) that produces augmented examples x’. Similar to any other layers in the network, we can treat the beginning of the network as an augmentation module and backpropagate gradients through it. We can also separate the augmentation function and add it to the inputs such that the transformation is not too dramatic, akin to adding an optimized noise map to the input. Minderer et al. [ 54 ] use this technique to facilitate self-supervised pretext tasks.

Neural augmentation

The following augmentations rely on auxiliary neural networks to generate new training data. This entails using a model trained on supervised Neural Machine Translation datasets to translate from one language to another and back to sample new instances, or a model trained on generative language modeling to replace masked out tokens or sentences to produce new data. We additionally discuss the use of neural style transfer in NLP to translate from one writing style to another or one semantic characteristic such as formal to casual writing.

Back-translation augmentation

Back-translation describes translating text from one language to another and then back from the translation to the original language. An example could be taking 1,000 IMDB movie reviews in English and translating them to French and back, Chinese and back, or Arabian and back. There has been an enormous interest in machine translation. This has resulted in the curation of large labeled datasets of parallel sentences. We can also imagine the use of other text datasets such as translations between programming languages or writing styles as we describe in more detail under Style Augmentation.

Back-translation leverages the semantic invariances encoded in supervised translation datasets to produce semantic invariances for the sake of augmentation. Also interestingly, back-translation is used to train unsupervised translation models by enforcing consistency on the back-translations. This form of back-translation is also heavily used to train machine translation models with a large set of monolingual data and a limited set of paired translation data. Outside of translation we could imagine structuring these domain pairings such as scientific papers and news articles or college-level and high-level reading and so on.

An interesting design question with this may be to weigh the importance of using a high performance machine translation model for the back-translation. However, as stated by Pham et al., the lesson has been “better translation quality of the pseudo-parallel data does not necessarily lead to a better final translation model, while lower-quality but diverse data often yields stronger results instead” [ 55 ]. The curation of paired languages and domains could also impact the final performance. Exploring back-translation augmentation for question answering Longpre et al. discuss “curating our input data and learning regime to encourage representations that are not biased by any one domain or distribution” [ 56 ].

Style augmentation

Finally, we present another augmentation strategy utilizing Deep Networks to augment data for the training of other Deep Nets. In our previous survey of Image Data Augmentation, we explored works that use Neural Style Transfer for augmentation. Artistic style transfers such as a picasso-themed dog image, may be useful as an OOD augmentation in a Negative Data Augmentation framework, which we will present later. However, we are more interested in styles within the dataset. This is an interesting strategy to prevent overfitting to high-frequency features or blurring out the form of language such as to focus on meaning. In the text data domain, this could describe transferring the writing-style of one author to another for applications such as abstractive summarization or context for extractive question answering.

Data Augmentation is often deployed to focus models on semantics, rather than particular forms of language. These particular forms could emerge from one author’s writing style or general tonality in the language such as an optimistic versus a pessimistic writer. Style transfer offers an interesting window to extract semantic similarities between writing styles. This could help with modeling contexts in question answering systems or documents for information retrieval.

Generative data augmentation

Generative Data Augmentation is one of the most exciting emerging ideas in Deep Learning. This includes generating photorealistic facial images [ 57 ] or indistinguishable text passages [ 14 ]. These models have been very useful for Transfer Learning, but the question remains: What is the killer application of the generative task? These generations are certainly interesting for artistic applications, but more importantly is their use for representation learning and Data Augmentation.

We note a core distinction in the use of generative models for Data Augmentation. A popular use is to take a pre-trained language model of the shelf and optionally fine-tune it further with the language modeling task. This is the standard operating procedure of Transfer Learning. However, the fine-tuning is usually done with the Supervised Learning task, rather than additional language modeling. The pre-trained language models have learned many interesting properties of language because they are trained on massive datasets. An interesting example that is publicly available is The Pile [ 58 ]. The Pile is 800GB of text data spanning Wikipedia, comment forums, entire books, and many more examples of data like this. Even though these models and datasets are very impressive, additional benefits will likely be achieved by domain-tuning with additional language modeling on the limited dataset.

Language modeling is a very useful pre-training stage and we often have more data for language modeling than a downstream task like question-answering. Whereas we may only have 100 question-answer pairs, the question, answer, and surrounding context could easily contain 300 words each, accounting for a total of 3,000 words for constructing language modeling examples. A dataset size of 3,000 compared to 100 can make a large difference in success with Deep Learning and is the prime reason for our interest in Data Augmentation to begin with. Gururangan et al. [ 59 ] present an argument for this use of language models since downstream performance is dramatically improved when pre-training on a relevant dataset. This distinction of “relevant dataset” is in contrasting reference to what is used to train models like GPT-3 [ 14 ].

One of the most popular strategies for training a language model for Generative Data Augmentation is Conditional BERT (C-BERT) [ 60 ]. C-BERT augments data by replacing masked out tokens of the original instance. The key novelty is that it takes an embedding of the class label as input, such as to preserve the semantic label when replacing masked out tokens. This targets the label-preserving property of Data Augmentation. The C-BERT training strategy can be used when fine-tuning a model pre-trained on another dataset or starting from a random initialization.

An emerging strategy to adapt pre-trained generative models to downstream tasks is to re-purpose the interface of masking out tokens. This is known as prompting. The output of language models can be guided with text templates for the sake of generating or labeling new data. Testing the efficacy of prompting with respect to the objective of learning from limited data, Scao and Rush [ 61 ] show that prompting is often worth 100s of data points on SuperGLUE classification tasks [ 32 ]. This is in direct comparison with the more heavily studied paradigm of Transfer Learning, head-based fine-tuning. We will present a few variants on implementing prompts, this includes in-context learning, pattern-exploiting training, and prompt tuning.

The first implementation of prompting we consider is in-context learning. In-context learning became well known when demonstrated with GPT-3. The idea is to prepend each input with a fixed task description and a collection of examples of the task. This does not require any further gradient updates of the model. Brown et al. [ 14 ] show that scale is crucial to making this work reporting significant performance drops from 175B parameters to 13B and less. This technique has likely not yet hit its ceiling, especially with the development of transformer models that can in sequences longer than 512 tokens as inputs. Similar to excitement about retrieval-augmented modeling, this will allow in-context learning models to process more demonstrations of the task. However, due to limitations of scale, methods that continue with gradient updates are more practically useful.

The next implementation of prompting we will present is prompt tuning. Prompt tuning describes first embedding the prompt into a continuous space, and then optimizing the embedding with gradient descent while keeping the rest of the network frozen. Similarly to GPT-3, Lester et al. [ 62 ] show that scale improves performance with prompt tuning and that prompt tuning significantly outperforms the in-context learning results reported from Brown et al. [ 14 ]. Performance can be further improved by ensembling optimized prompts and running inference as a single batch of the input and the appended prompts. Tuned prompt ensembling improves the average performance of the prompts on SuperGLUE from 88.5, and the best performing individual prompt at 89.8, to 90.5. The authors further highlight that analysis of the optimized prompt embedding can aid in task complexity and similarity metrics, as well as Meta-Learning. Prompt tuning shares the same underlying concept of prepending context to the input of downstream tasks to facilitate fine-tuning, however this technique is more in line with research on Transfer Learning with minimal modifications. For example, adapter layers [ 63 ] aim to introduce a small number of parameters to fine-tune a pre-trained Transformer.

An emerging theme in the pre-train then fine-tune paradigm has been that domain and task alignment tends to improve fine-tuned performance. Gururangan et al. [ 59 ] demonstrate the effectiveness of data domain alignment and Zhang et al. [ 64 ] demonstrate effectiveness of task alignment in the proposed PEGASUS algorithm. In correspondence with the lesson of alignment, Zhong et al. [ 65 ] tune language models to be better fitted to answer prompts. This is done by manually annotating 441 questions across 43 existing datasets that map every task to a “Yes” or “No” answer. Measured by AUC-ROC plots, the authors show that further fine-tuning on prompt specialization improves these models and that this also benefits from scale. The authors call for the organization of NLP datasets into unified formats that better aids in fine-tuning models for answering prompts.

Pattern exploiting training (PET) [ 66 ] uses the pre-trained language model to label task-specific unlabeled data. This is done with manually-defined templates that convert the supervised learning task into a language modeling task. The outputs of the language model are then mapped to supervised learning labels with a verbalizer. Gradient-descent optimization is applied to verbalized outputs to fine-tune it with the same cross-entropy loss function used to train classifiers. Schick and Shutze [ 67 ] demonstrated that the PET technique enables much smaller models to surpass GPT-3 with 32 labeled examples from SuperGLUE. Tam et al. [ 68 ] further developed the algorithm to ADAPET. ADAPET utilizes dense supervision in the labeling task, applying the loss to the entire vocabulary distribution without a verbalizer and additional requiring the model to predict the masked tokens in the context given the label, similarly to conditional-BERT. ADAPET outperforms PET without the use of task-specific unlabeled data.

A limitation to pattern-exploiting training, in-context learning, and prompt tuning, is that they require retaining a large language model for downstream tasks. Most applications are interested in compressing these models for the sake of efficiency. Under the scope of Label Augmentation, we will present the use of knowledge distillation. For now, we consider compression by generating data to train a smaller model with. This approach is most similar to pattern-exploiting training, except that rather than use the pre-trained language model to label data, we will instead use it to generate entire examples.

Drawing inspiration from the success of MixUp, which was presented in further detail in MixUp Augmentation, Yoo et al. developed GPT3Mix [ 69 ]. The input to GPT3Mix begins with a Task Specification that defines the task such as, “Text Type T = movie review, Label Type L = sentiment”. Akin to MixUp, the next inputs are examples of the task formulated as “text type: example text k (label type: example label k)”, such as “Example 1: The cat is running my mat. (negative)”. The final piece of the input is the template to generate new examples. Further, the generated example is “soft-labeled” by the generating probabilities of each token in the process of generating the new example. GPT3Mix achieves massive performance improvements over no augmentation, Easy Data Augmentation, and BackTranslation when subsetting available data to extreme levels such as 0.1% and 0.3%.

Schick and Shutze [ 26 ] also explore the strategy of generating data from language models, presenting Datsets from Instructions (DINO). DINO uses a task description and one example from the dataset to generate pairwise classification datasets. Interestingly, they contrast task descriptions which entail the resulting label to decode language model generation. For example, the task description could begin with “Write two sentences that” and continue with either “mean the same thing” or “are on completely different topics”. The generation accounts for the token another label description would generate. Evaluated on the STS text similarity dataset, representations learned from DINO show improvements over state-of-the-art sentence embedding techniques trained with supervised learning, such as Universal Sentence Encoders [ 70 ] and Siamese BERT and RoBERTa models [ 71 ].

While built on the same underlying concept, discrete versus continuous prompt search diverge heavily from one another. Discrete prompt search has the benefit of interpretability. For example, comparing different task descriptions and examples provided by a human annotator offers insights into what the model has learned. However, prompt optimization in the continuous embedding space fully automates the search. Continuous prompt optimization is likely more susceptible to overfitting due to the freedom of the optimization space.

Another somewhat similar theme to prompting in NLP has been to augment knowledge-enhanced text generation with retrieval. Popular models include Retrieval-Augmented Generation (RAG) [ 72 ], and Retrieval-Augmented Language Model Pre-training (REALM) [ 73 ]. Shuster et al. [ 74 ] show how retrieving information to prepend to the input reduces the problem of hallucination in text generation. Once this retrieved information is embedded into the continuous representation space of language models, it is a similar optimization problem as prompt tuning.

Another interesting idea is the intersection of Data Privacy and Generative Data Augmentation. Can we store data in the parameters of models instead of centralized databases? The idea of Federated Learning [ 75 ] is to send copies of the global model weights to a local database such as to avoid a centralized database. Which models should we send to local databases? Classifiers or generative models? If we send a generative model, we have the potential to cover more of the data distribution and learn more about general data manifolds such as the use of language more broadly, however, we risk exposing more critical information [ 76 ].

Label augmentation

Supervised Learning, describes fitting an input, x, to a label, y. Throughout this survey, we have presented strategies for regularizing the x values. In this section, we explore research looking to entertain the y class labels. The most successful example of this is Knowledge Distillation [ 77 ]. Knowledge Distillation describes transforming the traditional one-hot encoded y labels into a soft distribution by re-labeling xs with the logits of another neural network’s prediction. This has been very influential in compression such as DistilBERT [ 78 ], information retrieval [ 79 ], and achieving state-of-the-art classification results in Computer Vision [ 80 ].

In addition to Knowledge Distillation, several other strategies have been developed to augment the label space. Label smoothing uses a heuristic adjustment to the density on negative classes and has been highly influential for training classifiers [ 81 ] and generative adversarial networks [ 82 ]. Another exciting approach is the use of a meta-controller, similar to knowledge distillation, but massively different in that the Teacher is learning from the gradients of the Student’s loss to update the label augmentation. Notable examples exploring this include Meta Pseudo Labels [ 83 ] and Teaching with Commentaries [ 84 ]. This ambitious idea of learning to augment data through outer-inner loop gradients have also been explored in the data space, x, with Generative Teaching Networks [ 12 ]. As of the time of this writing, Generative Teaching Networks have only been applied to image data. A similar idea is “Meta Back-Translation” [ 55 ], in this work, the authors “propose a meta-learning framework where the back-translation model learns to match the forward translation model’s gradients on the development data with those on the pseudo-parallel data.”

Thakur et al. [ 85 ] present the Augmented SBERT to augment data labels for distillation. The authors note that the cross-encoder, although much slower and less efficient than bi-encoders, tends to reach higher accuracy on pairwise classification tasks such as ranking or duplicate question detection. The paper proposes to label data with the cross-encoder and fit these augmented labels with the bi-encoder. Also worth mentioning is that the cross-encoder heavily outperforms the bi-encoder with less training data. Thakur et al. find a significant benefit strategically selecting data to soft label with the cross encoder. We have found this idea throughout experiments in Data Augmentation, discussing it further in our Discussion section under Curriculum Learning.

Testing generalization with data augmentation

The holy grail of Machine Learning is to achieve out-of-distribution (OOD) generalization. This is distinct from in-distribution generalization where the training and test sets are sampled from the same data distribution. In order to measure OOD generalization, we need to make assumptions about how the distribution will shift. As Arjvosky writes, “if the test data is arbitrary or unrelated to the training data, then generalization is obviously futile” [ 86 ]. Chollet further describes the relationship between system-centric and developer-aware generalization, as well as levels of generalization such as absent, local, broad, and extreme [ 87 ]. We argue that Data Augmentation is the natural interface to quantify the relationship between test and train data distributions and levels of generalization.

A classic tool to test for generalization is to simply report the difference in accuracy between the training and test sets. However, as shown in papers such as Deep Double Descent [ 88 ], the phenomenon of overfitting is generally poorly understood with large-scale Deep Neural Networks. We believe it is more practical to study overfitting and generalization in the data space. For example, the success of adversarial examples shows that Deep Neural Networks cannot generalize to distributions added with adversarially optimized noise maps. Jia and Liang [ 89 ] show that models trained on SQuAD cannot generalize when adversarially optimized sentences are added to the context, an example of this is shown in Fig. ​ Fig.5. 5 . In addition to adversarial attacks, many other datasets show intuitive examples of distribution shifts where Deep Neural Networks fail to generalize.

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_492_Fig5_HTML.jpg

Fooled by injected text. Image taken from Jia and Liang [ 89 ]

We present Data Augmentation as a black-box test for generalization. CheckList [ 90 ] proposes a foundational idea for these kinds of tests in NLP. CheckList is designed to test the linguistic capabilities of models such as robustness to negation, vocabulary perturbations, or temporal consistency. We view this as introducing a distribution shift of linguistic phenomena in the test set. Clark et al. [ 91 ] construct a toy example for transformers to see how far they can generalize fact chaining. In this test, the training data requires the model to chain together more or less facts than are tested in the test set. Again, the distribution shift is controlled with an intuitive interface again to Data Augmentation. Finally, WILDS [ 92 ] is a collection of real-world distribution shifts. These real-world shifts can also be mapped to Data Augmentations.

Kaushiik et al. [ 24 ] describes employing human-labelers to construct a set of counterfactual movie reviews and natural language inference examples. The authors construct an elegant annotation interface and task Mechanical Turk workers to minimally edit examples such as to switch the label. For example, converting “The world of Atlantis, hidden beneath the earth’s core, is fantastic” to “The world of Atlantis, hidden beneath the earth’s core is supposed to be fantastic”. For movie reviews, the authors group the workers’ revisions into categories such as recasting fact as hoped for, suggesting sarcasm, inserting modifiers, inserting phrases, diminishing value qualifiers, differing perspectives, and changing ratings. For natural language inference, the authors group the workers’ revisions into categories such as modifying/removing actions, substituting entities, adding details to entities, inserting relationships, numerical modifications, using/removing negation, and unrelated hypothesis. These examples are constructed for testing generalization to these counterfactual examples.

Returning to our description of Generative Data Augmentation, are generative models capable of making these edits? If GPT-3 was given an IMDB review with the task prompt of “change this movie review from positive to negative”, it could probably manage it. We leave it to future work to investigate the generalization shifts induced by human-designed counterfactuals and generative models. To further motivate this study, the authors note that their dataset construction came with a hefty price tag of $10,778.14. Inference costs of generative models are unlikely to approach this cost, unless working with extremely large models. Highlighting that a similar categorization of the changes as Kaushik et al. use [ 24 ] could help us understand the linguistic phenomena underlying this kind of generalization test.

Generative Data Augmentation provides another lens to study generalization. Nakkiran et al. propose a novel way of studying generalization in “The Deep Bootstrap Framework” [ 93 ]. The idea is to compare the Online test error to the Bootstrap test error. The Online error describes the performance of a model trained on an infinite data stream, i.e. without repeating samples. The Bootstrap test error describes the common training setup in Deep Learning, repeating batches of the same data. The authors simulate the Online learning scenario by fitting a generative model, in this particular case a Denoising diffusion probabilistic model [ 94 ]. The generative model is used to sample 6 million examples, compared to the standard 50,000 samples used to train CIFAR-10. Garg et al. [ 95 ] additionally propose RATT, a technique that analyzes learning curves and generalization when randomly labeled unlabeled data is added to the training batch. The augmentations described in this survey may be able to simulate this unlabeled data and provide similar insights.

To conclude, when is overfitting problematic? How much of a data distribution are modern neural networks capable of covering? Deep Neural Networks have a remarkable ability to interpolate within the training data distribution. A potential solution could be to leverage Data Augmentation to expand the training distribution such that there are no reasonable out-of-distribution shifts in the test sets. Even if all the potential distributions cannot be compressed into a single neural network, this interface can illuminate where the model will fail.

Image versus text augmentation

Our survey on Text Data Augmentation for Deep Learning is intended to follow a similar format as our prior work on Image Data Augmentation for Deep Learning [ 6 ]. We note there are many similarities between the Easy Data Augmentations and basic geometric and color space transformations used in Computer Vision. Most similarly, both are easy to implement and complement nearly any problem working with text or image data respectively. We have described how Easy Data Augmentation can easily interface with text classification, pairwise classification, extractive question answering, abstractive summarization, and chatbots, to name a few. Similarly, geometric and color space transformations in Computer Vision are used in image classification, object detection, semantic segmentation, and image generation.

As described in the beginning of our survey, Data Augmentation biases the model towards certain semantic invariances. Image Data Augmentation has largely been successful because it is easy to think semantic invariances relevant to vision. These include semantic invariance to horizontal flips, rotations, and increased brightness, to name a few. Comparatively, it is much harder to define transformations to text data that are guaranteed to be semantically invariant. All of the augmentations described in Easy Data Augmentation have the potential to perturb the original data such that it changes the ground truth label, y.

Another interesting trend is the integration of vision and language in recent models such as CLIP and DALL-E. For the sake of Data Augmentation, a notable example is Vokenization from Tan and Bansal [ 96 ]. The authors align tokens such as “humans” with images of “humans” and so on, even for verbs such as “speaking”. The masked language modeling task then uses the visual tokens as additional supervision for predicting masked out tokens. There is some noise in this alignment such as finding a visual token for words such as “by” or “the”. Tan and Basil report visual grounding ratios for tokens of 54.8%, 57.6%, and 41.7% on curated vision-language datasets compared to 26.6%, 27.7%, and 28.3% for solely language corpora. Across the SST-2, QNLI, QQP, MNLI, SQuAD v1.1 and v2.0, and SWAG benchmark tasks, Vokenization improves BERT-Large from 79.4 to 82.1 and RoBERTa-Large from 77.6 to 80.6. There are many interesting vision-language datasets labeled for tasks such as visual question answering, image captioning, and text-image retrieval, to name a few. Vision-language Data Augmentation schemes such as Vokenization look to be a very promising area of research.

A recent trend in Image Data Augmentation has been its integration in the training of generative models, namely generative adversarial networks (GANs) [ 97 ]. The GAN framework, similar to the ELECTRA model [ 98 ], consists of a generator and a discriminator. The generator transforms random noise into images and the discriminator classifies these images as either coming from the generator or the provided training set. Following, we will describe why this does not work as well as autoregressive modeling for text. Returning to how Data Augmentation has been used for GANs, this investigation began with Zhang et al.’s work on consistency regularization [ 99 ]. Consistency regularization requires the discriminator to make the same classification on a real image and an augmented view of that same image. Unfortunately, this led to the augmentations being “leaked” into the generated distribution such that the generator produces augmented data as well.

We will end this discussion by presenting some ideas from LeCun and Misra [ 100 ] on the key distinction between generative modeling between Images and Text. The key issue stated in the article is handling uncertainty. As an example, take the masked token completion task: “The mask chases the mask in the savana”. LeCun and Misra point out that the language model can easily “associate a score or a probability to all words in the vocabulary: high score for lion’, ‘cheetah’, and a few other predators, and low scores for all other words in the vocabulary” [ 100 ]. In comparison, applying this kind of density on candidate images in highly intractable. The missing token can only be 1 of a typical 30,000 tokens, whereas a missing 8x8 RGB patch can take on a ridiculously large, 255x8x8x3 values. Therefore, image models need to rely on energy-based models that learn joint embedding spaces and assign similarity scores, rather than exactly modeling the probability of each missing patch. Perhaps the GAN framework, or something similar, will take over in NLP once generative modeling expands its scope to sentence-level or paragraph-level generation, such as the pre-training task used for abstractive summarization in PEGASUS [ 64 ].

Another interesting success of Data Augmentation has been its application in Reinforcement Learning. This has been heavily studied with Robotic Control from Visual Inputs and the Atari benchmark. One of the biggest bottlenecks with robotic learning, and most deep reinforcement learning problems, is a lack of data. It is challenging to restart a robot laundry folder back to the beginning of the unfolded shirt and collect millions of trajectories. To solve this problem, researchers have turned to forming augmented trajectories from collections in a replay buffer. Amongst many applications of reinforcement learning with Text data that have been proposed, patient care control is particularly exciting. Ji et al. [ 101 ] explore the use of model-based reinforcement learning for patient care of septic patients using the MIMIC-III dataset [ 102 ]. The authors use clinical notes to sanity check the model-based rollouts of physiological patient state markers. A promising area of research will be to apply Text Data Augmentation to collected clinical note trajectories to improve patient care and trajectory simulation.

Practical considerations for implementation

This section presents many details of implementing Text Data Augmentation that make a large performance difference in terms of evaluation metrics and training efficiency.

Consistency regularization

Consistency regularization is a strong compliment to the priors introduced via Data Augmentation. A consistency loss requires a model to minimize the distance in representations of an instance and the augmented example derived from it. In line with the motif of strengthening decision boundaries, consistency regularization enforces a connection between original and augmented samples. This is usually implemented in a multi-task learning framework where a model simultaneously optimizes the downstream task and a secondary consistency term.

Consistency regularization has been successfully applied to translate between programming languages by enforcing consistency on back-translations [ 103 ]. Alberti et al. [ 104 ] use a slightly different form of consistency regularization to generate synthetic question-answer pairs. Rather than minimizing the distance between representations of original and augmented examples, the framework requires that the model outputs the exact same answer when predicting from context, question inputs as when a separate model generates the question from context, answer inputs. The original BERT-Large model achieves an F1 score of 83.1 when fine-tuned on the SQuAD2. Fine-tuning BERT with an additional 7 million questions generated with the consistency condition improves performance to 84.8.

Consistency regularization is a common technique for self-supervised representation learning because unlabeled data should still have this property of consistent representations before and after augmentation. Xie et al. [ 105 ] deploy consistency regularization as shown in Fig. ​ Fig.6. 6 . This technique surpasses the previous state-of-the-arts trained solely with supervised learning using significantly less data. These improvements continue even in the extreme case of only 20 labeled examples. As an example of the performance gain, the fine-tuned BERT model achieves a 6.5% error rate on IMDB review classification, which is reduced to 4.2% with UDA. The multi-task loss formulation is also fairly common in consistency regularization implementations.

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_492_Fig6_HTML.jpg

Unsupervised data augmentation schema. Image taken from Xie et al. [ 105 ]

Contrastive learning

Contrastive learning differs from consistency regularization by utilizing negative samples to normalize the loss function. This is a critical distinction because the negative samples can provide a significant learning signal. We believe that the development of Text Data Augmentation can benefit from adapting successful examples in Computer Vision. The use of Data Augmentation to power contrastive self-supervised learning has been one of the most interesting stories in Computer Vision. This involves frameworks such as SimCLR [ 106 ], MoCo [ 107 ], SwAV [ 108 ], and BYOL [ 109 ], to name a few. This training strategy should be well suited for information retrieval in NLP.

Krishna et al. [ 110 ] propose contrastive REALM (c-REALM). The contrastive loss is used to align the embedding of the question and supervised answer, and contrast the question with other supervised answers from the mini-batch. However, this technique of contrastive learning is more akin to supervised contrastive learning [ 52 ], than frameworks such as SimCLR. In SimCLR, Data Augmentation is used to form the positive pairs. This strategy has not been heavily explored in information retrieval, likely due to the lack of augmentations. Hopefully, the list we have provided will help those interested pursue this idea.

Gunel et al. [ 111 ] demonstrate significant improvements on GLUE benchmark tasks by training with a supervised contrastive loss in addition to cross-entropy loss on one-hot encoded label vectors. The gain is especially pronounced when learning from 20 labeled examples, while they do not report much of a difference at 1,000 labeled examples. In addition to quantitative metrics, the authors highlight that the embeddings of classes are much more spread out through the lens of a t-SNE visualization.

Contrastive learning, similarly to consistency regularization, describes making the representation of an instance and a transformation-derived pair similar. However, contrastive learning adds a negative normalization that additionally pushes these representations away from other instances in the samples mini-batch. Contrastive learning has achieved large advances in representation Computer Vision such as SimCLR [ 106 ] and MoCo [ 107 ]. Using Data Augmentation for contrastive learning is a very promising area of research with recent extensions to the information-retrieval language model REALM [ 73 ]. We refer interested readers to a report from Rethmeier and Augenstein [ 112 ] for more details on early efforts to apply contrastive learning to NLP.

Consistency regularization and contrastive learning are candidate solutions to a common problem found by inspecting model performance. For example, Thorne et al. [ 113 ] find that fact verification models achieve better accuracy when classifying if claims are supported or refuted by the evidence when ignoring the evidence. Contrastive learning would require the model to correctly associated supporting evidence by contrasting it with refuting evidence. Consistency Regularization would more so describe having a similar prediction when the evidence has been slightly perturbed, such as inserting a random word or replacing it with a paraphrase that shares the same semantics.

Negative data augmentation

Negative Data Augmentation is a similar concept to the negative examples used in contrastive learning. However, a key difference is that contrastive learning generally uses other data points as the negatives, whereas Negative Data Augmentation entails applying aggressive augmentations. These augmentations are not just limited to label corruptions, but may push the example out of the natural language distribution entirely. Returning to the motif of Meaning versus Form [ 28 ] these augmentations may not be useful for learning meaning, but they can help reinforce the form of natural language. Sinha et al. [ 114 ] demonstrate how this can be used to improve contrastive learning and generative adversarial networks.

Augmentation controllers

A large contributor to the success of Data Augmentation in Computer Vision is the development of controllers. Controllers reference algorithms that optimize the strength of augmentations throughout training. The strength of augmentations describe the magnitude of operation such as inserting 3 additional words compared to 15. Augmentation strength also describes how many augmentations are stacked together such as random insertion followed by deletion followed by back-translation and so on, described more next. Successful controllers such as AutoAugment [ 7 ], Population-Based Augmentation [ 8 ], or RandAugment [ 9 ] have not yet seen large-scale adoption in NLP.

When applying Easy Data Augmentation, several hyperparameters arise. Hyperparameter optimization is one of the active areas of Deep Learning research [ 115 – 117 ]. This presents a perfect problem to find optimal values for random augmentation samplings, as well as magnitudes such as: how many tokens to delete? SpanBERT [ 118 ], for example, shows that instead of masking out single tokens for language modeling, masking out multiple tokens at a time, known as spans, results in better downstream performance.

Adversarial augmentation

Adversarial attacks and the use of adversarially optimized inputs for augmentation is very similar to the previous discussion on controllers. The key differentiation is that adversarially controllers target misclassifications whereas controllers generally try to avoid misclassifications. Particularly, adversarial optimization aims to improve robustness to high-frequency pattern shifts. Adversarial attacks on text data generally range from introducing typos to swiping out individual or chunks of words. There is a great deal of ambiguity with this since many of these perturbations would be cleaned and filtered by the text data preprocessing techniques such as spell checkers, case normalizations, or regular expression filtering.

TextAttack [ 119 ] is an open-source library implementing adversarial text attacks and providing APIs for Data Augmentation. There are four main components of an attack in the TextAttack framework, a goal function, constraints, transformations, and a search method. This pipeline is illustrated in Fig. ​ Fig.7. 7 . The goal function defines the target output, for example instead of solely flipping the predicted output we may want to target a 50-50 density. The constraints define how far the input can be changed. The transformation describes the tools available to change the input such as synonym swaps, deletions, applying back-translation, and all the other techniques discussed previously. Finally, the search method describes the algorithm for searching for the attack. Similar to our discussion of controllers there are many different ways to perform black-box searches such as grid or random searches, bayesian optimization, and evolutionary search, to name a few [ 115 ].

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_492_Fig7_HTML.jpg

Developing attacks in TextAttack [ 119 ]

A key consideration with adversarial augmentation is how quickly we can construct adversarial examples. Many adversarial example construction techniques such as Szegedy et al. [ 120 ] rely on iterative optimization such as L-BFGS to find the adversarial example. This would be a significant bottleneck in Deep Learning training to wait for the adversarial search at each training batch. Towards solving this issue, Wang et al. [ 121 ] reduce time consumption up to 60% with their DEAT algorithm. The high-level idea of DEAT is to use batch replay to avoid repeatedly computing adversarial batches.

Stacking augmentations

Stacking augmentations is a strategy that has improved vision models but is less straightforward to apply to text data. One strategy for this is CoDA [ 122 ]. CoDA introduces a local consistency loss to make sure stacking augmentations has not overly corrupted the sample, and a global loss to preserve local neighborhoods around the original instance.

Tokenization

The preprocessing pipeline of tokenization presents a formidable challenge for implementing Data Augmentations. It is common to tokenize, or convert word tokens to their respect numeric index in a vocabulary-embedding lookup table offline before it reaches the Data Loader itself. Applying Data Augmentations on these index lists could require significantly more engineering effort. Even for simple synonym replacement, additional code will have to be written to construct dictionaries of the synonyms index value for swaps. Notably, researchers are exploring tokenizer-free models such as byT5 [ 123 ] and CANINE [ 124 ]. These models process byte-level sequences such as ASCII codes [ 125 , 126 ] and will require special processing to integrate these augmentations.

Position embeddings

Another more subtle detail of Transformer implementations are the use of position embeddings. The original Transformer [92] uses sine and cosine functions to integrate positional information into text sequences. Another subtle Data Augmentation could be to explore perturbing the parameters that render these encodings.

Augmentation on CPUs or GPUs?

Another important aspect of Data Augmentation is to understand the typical data preprocessing pipeline from CPUs to GPUs. It has been standard practice to apply Data Augmentation to data on the CPU before it is passed to the GPU for model training. However, recent practice has looked at applying Data Augmentation directly on the GPU. This is done in Keras, for example, by adding Data Augmentation as a layer in the model immediately after the input layer. It is also worth noting clever schemes such as Data Echoing from Choi et al. [ 127 ] that apply additional techniques to avoid idle time between CPU data loading and GPU model training.

Offline and online augmentation

Similarly to the discussion of augmenting data on the CPU or on the GPU, another important consideration is to make sure the Data Augmentation is happening online, compared to offline. This refers to when the original instance is augmented in the data pipeline. Offline augmentation refers to augmenting the data and storing the augmented examples to the disk. Online augmentation describes augmenting the data as a new batch of the original data is loaded for a training step. We note that Online augmentation is much more powerful than Offline augmentation. Offline augmentation offers the slight benefit of faster loading times, but it does not really take advantage of the stochasticity and diversity enabled with most of the described augmentations.

Another important detail of this pipeline is augmentation multiplicity [ 128 ]. Augmentation multiplicity refers to the number of augmented samples derived from one original example. Fort et al. [ 128 ] and Hoffer et al. [ 129 ] illustrate how increasing augmentation multiplicity can improve performance. This approach could introduce significant memory overhead without an online augmentation pipeline. Additionally Wei et al. [ 130 ] point out that examples are often augmented online such that the model never actually trains with the original instances. Wei et al. propose separating the model into two fine-tuning heads, one which trains solely on the unaugmented data and the other trained on high magnitude augmentations. These works highlight the opportunity to explore fine-grained details in augmentation pipelines.

Curriculum learning

Curriculum Learning describes having a human or meta-controller structured organization to the data batches. This includes varying the strength of Data Augmentation throughout training. Kucnik and Smith [ 131 ] find that it is much more efficient to subsample a portion of the dataset to be augmented, rather than augmenting the entire dataset. Wei et al. [ 132 ] demonstrate the efficacy of gradually introducing augmented examples to original examples in the training of triplet networks for text classification. We note this is very similar to our discussion of controllers for augmentation and searching for optimal magnitude and chaining parameters. Thakur et al. [ 85 ] describe that “selecting the sentence pairs is non-trivial and crucial for the success of the method”.

Class imbalance

A prevalent issue explored in classification models is Class Imbalance [ 133 ]. In addition to customized loss functions, sampling techniques are a promising solution to overcome biases stemming from Class Imbalance. These solutions generally describe strategies such as random oversampling or undersampling [ 134 , 135 ], in addition to interpolation strategies such as synthetic minority oversampling technique (SMOTE) [ 136 ]. SMOTE is a general framework to oversample minority instances by averaging between them. From the list of augmentations we have covered, we note that MixUp is very similar to this technique and has been explored for text data. It may be useful to use other techniques for oversampling to avoid potential pitfalls of duplicating instances.

Task-specific augmentation for NLP

NLP encompasses many different task formulations. This ranges from text classification to paraphrase identification, question answering, and abstractive summarization, to name a few. The off-the-shelf Data Augmentation prescribed in the previous section will need slight adaptations for each of these tasks. For example, when augmenting the context in a question answering dataset, it is important to be mindful of removing the answer. The largest difference we have found between tasks from the perspective of Data Augmentation is that they vary massively with respect to input length. Short sequences will have to be more mindful of how augmentations change the original example. Longer sequences have more design decisions such as how to sample nested sentences for back-translation and so on. We refer interested readers to Feng et al. [ 16 ] who enumerate how Data Augmentation applies to summarization, question answering, sequence tagging, parsing, grammatical error correction, neural machine translation, data-to-text natural language generation (NLG), open-ended and conditional generation, dialogue, and multimodal tasks.

Self-supervised learning and data augmentation

In both the case of self-supervised learning and Data Augmentation, we are looking to inject prior knowledge about a data domain. When a model is deployed, what is more likely: the data distribution changes or the task the model is supposed to perform with the data changes? In self-supervised learning, we look for ways to set up tasks and loss functions for representation learning. In Data Augmentation, we look for priors to manipulate the data distribution. A key advantage of Data Augmentation is that it is much easier to stack priors than self-supervised learning. In order to utilize multiple priors, self-supervised learning relies on highly unstable multi-task learning or costly multi-stage learning. In contrast, Data Augmentation only requires random sampling operations to integrate multiple priors.

We note that many of the key successes in self-supervised Learning rely on Data Augmentation, or have at least been dramatically improved by Data Augmentation. For example, the success of contrastive learning relies on Data Augmentation to form two views of the original instance. The most data-efficient GAN frameworks achieve data-efficiency through the use of Data Augmentation [ 137 ]. Further, DistAug [ 138 ] even tests Data Augmentation with large scale pixel autoregressive modeling in the ImageGPT model [ 139 ].

Transfer and multi-task learning

Transfer learning has been one of the most successful approaches to training deep neural networks. This looks especially promising as more annotated datasets are collected and unified in dataset hubs. A notable example of which is HuggingFace datasets [ 140 ], containing 884 datasets at the time of this publication. In addition to transfer learning, researchers have additionally explored multi-task learning in which a model simultaneously optimizes multiple tasks. This has been well explored in T5 [ 141 ], which converts all tasks into language modeling. We believe there is room for Data Augmentation experiments in this space, such as the use of MixUp to combine data from multiple tasks or Back-Translation between curated datasets.

Wei et al. [ 130 ] propose an interesting extension, named as Multi-Task View (MTV), to the common practice of transfer learning to better utilize augmented subsets and share information across distributions. Multi-Task View (MTV) trains separate heads on augmented subsets and ensembles predictions for the final output. Geva et al. [ 142 ] have also shown utility in sharing a feature extractor base and training separate heads. In this case, Geva et al. train each head with a different task and reformulate inputs into unifying prompts for inference. Similar to the discussion of prompting under Generative Data Augmentation, there remains a significant opportunity to explore transfer learning, multi-task learning, and Data Augmentation.

One of the most interesting ideas in artificial intelligence research is AI-GAs (AI-generating algorithms) [ 10 ]. An AI-generating algorithm is composed of three pillars, meta-learning architectures, meta-learning the learning algorithms themselves, and generating effective learning environments. We believe that Data Augmentation and this interface to control data distributions will play a large role in the third pillar of generating learning environments. For example, embedding learning agents in teacher-student loops in which the teacher controls augmentation parameters to render the learning environment.

Learning the learning environment itself has been successfully applied to bipedal walking control with neural networks in POET [ 11 ]. POET is a co-evolutionary framework of control parameters and parameters that render walking terrains. Data Augmentation may be the most natural way of extending this framework to understanding language in which the environment searches for magnitude parameters of augmentation or subsets of data, as in curriculum learning. AI-GAs have been applied to vision problems in examples such as Generative Teaching Networks [ 12 ] and Synthetic Petri Dish [ 13 ]. In GTNs, a teacher network generates training data for a student network. Notably, the training data has high-frequency noise patterns that do not resemble natural image data. It could be interesting to see how well GTNs could generate text embeddings similar to the continuous optimization of prompt tuning.

In conclusion, this survey has presented several strategies for applying Data Augmentation in Text data. These augmentations provide an interface to allow developers to inject priors about their task and data domain into the model. We have additionally presented how Data Augmentation can help simulate distribution shift and test generalization. As Data Augmentation for NLP is relatively immature compared to Computer Vision, we highlight some of the key similarities and differences. We have also presented many ideas surrounding Data Augmentation, from practical engineering considerations to broader discussions of the potential of data augmentation in building artificial intelligence. Data Augmentation is a very promising strategy and we hope our discussion section helps motivate further research interest.

Acknowledgements

We would like to thank the reviewers in the Data Mining and Machine Learning Laboratory at Florida Atlantic University. Additionally, we acknowledge partial support by the NSF (IIS-2027890). Opinions, findings, conclusions, or recommendations in this paper are the authors’ and do not reflect the views of the NSF.

Authors' contributions

CS performed the literature review and drafted the manuscript. TMK worked with CS to develop the article’s framework and focus. TMK introduced this topic to CS. All authors read and approved the final manuscript.

NSF RAPID (IIS-2027890).

Availability of data and materials

Declarations.

The authors declare that they have no competing interests.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This week: the arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Machine Learning

Title: research trends and applications of data augmentation algorithms.

Abstract: In the Machine Learning research community, there is a consensus regarding the relationship between model complexity and the required amount of data and computation power. In real world applications, these computational requirements are not always available, motivating research on regularization methods. In addition, current and past research have shown that simpler classification algorithms can reach state-of-the-art performance on computer vision tasks given a robust method to artificially augment the training dataset. Because of this, data augmentation techniques became a popular research topic in recent years. However, existing data augmentation methods are generally less transferable than other regularization methods. In this paper we identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature. To do this, the related literature was collected through the Scopus database. Its analysis was done following network science, text mining and exploratory analysis approaches. We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
Comments: 23 pages, 9 figures, 5 tables
Subjects: Machine Learning (cs.LG)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Data Augmentation Techniques for Text Classification in NLP (Research Paper Walkthrough)

    data augmentation research paper

  2. [ Paper Summary ] The Effectiveness of Data Augmentation in Image

    data augmentation research paper

  3. Examples of data augmentation techniques used in reviewed papers to

    data augmentation research paper

  4. (PDF) Data augmentation for improving deep learning in image

    data augmentation research paper

  5. (PDF) Advanced Data Augmentation Approaches: A Comprehensive Survey and

    data augmentation research paper

  6. (a) An illustration of the data augmentation procedure designed to

    data augmentation research paper

VIDEO

  1. Coursework1

  2. Image Data Augmentation with Edge Impulse

  3. Classifying Images using Data Augmentation

  4. LiDA Language Independent Data Augmentation for Text Classification final year projects

  5. PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation

  6. 038 Memory Augmentation

COMMENTS

  1. Data augmentation: A comprehensive survey of modern approaches

    Data augmentation: A comprehensive survey of modern ...

  2. [1712.04621] The Effectiveness of Data Augmentation in Image

    View a PDF of the paper titled The Effectiveness of Data Augmentation in Image Classification using Deep Learning, by Luis Perez and 1 other authors. In this paper, we explore and compare multiple solutions to the problem of data augmentation in image classification. Previous work has demonstrated the effectiveness of data augmentation through ...

  3. A survey on Image Data Augmentation for Deep Learning

    A survey on Image Data Augmentation for Deep Learning

  4. Data Augmentation

    Data Augmentation

  5. [2301.02830] Image Data Augmentation Approaches: A Comprehensive Survey

    View a PDF of the paper titled Image Data Augmentation Approaches: A Comprehensive Survey and Future directions, by Teerath Kumar and 2 other authors. Deep learning (DL) algorithms have shown significant performance in various computer vision tasks. However, having limited labelled data lead to a network overfitting problem, where network ...

  6. Data augmentation for improving deep learning in image classification

    One of the ways of dealing with this problem is so called data augmentation. In the paper we have compared and analyzed multiple methods of data augmentation in the task of image classification, starting from classical image transformations like rotating, cropping, zooming, histogram based methods and finishing at Style Transfer and Generative ...

  7. Frontiers and developments of data augmentation for image: From

    However, a significant challenge these models face is their generalization performance, which has prompted ongoing research efforts aimed at enhancing this capacity through advancements in network structure and data augmentation techniques. This paper uniquely focuses on the pivotal role of data augmentation in achieving improved generalization.

  8. Title: Image Data Augmentation for Deep Learning: A Survey

    Image Data Augmentation for Deep Learning: A Survey. Suorong Yang, Weikang Xiao, Mengchen Zhang, Suhan Guo, Jian Zhao, Furao Shen. View a PDF of the paper titled Image Data Augmentation for Deep Learning: A Survey, by Suorong Yang and 4 other authors. Deep learning has achieved remarkable results in many computer vision tasks.

  9. A Survey on Data Augmentation Approaches for NLP

    Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. ... We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of ...

  10. Text Data Augmentation for Deep Learning

    We hope this paper inspires further research interest in Text Data Augmentation. Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. ... The paper proposes to label data with the cross-encoder and fit ...

  11. A survey of automated data augmentation algorithms for deep learning

    A survey of automated data augmentation algorithms for ...

  12. A review: Data pre-processing and data augmentation techniques

    This review paper provides an overview of data pre-processing in Machine learning, focusing on all types of problems while building the machine learning problems. It deals with two significant issues in the pre-processing process (i). issues with data and (ii). Steps to follow to do data analysis with its best approach.

  13. Data Augmentation in Classification and Segmentation: A Survey and New

    Image classification and image segmentation are two common, yet important, research areas in computer vision, which typically use data augmentation approaches. In this section, we discuss recent research, mostly within the past five years, in these two areas that leverage data augmentation for performance enhancement.

  14. [2105.03075] A Survey of Data Augmentation Approaches for NLP

    Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy. View a PDF of the paper titled A Survey of Data Augmentation Approaches for NLP, by Steven Y. Feng and 6 other authors. Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and ...

  15. A Comprehensive Survey of Image Augmentation Techniques for Deep

    To utilize an image augmentation algorithm efficiently, it is crucial to understand the challenges of application and apply suitable methods. This study was conducted to provide a survey that enhances the understanding of a wide range of image augmentation algorithms. 3.2. Vicinity distribution.

  16. A Survey of Data Augmentation Approaches for NLP

    data. In this paper, we present a comprehen-sive and unifying survey of data augmenta-tion for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP ...

  17. (PDF) Data augmentation for improving deep learning in image

    One of the w ays of dealing with this problem. is so called data augmentation. In the p aper we have compared. and analyzed multiple methods of data augmentation in the task. of image ...

  18. Data augmentation approaches in natural language processing: A survey

    Data augmentation is widely applied in the field of computer vision (Shorten and Khoshgoftaar, 2019), such as flipping and rotation, then introduced to natural language processing (NLP). Different to images, natural language is discrete, which makes the adoption of DA methods more difficult and underexplored in NLP.

  19. Data augmentation in natural language processing: a novel text

    Foundations of data augmentation. Data augmentation is a machine learning technique that artificially enlarges the amount of training data by means of label preserving transformations [].First variations of data augmentation can be identified in the well-known LeNet by [].Using random distortions of training pictures, the MNIST-dataset was ninefold enlarged, so that a better detection of ...

  20. [2405.09591] A Comprehensive Survey on Data Augmentation

    View a PDF of the paper titled A Comprehensive Survey on Data Augmentation, by Zaitian Wang and 8 other authors. Data augmentation is a series of techniques that generate high-quality artificial data by manipulating existing data samples. By leveraging data augmentation techniques, AI models can achieve significantly improved applicability in ...

  21. Navigating uncharted waters: ASU drives solutions for water resilience

    Editor's note: This is the fifth story in a series exploring how ASU is changing the way the world solves problems. In the Southwest, water seems to exist in two vastly conflicting states: abundance and scarcity. For some, simply turning on a faucet at work or at home yields a seemingly on-demand supply of one of our planet's most precious resources.

  22. Data Augmentation for Image Classification using Generative AI

    Scaling laws dictate that the performance of AI models is proportional to the amount of available data. Data augmentation is a promising solution to expanding the dataset size. Traditional approaches focused on augmentation using rotation, translation, and resizing. Recent approaches use generative AI models to improve dataset diversity. However, the generative methods struggle with issues ...

  23. Data augmentation for medical imaging: A systematic literature review

    Learnable data augmentation is a recent subfield of deep learning research that studies approaches that can reduce the human effort required when selecting and validating a set of data augmentation techniques [49]. The main idea is to discover automatically an optimal data augmentation strategy for a specific task.

  24. Text Data Augmentation for Deep Learning

    Data Augmentation research has been more thoroughly explored in Computer Vision than NLP. We present some ideas that have found interesting results with images, but remain to be tested in the text data domain. Finally, we discuss the intersection of visual supervision for language understanding and how vision-language models may help overcome ...

  25. Research Trends and Applications of Data Augmentation Algorithms

    In this paper we identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature. To do this, the related literature was collected through the Scopus database. Its analysis was done following network ...

  26. Data augmentation techniques in natural language processing

    Data augmentation techniques in natural language ...