Sample Submission

This post outlines a few more things you may need to know for creating and configuring your blog posts.


MODALS: Data augmentation that works for everyone


In this blog we discuss the paper “MODALS: Modality- agnostic Automated Data Augmentation in the Latent Space”[1]

Data augmentation is important for all types of deep learning tasks: unsupervised, supervised and self-supervised. Data augmentation has almost become a prerequisite for deep learning. In fact many self-supervised learning tasks [2] use data augmentation as a core component of the learning algorithm. It is also heavily used in image models and has consistently produced good results. By perturbing the data during training we are able to train more robust models, especially when there is limited data. Data augmentation is also very useful in incorporating known priors or domain knowledge into the model. For example the random flip transform applied to image data implicitly encodes the prior knowledge we have that the data is invariant to changes in orientation.

The usefulness of data augmentation has led to the development of specific techniques of augmentation unique to each modality of data. The techniques developed for one modality usually suit the type of data in that particular modality. For image data some commonly used augmentation techniques are rotation, cropping, applying affine transforms, random flips, contrast and color augmentations and adding Gaussian blur. More recent techniques like CutMix[3] and MixUp[4] apply data augmentation on both the image and label space. Many robust data augmentation techniques for image data already exist and are used widely. However modalities like tables and graph data don’t have as many robust augmentation techniques. Since the techniques developed for images were made taking into consideration the nature of image data, they usually cannot be applied to other modalities. If a generalized, modality agnostic framework for augmentation could be developed then standard, robust augmentation techniques can be applied across many modalities. This is exactly what the authors of the paper propose.

In order to be independent of modality, such a framework would have to convert data from different modalities to a common representation on which label preserving transforms can be applied. The authors point out that there exist techniques of data augmentation that use auto encoders to learn a latent space representation [5]. This is then used to generate augmentation data for the downstream task. In the framework proposed by the paper the learning of the latent space is integrated with the downstream task. A model is trained for the downstream task and an intermediate layer encodes the representations of the input data in the latent space. The framework uses interpolation and extrapolation of existing data points to produce augmented data. This requires that the latent space learned must be smooth (the representations cannot vary abruptly for a minute change in input features) for transformations to be label preserving and produce valid data. The smoothness ensures that the augmented data is inside the data manifold (ensuring it is valid data) and has the same label as the points used to generate it. To ensure this property the authors add two additional losses to the encoder used to learn the latent space. These are the adversarial loss and the triplet loss.

Adversarial Loss

To achieve a smooth latent space data manifold we use a discriminator network and add an adversarial loss to the model. The discriminator, given a data point, tries to predict if it is from the latent space or is a randomly sampled point from a Gaussian distribution. To keep the adversarial loss low (to “fool” the discriminator network) the model must also learn a smooth data representation so as to be indistinguishable from a smooth Gaussian distribution.

Triplet Loss

The aim is to have the representations of all the data points of a class to be close to each other and far away from data points from other classes. To learn the latent space in this way we add the triplet loss to the model. The Triplet loss pulls together samples from the same class and pushes away samples from other classes.




Figure 2 from the paper

Figure: The MODALS framework (figure from paper)


Transformations in the latent space

Once the representations of points have been found, the authors propose the following transforms in the latent space for data augmentation:

Hard example interpolation

For each sample in the training set, the $q$ nearest hard examples of that class are found. Then the data point is interpolated towards the nearest hard example to get a new augmented data point.
In the figure below, the data point being considered is represented by $\vec{\,a}$. $q1$ through $q5$ are 5 hard examples of the same class as $\vec{\,a}$ and $q3$ (represented by $\vec{\,b}$) is selected as it is the nearest to $\vec{\,a}$ among the five. $\vec{\,a}$ is interpolated towards $\vec{\,b}$ with a scaling factor $\lambda$ applied.

Illustration of hard example interpolation

Figure: Illustration of hard example interpolation


Hard example extrapolation

Similar to interpolation, each sample in the training set is considered and then shifted away from the class center to get the augmented data point.
In the figure, the data point being considered is represented by $\vec{\,a}$. The class center of $\vec{\,a}$ is represented by $\vec{\,b}$. $\vec{\,a}$ is extrapolated away from $\vec{\,b}$ by a factor $\lambda$.

Illustration of hard example extrapolation

Figure: Illustration of hard example extrapolation


Gaussian Noise

Gaussian noise is added to data points to get augmented data. The mean of the Gaussian noise is zero and the standard deviation is calculated on a per class basis.
In the figure below, $\vec{\,a}$ is the data point being considered and $\vec{\,b}$, $\vec{\,c}$ and $\vec{\,d}$ are three points to which a may be transformed after adding Gaussian noise. The Gaussian noise which is scaled by a factor $\lambda$ and added is shown as a circle.

Illustration of addition of Gaussian noise

Figure: Illustration of addition of Gaussian noise


Difference Transform

The rationale behind the difference transform is that labels should be independent of differences in features within a class. The difference between two samples from the same class is added to a data point to get an augmented instance of data. Adding the difference in features to the point should not change its label.
The figure shows $\vec{\,a}$ as the data point being considered and $\vec{\,b}$ and $\vec{\,c}$ are two randomly sampled points from the same class as $\vec{\,a}$. The difference of $\vec{\,b}$ and $\vec{\,c}$ is added to $\vec{\,a}$ to get the augmented point.

Illustration of the difference transform

Figure: Illustration of the difference transform



Population Based Augmentation (PBA)

Instead of fixing the set of augmentations to be applied to latent representations at all stages of training, the authors use a technique called PBA[6] to find optimized sets of augmentations to be applied at different stages. A few terms that PBA uses are defined below.

An augmentation function is defined as a 3-tuple consisting of a transformation, the probability of its application and its magnitude. An example given in the paper is that (op: rotation, p = 0.4, lambda = 0.5) is an augmentation function that applies the rotation transformation with a probability of 0.4 and a magnitude of 0.5.
A policy is defined as the set of all 3-tuples which are to be applied to the dataset. 0, 1 or 2 tuples are randomly sampled from the current policy and applied to each mini-batch.
A policy schedule is the order of policies which will be applied at different stages of training.

PBA searches for a good policy schedule which will give us optimized augmentation hyperparameters to be applied at different stages of training. PBA exploits population based training which is a method in which multiple child models are trained simultaneously to learn an optimal policy schedule. At the beginning each child is assigned a randomly sampled policy from the set of all possible policies (corresponding to all possible values of probability and magnitude).The child models are initialized with random weights and are trained for a fixed number of steps following which they are evaluated using the validation set. The evaluation reveals the worst and best of the models being trained. The worst models copy the weights and policies learned by the best models. The parameters (probability and magnitude) of the copied policies are then perturbed slightly or resampled from the set of all possible values. This is done to avoid creating duplicate models. This sequence of steps are repeated until the child models are done training. At the end the best child model is selected and its policy schedule is used.
There is reason to believe that this approach for finding a good policy schedule will work. The authors of [6] state that “… though the space of schedules is large, most good schedules are necessarily smooth and hence easily discoverable through evolutionary search algorithms such as PBT.”

The sequence of steps of PBA

Figure: The sequence of steps of PBA



Closing Thoughts

In their experiments the authors applied the MODALS framework on text, tabular, time-series and image data. In general the framework works well across modalities except in the imaging modality where they found that using PBA with image specific data augmentation techniques performed better than MODALS. While MODALS is a powerful technique for data augmentation, domain specific augmentation techniques could still be used in certain contexts like image data. Domain knowledge and insights about the data can be incorporated through hand crafted augmentations but not in MODALS. As the authors themselves note, one possible approach is to use a combination of both, using hand crafted augmentations to incorporate domain knowledge and insights into the augmentation.

The framework proposed in the paper builds around classification tasks where each data point has a single label. There is potential for a technique like MODALS to be extended to other tasks like regression and multi label classification.

References

[1] Cheung, Tsz-Him, and Dit-Yan Yeung. “MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space.” International Conference on Learning Representations. 2020.
[2] He, Kaiming, et al. “Momentum contrast for unsupervised visual representation learning.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[3] Yun, Sangdoo, et al. “Cutmix: Regularization strategy to train strong classifiers with localizable features.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
[4] Liu, Zicheng, et al. “AutoMix: Unveiling the Power of Mixup.” arXiv preprint arXiv:2103.13027 (2021).
[5] DeVries, Terrance, and Graham W. Taylor. “Dataset augmentation in feature space.” arXiv preprint arXiv:1702.05538 (2017).
[6] Ho, Daniel, et al. “Population based augmentation: Efficient learning of augmentation policy schedules.” International Conference on Machine Learning. PMLR, 2019.

Images created using: The Manim Community Developers. (2022). Manim – Mathematical Animation Framework (Version v0.14.0) [Computer software]. https://www.manim.community/


Example content (Basic Markdown)

Howdy! This is an example blog post that shows several types of HTML content supported in this theme.