Cookbook for Vision Transformers: A Formula for Training Vision Transformers

deir

Vision Transformers (ViTs) are overtaking convolutional neural networks (CNN) in many vision tasks, but procedures for training them are still tailored for CNNs. New research investigated how various training ingredients affect ViT performance.

What’s new: Hugo Touvron and colleagues at Meta and Sorbonne University formulated a new recipe for training ViTs. They call their third-generation approach Data Efficient Image Transformers (DeiT III).

Key insight: The CNN and transformer architectures differ. For instance, when processing an image, a CNN works on one group of pixels at a time, while a transformer processes all pixels simultaneously. Moreover, while the computational cost of a CNN scales proportionally to input size, a transformer’s self-attention mechanism requires dramatically more processing as input size increases. Training recipes that take these differences — and other, less obvious ones — into account should impart better performance.

How it works: The authors pretrained ViTs to classify images in ImageNet using various combinations of training data, data augmentation, and regularization. (They also experimented with variables such as weight decay, dropout, and type of optimizer, for which they didn’t describe results in detail.) They fine-tuned and tested on ImageNet.

  • The authors pretrained the transformers on ImageNet-21K using lower image resolutions, such as 192x192 pixels, before fine-tuning on full-res 224x224-pixel images. Pretraining transformers on lower-res versions is faster and less memory-intensive and has been shown to result in better classification of full-res images.
  • ImageNet-21K includes roughly 10 times as many images as the more common ImageNet. The larger dataset makes augmenting data via random cropping unnecessary to prevent overfitting. Instead, they used a cropping procedure that was more likely to retain an image’s subject. First, they resized training examples so their smaller dimension matched the training resolution (say, from 224x448 to 192x384). Then they cropped the larger dimension to form a square (192x192) with a random offset.
  • The authors altered the colors of training examples by blurring, grayscaling, or solarizing (that is, inverting colors above a certain intensity). They also randomly changed brightness, contrast, and saturation. Less consistent color information may have forced the transformers — which are less sensitive than CNNs to object outlines — to focus more on shapes.
  • They used two regularization schemes. Stochastic depth forces individual layers to play a greater role in the output by skipping layers at random during training. LayerScale achieves a similar end by multiplying layer outputs by small, learnable weights. Because a transformer’s residual connections connect every other layer, this scaling enables the network to begin learning with a small number of layers and add more as training progresses. The gradual accumulation helps it continue to learn despite having large numbers of layers, which can impede convergence.

Results: The authors’ approach substantially improved ViT performance. An 86 million-parameter ViT-B pretrained on ImageNet-21K and fine-tuned on ImageNet using the full recipe achieved 85.7 percent accuracy. Their cropping technique alone yielded 84.8 percent accuracy. In contrast, the same architecture trained on the same datasets using full-resolution examples augmented via RandAugment achieved 84.6 percent accuracy.

Why it matters: Deep learning is evolving at a breakneck pace, and familiar hyperparameter choices may no longer be the most productive. This work is an early step toward updating for the transformer era recipes that were developed when CNNs ruled computer vision.

We’re thinking: The transformer architecture’s hunger for data makes it especially important to reconsider habits around data-related training procedures like augmentation and regularization.