Vision Transformers Made Manageable: FlexiViT, the vision transformer that allows users to specify the patch size

Vision transformers typically process images in patches of fixed size. Smaller patches yield higher accuracy but require more computation. A new training method lets AI engineers adjust the tradeoff.

What’s new: Lucas Beyer and colleagues at Google Research trained FlexiViT, a vision transformer that allows users to specify the desired patch size.

Key insight: Vision transformers turn each patch into a token using two matrices of weights, whose values describe the patch’s position and appearance. The dimensions of these matrices depend on patch size. Resizing the matrices enables a transformer to use patches of arbitrary size.

How it works: The authors trained a standard vision transformer on patches of random sizes between 8x8 and 48x48 pixels. They trained it to classify ImageNet-21K (256x256 pixels).

  • FlexiVit learned a matrix of size 32x32 to describe each patch’s appearance and a matrix of size 7x7 to describe its position.
  • Given an image, FlexiViT resized the matrices according to the desired patch size without otherwise changing the architecture. To accomplish this, the authors developed a complicated method they call pseudo-inverse resize (PI resize).

Results: The authors compared FlexiVit to two vanilla vision transformers, ViT-B/16 and ViT-B/30, trained on ImageNet-21k using patch sizes of 16x16 and 30x30 respectively. Given patches of various sizes, the vanilla vision transformers’ position and appearance matrices adjusted in the same manner as FlexiViT’s. FlexiViT performed consistently well across patch sizes, while the models trained on a fixed patch size performed well only with that size. For example, given 8x8 patches, FlexiViT achieved 50.2 percent precision; ViT-B/16 achieved 30.5 percent precision, and ViT-B/30 achieved 2.9 percent precision. Given 30x30 patches, FlexiViT achieved 46.6 percent precision, ViT-B/16 achieved 2.4 percent precision, and ViT-B/30 achieved 47.1 percent precision.

Why it matters: The processing power available often depends on the project. This approach makes it possible to train a single vision transformer and tailor its patch size to accommodate the computation budget at inference.

We’re thinking: Unlike text transformers, for which turning text into a sequence of tokens is relatively straightforward, vision transformers offer many possibilities for turning an image into patches and patches into tokens. It’s exciting to see continued innovation in this area.