Vision transformers have bested convolutional neural networks (CNNs) in a number of key vision tasks. Have CNNs hit their limit? New research suggests otherwise.
What’s new: Sanghyun Woo and colleagues at Korea Advanced Institute of Science & Technology, Meta, and New York University built ConvNeXt V2, a purely convolutional architecture that, after pretraining and fine-tuning, achieved state-of-the-art performance on ImageNet. ConvNeXt V2 improves upon ConvNeXt, which updated the classic ResNet.
Key insight: Vision transformers learn via masked pretraining — that is, hiding part of an image and learning to reconstruct the missing part. This enables them to learn from unlabeled data, which simplifies amassing large training datasets and thus enables them to produce better embeddings. If masked pretraining works for transformers, it ought to work for CNNs as well.
How it works: ConvNeXt V2 is an encoder-decoder pretrained on 14 million images in ImageNet 22k. For the decoder, the authors used a single ConvNeXt convolutional block (made up of three convolutional layers). They modified the ConvNeXt encoder (36 ConvNeXt blocks) as follows:
- The authors removed LayerScale from each ConvNeXt block. In ConvNeXt, this operation learned how much to scale each layer’s output, but in ConvNeXt V2, it didn’t improve performance.
- They added to each block a scaling operation called global response normalization (GRN). A block’s intermediate layer generated an embedding with 384 values, known as channels. GRN scaled each channel based on its magnitude relative to the magnitude of all channels combined. This scaling narrowed the range of channel activation values, which prevented feature collapse, a problem with ConvNeXt in which channels with small weights don’t contribute to the output.
- During pretraining, ConvNeXt V2 split each input image into a 32x32 grid and masked random grid squares. Given the masked image, the encoder learned to produce an embedding. Given the embedding, the decoder learned to reproduce the unmasked image.
- After pretraining, the authors fine-tuned the encoder to classify images using 1.28 million images from ImageNet 1k.
Results: The biggest ConvNeXt V2 model (659 million parameters) achieved 88.9 percent top-1 accuracy on ImageNet. The previous state of the art, MViTV2 (a transformer with roughly the same number of parameters) achieved 88.8 percent accuracy. In addition, ConvNeXt V2 required less processing power: 600.7 gigaflops versus 763.5 gigaflops.
Why it matters: Transformers show great promise in computer vision, but convolutional architectures can achieve comparable performance with less computation.
We’re thinking: While ImageNet 22k is one of the largest publicly available image datasets, vision transformers benefit from training on proprietary datasets that are much larger. We’re eager to see how ConvNeXt V2 would fare if it were scaled to billions of parameters and images. In addition, ImageNet has been joined by many newer benchmarks. We’d like to see results for some of those.