How to efficiently and accurately perform image search? Take a look at Lightweight Vision Pretrained Models

Original link: https://www.msra.cn/zh-cn/news/features/lightweight-vision-pre-training

Editor’s note: Have you ever had trouble with image retrieval? Or it is difficult to accurately find the desired image in the massive image, or get unsatisfactory results in text-based retrieval. For this problem, researchers from Microsoft Research Asia and Microsoft Cloud Computing and Artificial Intelligence Division have conducted in-depth research on lightweight visual models, and proposed a series of visual pre-training model design and compression methods to realize the visual Transformer. Lightweight deployment requirements. At present, the method and model have been successfully applied to the Microsoft Bing search engine, realizing accurate and fast reasoning and retrieval of tens of billions of pictures. This article will explain in depth the development, key technologies, applications and potential of lightweight visual pre-training models, as well as future opportunities and challenges. I hope everyone can better understand the field of lightweight visual pre-training and jointly promote the development of related technologies.


Recently, Transformer-based visual pre-training models have achieved superior performance on many computer vision tasks and have received extensive attention. However, the visual Transformer pre-training model usually has a large number of parameters and high complexity, which restricts its deployment and use in practical applications, especially in resource-constrained devices or in scenarios with high real-time requirements. Therefore, the “lightweight” research on large models of visual pre-training has become a new focus of attention in academia and industry.

In this regard, researchers from Microsoft Research Asia and Microsoft Cloud Computing and Artificial Intelligence Division have conducted in-depth explorations on the structural design and training inference of large visual models, and also on the lightweight, real-time and cloud deployment of large models. Made innovative applications. This article will start with the development of lightweight vision pre-training models, discuss key technologies in model lightweight research, and the application and potential of lightweight vision Transformer models in actual products, and finally look forward to the future development opportunities and potential of lightweight vision models. challenge.

Large visual models emerge in an endless stream, but lightweight pre-training models are scarce

In recent years, the progress of deep learning on the ImageNet image classification task has mainly benefited from the large expansion of the visual model capacity. As shown in Figure 1, in just a few years, the capacity of the visual pre-training model has expanded by more than 300 times, from the ResNet-101 model with 44.5 million parameters to the V-MoE model with 15 billion parameters. Training models have made great strides in tasks such as image understanding and visual content generation.

Figure 1: Graph of the change trend of the parameters of the visual pre-training model

Whether it is Microsoft’s 3 billion-parameter Swin-V2 model or Google’s 1.8 billion-parameter ViT-G/14 model, the visual large model has demonstrated superior performance in many tasks, especially its powerful small sample (few- shot) or even zero-shot generalization is critical to achieving general intelligence.

However, in many practical scenarios, due to the limitations of storage and computing resources, large models are difficult to deploy directly or cannot meet real-time requirements. Therefore, the research on lightweight visual pre-training models has become more and more important, and has strong practical application value. Although there are some works discussing lightweight models, most of these methods are designed for specific tasks and specific structures. The generality of the model is not considered in the design and training process, and there is generalization across data domains and tasks. limitation.

Research on Key Technologies of Lightweight Vision Models

In order to implement a lightweight visual pre-training model, Microsoft researchers found two key issues: 1) How to design a more versatile lightweight model structure? 2) Subject to the limited capacity of lightweight visual pre-training models, how to design efficient pre-training methods so that small models can also learn effective information from large-scale data? Faced with these difficulties, researchers have achieved some phased results through persistent research and exploration.

Since the core of improving the versatility of lightweight pre-training models is how to strengthen the learning ability of the model under the condition of limited resources (parameter amount, delay, etc.), so that it can better learn common features in large-scale data, so , the researchers conducted in-depth exploration from the following three perspectives:

1. Lightweight module design

Lightweight, low-latency modules are an important part of a lightweight model. In the convolutional neural network, the representative lightweight modules are the Inverted Residual Block of MobileNet and the Shuffle Unit of ShuffleNet. In the visual Transformer structure, since the calculation of attention between image blocks does not take into account the relative position encoding information well, researchers have designed a plug-and-play lightweight two-dimensional image relative position encoding method iRPE [1], It can improve the performance of the model without modifying any training hyperparameters. In addition, researchers have designed a Weight Multiplexing module [2] for the problem of parameter redundancy in the visual Transformer. As shown in Figure 2, this method reduces the redundancy of model parameters through multi-layer weight reuse, and introduces non-shared linear transformations to improve the diversity of parameters.

Figure 2: Weight Multiplexing Module in Transformer

2. Lightweight Model Search

Neural Architecture Search can automatically find more lightweight and better performance model structures from the model design space [3]. Among the convolutional neural networks, representative works include NASNet and EfficientNet. In the visual Transformer structure search, researchers have successively proposed AutoFormer [4] and S3 [5] for multiple dimensions such as channel width, network depth, and head number in the visual model, which realizes the dynamic scalable training and Structure search. Under the condition of the same model accuracy, the new model obtained by the search has a smaller amount of parameters and computation. It is worth noting that in S3, researchers used ET Error [5] and weight sharing supernet to guide and improve the search space, and also analyzed the evolution process of the search space while obtaining a more efficient model structure, as shown in Figure 3 shown. At the same time, the process of model structure search provides effective design experience and reference for the design of lightweight models.

Figure 3: Lightweight model search space evolution process

3. Visual Large Model Compression and Knowledge Transfer

Another difficulty of lightweight pre-trained models is that due to the limited model capacity, it is difficult to directly learn the rich information and knowledge contained in large-scale data. To solve this problem, researchers have proposed a fast pre-training distillation scheme to transfer the knowledge of large models to lightweight small models [6]. As shown in Figure 4, different from the traditional single-stage knowledge distillation, fast pre-training distillation is divided into two stages: 1) compress and save the data augmentation information and prediction information used in the training process of the large model; 2) load and restore After the prediction information and data of the large model are augmented, the large model is used as a teacher to guide the learning and training of the lightweight student model through pre-training distillation. Different from pruning and quantization, this method uses the weight reuse mentioned above [2] on the basis of weight sharing. By introducing lightweight weight transformation and distillation, it successfully compresses the large visual pre-training model and obtains a general A more robust lightweight model. Without sacrificing performance, this method can compress the original large model by dozens of times.

Figure 4: Fast pretrained knowledge distillation

This series of research results has not only published many papers in the top academic conferences of computer vision (CVPR, ICCV, ECCV, NeurIPS, etc.) [1-6], but also through the cooperation with Microsoft Bing, the lightweight The pre-trained model is applied to image search products, improving the ability to understand image and video content in actual business.

Application of Lightweight Vision Pretrained Models

Lightweight visual pre-training models have many practical uses, especially in scenarios with high real-time requirements or limited resources, such as cloud video real-time rendering and enhancement, end-to-end images, and video content understanding. Lightweight visual models have shown broad application prospects in smart retail, advanced manufacturing and other fields, and will play an important role in emerging industries such as metaverse and autonomous driving in the future. Taking the image content search in Microsoft’s Bing products as an example, the following will show you the practical application and deployment of lightweight visual models.

At present, content-based image search is relatively mature in understanding the category attributes of images, but there are still great challenges in understanding the content of complex scenes. Pictures of complex scenes usually have the characteristics of large depth of field, cluttered background, many characters, and complex relationship between objects, which significantly increases the difficulty of content understanding, and thus puts forward higher requirements for the robustness and generalization of the pre-training model.

For example, the search quality of anime pictures cannot be effectively improved for a long time, and the main challenges include: drawing lines and colors are more exaggerated than real scene pictures, containing more actions and scenes, the style between different manga The content varies enormously. Figures 5 to 7 show three different cartoon characters and behaviors of “Slam Dunk”, “Pikachu” and “Team Soccer”, and their cartoon styles and contents are quite different. How to effectively understand the content of cartoon pictures puts forward higher requirements for the visual pre-training model.

Figure 5: In Microsoft’s Bing search engine, the action understanding of slam dunk includes: dunk, dribble, steal, shoot, etc.

Figure 6: In Microsoft’s Bing search engine, the understanding of Pikachu’s behavior, such as eating apples, eating watermelons, eating ice cream, etc.

Figure 7: A close-up of the football teenager’s shooting action in Microsoft’s Bing search engine

The lightweight visual general model and fast pre-training distillation algorithm mentioned above have been successfully applied to the Microsoft Bing search engine. With the help of the visual language multimodal pre-training model provided by Microsoft Research Asia, the Microsoft Bing image search function enhances the understanding of comic content and can return image content that better matches user needs.

At the same time, the huge index database of Microsoft’s Bing search engine has very high requirements for retrieval efficiency. The fast pre-training distillation method provided by Microsoft Research Asia effectively transfers the indexing ability of the pre-trained large model to the lightweight model, improves the recognition accuracy of the existing model by 14%, and greatly optimizes the calculation of the model. Efficiency, realizing fast reasoning of tens of billions of pictures.

Future opportunities and challenges

Model lightweighting is the core of the future application of artificial intelligence. With the continuous improvement of vision technology, algorithms, computing power and data, the complexity of the model has risen sharply, and the energy cost of neural network computing has become higher and higher. The efficient computing efficiency and low deployment and application cost of lightweight visual models can play a huge advantage in more practical products in the future. In addition, the localized lightweight pre-trained vision model can better protect user data and privacy while supporting more services. The user’s data will no longer need to leave the device, and the remote upgrade of functions such as model services can be realized.

Of course, researchers are also aware of the challenges faced by lightweight pre-trained vision models: on the one hand, in terms of model structure design, how to achieve the optimal learning ability of the model under the constraints of the amount of model parameters and inference delay has always been a problem. It is an issue that is closely watched by academia and industry. Although many effective model structures have been deposited, and great progress has been made in the fields of Universal Approximation Theorem (UAT), Neural Network Structure Search (NAS), etc., the existing lightweight pre-trained visual models and visual large There are still gaps between the models, which need to be further optimized and improved. On the other hand, in terms of training methods, academia and industry have proposed a variety of training methods such as self-supervision, image classification, and multimodality for large visual models, which have significantly improved the general capabilities of the model. How to design more effective training methods for lightweight models with limited capacity requires further research and exploration. Researchers at Microsoft Research Asia will continue to promote the scientific research progress of lightweight pre-trained visual models, and welcome more scientific and technological colleagues to communicate and explore related technologies in this field.

references

[1] Rethinking and Improving Relative Position Encoding for Vision Transformer, ICCV 2021.

[2] MiniViT: Compressing Vision Transformers with Weight Multiplexing, CVPR 2022.

[3] Cyclic Differentiable Architecture Search, TPAMI 2022.

[4] AutoFormer: Searching Transformers for Visual Recognition, ICCV 2021.

[5] Searching the Search Space of Vision Transformer, NeurIPS 2021.

[6] TinyViT: Fast Pretraining Distillation for Small Vision Transformers, ECCV 2022.

Authors: Peng Houwen, Yan Haoran, Li Bichong, Fu Jianlong, Wei Sining

This article is reprinted from: https://www.msra.cn/zh-cn/news/features/lightweight-vision-pre-training
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment