Original link: https://www.msra.cn/zh-cn/news/features/davinci
Editor’s note: Do you often “archaeological” some old movies and animations to recall the old days? Do you also have some precious videos that take you to relive the good old days? However, we have become accustomed to the high-definition experience. Looking back at the old images, the picture quality may be “scum” and people can’t bear to look directly. In this era of multimedia content explosion, people’s demand for video content has become stronger and stronger, and the creation, enhancement and re-creation technology of video materials has also been greatly improved. Although using existing video repair tools, video editors can also make videos more high-definition, but the premise is that they need a computer with ultra-high performance configuration, and pay the time cost of several times or even dozens of times the length of the video, even if As such, the results may still be unsatisfactory.
So is it possible to complete the high-definition and intelligent frame insertion of video in real time and high quality on the terminal device with limited computing power or even without the need for networking? The intelligent video enhancement toolset “Da Vinci” of Microsoft Research Asia replied, “I can!” Relying on 4 million high-definition training data and large-scale low-level visual pre-training methods, “Da Vinci” can be compared on the end. Low computational cost to repair video quality. Especially for some practical production requirements, large-scale low-level vision pre-training further improves the robustness of the model, making it applicable to more challenging scenarios.
In November 1998, Microsoft Research Asia was established in Beijing. Bill Gates, then the CEO of Microsoft, recorded a video for this purpose, let us take a look at the clips first.
For those of us who are used to watching high-definition videos these days, perhaps the quality of this video is a bit too scum. In order to solve the pain points of existing video enhancement and repair tools and give full play to the advantages of AI technology, researchers from the Multimedia Search and Mining Group of Microsoft Research Asia combined deep learning, machine learning and other technical experience with actual scenarios and needs to launch a A set of intelligent video enhancement toolset – “DaVinci” (DaVinci), greatly reduces the threshold for users to process video materials, and can make the video clearer and smoother with just one click.
Now, let’s take a look at the version below that has been restored by “Da Vinci”. Do you feel like you can see the sun in an instant?
“Da Vinci was one of the most outstanding artists of the Renaissance, he combined artistic creation and science, leaving many immortal works. Therefore, we hope that the intelligent video enhancement toolset ‘Da Vinci’ can make AI The application of technology to the field of multimedia content processing allows video and image creators to better express their creativity, which is why we named the project ‘Da Vinci’,” said Yang Huan, a researcher in charge of Microsoft Research Asia.
The three major skills of “Da Vinci”, the real application of academic concepts
According to Fu Jianlong, a researcher in charge of Microsoft Research Asia, the academic community has started research on image and video processing very early, and has explored theoretical methods in many directions, but these innovative ideas that remain in the proof-of-concept stage must be truly implemented. , you need to carefully screen the feasible directions. “After careful research, we believe that image quality enhancement in the two scenarios of general image and video conferencing has great potential, behind which are mainly image/video super-resolution, video frame interpolation, and compressed video super-resolution. With the support of big technology, it has better landing and application opportunities, and it is most likely to allow people to experience the advantages of current AI technology.” Ultimately, these three technologies were integrated into the “Da Vinci” tool set and provided through open source. User download and use.
Relying on the innovative Transformer-based image/video super-resolution, video frame insertion and compressed video super-resolution technologies developed by Microsoft Research Asia, the “Da Vinci” toolset can help users complete video enhancement needs in different scenarios in real time. Whether online or offline, it can generate clear, coherent high-quality video, greatly improving the video viewing experience.
Video Super-Resolution: Upscaling a video from a sequence of low-resolution frames to a sequence of high-resolution frames. For video, the most intuitive feeling is to make the picture we see more high-definition, and the details in the video content more moving, so as to meet people’s increasing demand for video clarity, and to better adapt to the improvement of hardware resolution. . For example, turning an old 480P video into a 2K/4K high-definition version, no matter whether it is played on a small screen or a large screen, there is no pressure on the picture quality.
Comparison of video super-resolution results (left: traditional Bicubic algorithm, right: algorithm provided by the “Da Vinci” toolset)
Video Interpolation: It is the synthesis of non-existing frames between two edge frames. The current mainstream frame rate of video is 24 frames per second, that is, 24 pictures are played in one second. With the improvement of the performance of video processing equipment and display equipment, the original frame rate can no longer meet the needs of the public. Especially in sports events or games, if the frame rate can be increased to 60 frames per second or even 120 frames per second, it can make the picture smoother and reduce the dizziness caused by insufficient frame rate. In fact, this technology can be applied to many scenarios, including slow-motion video, frame rate conversion, and more.
Comparison of video frame interpolation results (left: traditional frame interleaving algorithm, right: algorithm provided by “Da Vinci” toolset)
Compressed video super-resolution: refers to the restoration from compressed low-resolution video frames to high-resolution video frames. In order to ensure a high transmission rate of video on the Internet, or to transmit as smooth video as possible under limited network conditions, most videos on the Internet or user devices are stored and transmitted in a compressed format. However, video compression can cause a loss of quality, causing the video to appear to be mosaicked when viewed by end users, especially in scenes with high motion. Compressed video super-resolution is to repair this loss and make the video quality better.
Comparison of compressed video super-resolution results (left: traditional Bicubic algorithm, right: algorithm provided by the “Da Vinci” toolset)
Thousands of devices are required for all kinds of needs, and the innovative design of “Da Vinci” is all done
Usually, what a technology presents in academic papers is the upper limit that it can achieve under ideal conditions, and when translated into a tool for practical applications, the technology has to deal with various lower limit problems. For example, we cannot predict what types of video materials users will use video enhancement tools to process. It may be childhood images recorded by mobile phones, great rivers and mountains captured by DV, film movies, and nostalgic golden song MVs saved in MP4. Or a compressed 4K movie shared by a friend. Therefore, the model needs to be robust enough to handle different requirements.
In addition, the user’s deployment environment is also unknown. Although most devices may be mobile phones, notebook computers, desktop computers, etc., the performance of memory, CPU, and graphics cards of different devices is also different. At the same time, researchers also need to consider the power consumption of computing. For mobile devices such as mobile phones, the power consumption and computing processing time also need to be carefully optimized and designed. In addition, the performance of the model will be degraded to a certain extent when the model is migrated from the laboratory server to the terminal device. How to ensure the consistency of the experience of all devices is also an important challenge in the design of the “Da Vinci” model.
The realization of the three major skills of “Da Vinci” is supported by the industry-leading low-level vision pre-training technology (Low-level Vision Pre-training), supplemented by a large amount of data training. For the robustness of the model, the researchers used the collected 4 million public images and video data on the one hand. On the one hand, in order to ensure the amount of training data and rich data types, the researchers also artificially synthesized more degraded data containing noise based on the existing data, so that the entire model training can cover more actual application scenarios of users .
In order to meet the diverse needs of user deployment environments, the researchers made a lightweight design for the model, and also made special optimizations on the network structure and model storage. For example, the traditional video processing method also considers the entire time series when processing each frame of image itself, which greatly increases the amount of computation. Researchers from Microsoft Research Asia believe that video playback is the movement trajectory of objects in time series, and only the content on this trajectory is helpful for the current picture enhancement, while the content in other areas is less relevant.
As a result, the researchers proposed the Trajectory Transformer-based Video Super-Resolution Network (TTVSR), which is also the extension of the previous Microsoft Research Asia Image Super-Resolution Texture Transformer (TTSR). For the calculation of super-resolution and frame interpolation, the trajectory-aware converter can reduce the space complexity of the original time × video single-frame image to just the calculation of time series, thereby simplifying the computational complexity of the entire model. It used to take ten minutes or even an hour for industrial models to process one minute of video. Now, “Da Vinci” can complete high-definition video processing in real time or even in super real time. On the task of compressed video super-resolution, “Da Vinci” can better preserve high-frequency visual details and guide the generation of high-frequency textures, reducing the impact of video compression artifacts.
Yang Huan and Fu Jianlong said that compared with pictures, the content of video is richer. In addition to the spatial dimension, the time dimension must also be considered, and the demand for computing is higher. Therefore, for video processing, a more sophisticated design such as a track-aware Transformer is required. method. For example, for the continuity and correlation between different frames of the video, if a person appeared at this position in the previous frame, and the next frame may move a little to the left, then our enhancement and calculation for this person only needs to move along him. The trajectory can be calculated without searching and calculating the entire video.
Industry benchmark tests are higher than existing methods
By testing on two metrics, Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) widely used in the industry, the “Da Vinci” toolset outperforms existing methods. The table below shows the test results of the Trajectory-Aware Video Super-Resolution Transformer (TTVSR) on the most challenging REDS4 dataset, where PSNR is improved by 0.70db and 0.45db over BasicVSR and IconVSR, respectively.
Test Results of Trajectory-Aware Video Super-Resolution Transformer (TTVSR) on REDS4 Dataset
Applying the above-mentioned Transformer-based video super-resolution related technologies to video frame insertion and compressed video super-resolution can still get very good results. For example, on the Vimeo-90K data set of video frame insertion, it can bring a PSNR improvement of 0.36db; when applied to the REDS4 data set of compressed video super-resolution, it can bring an amazing 1.04db under the compression ratio of CRF25 PSNR boost.
The tests on the above datasets are based on specific degradation models in academia, but considering the actual usage scenarios, the videos uploaded by users are not high-quality standard materials, and there is no benchmark for comparison. Therefore, in order to be closer to the real needs of users, researchers at Microsoft Research Asia also designed a video evaluation method CKDN that “does not require standard answers”, that is, the industry’s non-reference-based quality evaluation, which aims to continuously explore video processing for the industry. method to provide more reference. (Paper link: https://ift.tt/1iDITuH)
Download the toolset executable for crisp, silky video
At present, Microsoft Research Asia has packaged and released some executable files of the “Da Vinci” toolset, and the project homepage on GitHub will also be launched in the near future. Follow-up researchers will publish and update more video enhancements on the homepage. tool. Professional developers can use the toolset to try out their own scenarios and deeply integrate with their respective businesses or develop secondary development. Ordinary users with zero technical foundation can also download and run the executable files of the toolset to experience “Dafin”. Clear, silky video.
“Da Vinci” toolset download
Image Super Resolution: https://ift.tt/EFIzeb9
Video Super Resolution: https://ift.tt/zuwVOeM
Links to related papers:
Learning Texture Transformer Network for Image Super-Resolution
Learning Trajectory-Aware Transformer for Video Super-Resolution
Learning Conditional Knowledge Distillation for Degraded-Reference Image Quality Assessment
The Da Vinci project GitHub page (coming soon, stay tuned!):
If you find any issues with the toolset, please contact us by emailing [email protected] or filing an issue on the upcoming GitHub page.
This article is reprinted from: https://www.msra.cn/zh-cn/news/features/davinci
This site is for inclusion only, and the copyright belongs to the original author.