The Next Generation of AutoAI: From Model-Centric to Data-Centric

Author | Li Mei

Editor | Chen Caixian

At the beginning of this year, in an interview with IEEE Spectrum, well-known AI scholar Wu Enda called on everyone to shift their focus from model-centric to data-centric. Since the advent of deep learning, as neural network architectures have become fixed and mature, the search for ways to improve data has become a new outlet for AI research and development.

Recently, Dr. Wang Xiaoyu, the first prize winner of the 2021 Wu Wenjun Artificial Intelligence Science Progress Award and chief scientist of Yuntian Lifei, made a keynote report entitled “Towards Automated Artificial Intelligence” at the annual conference of the artificial intelligence industry. In the report, Dr. Xiaoyu Wang detailed the three development stages of AutoML/AutoAI, and introduced YMIR, an automated AI model production platform he led and developed at Yuntian Lifei.

Wang Xiaoyu, the current chief scientist of Yuntian Lifei, was previously the chairman of computer vision at Snap Research Institute and a research scientist at NEC American Research Institute. He graduated from the University of Science and Technology of China with a bachelor’s degree, and later obtained a master’s degree in statistics and a doctorate in electrical and computer engineering from the University of Missouri in the United States. His main research fields are computer vision, machine learning and data mining. AI scholars who won the Wu Wenjun Artificial Intelligence Science and Technology Progress Award in three directions.

AI Technology Review made an unchangeable arrangement of Dr. Wang Xiaoyu’s report at the Wu Wenjun Award Conference, and conducted an in-depth dialogue with Dr. Wang Xiaoyu around AutoAI.

1 The three stages of AutoML/AutoAI

The first stage: model design, parameter tuning automation

At present, many scholars have noticed that the research and development focused by outstanding talents in academia or industry spends too much time on model structure design and parameter tuning, but in fact it should not be the main content of research. So, is there an automated way to allow the deep learning network architecture to evolve its architecture autonomously when faced with a problem?

This year, scholars interested in this issue co-sponsored the first International Conference on Automated Machine Learning (AutoML Conference 2022), which will be held in Baltimore from July 25 to July 27, 2022. hold.

At the conference, scholars outlined 10 topics covered by automated machine learning:

Neural Architecture Search (NAS)
Hyperparameter Optimization (HPO)
Combined Algorithm Selection and Hyperparameter Optimization (CASH)
Automated Data Mining
Automated Reinforcement Learning (AutoRL)
Meta-Learning and Learning to Learn
Bayesian Optimization for AutoML
Evolutionary Algorithm for AutoML
Multi-Objective Optimization for AutoML
AutoAI (including Algorithm Configuration and Selection)

NAS studies the automatic search and design of neural network structures. The goal of Hyperparameter Optimization (Hyperparameter Optimization) automation is that when we train a neural network, we no longer have to spend time picking parameters to consider which parameter is better and which parameter is worse, but can automatically predict and find. CASH is a harder problem, which refers to which machine learning methods we can automatically choose when we want to solve a specific problem, rather than manually designing it ourselves.

Stage 2: Softwareization of Simple Model Training

If the automation in the first stage is mainly for professional algorithm researchers, the systematization in the second stage is for general AI practitioners. His main goal is to train the model through a visual interface given the labeled data. But the second stage is just a good story, and it is difficult to apply practically, because it lacks the support for the continuous iteration of the algorithm in the actual scene.

Stage 3: Data Iterative Automation

Based on the automation of algorithm design, some changes are taking place. At last year’s NeurIPS conference, well-known artificial intelligence scientist Andrew Ng held a workshop to discuss “which one is more important, model or data”. In design-based industrial production, his point of view is that model-centric technology research and development has been transformed into data-centric technology research and development.

The relationship between models and data can be analogized like this (this is my personal understanding, and does not represent the views of others): Models and data are analogous to a person’s IQ and knowledge reserve, respectively. Assuming that a person is born with a high IQ, if he is raised at home from a young age, never interacts with society, and does not let him learn new knowledge, then he will still grow into a very stupid person. Even if a person’s aptitude is mediocre, if he has seen things all over the world, has been to Europe and the United States to study, has done actual industrial production in China, and has seen many design cases, then he may be more powerful than that person with high IQ. . So if understood in this way, the model is somewhat similar to IQ, and the data is somewhat similar to knowledge. The two are equally important, but later you will find that knowledge is more and more important, because you can only know by experiencing it yourself, and “knowing” is more important than “don’t know”.

In the large-scale development of industrialization, we are slowly transforming from model-centric production to data-centric. The following figure shows some experimental comparisons made by Wu Enda:

We can see that when we have a basic algorithm, we can improve its performance from two dimensions. One is the model-centric method, that is, we try to improve the complexity and technical content of the model design. The second is a data-centric method, such as adding data (there are also some scientific methods for adding data, it is not that the performance will definitely improve after adding data), checking whether there is any problem with the data, and so on. He found that a data-centric approach improved performance more than a model-centric approach. When we did model production ourselves, we also came to the conclusion that the iteration of data became more and more important the further back we went. Because all model services are actually for a specific scenario and use specific data.

In our practice over the past eight years, we have found that the iteration of the algorithm actually becomes the iteration of the data. From another dimension, up to now, we have developed a large number of algorithm models, but there has never been a model that collects data once and adjusts the parameters once and then does not need to adjust again. Many models have been iterated for 5-6 years. Content is data. Because when we solve different needs, we will encounter generalization problems in different scenarios, and the problems we encounter are becoming more and more different. This is not a problem of different algorithms, but different scenarios and different data to be processed, so we need to constantly update the iterative data to meet the needs of applications in different scenarios.

Now that algorithm iteration has become data iteration, is there a way to automate data iteration as well? If the algorithm design can be automated, the data iteration can also be automated, so an end-to-end automated AI model production platform is gradually becoming possible.

The automation of data iteration requires technical support as well as system-level support.

2 YMIR: Automated AI Model Production Platform

Why do we need to platformize AI model production? Today, there are not only big companies such as Google, Microsoft, Meta, IBM, Apple, etc. that have the demand for automated AI model production platforms. Many real estate companies in our country have begun to invest in AI. They all have talent needs, and automated AI can reduce their costs. Real estate companies, property companies, and battery companies like CATL are slowly introducing AI to solve practical problems.

Why is this happening? Because: “AI is the power of the new era”. AI is a very basic ability that can improve the efficiency of what we do. AI does not change the industry, but it can improve the production efficiency of the industry in which it operates. Therefore, this impact is comprehensive and has gradually spread to non-technical companies. . Not to mention the wide-ranging manufacturing industry now, many links in the manufacturing process can leverage the capabilities of AI. If you want to improve your international competitiveness and improve your production quality, you need AI capabilities to enable production.

But the problem comes again, we don’t have so many AI talents, we need a more scalable method for AI research and development. So we made YMIR, an automated model production system.

YMIR is an open source, non-profit AI model production platform. It is an international open source project jointly initiated by us, internationally renowned universities and Silicon Valley technology companies. We also invited the chief AI officers of several American technology giants to serve as our project consultants. The project has been open sourced on GitHub.

Github address: https://ift.tt/XEUhAMs

YMIR: Covers the whole process of model production, focusing on the rapid iteration capability of models

With the support of these technologies, we have created an engineered system, YMIR. YMIR covers the entire model production process, focusing on rapid model iteration. Instead of training a model, we iterate over the model (through data) until it can meet the needs of the real-world scenario.

The following figure is the framework of the whole technology. The left side is the initial stage of model production, including data preparation, data labeling, and model training. Model training.

R&D practice: Algorithm production efficiency is increased by 6 times, and the demand for algorithm personnel is reduced by 1/10

We have done a lot of large-scale R&D experiments to see if it can solve the problem in actual production. We did about 6 months of tracking and invested 10 annotators, who were either high school students, graduates from vocational colleges, and of course algorithm personnel. At present, we cannot completely separate from the algorithm personnel. When faced with a problem, how to decompose it into technical implementation requires the intervention of the algorithm personnel. At the same time, we also need algorithm personnel to do some simple system training for labelers. The algorithm personnel invest about 0.3, that is, they spend 30% of the day doing these things, and the rest of the time they also do algorithm development, review of annotation documents, review of model iterations, and discovery of model problems.

The total number of images we annotated is 750,000, and the number of annotated picture frames is 1 million. 90% of the work content of the annotator is used in labeling, marking out the objects to be detected, and 10% of the time is used in operating the YMIR system. In 3 months, we used 10 labelers and 0.3 algorithm personnel to produce 50 algorithms, and most of these algorithms can meet the needs of practical applications, such as fire extinguisher detection in emergency events, fire hydrant detection, etc. The needs of urban governance. Some algorithms have reached 97% accuracy.

Here’s how we spend with and without the system:

The cycle is about three months. Without this system, the input of algorithm manpower is about 36 people/day, the input of labeling personnel is 24 people/day, and the model produces six algorithms. After investing in this system, we can produce 51 algorithms in the same time period, and the production efficiency is about 17 algorithms/month, compared to 3 algorithms/month before. After using the automation platform, the algorithm production efficiency has increased by 6 times, but the demand for algorithm personnel has been reduced to 1/10 of the original. (Public number: Leifeng.com)

3 Dialogue with Wang Xiaoyu

AI Technology Review: Yuntian Lifei is an algorithm company, why is it researching AutoAI?

Wang Xiaoyu: We are not a company that only produces algorithms, we provide customers with end-to-end AI solutions.

At the same time, we realize that the foundation of our country’s technological intelligence and informatization is still relatively weak. We hope that in five years, companies will recognize the importance of AI, and when they invest in AI upgrades, the automated AI platform will save them a lot of cost and become a catalyst for the large-scale popularization of AI. When AI becomes an indispensable part, there will be opportunities for platform-based hardware, platform-based productivity tools, and platform-based services. We hope that the YMIR AutoAI system will advance the industry and enable companies to enter the research and development of next-generation AI technologies and services.

AI Technology Review: You mentioned that AutoML has gone through three stages of development, what is the difference in their essence?

Wang Xiaoyu: The first stage is mainly in the academic field. For example, scholars have launched AutoML Conference 2022. Everyone is mainly exploring which aspects of algorithm model design can be automated, and how to achieve automation, such as how to achieve Neural network structure search, hyperparameter optimization, hybrid algorithm selection, and more.

The second stage is to create an automated algorithm model production system, deposit the methodology accumulated in the first stage into a platform and system, and realize automatic algorithm model training in a low-code or even zero-code way. However, this type of platform does not implement the process of model iteration into the system, and does not cover the complete production cycle of real model training, so it cannot meet the needs of industrial production. I position AutoML at this stage as a “toy”. You can play, but you can’t really use it in actual tasks. Because no industrially produced model can be trained only once, it needs iteration.

What we are doing is the third stage of AutoML, which is to create an automated model training platform for industrial applications. According to our market research , YMIR is the only system on the market that covers the full life cycle of model production, and it can actually be used in industrial production. It can be argued that early AutoML was biased towards pure technology, while YMIR emphasized practical industrial applications. What we do is a product system, so we consider not only technical issues, but also engineering and system issues.

AI Tech Review: What is the difference between AutoML and AutoAI?

Wang Xiaoyu: I think it is more appropriate for us to limit the concept of AutoML to its first stage, which focuses on technology. Machine Learning is only one of the artificial intelligence technologies. The production system is not actually AutoML in the traditional sense, but we can’t find a suitable word to summarize it yet. In comparison, AutoAI can better generalize what we do now.

AI Technology Review: Why is data increasingly important?

Wang Xiaoyu: Technologies such as data and algorithms complement each other. If the final technology is to meet the application requirements, data availability is an indispensable part.

Algorithms can increase the accuracy of AI models from 50% to 60%, but they still cannot solve practical problems in applications, while data can increase the accuracy of AI systems from 60% to 90%. Because the design of the model gradually converges and the technology matures, the iteration of the data becomes more important than the technology itself. Algorithmic technology has always been important, but it often comes to the fore and requires data to drive it.

AI Technology Review: There are already other AI model production platforms claiming that it only takes ten minutes to train a model. What do you think?

Wang Xiaoyu: Models are only useful if they can be deployed into real systems and run. It doesn’t make sense to advertise how long it takes to train a model, because it’s the data that really takes time. Model training may only take ten minutes, but it also takes a month to label millions of data. In the full production cycle of a model, we first define the problem, then collect data, and then train the model. Apply the trained model to the real scene to see if there are any problems, and then collect a large amount of data for iteration. This iteration process is very long.

Many of our algorithm people spend 90% of their time processing data and only 10% of their time writing code and developing model structures. Data on the Internet is relatively easy to obtain, but it also requires a lot of work, because the noise of the data is very high, especially as this wave of artificial intelligence application scenarios slowly sinks offline, the noise of the data becomes even greater. For example, the quality of image data and data annotations taken by quality inspectors in traditional enterprises will also vary due to differences in the individual qualities of quality inspectors.

AI Tech Review: Does the YMIR Platform Include Automatic Labeling of Data?

Wang Xiaoyu: We provide pre-labeling. The so-called “automatic labeling” is a pseudo-concept. At least at this stage, there is no platform that can truly achieve automatic labeling. Human intervention is still required. For example, when testing the manhole cover, we draw a detection frame for the manhole cover in advance. If the drawing is correct, the annotator will go through the review directly; if it is not correct, the annotator will have to revise it again.

AI Technology Review: Why have you not yet chosen to commercialize YMIR, but free and open source?

Wang Xiaoyu: Our domestic consumer market is developing very well because we have a good mobile Internet foundation. Compared with developed countries, our ToB service companies have a huge gap. I think a considerable part of the reason is that we do not have a good enterprise informatization foundation and enterprise service ecology. We are actually a low-level evangelist who wants to promote the prosperity of the enterprise service ecosystem. So our platform is completely open source, it is free whether you use it personally or commercialize it. This is indeed a bit idealistic, but we believe that if the whole industry is good, we are good, this is a kind of long-termism. (Public number: Leifeng.com)

This article is reprinted from: https://www.leiphone.com/category/academic/iqFiicnCgT5OO8dS.html
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment Cancel Reply