Original link: https://www.latepost.com/news/dj_detail?id=1626
Zhou Ming decided to start a business when the temperature of the domestic AI market dropped to the bottom. That was two years ago. At the end of 2020, he considered resigning from his job as the vice president of Microsoft Research Asia. Many friends advised him not to come out, but he was determined to start a large-scale model business, thinking that “the large-scale model will become a kind of foundation in the future. facility”.
Six months before this, OpenAI released GPT-3, which attracted attention in the field of artificial intelligence, but it has not yet formed a consensus that the opportunity for large models has arrived after the release of ChatGPT at the end of last year.
Unlike some entrepreneurs who joined the boom who were willing to talk about the “endgame”, Zhou Ming repeatedly emphasized that “the company must survive” in an interview with “LatePost”. His business prospects are less “sexy”. He believes that for domestic entrepreneurs, 2B slow business is more reliable.
He admitted with a smile, “Those who have 2C aspirations probably look down on those who are 2B.” He believes that large-scale 2C entrepreneurship is an endless “arms race” with great uncertainty, and 2B is more pragmatic.
Behind the pragmatism is that he is always constrained by insufficient resources.
When Zhou Ming founded Lanzhou Technology, domestic AI venture capital activities were at a freezing point. He is glad that he did not come out of Microsoft one year later. Since 2019, the enthusiasm for venture capital investment in China’s AI field has been declining, reaching its lowest point last year. According to IT Juzi, as of November 2022, the total amount of financing in the primary market in China’s AI field has dropped by 61% compared with the same period in 2021.
As soon as it was established, Zhou Ming and Lanzhou encountered a cold market environment, and few people recognized the potential of large models at that time.
Resource constraints also come from customer habits: many large enterprises in China, especially central state-owned enterprises, have a strong demand for data privatization. So after Zhou Ming visited hundreds of domestic customers, the feedback he got was often: “If you make a big model, we can’t afford it.”
They want to deploy the large model locally, which requires purchasing a large number of GPUs and building a computing power center, and investing at least tens of millions of yuan. Therefore, Zhou Ming chose to make a model with 1 billion parameters at the beginning, and the focus of research and development was how to use lightweight models to solve problems. Until ChatGPT educated customers, Lanzhou began to accelerate the development of tens of billions and hundreds of billions of large models.
Insufficient resources still have the imprint of the times. Zhou Ming has been studying natural language processing (NLP) since the 1980s, and participated in the development of China’s first Chinese-English machine translation system CEMT. At that time, the computing power was very low, and he thought about how to save memory every day. In the 1990s, he went to Tsinghua University to teach, but the research funding he applied for was still limited. After joining the newly established Microsoft Asia Research Institute in 1999, he and his team focused on how to use small data to train results similar to big data for a long time.
Zhou Ming often jokes that he is “poverty limits imagination” – he has never dared to think that general artificial intelligence (AGI) will be realized. It was not until the outbreak of ChatGPT that he took AGI as his vision. After nearly forty years of doing natural language processing research and two years of starting a business, he finally has the right ambition.
However, in today’s imaginative atmosphere of greater opportunities and newer species, Zhou Ming is not the most attractive entrepreneur for capital and resources. More than one investor who pays attention to large models emphasized the age of entrepreneurs to “LatePost”. Some people believe that the past NLP research paradigm has been completely subverted, and younger entrepreneurs can better understand new technologies and grasp new opportunities. .
Zhou Ming started his business at the age of “knowing the destiny”, and he is used to doubts about age. Age also brings benefits, that is, resilience through cycles.
“Using the right time and place is a person’s core competitiveness.” He said.
Zhou Ming, founder and CEO of Lanzhou Technology
The following is the conversation between Zhou Ming and “LatePost”:
Ordinary people don’t feel it, but the “huge shock” in the AI world has already begun
“Later”: You resigned from Microsoft Asia Research Institute at the end of 2020 and decided to start a large-scale model business. ChatGPT caused a shock two years later. Why did you see the opportunity sooner?
Zhou Ming: I did a lot of research in the team at Microsoft Asia Research Institute and saw the usefulness of large models. I think further down the road, it will become some kind of infrastructure.
At that time, many domestic small and medium-sized enterprises did not realize what a big model was and what it was useful for. BAT has already started to make large-scale models, but has not released too many technologies and services to the outside world. Chinese enterprises, especially small and medium-sized enterprises, will definitely use large models in the future. Who will do it? Herein lies an entrepreneurial opportunity.
“Later”: The big model is useful, how did you perceive it at that time?
Zhou Ming: In fact, after Google’s Transformer came out in 2017, the field of NLP (Natural Language Processing) immediately shifted to Transformer.
The natural language group I was leading at Microsoft immediately began to use Transformer for encoding, decoding, and various large-scale models. At that time, it was called a pre-training model. We made a well-known model in the industry called Unified Language Model (UniLM). Our technology has been successfully applied to multiple products, including Microsoft Turing Large Model, Bing Search Relevance Improvement, Office Grammatical Error Checking, Azure Machine Translation, etc.
“Later”: So Transformer in 2017 brought more shock to the industry than ChatGPT?
Zhou Ming: If the Turing Award is awarded in the future, it may be awarded to Transformer instead of ChatGPT, because the Turing Award generally encourages basic technologies with long-term and extensive influence.
Ordinary people didn’t feel it, but the AI world all switched to Transformer at that time. Google may feel that it is a bit of a loss, Transformer is made by it, and the sensational BERT is also made by it, but now it is GPT that picks the fruit.
(*BERT is a large model based on Transformer launched by Google in 2018.)
“Later”: What changes did Transformer bring?
Zhou Ming: Let’s start from the beginning. Why has natural language processing developed rapidly in recent years? “Self-supervised learning” is the most important.
In the past, for many tasks in natural language, such as Chinese-English translation, you had to find Chinese-English bilingual corpus on the Internet, manually check and confirm, or add new corpus. Different tasks need to be labeled with different data, and the labeling cost is particularly high. Then use the labeled data to design a model for learning. This is “supervised learning”.
The GPT large model is “self-supervised learning”. It does not need to mark the data in advance. It only needs to prepare a large-scale corpus, and the neural network will adjust the parameters by itself to learn a stable state.
When doing specific tasks, such as information extraction or text generation, it is necessary to make a fine-tuning on the model, which requires labeling data for these tasks, but the amount of labeling is much smaller than that of supervised learning. Because the model is smarter, if you give it a few examples, it will. It used to be possible to label 10,000 pieces of data, but now it may be 100 pieces.
Now GPT-4 does not even need to label specific tasks. It can directly tell the model how to do the task through prompts. The more detailed and accurate you are, the better it will complete.
“Later”: How does Transformer realize self-supervised learning?
Zhou Ming: Self-supervised learning has long been thought of by people engaged in natural language, but there is no good coding method to realize it.
The first big change in the NLP field in recent years was the ImageNET fire in 2012. Everyone realized the powerful capabilities of deep learning in the field of image recognition and began to apply deep learning to NLP. Initially, deep learning only transformed part of the original NLP process, mainly using it to generate features (features) that help machines understand language. is characteristic. But at that time, it was impossible to do end-to-end training from input data to output results like a large model, mainly because of insufficient coding ability and efficiency.
The emergence of Transformer has changed this situation. It brings the most efficient encoder and decoder at present, and it can be calculated in parallel at a high speed. The key is that it introduces a “multi-head self-attention mechanism”; and when encoding a word, in addition to semantic information, it also adds the position information of the word in the context. To put it simply, this can extract sentence information in multiple dimensions, and finally combine the multi-layer attention model and location information, greatly improving the encoding and decoding capabilities.
After this, everyone became more courageous, BERT, GPT-1, 2, 3, and then ChatGPT, a line came out.
“Later”: Looking back now, this seems to be a very natural way of thinking. Why was it realized in 2017?
Zhou Ming: First, the computing power has really improved. This form of encoding requires extremely high computing power, because so much attention is required, each word has a lot of coding bits, and the number of neural network layers is also large, all of which consume a lot of calculation.
The second is that the imagination has been enhanced, which is also related to the increase in computing power. I didn’t dare to think about it with the attention of a single head before, it was too much space.
Computing power, algorithms, and data are interactive and advancing: the stronger the computing power, the more you can think, and the stronger the algorithm, the higher the efficiency of data processing.
“Later”: For those who have been doing NLP for many years, is Transformer a ground-breaking disruptive innovation or an incremental innovation based on existing technologies?
Zhou Ming: Disruptive innovation. Every part of it may have been thought of in the past, but turning it into a system and becoming the basis of a neural network is definitely a disruptive innovation.
“Later”: Did you ever think of any part of it?
Zhou Ming: Encoding, and the correlation between words. It can’t be said that I thought of it. I have been doing NLP since 1985. At that time, some people studied multi-feature encoding. Can the research be encoded with a unified multi-dimensional vector regardless of part of speech or language?
“Later”: But over the years, you and others have failed to realize these visions.
Zhou Ming: Poverty limits imagination.
Our machines were too small at that time, and what we thought about all day was how to save memory. If someone really thinks about it, you will say stupid, eat up all the internal memory and external storage at once, how can it be done? The big model is reversed, thinking about how to fully mobilize computing power, and not so concerned about computing power consumption.
And in the past we only scratched the surface and had initial ideas. Transformer is an all-round, multilingual, multimodal unified coding spirit, which can be done in all languages, including program code, because the coding mechanism is the same.
Ilya has the ability to innovate at the bottom, and Sam has achieved the ultimate in integrated innovation
“Later”: Based on Transformer, what did OpenAI do?
Zhou Ming: Continue to work hard to achieve the ultimate in data cleaning, scale, parameter volume, training speed…all things.
“Later”: Meta’s AI Chief Scientist Yann LeCun (Yann LeCun) commented on ChatGPT “in terms of underlying technology, there is nothing innovative.”
Zhou Ming: What he said makes sense. People who engage in academic research will say that ChatGPT is nothing special. The technology it uses is scattered in the literature, and it has been used elsewhere.
But people who are engaged in engineering and products will think that ChatGPT is amazing. Its greatest achievement is to achieve the ultimate in all aspects, and it is a model of integrated innovation.
China’s integrated innovation capability is relatively weak, and we are doing well in point and application innovation.
“Later”: What kind of innovation did ResNET under the guidance of Sun Jian of Microsoft Asia Research Institute belong to?
Zhou Ming: It belongs to the underlying innovation of fundamental. The light of ResNET has so far shined on the entire neural network and AI field, and it is the pride of Microsoft Asia Research Institute.
(*ResNET mainly solves the problem that the deep neural network is difficult to train. It was proposed by He Yuming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian at the end of 2015. The paper has been cited more than 120,000 times. “The strongest chess player” AlphaGo Zero also used this technology.)
“Later”: What is the difference between underlying innovation and integrated innovation?
Zhou Ming: The underlying innovation, the algorithm is proposed by myself from the beginning to the end. Integrated innovation is like the “star-attracting method”, integrating all excellent algorithms, engineering, interface, interaction capabilities, and even PR capabilities.
In the field of large models, you can understand that Transformer is the underlying innovation, and ChatGPT is the master of integrated innovation.
“Later”: In the last computer vision-based AI upsurge, Chinese scientists proposed such achievements as ResNET. Looking at the language model now, why did the Transformer at the bottom and ChatGPT later appear in the United States?
Zhou Ming: The emergence of any technology has a certain chance.
China is relatively weak in integrated innovation, and many American companies other than OpenAI are also weak. Microsoft has also done a lot for OpenAI, and OpenAI cleverly used Microsoft’s computing, resources, and data.
“Later”: Then the question can be, why OpenAI?
Zhou Ming: There are several types of people in the world. Some people want to study bottom-level innovation. Some are applied on the underlying innovation, and the general application is to solve a single task. There are also integrated innovations, where all work, applications, and algorithms are reflected on a large platform to form milestones. OpenAI happens to do a very good job of integrating innovation.
China’s application innovation is relatively strong, integrated innovation is relatively weak, and bottom-level innovation has made some breakthroughs.
“Later”: Where does the underlying innovation come from, such as how did the first person think of Transformer?
Zhou Ming: One is imagination, and imagination comes from the ability to ask questions.
Those who do applied innovation will not think about “how to better code the language”, but those who can propose low-level innovation will think that if this problem is not solved, it will be difficult to push forward; he has seen all the problems and can at the same time Find the breaking point.
The second is that bottom-level innovation requires a mathematical foundation.
“Later”: What inspiration does the success of OpenAI give you?
Zhou Ming: It is a rare match made in heaven with Microsoft. Sam has good personal relationships with Microsoft CEO Nadella, Musk, and Huang Renxun, and they have trust. In addition, Sam has seen a lot of entrepreneurial projects, has strategic determination, knows which direction to go, and is matched with a chief scientist, Ilya, who is very persistent.
Later: Is Ilya harder to find, or is Sam harder to find?
Zhou Ming: China has Ilya and Sam, but it is not easy for them to meet together. China also lacks companies like Microsoft.
Ilya is a firm believer that certain technologies can work wonders. We also have it in our field.
“Later”: What kind of talent do you think you are?
Zhou Ming: I may be more like an architect, an architect. I have a clear idea, can organize different people and resources, and know where to go. But let me write a particularly good algorithm, I can’t write it.
The application of Lanzhou is very strong, and the model and algorithm are still at the top level in China. I have also seen some entrepreneurial teams who do not understand the underlying technology and do integration directly, which may make more haste than speed.
“Later”: You are one of the earliest companies in China to do large-scale model development and application practice. What should you do now that other companies are poaching people?
Zhou Ming: They haven’t looked at us yet. We don’t pay attention to individual heroism, each of us plays to our own strengths, and different people support each other to make big projects.
Large companies, including start-ups, are looking at international talent. You have worked in OpenAI, even if it is a sweeping one, you are worth a lot now. I couldn’t find OpenAI’s, but I’ve worked at Microsoft and Google, and I’m looking for people like this now.
“Later”: Is it a wise move for them to go to the United States to poach people?
Zhou Ming: Regardless of domestic or foreign digging, simply poaching people is not the best policy. Most people only work on a certain screw in a company, and only have a partial understanding of the problem. They are originally soldiers. You expect him to come to you to be handsome. Where do you think your company will go?
2C is more ambitious, but 2B is more pragmatic
“Later”: You once commented that OpenAI’s “ambition is admirable”, and it is a bit “timid” to do NLP in China. What is “ambition”? What is “timid”?
Zhou Ming: OpenAI wanted to do AGI (General Artificial Intelligence) from the very beginning. Most other companies at home and abroad do not have this ambition. They are more interested in doing tasks such as machine translation and search engines well, and do not necessarily have to go to AGI.
But now some Chinese companies are too ambitious after seeing the success of OpenAI, thinking that as long as I have money, as long as I can afford a machine, I will soon reach or surpass ChatGPT. I think it is unlikely.
“Later”: You haven’t thought about AGI yourself?
Zhou Ming: I used to think that it could not be done, and now I dare not say that Lanzhou can do it, but with this Vision (vision). There is a big difference between yes and no: our generation may or may not be able to achieve it, but everyone is approaching it every day, so we must have such ambitions.
“Later”: How do you define AGI? Some people think that AGI is already here.
Zhou Ming: AGI is a progressive process. Originally, you could only do one task, but later it became N tasks and 10,000 tasks, all of which were realized on one platform.
Is 10,000 tasks AGI? No, it may keep going up. The earlier tasks are easier to be used by people, and the later ones are the longer tails.
“Later”: You are defining AGI from the perspective of generality, and you do not consider the cognition or consciousness of machines?
Zhou Ming: I am from the perspective of productivity, not production relations. Productivity has not yet been achieved.
“Later”: The idea is quite pragmatic. How do you do it?
Zhou Ming: I pay attention to walking on two legs, one is Vision and the other is Stage. Vision is the ultimate goal, and each stage also has stage goals, allowing the company to have intermediate results or realize income.
So we not only refine the model, but also hope that it can be implemented in some fields soon, and it can be used while refining, and cannot be separated. There is mutual feedback between the two: when refining the model, we should consider how to use it, which will be more focused and more efficient. ; When using it, think about how to combine it with the “last mile”. In the current entrepreneurial team, there are very few who have the ability to practice and use at the same time.
“Later”: Wang Huiwen’s idea is also “big model + application”, which he calls two-wheel drive.
Zhou Ming: This shows that he really worked in a big company. The advantage of Lanzhou is that it has been in the works for two years, and we have suffered a lot. Our past models have already had experience in landing, and now we are making larger models for landing. We have an extra “feedback chain”.
“Later”: Different from this batch of new companies, Lanzhou will make a model with 1 billion parameters in early 2021. Looking back, is this a relatively timid choice?
Zhou Ming: When I first started my business, I wanted to make a large-scale model, but I researched hundreds of units, and they said that if you make a large-scale model, I can’t afford it. If you give me a model with tens or hundreds of billions of parameters, I have to How many machines to buy? China’s central state-owned enterprises are going to be privatized and deployed, and I think they are the most important customers of China’s 2B. Therefore, in the past two years, Lanzhou has taken a pragmatic route in lightweighting.
“Later”: What is the specific cost for the customer to deploy the large model?
Zhou Ming: If you want to train a large model with hundreds of billions of parameters and pursue training speed, you need thousands of A100s. Now, one A100 costs about 100,000 yuan, which is an investment of hundreds of millions of yuan. If the training is very slow, I think at least 128 A100s are used, which is an investment of tens of millions, and I am not sure whether it can be trained.
Of course, if you only deploy inference locally, you don’t need so many cards. Reasoning is when the model is trained and used. The 100 billion large model needs 8 to 16 A100s, which is also an investment of one or two million yuan. If the tasks supported by this model are not so important, customers still feel that it is not cost-effective. So I could only do lightweight models at that time.
“Later”: At Lanzhou’s Mencius large-scale model conference in March this year, you said that a model with tens or hundreds of billions of parameters will be made next.
Zhou Ming: The demand distribution of 2B is: 80% are tasks that can be solved by lightweight models, such as machine translation, information collection, reading comprehension, etc.; and 20% of tasks require multiple rounds of dialogue, complex semantic understanding or intent recognition, such as customer service , Contract review, etc., this can only be done with a large model. We didn’t touch this 20% before, even if it has a higher unit price.
What we did in the first two years was to take 80% of the tasks first, accumulate capabilities, and then gradually build larger models to get 20% of the large orders.
“Later”: Before ChatGPT comes out, you must not get the 20% orders?
Zhou Ming: I can’t get it. Your model ability is not enough, and customers think you can’t do it. I have to judge the situation, as a startup team, I have to live on 80% of the tasks first.
But when ChatGPT came, it educated customers, and customers wanted to use it. Our original plan, coupled with the advancement of technology, customer education, and competition from peers, has made us more capable, and everything is ready. I should do this (large model with 100 billion parameters).
“Later”: Now that you are making a large model with a parameter scale of more than 10 billion, will corporate customers still not be able to use it?
Zhou Ming: First, larger and more important tasks have larger budgets; second, according to Moore’s Law, the performance of machines doubles every 18 months, and the price doubles. Of course, China is now restricted by the United States for chips.
“Later”: You were researching corporate customers from the very beginning, why didn’t you consider doing 2C?
Zhou Ming: 2C may become a great company, while 2B is slower but more pragmatic. People with 2C ambitions probably look down on those with 2B ambitions.
But the big model 2C is very difficult in China. I privately think that it may be a road of no return. First of all, many people don’t understand the difference between C and B. He thinks that if I copy ChatGPT, I can do both in the future.
In fact, 2C needs AGI even more. It needs to put various functions on a common engine. It cannot translate an App, write an App, or a bunch of Apps. This requires putting two types of abilities—understanding human speech, the basic ability of language understanding, and doing things, that is, the ability to solve various tasks—into one model. Accordingly, the model parameter scale must be large. The number of ChatGPT parameters has reached 175 billion, and it will be even larger in the future. To do 2C, the future will be an arms race of increasing parameter scale, data volume and machines, which may always be suppressed by OpenAI.
Second, it is difficult for domestic 2C to receive money directly from users, and the supervision is relatively strict.
In fact, there is a third way, which is 2B2C, similar to OpenAI’s ability to embed GPT’s capabilities into Microsoft’s standard products, such as Bing or Office. This road has opportunities, and it is necessary to find a good partner.
“Later”: Lanzhou is now focusing on 2B, considering 2B2C, not 2C?
Zhou Ming: We also do 2C, but for the purpose of acquiring customers. 2B2C, we have cooperated with a large communication manufacturer to serve its customers.
“Later”: Will the 2C large models that are more versatile in the future crush smaller models?
Zhou Ming: In terms of specific tasks, a relatively small model, coupled with better fine-tuning and domain-specific data, will surpass the general-purpose large model. In addition, there is cost. For many scenarios, customers need to be cheap and sufficient.
“Later”: If the future general-purpose large-scale model is made 2B on the public cloud, the cost of small tasks can be shared equally.
Zhou Ming: Many businesses of state-owned enterprises and state-owned enterprises generally do not use public clouds due to data security considerations. I think this situation will not change in the next ten years.
“Later”: This poses a problem for 2B, can you use customer data to help optimize the model and form a data flywheel?
Zhou Ming: It is difficult to establish the flywheel effect of domestic industry data, and the data and trained models of central state-owned enterprises are not something you can take away. Of course, this is the same for all companies, everyone is at the same starting line.
People can’t control the situation, they can only adapt to the situation. China’s SaaS (software as a service) is not as popular as the United States. Public cloud and SaaS may explode one day. Before that, we must accumulate and reserve capabilities and wait for future changes.
“Later”: Where might the change come from?
Zhou Ming: It’s about strengthening yourself first, never being able to do what you can, and then waiting for some external relationship adjustments, including looking at the possibility of going to sea, and constantly looking for new opportunities to survive.
I was born in a relatively poor family, and I have experienced all kinds of harsh environments since I was a child, so I am not afraid of hardship. I feel like things are getting better every day.
“Later”: Compared with the last AI boom, has the gap between China and the world widened or narrowed?
Zhou Ming: It’s getting better and better. If it weren’t for the chips, the gap wouldn’t be as big as it seems.
“Later”: At the beginning of this venture, you once told people that you wanted to be the best NLP company in the world. This is influenced by many factors.
Zhou Ming: How a person makes good use of the time and place is his core competitiveness.
“Later”: If this wish is not realized in the end, what kind of regret is it?
Ming Zhou: It’s like machine learning. It needs both positive and negative feedback. In the end, the neural network will become stronger and stronger. When you hold a learning heart, life experience, no matter success or failure, everyone and everything, is your learning, your training corpus.
This article is transferred from: https://www.latepost.com/news/dj_detail?id=1626
This site is only for collection, and the copyright belongs to the original author.