From AlphaGo to GPT, the doomed bumpy road to general artificial intelligence

Original link:

The 75-year-old Turing Award winner Geoffrey Hinton (Geoffrey Hinton) has spent most of his life studying AI, promoting deep learning to become the mainstream of AI research, and has grown today’s large models and ChatGPT. Now, he is afraid and regretful, worrying that his life’s work may push mankind into unknown risks.

“I use a common excuse to comfort myself: If I didn’t do it, someone else will.” After announcing his departure from Google in early May, Hinton told the media, “Until I figure out whether I can control AI, I don’t think it should be expanded. its size.”

The deep concern of the founders of deep learning is a commentary on the power of GPT.

Less than two months after its launch at the end of November last year, ChatGPT has acquired 100 million users, growing faster than any previous Internet product. In April of this year, the number of users using ChatGPT on the website alone surpassed Microsoft’s Bing search, equivalent to 60% of Baidu’s. The record-setting development speed ignited the enthusiasm of global industries and capital.

Every week, big and small companies are launching their own big models, and AI’s subversion of the world is no longer regarded as a distant imagination. Artificial general intelligence (AGI), which can replace artificial intelligence in various fields, seems to be born at any time.

Similar AI mania has happened several times in history. Most recently, in the spring of 2016, AlphaGo, developed by Google’s DeepMind, beat Go master Lee Sedol. People expect that AI will be quickly replicated and applied across industries, subvert many existing businesses, and even bring about greater social changes.


Healthcare Futurist, April 2016: AlphaGo can reshape the future of healthcare

But in the past seven years, those investors and entrepreneurs who want to reap huge business value, and critics who worry that AI will cause mass unemployment, have “thought too much.”

One of the reasons is that the difficulty and cost of replicating AI capabilities among different industries, scenarios, and customers is far beyond imagination—AI versatility is greatly overestimated.

When people want to use AI in various industries and scenarios such as medical care, finance, industrial quality inspection, and marketing decision-making, it is difficult to cross-industry and cross-customer. A group of start-up companies with the vision of “AI Empowering Thousands of Industries” grew into super unicorns after 2016. However, due to high personnel costs, difficulty in product standardization and large-scale replication, most of them have become software outsourcing companies that continue to do customized projects.

The second reason is that AI based on deep learning has “black box” characteristics. At NIPS, the top AI academic conference held at the end of 2017, Ali Rahimi from Google compared deep learning to alchemy: “I don’t understand the principles of airplanes, but I am not afraid of flying because I know that there are a large number of experts who master the principles. Depth The most worrying thing about the learning world is that I don’t know the principles myself, and I know you don’t either.”

Those who develop AI models and algorithms cannot fully explain how it works. In commercial practice, it is impossible to clearly know what effect changes some changes may bring about, which greatly increases the difficulty of model tuning and increases project delivery. cost; and it is also difficult to completely eliminate some specific problems of the model. This will affect the implementation of AI in some scenarios where no mistakes can be made, such as the pursuit of ultimate safety in unmanned driving.

This round of large-scale model upsurge solved some of the obstacles to AI commercialization during the AlphaGo period, but other problems that were exposed at that time still existed and were even further amplified.

How the power of general-purpose AI was misjudged the last time

During the AlphaGo period, when AI companies were doing different customer projects, they basically had to train different models from scratch. In 2019, 2020 and the first half of 2021, the artificial intelligence company Shangtang Technology produced 1152, 9673 and 8377 AI models respectively, but did not produce corresponding commercial benefits. After some models are made, there may only be three or five customers Reusable. From 2019 to 2020, the number of Shangtang models has increased by 740%, and the revenue growth rate is only nearly 14%.


China’s AI Four Tigers, SenseTime, Megvii, Yuncong and Yitu, had a total financing of about US$7.5 billion before submitting their prospectuses, which is equivalent to the total financing of ByteDance established in the same period before 2020. But Byte’s revenue in 2020 exceeded 236 billion yuan, more than 30 times the total revenue of these four companies during the same period.

The key to Byte’s success is to make good use of AI technology. It has polished a set of efficient recommendation algorithms, and then cooperates with news information and short video apps to realize advertising monetization. It is a typical Internet business model. The original expectations of companies such as Shangtang are even more ambitious, but it is more difficult to implement – to use AI to empower thousands of industries.

It’s a vision big enough to attract a lot of ambition, intelligence, ambition and capital. Around 2016, a group of technology giants, top scholars, investment institutions, and start-up companies invested in this direction for the same reason as now—AlphaGo also made people see the possibility of a more general AI, but this “more general” AI Short-term business value is overestimated.

Before AlphaGo, most deep learning models at the time needed to be trained with a large amount of pre-labeled data, which was called “supervised learning”. These data are usually a result of the information collected by the machine corresponding to manual annotation. For example, the annotation data of the hospital diagnosis scene is “patient symptom description” and corresponding “doctor’s diagnosis and medical measures”.

The data problems encountered in supervised learning directly affect the effect and cost of AI commercial implementation.

Because in supervised learning, the performance of the model will fluctuate greatly due to changes in the data environment. Moreover, the data required for commercialization, especially the real data at the customer site, are difficult to obtain. At that time, most of the models were in the development stage, and they could only use small-scale industry data for training. These data may not reflect the latest industry conditions and customer characteristics. Therefore, although a model can achieve good results in the research and development stage, once it reaches the customer, the effect may be lost once the data and environment change. AI companies must re-collect data in the customer environment and re-train the model, and cannot achieve “one-time development, multiple sales”, and the marginal cost of model delivery remains high.

Watson, the IBM AI medical system, was once misdiagnosed because the data used to train the model could not keep up with the latest medical care standards. Another common example is that when AI detection tools are deployed in factories, changes in the light and dust conditions of the real workshop will affect the model effect, thereby affecting the performance of the same solution in different factories, assembly lines or workstations.

AlphaGo, especially its follow-up version, AlphaGo Zero, brings a turning point. It does not learn human chess data, but generates data by itself, and continuously improves the level of chess in the game between itself and itself. It does not require human experts to mark a large number of data. This can greatly reduce the difficulty of data acquisition, increase the amount of training data, and achieve amazing results.

But when people want to copy this model to scenarios other than games with limited rules such as Go, poker, and StarCraft, its scope of application is very limited.

AlphaGo can play itself in a computer simulation environment, trial and error infinitely, until it finds the hand of God. But most scenarios, especially those involving the offline physical world, do not have such room for trial and error. AI doctors cannot use patients for trial and error, and AI drivers cannot use pedestrians for trial and error.

One of the solutions is to establish an automatic driving simulator and a robot simulator in a virtual environment to obtain data, but it will encounter new difficulties: if you want to achieve 100% simulation of the physical world, the development cost is infinitely high; The world gap is too big, and the training has no effect.

DeepMind, which developed AlphaGo, failed to expand the industry itself. From 2016 to 2017, it launched a batch of cooperation with the medical and energy industries, trying to get AI out of the virtual game world. These projects either ended in dismal, or they have been in a publicity state.

Those companies that were stimulated by AlphaGo and invested in the last round of AI commercialization boom will have to use new stories to package a technical solution that has not changed much in the next few years, and become “project companies that cannot serve multiple customers with standard products.” “.

In order to implement an AI product, parameter adjustment, model optimization, engineering support, and embedding AI into the customer’s existing business system or process, all these tasks are indispensable. Among them, model tuning, optimization and other tasks generally require AI companies to send well-paid algorithm engineers to the customer site, which is a hard job. In addition to high wage costs, this also makes AI companies unable to retain people.

The last mile became the farthest and most tiring mile.

Big models make AI more general, but maybe not enough

Similar to AlphaGo, the fledgling big-model craze may promise a more general AI. Last month, OpenAI and the Wharton School of Business published a paper titled “GPTs are General-Purpose Technologies” ( GPTs are GPTs , the second GPTs refers to General-Purpose Technologies).

GPT has indeed shown its versatility beyond the past: it can learn some new skills without the need for special new data and training; when replicating the skills it has learned in different scenarios and customers, it can also use lower Cost to adapt to new data environment.

Among them, the ability to learn new skills comes from the “emergence” phenomenon that occurs when the large model continuously expands the amount of training data and model parameters. As OpenAI discovered in 2019, GPT-2 unexpectedly taught itself several skills such as text summarization and translation. Previously, OpenAI did not use the corresponding corpus between different languages ​​to train specifically for translation tasks. This is the conventional method for AI to learn to translate in the past.

In addition to GPT, other generative pre-training large models also show the law of “learning more new skills as the size of the model grows”. This unprecedented self-teaching is a major source of greater versatility of large models.


The large Google PaLM model learned several new skills as its parameters grew from 8 billion to 540 billion.

Better adaptation to data refers to the process of fine-tuning based on the basic large model to obtain customized models and products.

Pre-training of large models uses data that does not need to be labeled, while fine-tuning still needs to use more expensive and harder-to-obtain labeled data. But because there is already a powerful general-purpose model, fine-tuning is theoretically cheaper and easier than re-collecting and labeling data and re-training the model when encountering new tasks and new environments in the past.

The above characteristics all point to lower marginal costs during business duplication and industry migration. Companies that provide GPT large models may use a unified general model to serve a variety of industries and scenarios. After completing AlphaGo, many companies once pursued the goal but went further and further away.

However, there are also some indications that today’s “last mile” from a general-purpose model to a real usable product is still expensive.

When ChatGPT was built on the basis of GPT-3.5, in order to improve the dialogue effect and reduce false and discriminatory information, OpenAI hired a large number of annotators to score the machine’s answers and mark hate, pornography and violent speech. According to comprehensive media reports, OpenAI pays US$15 per hour to data labelers in the United States, and only one-tenth of the hourly wage to contract workers in Africa, but these workers are fighting for higher wages. Vote to form a union. OpenAI also employs about 1,000 contract workers doing similar work in Eastern Europe and Latin America.

Combining the technical characteristics of GPT large-scale models and the needs of various industries, the commercial limitations of GPT-type large-scale models are first of all that there is a limit to the scope of versatility.

At present, GPT mainly obtains various cognitive intelligence capabilities related to natural language through text data pre-training. The range of multimodal capabilities of the large model has not yet been verified. GPT-4 can understand the stems in the stems, but it may not be able to identify abnormalities in X-ray chest films. Therefore, in the fields of decision-making intelligence represented by numerical regression and behavior prediction, and perceptual intelligence represented by image and video recognition, the applicability of large models will be questioned.

Normally, this limitation is not a problem, because no technology is suitable for all situations. But fanaticism dispels common sense. Some industries that were far from the big model are now talking about “rapid disruption”. Some companies promise customers that new technologies have magical powers, but they do not describe the applicability enough.

For those who seriously want to use large models to do things, it is more worth considering that even in the field of language cognitive intelligence, which large models are best at, their inherent ability defects will limit the commercialization of certain scenarios.

These shortcomings include insufficient planning ability, inability to learn continuously, lack of long-term memory, high reasoning cost and high latency. Among them, the biggest threat to commercialization is the uncontrollable, unpredictable and unreliable large model. Typical manifestations are large models prone to hallucinations (hallucination), which produce “completely unrealistic content without provenance”.

Ilya Sutskever, a student of Hinton and the chief scientist of OpenAI, has repeatedly stated that hallucinations are the biggest problem when large models are applied to more important scenarios, but it is currently impossible to fundamentally control the generation of hallucinations. Because OpenAI’s training goal for GPT is to “predict the next word” more accurately. In order to continuously provide content that conforms to human language habits and expectations, the big model does not care about the authenticity of the information. Sometimes the more similar and more involved the story is, the easier it is to get human praise. It can be said that hallucinations are a flaw inherent in the goals and architecture of the GPT model. As users trust the model more, the harm of hallucinations will become greater and greater.

At the same time, the “emergence” process that brings greater versatility to large models is itself uncontrollable and unpredictable, so it is difficult to do business and business planning around emergence.

The paper “Emergent Abilities of Large Language Models” (Emergent Abilities of Large Language Models) jointly published by Google, Stanford and DeepMind found that the emergence of new capabilities of large models does not occur gradually with the linear growth of model parameters, but in the model When the parameters exceed a certain critical point, it will explode. For example, if it exceeds 62 billion parameters, it will obtain the chain of thought (CoT) multi-step reasoning ability. After exceeding 175 billion parameters, it will obtain the ability to identify satirical content.


Learning curves for new skills that scale with parameter size for models like LaMDA, GPT-3, etc.

It is difficult for people to predict where the next few critical points of the parameter scale will be, and what capabilities may emerge after breaking through these critical points, and whether these capabilities will be beneficial or harmful to specific business goals. This makes it difficult to improve the efficiency and certainty of the emergence of specific new skills when the application is implemented.

The limitations of these large-scale models may cause some theoretically feasible and exciting application scenarios to experience long-term obstacles in the commercialization process.

For example, the real implementation of one-to-one large model private doctor service depends on the breakthrough of multi-modal image capabilities in medical scenarios, and also requires the model to be able to remember the patient’s condition, treatment history and communication process for a long time, and this system must Accurate, otherwise a small mistake may bring fatal results. These are the shortcomings of the current large model. When the defects of the model cannot be solved 100%, companies that develop large-scale medical services must also reach an agreement with the regulatory agency on the space for trial and error, legal norms, and regulatory methods.

When estimating the generalization landing time of the large model, Ilya, the chief scientist of OpenAI, once used the analogy of automatic driving: “Looking at Tesla’s automatic driving, it seems that it can do everything, but there is still a long way to go in terms of reliability. The way to go. The same goes for the big models, while it looks like it can do anything, at the same time it needs more work until we actually get it all figured out.”

Market frenzy, OpenAI cautious

Based on the recognition of the difficulty and risks of cross-industry implementation, OpenAI has been conducting professional and rigorous experiments on the effects and problems of GPT-4 industry applications with partners in different industries such as law, education, and medical care since the second half of last year.

For example, in the legal industry, OpenAI supports Casetext to develop AI legal assistant CoCounsel based on GPT-4, trying to deal with the problem of hallucinations.

First, a team of AI engineers and lawyers spent nearly 4,000 hours on model training and fine-tuning based on more than 30,000 legal questions. Then, more than 400 lawyers were organized as a Beta testing group, and they were required to use CoCounsel at least 50,000 times in their daily work and provide detailed problem feedback. Through such a project, potential problems can be exposed earlier, and the level of risk and the effectiveness of risk solutions can be assessed before the large model is put into application.

This is not an easy and cheap job, and requires a group of professionals and technicians to complete industry customization. Similar to the last round of AI boom after AlphaGo, if the target market is too small, there are not many opportunities for future reuse, or the marginal cost of reuse is too high, such high-investment early-stage development has no commercial prospects.

At a time when OpenAI is prudently and steadily advancing certain industry applications, the market stimulated by it is much fanatical and radical.

Since the Spring Festival, many stocks in the A-share “ChatGPT” sector have risen sharply, and large-scale models have been released, which has become a sharp weapon for many listed companies to increase their stock prices. U.S. stocks are similar. For example, at the end of January this year, C3 AI, an enterprise-level application provider in the United States, announced that it will provide a generative AI enterprise search product based on OpenAI’s latest GPT large model capabilities. Enterprise users can directly ask this product in natural language, “this sales How likely is the plan to be completed?” or “When will this truck need service?”

Although it did not explain how it uses enterprise data to make analysis and predictions, C3 AI’s share price still soared by more than 20% as soon as the news was released. The market can use its imagination as much as possible, but when enterprise customers consider the promise of applications such as C3, they should understand that large models currently do not have the ability to predict values ​​related to values, and it is unknown whether they will emerge in the future.

The chairman of a new energy industrial software company said in a recent media interview that he believes that it will take a long time for ChatGPT to generate business value in the industrial field, but the industry needs such a concept. There is no chance of investing in this matter.”

He euphemistically pointed to a phenomenon: no matter whether GPT can generate value in the short term or not, and whether it is the core of its own products, in order to get money or resources, many companies choose to hang their heads first. This further amplifies the false exuberance and optimism. It’s just that some people are acting optimistic, and some people are really overly optimistic.

Pursue control in out of control

Out of Control: The New Biology of Machines, Society, and the Economy, 1994, describes the evolution, emergence, and loss of control of complex systems. Kevin Kelly concluded that the actions of systems such as the neural network of the human brain, ant colonies, and bee colonies are generated from a large number of chaotic but interconnected events, like thousands of springs driving in parallel A system where the movement of any one mainspring is transmitted to the entire system. What emerges from the group is not a series of individual behaviors, but an overall action completed by many individuals, such as human thinking emerging from countless interconnected neurons, and an ant nest built by thousands of ants unconsciously cooperating. This is the cluster model system.

The AI ​​large model also conforms to the characteristics of the cluster system. Due to the lack of central control, the efficiency of the cluster system is relatively low. For example, there is information redundancy in large models, and it is unpredictable, unknowable, and uncontrollable. But the lack of central control also brings the advantages of adaptability, evolvability, infinity, and novelty, so that large models can teach themselves new skills through emergence.

OpenAI speculates that emergence may be a mechanism similar to evolution. In the paper “Learning to Generate Comments and Discover Sentiment”, OpenAI’s chief scientist Ilya is one of the authors, it is mentioned that when enough model capacity, training data and computing time are given, the GPT large model internally generates a sentiment analysis functional unit , can accurately distinguish whether the text is expressing joy, sadness or anger.

The paper believes that this may be because it can distinguish emotional colors, which is of great help to GPT to better accomplish its goal, that is, to predict the next word. Just like humans have evolved complex physiological characteristics and cultural customs for the single goal of survival and reproduction, and those characteristics that are more suitable for survival and allow population expansion will be preserved. Emergence may be such an evolutionary process similar to natural selection.

The other side of evolution is out of control. Things that can evolve are not completely controllable and can be designed in advance. Evolution not only creates new skills, it may also create illusions. Learning to use a rapidly evolving black-box tool is a topic that humans have never encountered before. On the premise of accepting, understanding and adapting to being out of control, we need to find the controllable parts and avoid commercial risks and greater risks.

On the one hand, the AI ​​academic community is still doing more research to understand as much as possible the law of the emergence of large models. When Ali Rahimi compared AI to alchemy in 2017, he suggested drawing lessons from physics to simplify problems and open the AI ​​black box.

Six years later, Sébastien Bubeck of Microsoft Research, the first author of the paper “Spark of General Artificial Intelligence: GPT-4 Early Experiments”, once again proposed the research direction of AI Physics, that is, drawing on physics Thought, using simplified models and controlled experiments to disassemble large models, hoping to find elements that affect emergence.

Just yesterday, OpenAI also published a paper “Using Language Models to Explain Neurons in Language Models”, using GPT-4 to explain some of the working conditions of some neurons in GPT-2. This method still has limitations, such as only explaining the behavior without explaining the relevant mechanism and the impact on task performance, but OpenAI hopes to improve this method to promote the interpretability of large models, and it can be fully automated.

On the other hand, the industry is also trying various methods to make the overall results of AI large models more controllable.

  • Choose what to do and what not to do.


In some scenarios, some existing defects of the large model have less impact on commercial use, and may even be beneficial. For example, the founder of, a chat application that emphasizes personalization and fun, said: “I don’t think hallucinations are a problem that needs to be solved. I even like it. This is an interesting feature of the model.” Doing in ” In role-playing” chat scenes, hallucinations are a source of imagination.

But for other industries with very low fault tolerance, such as medical diagnosis, autonomous driving, and industrial automation, hallucinations are significantly harmful.

  • Patches defects in large models by manual or machine means.

In areas where the capabilities of GPT’s large models are suitable, there are now partial solutions to defects such as hallucinations, insufficient planning ability, and lack of long-term memory.

The machine means include bringing historical memory into the dialogue through local database query to increase the memory ability of the model; identifying hallucinations through left and right wrestling between two models. Manual means include, by prompting engineering to guide large models for complex planning, and by manually reviewing to discover and correct model illusions.

The effects and costs of the above methods in different industry scenarios and different data environments need to be verified by practice, and the comprehensive results will affect the commercial value of the GPT large model in this industry or scenario.

  • In-depth customization for the target industry, cautious expectations for rapid disruption, and expectations for additional costs.

Since the general-purpose large model, which is more general but not enough than before, still cannot take all industries, more and more people realize that on top of the general-purpose large model, it is also possible to fine-tune training and customize large-scale models for vertical fields. Only limited types of tasks are performed in specified industry scenarios, and the scale can be appropriately reduced.

From light to heavy, the ways to do customization are:

  • Based on the API interface of the existing closed-source large model, customized applications are made through application-level fine-tuning + patching.
  • Choose an open source, basic model that has completed pre-training work, and do more customization.
  • Train the vertical model from scratch: start with pre-training data selection and model structure design, and customize a new large model to solve problems in specific industry scenarios. For example, Bloomberg has launched a 50 billion-parameter financial vertical large-scale model BloombergGPT. The pre-training uses financial data sets and general data sets in equal proportions. It is ahead of general-purpose large-scale models in financial-specific tasks, such as news sentiment analysis.

The heavier the practice, the greater the cost and the higher the barriers. Different industries have the most competitive advantage. There is no standard answer, but there is a general decision-making model:

A more stable choice is to do the lightest patching first, and then decide whether to take the completely customized route after mastering the problem and data and verifying the business value. However, this may miss the time window, resulting in the failure to catch up with companies that made vertical models earlier in the industry. The latter may form a “data flywheel” that feeds back data to model capability iterations faster, widening the gap with others.

A more daring way is to jump into the selected direction and find business value while building a large model from scratch. This requires continuous resources, and it is also the common choice of a group of entrepreneurs with the strongest financing capabilities. For example, the co-founder of Meituan Wang Huiwen and Sogou founder Wang Xiaochuan.

The future large model may be such a business ecosystem: there will be a few general large models in the market from the combination of top startups and cloud giants such as “OpenAI + Microsoft Cloud”, or other self-developed general large model giants. They will develop some products by themselves to directly meet the scenarios that some large models are best at, or scenarios that are more tolerant of errors and unreliability, such as personal knowledge assistants, companion chat, creative content generation, non-professional question-and-answer retrieval, etc.; At the same time, MaaS (Models as a Service) service is provided to provide a model basis for some enterprise customers who want to make customized applications. This needs to simplify the fine-tuning work, so that third-party integrators or customers can easily fine-tune themselves and deliver products at low cost. OpenAI is already making fine-tuning itself more automatic.

In some professional scenarios with high value and high requirements, data confidentiality is required or data is difficult to obtain, there will be more vertical large models for independent training and service industries.

In the era of steam engines, it took decades for the whole society to fully realize the potential of new technologies, and the cross-industry landing of large models will also experience a more tortuous process than imagined. As mentioned in the OpenAI paper “GPT is a general-purpose technology”, this requires addressing its shortcomings, combining the characteristics of various industries, and co-inventing application solutions with industry partners, and even redesigning enterprise organizations.

This process will be accompanied by ups and downs of confidence, depreciation after overestimation, and the next change brewing in the trough. OpenAI, which ignited the GPT boom, was established in November 2015. It is itself a product of the previous round of AI confidence, but its confidence in AI is different. OpenAI has been pursuing AGI since its inception, and most practitioners didn’t think it could be achieved in their lifetime. Over the years, OpenAI has not been eager to find practical business scenarios for the progress of AI technology in the previous stage, but is exploring the breakthrough of AGI until the hard rock is finally cut open. This is not a moment that can be easily harvested, but just the beginning of a long journey.

* Long Zhiyong, the first author of this article, has written “From <Huawei’s Winter> to AI’s Winter”, translated “How to Create Credible AI” and “Scarcity”, and will soon publish a new book “The Age of Large Models”.

Author’s contact information, Zhihu ID: douglong


Literature citations:

DeepMind’s quest for AGI may not be successful, say AI researchers

Ali Rahimi’s talk at NIPS(NIPS 2017 Test-of-time award presentation)

GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models

Sparks of Artificial General Intelligence: Early experiments with GPT-4

What’s in my AI? – Dr Alan D. Thompson – Life Architect

Emergent Abilities of Large Language Models

AI Frontiers: The Physics of AI with Sébastien Bubeck – Microsoft Research

Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance – Google AI Blog (

Physics of AI – YouTube

What’s Next for AI Systems & Language Models With Ilya Sutskever of OpenAI – Video | Scale Virtual Events

Episode 85: A Conversation with Ilya Sutskever – Voices in AI

Out of Control: The New Biology of Machines, Social Systems, and the Economic World

The AI ​​Revolution in Medicine: GPT-4 and Beyond

This article is transferred from:
This site is only for collection, and the copyright belongs to the original author.