Original link: https://www.latepost.com/news/dj_detail?id=1703
In the second half of 2022, while ChatGPT exploded, a16z, a famous venture capital in Silicon Valley, visited dozens of AI startups and big technology companies. They found that startups gave 80%-90% of their early financing funds to cloud computing platforms to train their own models. They estimate that even if the products of these companies are mature, they have to give 10%-20% of their revenue to cloud computing companies every year. It is equivalent to an “AI tax”.
This has brought about a large market for providing model capabilities and training services on the cloud, and renting computing power to other customers and startups. In China alone, at least dozens of start-up companies and small and medium-sized companies are making their own complex large language models, and they all have to rent GPUs from cloud computing platforms. According to a16z’s calculations, a company’s annual AI computing expenditure only exceeds 50 million US dollars before it has enough scale to support its batch purchase of GPUs.
According to “LatePost”, after the Spring Festival this year, all major Internet companies in China with cloud computing services have placed large orders with Nvidia. Byte has ordered more than US$1 billion of GPUs from Nvidia this year, and another large company has ordered at least more than 1 billion yuan.
Byte alone may have placed orders this year close to the total number of commercial GPUs Nvidia sold in China last year. In September last year, when the U.S. government issued export controls on A100 and H100 (Nvidia’s latest two-generation data center commercial GPUs), Nvidia responded that this may affect its US$400 million (about 2.8 billion yuan) in the Chinese market in the fourth quarter of last year. RMB) potential sales. Based on this calculation, the sales of Nvidia data center GPUs in China in 2022 will be about 10 billion yuan.
Compared with overseas giants, China’s big technology companies are more urgent to purchase GPUs. In the cost reduction and efficiency increase in the past two years, some cloud computing platforms have reduced GPU purchases and have insufficient reserves. In addition, no one can guarantee that the high-performance GPU that can be bought today will be subject to new restrictions tomorrow.
From cutting orders to adding purchases, while moving internally
Before the beginning of this year, demand for GPUs from China’s big tech companies was tepid.
GPUs have two main uses in China’s large Internet technology companies: one is to support businesses internally and do some cutting-edge AI research, and the other is to sell GPUs on cloud computing platforms.
A person from Byte told “LatePost” that after OpenAI released GPT-3 in June 2020, Byte had trained a large generative language model with billions of parameters. At that time, the GPU mainly used was the predecessor of A100. V100. Due to the limited scale of parameters, the generation ability of this model is average, and Byte could not see its commercialization possibility at that time, “ROI (return on investment) can’t be calculated”, this time the attempt was in vain.
Ali also actively purchased GPUs in 2018-2019. According to an Alibaba cloud source, Ali’s purchases at that time reached at least tens of thousands of yuan, and the purchased models were mainly V100 and T4 released earlier by Nvidia. However, only about one-tenth of these GPUs were given to DAMO Academy for AI technology research and development. After the release of the trillion-parameter large model M6 in 2021, Dharma Academy disclosed that 480 V100s were used to train M6.
More of the GPUs purchased by Alibaba at that time were given to Alibaba Cloud for external leasing. However, including Alibaba Cloud, a group of Chinese cloud computing companies have overestimated the AI demand in the Chinese market. A technology investor said that before the upsurge of large-scale models, GPU computing power on major domestic cloud vendors was not in short supply, but worried about selling, and cloud vendors even had to cut prices to sell resources. Last year, Alibaba Cloud cut prices six times, and GPU rental prices dropped by more than 20%.
In the context of reducing costs and increasing efficiency, and pursuing “quality growth” and profits, it is understood that Ali has reduced the scale of GPU procurement after 2020, and Tencent also cut a single batch of Nvidia GPUs at the end of last year.
However, not long after, at the beginning of 2022, ChatGPT changed everyone’s views, and a consensus was quickly reached: a large model is a big opportunity that cannot be missed.
The founders of each company paid close attention to the progress of the large model in person: Zhang Yiming, the founder of ByteDance, began to read artificial intelligence papers; Zhang Yong, chairman of the board of directors of Alibaba, took over Alibaba Cloud and announced the progress of Alibaba’s large model at the Alibaba Cloud Summit. , software, and services are all worth redoing based on the capabilities of the large model.”
A person from Byte said that in the past, when applying for the purchase of GPUs within Byte, it was necessary to explain the input-output ratio, business priority and importance. But now the large-scale model business is a new business at the company’s strategic level, and the ROI cannot be calculated for the time being, and investment must be made.
Developing their own general-purpose large-scale models is only the first step. The greater goal of each company is to launch cloud services that provide large-scale model capabilities. This is a truly large market that can match investment.
Microsoft’s cloud service Azure does not have a strong presence in China’s cloud computing market. It has mainly served the Chinese business of multinational companies in China for ten years. But now customers have to wait in line because it is the only cloud broker for OpenAI commercialization.
At the cloud summit in April, Ali once again emphasized that MaaS (Model as a Service) is the future trend of cloud computing. In addition to the open and self-developed general basic model “Tongyi Qianwen” test, it also released a series of help customers in the cloud. Tools for training and using large models. Soon after, Tencent and Byte Volcano Engine also released their own new versions of training cluster services. Tencent said that using a new generation of clusters to train a large model with trillions of parameters, the time can be compressed to 4 days; Byte said that their new cluster supports Wanka-level large-scale model training. Dozens of large-scale model companies in China, most of them Already using the volcano engine.
All these platforms use either Nvidia A100 and H100 GPUs, or Nvidia’s specially launched reduced versions of A800 and H800 after the ban last year. The bandwidth of these two processors is about 3/4 and about half of the original version, avoiding high The regulatory standard for performance GPUs.
Around the H800 and A800, China’s major technology companies have started a new round of order competition.
A person from a cloud manufacturer said that big companies such as Byte and Ali mainly negotiate directly with Nvidia’s original factory for procurement, and agents and second-hand markets are difficult to meet their huge needs.
Nvidia will negotiate a discount based on the list price and the purchase scale. According to Nvidia’s official website, the price of A100 is US$10,000 per piece (about 71,000 yuan), and the price of H100 is US$36,000 per piece (about 257,000 yuan); it is understood that the price of A800 and H800 is slightly lower than the original version. .
Whether a Chinese company can grab a card depends more on business relationships, such as whether it was a major customer of Nvidia in the past. “It makes a difference whether you talk to Nvidia in China, or go to the United States to talk directly to Lao Huang (Huang Renxun, founder and CEO of Nvidia).” A person from a cloud vendor said.
Some companies will also conduct “business cooperation” with Nvidia. When purchasing popular data center GPUs, they also purchase other products to strive for priority supply. This is like Hermès distribution. If you want to buy a popular bag, you often have to match it with clothes and shoes worth tens of thousands of yuan.
Based on the industry information we have obtained, Byte’s new orders this year are relatively aggressive, exceeding the $1 billion level.
According to a person close to Nvidia, there are a total of 100,000 pieces of A100 and H800 that have arrived and have not arrived. Among them, H800 only started production in March this year, and this part of chips should come from additional purchases this year. It is understood that with the current production schedule, some H800s will not be delivered until the end of this year.
ByteDance started building its own data center in 2017. Data centers used to rely more on CPUs for all calculations. Until 2020, Byte spent more on Intel CPUs than Nvidia GPUs. Changes in byte purchases also reflect that in the computing needs of large technology companies today, intelligent computing is catching up with general computing.
It is understood that a major Internet company has at least placed a 10,000-level order with Nvidia this year, with an estimated value of over 1 billion yuan based on the catalog price.
Tencent took the lead in announcing that it has used the H800. Tencent Cloud has already used the H800 in the new version of high-performance computing services released in March this year, saying that this is the first domestic launch. At present, this service has been opened to enterprise customers for testing applications, which is faster than the progress of most Chinese companies.
It is understood that Alibaba Cloud also proposed internally in May this year to take the “Smart Computing Battle” as the number one battle this year, and set three goals: machine scale, customer scale, and revenue scale; among them, the important indicator of machine scale is the number of GPUs .
Before the arrival of the new GPU, companies are also making internal moves to give priority to supporting the development of large models.
The way to release more resources at one time is to cut off some less important directions, or directions where there is no clear prospect in the short term. “Large companies have many half-dead businesses that occupy resources.” An AI practitioner in a major Internet company said.
In May of this year, Ali Dharma Institute abolished the autonomous driving laboratory: about 1/3 of the more than 300 employees were assigned to the rookie technical team, and the rest were laid off. Dharma Institute no longer retains the autonomous driving business. The development of autonomous driving also requires high-performance GPUs for training. This adjustment may not be directly related to the large model, but it did allow Ali to obtain a batch of “free GPUs”.
Byte and Meituan directly share GPUs from the commercial technology team that brings advertising revenue to the company.
According to “LatePost”, shortly after the Spring Festival this year, Byte distributed a batch of A100s that were originally planned to be added to the Byte commercialization technology team to Zhu Wenjia, the head of TikTok product technology. Zhu Wenjia is leading the research and development of byte large models. The commercialization technical team is the core business department that supports the Douyin advertising recommendation algorithm.
Meituan began to develop large models around the first quarter of this year. It is understood that Meituan recently transferred a batch of 80G video memory top version A100 from multiple departments, giving priority to supplying large models, so that these departments can switch to GPUs with lower configurations.
Bilibili, whose financial resources are far less abundant than large platforms, also has plans for large models. It is understood that Station B has previously reserved hundreds of GPUs. This year, on the one hand, Bilibili continues to purchase additional GPUs, and on the other hand, it is also coordinating various departments to evenly distribute cards to large models. “Some departments give 10 tickets, and some departments give 20 tickets.” A person close to Station B said.
Internet companies such as Byte, Meituan, and Station B generally have some redundant GPU resources in the technical departments that originally supported search and recommendation. come out”.
However, the number of GPUs that can be obtained by this method of dismantling the east and supplementing the west is limited, and the large GPUs required for training large models still have to rely on the past accumulation of each company and wait for the arrival of new GPUs.
The whole world is scrambling for computing power
The race for Nvidia’s data center GPUs is also happening around the world. However, overseas giants purchased a large number of GPUs earlier, and the purchase volume is larger, and the investment in recent years has been relatively continuous.
In 2022, Meta and Oracle have already invested heavily in A100. Meta partnered with Nvidia to build the RSC supercomputing cluster last January, which contains 16,000 A100s. In November of the same year, Oracle announced the purchase of tens of thousands of A100 and H100 to build a new computing center. Now the computing center has deployed more than 32,700 A100s, and new H100s have been launched one after another.
Since Microsoft first invested in OpenAI in 2019, it has provided tens of thousands of GPUs to OpenAI. In March of this year, Microsoft announced that it had helped OpenAI build a new computing center, including tens of thousands of A100. In May of this year, Google launched Compute Engine A3, a computing cluster with 26,000 H100s, serving companies that want to train large models by themselves.
The current actions and mentality of major Chinese companies are more urgent than those of overseas giants. Taking Baidu as an example, it placed tens of thousands of new GPU orders with Nvidia this year. The order of magnitude is comparable to companies such as Google, although Baidu’s volume is much smaller. Its revenue last year was 123.6 billion yuan, only 6% of Google’s.
It is understood that Byte, Tencent, Ali, and Baidu, the four Chinese technology companies that have invested the most in AI and cloud computing, have accumulated tens of thousands of A100 in the past. Among them, A100 has the most absolute number of bytes. Excluding new orders this year, the total number of Byte A100 and its predecessor V100 is close to 100,000.
Among the growing companies, Shangtang also announced this year that a total of 27,000 GPUs have been deployed in its “AI large device” computing cluster, including 10,000 A100s. Even Magic Square, a quantitative investment company that seems to have nothing to do with AI, bought 10,000 A100 before.
Just looking at the total number, these GPUs seem to be more than enough for companies to train large models. According to the case on Nvidia’s official website, OpenAI used 10,000 V100s when training GPT-3 with 175 billion parameters. To train GPT-3, 1024 blocks of A100 are needed for 1 month of training. Compared with V100, A100 has a performance improvement of 4.3 times. However, a large number of GPUs purchased by large Chinese companies in the past must support existing businesses or be sold on cloud computing platforms, and cannot be freely used for large-scale model development and external support for large-scale model needs of customers.
This also explains the huge difference in the estimation of computing resources by Chinese AI practitioners. Zhang Yaqin, Dean of Tsinghua Intelligent Industry Research Institute, said at the Tsinghua Forum at the end of April, “If one piece of China’s computing power is added, it is equivalent to 500,000 A100, and it is no problem to train five models.” Yin Qi, CEO of AI company Megvii Technology, accepted “Caixin” said in an interview: China currently only has a total of about 40,000 A100s that can be used for large-scale model training.
It mainly reflects the capital expenditure on investment in fixed assets such as chips, servers, and data centers, and can intuitively illustrate the order of magnitude gap in computing resources of large Chinese and foreign companies.
Baidu, which was the first to test ChatGPT-like products, has an annual capital expenditure of between US$800 million and US$2 billion since 2020, Ali’s between US$6 billion and US$8 billion, and Tencent’s between US$7 billion and US$11 billion. During the same period, the annual capital expenditures of Amazon, Meta, Google, and Microsoft, the four American technology companies with self-built data centers, all exceeded US$15 billion at least.
During the three years of the epidemic, the capital expenditure of overseas companies continued to rise. Amazon’s capital expenditure last year has reached 58 billion US dollars, Meta and Google are both 31.4 billion US dollars, and Microsoft is close to 24 billion US dollars. Investments by Chinese companies are shrinking after 2021. The capital expenditures of Tencent and Baidu both fell by more than 25% year-on-year last year.
GPUs for training large models are no longer sufficient. If Chinese companies really want to invest in large models for a long time and earn money to “sell shovels” for other model needs, they will need to continue to increase GPU resources in the future.
Going Faster OpenAI has met this challenge. In mid-May, OpenAI CEO SamAltman said in a small-scale communication with a group of developers that due to insufficient GPUs, OpenAI’s current API service is not stable enough and the speed is not fast enough. Before there are more GPUs, GPT-4’s multimodal The capabilities cannot be extended to every user, and they are not planning to release new consumer products in the near future. According to a report released by the technical consulting agency TrendForce in June this year, OpenAI needs about 30,000 A100s to continuously optimize and commercialize ChatGPT.
Microsoft, which has a deep cooperation with OpenAI, is also facing a similar situation: In May of this year, some users complained that the New Bing answer speed was slow, and Microsoft responded that this was because the GPU replenishment speed could not keep up with the user growth rate. Microsoft Office 365 Copilot, which is embedded with large-scale model capabilities, is not currently open on a large scale. The latest figure is that more than 600 companies are trying it out-the total number of Office 365 users worldwide is close to 300 million.
If a large Chinese company does not only aim to train and release a large model, but really wants to use the large model to create products that serve more users, and further support other customers to train more large models on the cloud, they need to reserve more in advance. Multiple GPUs.
Why only those four cards?
In terms of AI large model training, there are no substitutes for A100, H100 and the reduced version A800 and H800 specially supplied to China. According to quantitative hedge fund Khaveen Investments, Nvidia’s data center GPU market share will reach 88% in 2022, and AMD and Intel will divide the rest.
At the GTC conference in 2020, Huang Renxun made his debut with the A100.
The current irreplaceability of Nvidia GPU comes from the training mechanism of large models. Its core steps are pre-training and fine-tuning. The former is to lay the foundation, which is equivalent to receiving general education to graduate from university. ; the latter is optimized for specific scenarios and tasks to improve work performance.
The pre-training link is particularly computationally intensive, and it has extremely high requirements on the performance of a single GPU and the data transmission capability between multiple cards.
Now only A100 and H100 can provide the computing efficiency required for pre-training. They seem expensive, but they are the cheapest option. Today, AI is still in the early stages of commercial use, and the cost directly affects whether a service is available.
Some models in the past, such as VGG16, which can recognize cats as cats, have only 130 million parameters. At that time, some companies would use RTX series consumer-grade graphics cards for playing games to run AI models. The parameter scale of GPT-3 released more than two years ago has reached 175 billion.
Under the huge computing requirements of large models, it is no longer feasible to use more low-performance GPUs to form computing power. Because when using multiple GPUs for training, it is necessary to transmit data and synchronize parameter information between chips. At this time, some GPUs will be idle and cannot be saturated all the time. Therefore, the lower the performance of a single card, the more cards are used, and the greater the computing power loss. When OpenAI uses 10,000 V100s to train GPT-3, the computing power utilization rate is less than 50%.
A100 and H100 have both high computing power of a single card and high bandwidth to improve data transmission between cards. A100’s FP32 (referring to 4-byte encoding and storage calculation) has a computing power of 19.5 TFLOPS (1 TFLOPS means one trillion floating-point operations per second), and H100’s FP32 computing power is as high as 134 TFLOPS. About 4 times that of MI250.
A100 and H100 also provide efficient data transmission capabilities to minimize idle computing power. Nvidia’s exclusive cheats are the communication protocol technologies such as NVLink and NVSwitch that have been launched since 2014. The fourth-generation NVLink used on the H100 can increase the two-way communication bandwidth of GPUs within the same server to 900 GB/s (900GB of data per second), which is 7 times that of the latest generation of PCle (a point-to-point high-speed serial transmission standard) many.
Last year, the U.S. Department of Commerce’s regulations on the export of GPUs were also stuck on the two lines of computing power and bandwidth: the upper-line computing power was 4800 TOPS, and the upper-line bandwidth was 600 GB/s.
A800 and H800 have the same computing power as the original version, but the bandwidth is discounted. The bandwidth of the A800 has been reduced from 600GB/s of the A100 to 400GB/s. The specific parameters of the H800 have not been disclosed. According to Bloomberg, its bandwidth is only about half of that of the H100 (900 GB/s). When performing the same AI task, the H800 will Takes 10% -30% more time than H100. An AI engineer speculated that the training effect of the H800 may not be as good as the A100, but it is more expensive.
Even so, the performance of the A800 and H800 still outperforms similar products from other big companies and startups. Limited by performance and more dedicated architectures, the AI chips or GPU chips launched by various companies are now mainly used for AI reasoning, which is difficult for large-scale model pre-training. To put it simply, AI training is to make a model, AI reasoning is to use the model, and training requires higher chip performance.
In addition to the performance gap, Nvidia’s deeper moat is the software ecology.
As early as 2006, Nvidia launched the computing platform CUDA, which is a parallel computing software engine. Developers can use CUDA to perform AI training and reasoning more efficiently and make good use of GPU computing power. CUDA has become the AI infrastructure today, and mainstream AI frameworks, libraries, and tools are all developed based on CUDA.
If GPUs and AI chips other than Nvidia want to connect to CUDA, they need to provide their own adaptation software, but only part of the performance of CUDA, and the update iteration is slower. AI frameworks such as PyTorch are trying to break CUDA’s software ecological monopoly and provide more software capabilities to support other manufacturers’ GPUs, but this has limited appeal to developers.
An AI practitioner said that his company had contacted a non-NVIDIA GPU manufacturer, who offered lower prices for chips and services than Nvidia, and promised to provide more timely services, but they judged that the overall training and development using other GPUs The cost will be higher than that of Nvidia, and it will have to bear the uncertainty of the results and take more time.
“Although the A100 is expensive, it’s actually the cheapest to use,” he said. For large technology companies and leading startups that intend to seize the opportunity of large models, money is often not a problem, and time is a more precious resource.
In the short term, the only thing affecting Nvidia’s data center GPU sales may be TSMC’s production capacity.
The H100/800 is a 4 nm process, and the A100/800 is a 7 nm process. These four chips are all produced by TSMC. According to Chinese Taiwan media reports, Nvidia has added 10,000 new data center GPU orders to TSMC this year, and has placed a super urgent order, which can shorten the production time by up to 50%. Normally, TSMC would take several months to produce the A100. The current production bottleneck is mainly due to insufficient production capacity of advanced packaging, with a gap of 10 to 20 percent, which will take 3-6 months to gradually increase.
Since GPUs suitable for parallel computing were introduced into deep learning, for more than ten years, the driving force of AI development has been hardware and software, and the overlapping of GPU computing power and models and algorithms has moved forward: model development drives computing power demand; computing power grows, It also makes larger-scale training that was originally difficult to achieve possible.
In the last wave of deep learning boom represented by image recognition, China’s AI software capabilities are comparable to the world’s most cutting-edge level; computing power is the current difficulty – designing and manufacturing chips requires a longer accumulation, involving a long supply chain and numerous patents barrier.
The large model is another big progress in the model and algorithm layer. There is no time to take it slowly. Companies that want to build large models or provide cloud computing capabilities for large models must obtain enough advanced computing power as soon as possible. The battle for GPUs won’t stop until the wave cheers or disappoints the first companies.
This article is transferred from: https://www.latepost.com/news/dj_detail?id=1703
This site is only for collection, and the copyright belongs to the original author.