Tech Enthusiast Weekly (Issue 253): The day the training material runs out

Original link: http://www.ruanyifeng.com/blog/2023/05/weekly-issue-253.html

Here is a record of the technology content worth sharing every week, released on Friday.

This journal is open source and contributions are welcome. The weekly magazine also has a service called “Who’s Recruiting” , which publishes programmer recruitment information. For cooperative promotion, please email ([email protected]).

cover picture

This is not an art gallery, but a red bayberry shed in Sankou Village, Lin’an, Hangzhou, stacked together along the hillside. ( via )

Topic of the Week: The Day You Run Out of Training Materials

In the current news reports, there are AI news every day, and many models will be mentioned in it.

To distinguish the strength of the model, there is a key indicator, which is to see how many parameters it has. In general, the greater the number of parameters, the stronger the model.

GPT-2 has 1.5 billion parameters, GPT-3 and ChatGPT have 175 billion parameters , and GPT-4 has not announced this indicator, which is said to be more than 5 times larger than the previous generation.

So, what are parameters?

According to my rough understanding, the parameter is equivalent to the number of nodes of the neural network based on the model prediction. The more parameters, the more possibilities the model considers, the greater the amount of calculation, and the better the effect.

Since the more parameters the better, will the parameters grow infinitely?

The answer is no, because the parameters are constrained by the training material. There must be enough training material to calculate these parameters. If the parameters grow infinitely, the training material must also grow infinitely.

One argument I’ve seen is that the training material should be at least 10 times the parameters. For example, a model that distinguishes cat photos from dog photos, assuming 1,000 parameters, should be trained on at least 10,000 images.

ChatGPT has 175 billion parameters, so the training material is preferably no less than 1.75 billion tokens. “Lexical elements” are all kinds of words and symbols. Take the novel “Dream of Red Mansions” as an example. It has 788,451 characters, which is 1 million lexical elements. Then, the training material of ChatGPT is equivalent to 1.75 million copies of “Dream of Red Mansions”.

According to reports , ChatGPT actually used 570 GB of training material from Wikipedia, Internet libraries, Reddit forums, Twitter, and more.

Think about it, everyone, a more powerful model requires more training materials. The question is, can we find so many materials, and will there not be enough materials one day?

Let me tell you that there are indeed scholars who have written papers and studied this issue.

Over the past 10 years, AI training datasets have grown much faster than the world’s data stock. If this trend continues, exhausting the data stock is inevitable.

The paper gives three time points.

  • 2026: Exhaustion of general language data
  • 2030-2050: use up all language data
  • 2030-2060: use up all visual data

In other words, according to their predictions, new training materials will be difficult to find in about three or four years . Thirty years later at the latest, all the materials in the world are not enough for AI training.

The above picture is the trend chart given by the author, the dotted line is the growth rate of the training material, and the red line and blue line are different predictions of the model growth rate. After 2035, these three lines will merge together, and the curve will become flatter and flatter.

At that point, the authors argue, AI model development could slow down significantly due to insufficient training material.

If his prediction is correct, it means that, contrary to popular belief, the rapid development of AI will not last long. Now may be the stage of the fastest development , and then it will start to slow down, and it will slow down significantly by the middle of this century, approaching a stagnation, similar to the status quo of quantum physics.

Technology dynamics

1. Wheel steering system

South Korea’s Hyundai has released a new technology that allows each wheel to turn 90 degrees independently.

In the demo video, the concept car can drive sideways or turn around on the spot.

Although very practical, this technology increases the complexity and cost of the vehicle, and it is unknown whether it will affect normal driving. Hyundai did not say whether it will be put into production.

2. Static electricity from computer chairs

A foreign netizen posted that the monitor in his home often went dark for a few seconds for no reason, and then turned back on again.

He thought it was a monitor problem, but later found that the glitch only occurred when moving the computer chair, or sitting down and standing up.

His computer chair is MARKUS from IKEA. Many netizens replied that their computer chair also has this problem.

The fabric material or metal frame of this chair is prone to static electricity, and any movement will cause discharge, causing the computer monitor to shut down for a short time.

The solution seems to be to replace the chair, but some netizens with strong hands-on ability connect the ground wire to the chair and let it connect to the ground, thus solving the discharge problem.

3. The hearing aid effect of wireless headphones

Wireless headphones could replace hearing aids and help the hearing-impaired, a study finds

Apple’s Airpods headphones have a “real-time listening” function that can amplify external sounds, much like hearing aids, and the actual effect is also very good.

The price of hearing aids is very expensive, tens of thousands of yuan for good ones, and several thousand for ordinary ones. If wireless headphones can be replaced, it will benefit many deaf people.

4. Sand Dam Reservoir

South Korea built the country’s first sand dam reservoir in order to solve the problem of water cut-off in mountainous areas during the dry season.

There is a sandstone reservoir inside the dam body, which is usually used to store water. When necessary, the pipeline is opened to let the water flow downstream.

Doing so is said to have three benefits: water evaporation is greatly reduced; water quality is improved as it passes through the sand bed; and water does not freeze in winter.

5. Smart wedding ring

A Czech company has launched a “smart wedding ring”, which can sense the wearer’s heartbeat and display the heartbeat curve on the ring.

What’s interesting is that it doesn’t show your own heartbeat, but the heartbeat of the other party.

It communicates with the phone via Bluetooth, and whenever the wearer presses on the ring, the phone contacts the other paired ring.

The heartbeat frequency of the other party will be transmitted to your mobile phone, and the heartbeat curve will also be displayed on the ring.

According to its inventors, it allows you to feel the romantic heartbeat of your lover at all times. It is made of rose gold and the price is $3,000 per pair.

article

1. My open source experience (Chinese)

The author shares his experience and develops a web application for image editing. (Contributed by @nihaojob )

2. How to implement CodePen by yourself (English)

CodePen is a well-known real-time editing and preview tool for web pages. This article teaches you how to realize its main functions, which is very simple.

3. Quick Start with tcpdump (English)

The author teaches you how to use the command line tool tcpdump to view the TCP communication of a website.

4. Why is WebGPU important (English)

The graphics API of the operating system is not unified at present: Windows is DirectX, Apple is Metal, and Linux is Vulkan.

WebGPU is a cross-platform solution that provides a unified interface. This long article is recommended.

5. My 30 years of developing PCalc (English)

The author wrote a calculator, PCalc (above), for the Macintosh computer in 1992. He later maintained the project for 30 years and ported it to other Apple devices, such as the iPhone and iWatch (below). The author recalled his 30 years.

6. Use hurl to automate HTTP testing (English)

This article introduces a simple method, using the software hurl, to automate the testing of the website API to see if it responds correctly.

7. Error handling mechanism of programming language (English)

This article discusses how different languages ​​handle error reporting, such as Java throws an exception, while Go assigns the error to a variable.

Here is another article on the same topic , which is also worth referring to.

8. Crazy C language strings (English)

This article is a string tutorial in C language. From the \0 at the end to Unicode, the conclusion is how troublesome it is to correctly handle strings in C language.

tool

1. Stagit

This software can turn a Git repository into a static website, generating a page for each file and commit.

2. Meta tag generator

For external URLs, many social media will display a card with a title, thumbnail image and brief content of the page. This information comes from the meta tags inside the web pages, and this tool can help you generate these meta tags.

3. CJK font recognition

Upload a picture of East Asian text, and this open source tool can identify what fonts are used for these texts. (Contributed by @JeffersonQin )

4. microblog.pub

A self-hosted open-source microblogging website that can only be used by one person (that is, no multiple users), and supports the ActivityPub protocol.

5. Textual Markdown Browser

A Markdown file renderer for the terminal window, suitable for reading Markdown files in the terminal.

6. HorusPass

This site takes the text entered by the user and generates a URL for sharing. However, this URL can only be opened once, and it will not exist on the second visit, which is a bit like “burning after reading”.

7. Progress-up

A web multi-file upload JS library with upload progress display.

8. snappify

A tool to generate screenshots of code snippets.

9. RustDesk

An open source remote desktop software that allows you to remotely operate the desktops of other computers, with clients for various operating systems.

10. Lossless Cut

A video editor, the biggest feature is that it does not re-encode, and cuts and connects according to the original video format, so the speed is extremely fast.

resource

1. ChatGPT prompt project for developers

Wu Enda’s free English course with OpenAI teaches you how to write ChatGPT tips and make your own chatbot.

2. The Complete Guide to Next.js and React

Chinese subtitle version of Udemy High Score Paid Course. (Contributed by @lyf61 )

3. Graphical QUIC connection (Chinese version)

Explain the meaning of each byte of QUIC protocol communication, translation of the original English version . (Contributed by @cangSDARM )

4. Musico

An AI model that automatically generates music, its official website can listen to the music generated by this model.

picture

1. The expression of the cloud

An American artist specializes in adding expressions to photos of clouds, making them look like cartoon characters.

Originally, out of boredom, he took some random photos of Yun, drew expressions on them, and posted them on the Internet.

Later, he found that many people liked these works, so he persisted.

“Looking at the clouds gives you endless inspiration,” he said.

Now, more and more readers are submitting articles to him. He is also preparing to publish a book.

abstract

1. The seven levels of busyness

The busyness of life can be divided into seven levels.

You can compare, which level do you belong to?

Level 1: Not busy at all.

The time is very free, you can arrange it however you want, there are no things that must be done, and you can sleep as long as you want on weekends.

Level 2: There are some little things.

You remember that there is something to do. These things are legitimate things with no deadlines, but you know they have to be done sooner or later.

Level 3: Something important.

You have things that must be done and need to be tracked in time without procrastination, and you will always remind yourself of these things.

Level 4: The schedule is packed.

Your schedule is full, and you have to constantly ask yourself “what is more important?” in order to decide which things to do first and which things to do later.

You have no unplanned time, but you can still control the schedule.

Level 5: Chaos occurs in life.

You can’t finish your work during working hours, and you start working overtime.

You often say “sorry” to others because things are too late. Those things were not given up by you, but you had to rush, and some things became sloppy in execution.

Level 6: The task is endless.

You need to do more than you can schedule. Even if you give up some things, you still can’t finish the rest.

Your working hours are greatly extended, affecting normal life. You feel very tired.

Level 7: Life can’t get by.

Tasks fill every waking minute of your life. Meals and other necessities of life are taken time to do. When you’re busy, you don’t even have time to eat.

You stop writing your schedule because there is no time to plan and things change every hour.

You are also absent-minded when you walk, and you often feel that you are going to collapse, and you can’t go on with your life.

remarks

1,

I left Google to call out the risks of AI, and it is not convenient to talk about these things at Google.

“Father of Deep Learning” Geoffrey Hinton (Geoffrey Hinton), announced his resignation from Google

2,

The problem in Europe is that, instead of seeing the Internet as an economic opportunity to be exploited, it is seen as something to be regulated.

“Europe is not ready to become a ‘third superpower'”

3.

Most people think it’s okay to have people under them who are smarter than them. Typically, leaders hire advisers and staff who are smarter than themselves.

So why do people feel threatened when your minions are turned into AI models that are smarter than you?

Yann LeCun , Chief AI Scientist at Meta

4.

To be a good programmer, write a lot of code; to be a top programmer, read a lot of code.

“Write CRISP Code Please”

this week in history

How to Move Over Disappointment and Doubt (2022 #206)

Graphics card shortages and competition from other industries (2021 #156)

Digital Nomads (2020 #106)

Why are liberal arts students not easy to find jobs? (2019 #56)

thank you

The Weekly is deeply grateful for the help of FlowUs , a new generation of knowledge management and collaboration platform in China.

FlowUS = document + form + network disk. You can use it to write documents, make a home page, manage data, store files, and more.

Each issue of the weekly magazine is published in the FlowUs column at the same time. You are welcome to open your own column and homepage.

(over)

document information

  • Copyright Notice: Free Reprint-Non-Commercial-Non-Derivative-Attribution ( Creative Commons 3.0 License )
  • Date of publication: May 5, 2023

This article is transferred from: http://www.ruanyifeng.com/blog/2023/05/weekly-issue-253.html
This site is only for collection, and the copyright belongs to the original author.