OpenAI launches GPTBot, a web crawler robot

Original link: https://www.williamlong.info/archives/7251.html

OpenAI has launched a web crawler robot called GPTBot to collect informative data to improve future AI models. It is understood that GPTBot will strictly abide by the rules of any paywall, will not capture information that requires payment, and will not collect data that can be traced to an individual.

Not only that, OpenAI also gives the website owner the choice of whether to make their website data crawlable by GPTBot, and they can modify their robots.txt file by themselves. Or block GPTBot from scraping data from its website by masking its IP address.

This is certainly not enough, modifying robots.txt is one way, but it can be more convenient and more transparent, and it can further inform what the data will be used for, etc.

Previously, OpenAI’s behavior of grabbing public data to train patented AI models was controversial. Sites such as Reddit and Twitter have taken steps to crack down on AI companies using their users’ posts for free, while some authors and other creators have filed lawsuits over alleged unauthorized use of their work.

The training of the GPT model under OpenAI requires a large amount of network data, which may involve issues such as data privacy and copyright. To address these issues, OpenAI recently introduced a new feature that allows websites to prevent their web crawlers from scraping data from their websites to train GPT models.

It is understood that a web crawler is an automated program that can search and obtain information on the Internet. OpenAI’s web crawler is named GPTBot, which visits various websites at a certain frequency and saves the content of the webpage for training the GPT model.

OpenAI said in its blog post that website operators can prevent GPTBot from scraping data from their sites by disallowing GPTBot’s access in their site’s Robots.txt file, or by blocking their IP addresses. OpenAI also stated that “web pages crawled using the GPTBot user agent (useragent) may be used to improve future models, and will filter out those that require paid access, are known to collect personally identifiable information (PII), or violate our policies. source of the text.” For sources that do not meet the exclusion criteria, “allowing GPTBot to access your website can help AI models become more accurate and improve their general capabilities and safety.”

However, this does not retroactively remove previously scraped content from websites from ChatGPT’s training data.

The Internet provides most of the training data for large language models (such as OpenAI’s GPT model and Google’s Bard), and obtaining data for AI training has become increasingly controversial. Some sites, including Reddit and Twitter, have taken steps to crack down on AI companies using their users’ posts for free, while authors and other creators have filed lawsuits over alleged unauthorized use of their work.

Source: Comprehensive Drive Home, IT Home

This article is transferred from: https://www.williamlong.info/archives/7251.html
This site is only for collection, and the copyright belongs to the original author.