Is your website used by Google for AI training?

Original link: https://seo.g2soft.net/2023/04/23/google-ai-C4-discuss.html

This is when I saw Zac’s article ” Has your website content been used for AI training?” Do you want to? “After that, I thought of it.

Basically, the AI ​​training used by Google requires a lot of corpus, and major websites, or small websites, will become its targets. I may feel differently about the rapid emergence and rapid viral spread of these AI tools in the last six months. I think that ordinary users need to wait patiently when technology is advancing rapidly, and they can try, and don’t fall into too deep. After the big waves wash away, they can still be used by people. It must be a tool that can improve productivity and help people improve efficiency.

In February, in the article Grandpa try the new technology , I introduced Midjourney and ChatGPT to my father-in-law, who found it very interesting. In March, I tried Stable Diffusion on my local computer and found it quite painful. Ordinary personal computer or do not try, too time-consuming.

For the time being, I still think that the major Internet giants will challenge OpenAI to do their own AI training. The article by Zac I saw today is about the situation of the Google AI training set.

Google uses the C4 data set, which has a huge number of websites, and of course many more websites are not included. The Washington Post has made an interactive tool to see if it was included, and how much was used.

C4 started as a crawl started in April 2019 by the nonprofit CommonCrawl, a well-known resource for AI models. CommonCrawl told The Post that it tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.

According to Zac’s test, his website ranks 11,196,890 and has 280 Tokens

seozac.com rank in C4

I’m also curious to see what’s going on on this site.

seo.g2soft.net ranking in Google C4

It seems that the ranking of SEO website optimization and promotion is a little higher, as many as 1.9K tokens are used. I am still happy, after all, it is useful.

I also read another major English blog, the usage in the Google C4 dataset.

yinfor.com Rank in Google C4 dataset

It seems that more attention is paid to English websites.

After looking at more websites, I feel that C4 places more emphasis on credibility. The Rank here can be seen as another Google PageRank, or SiteRank.

I don’t think it’s a big problem for Google to use the data of these websites for AI training. If the training results are used as a service and profited from it, then these websites have the right to demand returns, at least credit or links.

Maybe it’s time to rewrite the website’s copyright notice.

Check to see if your website is in the Google C4 dataset .

This article is transferred from: https://seo.g2soft.net/2023/04/23/google-ai-C4-discuss.html
This site is only for collection, and the copyright belongs to the original author.