AI Rookies

Common Crawl

Fact

A huge public web-page dataset often used to pretrain AI models.

In Plain Words

Common Crawl is a giant Roomba for the public web. It bumps around and sweeps pages into a huge dustbin.

AI teams use it to pretrain big models. It can also bring copyright fights, privacy leaks, and junk.

Related Concepts

Pretraining
Common Crawl often feeds pretraining, so models learn language patterns first.

LLM
LLMs often learn language and common facts from huge web datasets like this.

Big Data
Common Crawl turns web scraping into downloadable and reusable Big Data.

Copyright
Putting web pages into training data can cross copyright lines.