A huge public web-page dataset often used to pretrain AI models.
Common Crawl is a giant Roomba for the public web. It bumps around and sweeps pages into a huge dustbin.
AI teams use it to pretrain big models. It can also bring copyright fights, privacy leaks, and junk.
Pretraining
Common Crawl often feeds pretraining, so models learn language patterns first.
LLM
LLMs often learn language and common facts from huge web datasets like this.
Big Data
Common Crawl turns web scraping into downloadable and reusable Big Data.
Copyright
Putting web pages into training data can cross copyright lines.