aidatahub.io guide

Open Datasets for LLM Training

Training data quality is still the strongest driver of downstream model quality and specialization.

Who this guide is for

FineWeb is a 15-trillion-token English web dataset curated by HuggingFace from CommonCrawl, designed for high-quality LLM pretraining.

RedPajama V2 is a 30-trillion-token open dataset for LLM pretraining, containing web data in 5 European languages with quality annotations.

The Stack v2 is a large dataset of permissively licensed source code in 600+ programming languages, created by BigCode for training code models.

Dolma is a 3-trillion-token open English corpus from AI2, combining web, academic, code, and social media text for LLM pretraining.

Cosmopedia is a 25-billion-token synthetic English dataset of textbooks and educational content generated by Mixtral for LLM pretraining.

Item	Type	Category	Key Metric	Access
FineWeb	dataset	text	Size: 15T tokens	External source
RedPajama v2	dataset	text	Size: 100B+ documents (30T+ tokens, 20B deduplicated)	External source
The Stack v2	dataset	code	Size: 67.5TB	External source
Dolma	dataset	text	Size: 3T tokens (v1.5), 4.5TB (v1.7 gzip)	External source
Cosmopedia	dataset	text	Size: 25B tokens	External source

FineWeb and Dolma are strong text baselines; pair with task-specific datasets for robust fine-tuning outcomes.

Which dataset is best for general pretraining?

FineWeb, RedPajama, and Dolma are common modern starting points.

Can all datasets be used commercially?

No. Always validate license and provider restrictions before use.

Should I use one large dataset or mix many?

Most high-quality pipelines mix broad corpora with domain-specific datasets.