FineWeb
FineWeb is a 15-trillion-token English web dataset curated by HuggingFace from CommonCrawl, designed for high-quality LLM pretraining.
- Provider: HuggingFace
- Size: 15T tokens
- License: ODC-By 1.0
aidatahub.io guide
Training data quality is still the strongest driver of downstream model quality and specialization.
FineWeb is a 15-trillion-token English web dataset curated by HuggingFace from CommonCrawl, designed for high-quality LLM pretraining.
RedPajama V2 is a 30-trillion-token open dataset for LLM pretraining, containing web data in 5 European languages with quality annotations.
The Stack v2 is a large dataset of permissively licensed source code in 600+ programming languages, created by BigCode for training code models.
Dolma is a 3-trillion-token open English corpus from AI2, combining web, academic, code, and social media text for LLM pretraining.
Cosmopedia is a 25-billion-token synthetic English dataset of textbooks and educational content generated by Mixtral for LLM pretraining.
| Item | Type | Category | Key Metric | Access |
|---|---|---|---|---|
| FineWeb | dataset | text | Size: 15T tokens | External source |
| RedPajama v2 | dataset | text | Size: 100B+ documents (30T+ tokens, 20B deduplicated) | External source |
| The Stack v2 | dataset | code | Size: 67.5TB | External source |
| Dolma | dataset | text | Size: 3T tokens (v1.5), 4.5TB (v1.7 gzip) | External source |
| Cosmopedia | dataset | text | Size: 25B tokens | External source |
FineWeb and Dolma are strong text baselines; pair with task-specific datasets for robust fine-tuning outcomes.
Get a personalized recommendationFineWeb, RedPajama, and Dolma are common modern starting points.
No. Always validate license and provider restrictions before use.
Most high-quality pipelines mix broad corpora with domain-specific datasets.