aidatahub.io guide

Open Datasets for LLM Training

Training data quality is still the strongest driver of downstream model quality and specialization.

Who this guide is for

  • You are building a domain model
  • You need clean multilingual corpora
  • You need benchmark-aligned evaluation sets

Breakdown

FineWeb

FineWeb is a 15-trillion-token English web dataset curated by HuggingFace from CommonCrawl, designed for high-quality LLM pretraining.

  • Provider: HuggingFace
  • Size: 15T tokens
  • License: ODC-By 1.0

RedPajama v2

RedPajama V2 is a 30-trillion-token open dataset for LLM pretraining, containing web data in 5 European languages with quality annotations.

  • Provider: Together AI
  • Size: 100B+ documents (30T+ tokens, 20B deduplicated)
  • License: CC-BY 4.0

The Stack v2

The Stack v2 is a large dataset of permissively licensed source code in 600+ programming languages, created by BigCode for training code models.

  • Provider: BigCode / HuggingFace
  • Size: 67.5TB
  • License: Mixed per-repo

Dolma

Dolma is a 3-trillion-token open English corpus from AI2, combining web, academic, code, and social media text for LLM pretraining.

  • Provider: AI2 (Allen Institute)
  • Size: 3T tokens (v1.5), 4.5TB (v1.7 gzip)
  • License: ODC-By 1.0

Cosmopedia

Cosmopedia is a 25-billion-token synthetic English dataset of textbooks and educational content generated by Mixtral for LLM pretraining.

  • Provider: HuggingFace
  • Size: 25B tokens
  • License: Apache 2.0

Comparison table

Item Type Category Key Metric Access
FineWeb dataset text Size: 15T tokens External source
RedPajama v2 dataset text Size: 100B+ documents (30T+ tokens, 20B deduplicated) External source
The Stack v2 dataset code Size: 67.5TB External source
Dolma dataset text Size: 3T tokens (v1.5), 4.5TB (v1.7 gzip) External source
Cosmopedia dataset text Size: 25B tokens External source

How to choose

  • Coverage Breadth of topics and language support
  • License fit Commercial viability and legal clarity
  • Data quality Noise level and deduplication quality
  • Pipeline readiness Format and ease of ingestion for training

Our verdict

FineWeb and Dolma are strong text baselines; pair with task-specific datasets for robust fine-tuning outcomes.

Get a personalized recommendation

FAQ

Which dataset is best for general pretraining?

FineWeb, RedPajama, and Dolma are common modern starting points.

Can all datasets be used commercially?

No. Always validate license and provider restrictions before use.

Should I use one large dataset or mix many?

Most high-quality pipelines mix broad corpora with domain-specific datasets.