Data

Open AI Datasets

25 widely used datasets for text, code, image, audio, video, and benchmark workflows.

Name Provider Category Size License Commercial Link
AudioSet Google audio 2,084,320 10s clips (5.8 thousand hours of audio) CC-BY 4.0 (labels only; audio from YouTube ToS) No Open
BIG-Bench BIG-bench Collaboration benchmark 200+ tasks Apache 2.0 Yes Open
C4 (Colossal Clean Crawled Corpus) Google Research text 305GB (English subset), 1.7TB (full dataset) ODC-By 1.0 Yes Open
Common Voice Mozilla audio 30K+ hours CC-0 Yes Open
Cosmopedia HuggingFace text 25B tokens Apache 2.0 Yes Open
Dolma AI2 (Allen Institute) text 3T tokens (v1.5), 4.5TB (v1.7 gzip) ODC-By 1.0 Yes Open
FineWeb HuggingFace text 15T tokens ODC-By 1.0 Yes Open
FineWeb-Edu HuggingFace text 15T tokens (45TB compressed) ODC-By 1.0 Yes Open
ImageNet Stanford/Princeton image 14,197,122 images Custom (research) No Open
LAION-5B LAION image 5.85B img-text pairs CC-BY 4.0 (metadata only; images carry original licenses) No Open
LibriSpeech OpenSLR audio 1,000 hours CC BY 4.0 Yes Open
MS COCO Microsoft image 330K images (>200K labeled) CC-BY 4.0 (annotations); mixed (Flickr images) Yes Open
MS MARCO Microsoft text 1M+ passages (with additional datasets: 100K QnA, 1M QnA, 180K NLG, 148K KeyPhrase, Crawling dataset, Conversational search) Custom (non-commercial research only, no license granted) No Open
Natural Questions Google text 300K+ questions CC-BY-SA 3.0 Yes Open
Open Images V7 Google image 15.8M+ images CC-BY 4.0 Yes Open
OpenAssistant Conversations LAION text 88.8k rows Apache 2.0 Yes Open
RedPajama v2 Together AI text 100B+ documents (30T+ tokens, 20B deduplicated) CC-BY 4.0 Yes Open
SlimPajama Cerebras text 627B tokens Apache 2.0 Yes Open
SQuAD 2.0 Stanford text 150K questions CC-BY-SA 4.0 Yes Open
StarCoderData BigCode code 783GB Apache 2.0 Yes Open
The Pile EleutherAI text 825 GiB mixed (constituent data) Yes Open
The Stack v2 BigCode / HuggingFace code 67.5TB Mixed per-repo No Open
UltraChat Tsinghua NLP text ~949k dialogues MIT Yes Open
WebVid University of Oxford video 10 million video-text pairs CC-BY 4.0 (metadata); mixed (videos from web) No Open
Wikipedia Dump Wikimedia text 22GB (English) CC-BY-SA 3.0 / GFDL Yes Open