Data

Open AI Datasets

25 widely used datasets for text, code, image, audio, video, and benchmark workflows.

Category

License

Name	Provider	Category	Size	License	Commercial	Link
AudioSet	Google	audio	2,084,320 10s clips (5.8 thousand hours of audio)	CC-BY 4.0 (labels only; audio from YouTube ToS)	No	Open
BIG-Bench	BIG-bench Collaboration	benchmark	200+ tasks	Apache 2.0	Yes	Open
C4 (Colossal Clean Crawled Corpus)	Google Research	text	305GB (English subset), 1.7TB (full dataset)	ODC-By 1.0	Yes	Open
Common Voice	Mozilla	audio	30K+ hours	CC-0	Yes	Open
Cosmopedia	HuggingFace	text	25B tokens	Apache 2.0	Yes	Open
Dolma	AI2 (Allen Institute)	text	4.5TB (v1.7 gzip), 6.4TB (v1.5), 6.0TB (v1), 5.4TB (v1.6), 16.4GB (v1.6-sample), 2.9TB (v1.5-sample)	ODC-By 1.0	Yes	Open
FineWeb	HuggingFace	text	15T tokens	ODC-By 1.0	Yes	Open
FineWeb-Edu	HuggingFace	text	15T tokens (45TB compressed)	ODC-By 1.0	Yes	Open
ImageNet	Stanford/Princeton	image	14,197,122 images	Custom (research)	No	Open
LAION-5B	LAION	image	5.85B img-text pairs	CC-BY 4.0 (metadata only; images carry original licenses)	No	Open
LibriSpeech	OpenSLR	audio	1,000 hours	CC BY 4.0	Yes	Open
MS COCO	Microsoft	image	330K images (>200K labeled)	CC-BY 4.0 (annotations); mixed (Flickr images)	Yes	Open
MS MARCO	Microsoft	text	1M+ passages (100K QnA, 1M QnA, 180K NLG, 148K KeyPhrase, Crawling dataset, Conversational search)	Custom (non-commercial research only, no license granted)	No	Open
Natural Questions	Google	text	300K+ questions	CC-BY-SA 3.0	Yes	Open
Open Images V7	Google	image	15.8M+ images	CC-BY 4.0	Yes	Open
OpenAssistant Conversations	LAION	text	88.8k rows	Apache 2.0	Yes	Open
RedPajama v2	Together AI	text	100B+ documents (30T+ tokens, 20B deduplicated)	CC-BY 4.0	Yes	Open
SlimPajama	Cerebras	text	627B tokens	Apache 2.0	Yes	Open
SQuAD 2.0	Stanford	text	150K questions	CC-BY-SA 4.0	Yes	Open
StarCoderData	BigCode	code	783GB	Apache 2.0	Yes	Open
The Pile	EleutherAI	text	825 GiB	mixed (constituent data)	Yes	Open
The Stack v2	BigCode / HuggingFace	code	67.5TB	Mixed per-repo	No	Open
UltraChat	Tsinghua NLP	text	~949k dialogues	MIT	Yes	Open
WebVid	University of Oxford	video	10 million video-text pairs	CC-BY 4.0 (metadata); mixed (videos from web)	No	Open
Wikipedia Dump	Wikimedia	text	22GB (English)	CC-BY-SA-3.0 AND GFDL	Yes	Open