Data
Open AI Datasets
25 widely used datasets for text, code, image, audio, video, and benchmark workflows.
| Name | Provider | Category | Size | License | Commercial | Link |
|---|---|---|---|---|---|---|
| AudioSet | audio | 2,084,320 10s clips (5.8 thousand hours of audio) | CC-BY 4.0 (labels only; audio from YouTube ToS) | No | Open | |
| BIG-Bench | BIG-bench Collaboration | benchmark | 200+ tasks | Apache 2.0 | Yes | Open |
| C4 (Colossal Clean Crawled Corpus) | Google Research | text | 305GB (English subset), 1.7TB (full dataset) | ODC-By 1.0 | Yes | Open |
| Common Voice | Mozilla | audio | 30K+ hours | CC-0 | Yes | Open |
| Cosmopedia | HuggingFace | text | 25B tokens | Apache 2.0 | Yes | Open |
| Dolma | AI2 (Allen Institute) | text | 3T tokens (v1.5), 4.5TB (v1.7 gzip) | ODC-By 1.0 | Yes | Open |
| FineWeb | HuggingFace | text | 15T tokens | ODC-By 1.0 | Yes | Open |
| FineWeb-Edu | HuggingFace | text | 15T tokens (45TB compressed) | ODC-By 1.0 | Yes | Open |
| ImageNet | Stanford/Princeton | image | 14,197,122 images | Custom (research) | No | Open |
| LAION-5B | LAION | image | 5.85B img-text pairs | CC-BY 4.0 (metadata only; images carry original licenses) | No | Open |
| LibriSpeech | OpenSLR | audio | 1,000 hours | CC BY 4.0 | Yes | Open |
| MS COCO | Microsoft | image | 330K images (>200K labeled) | CC-BY 4.0 (annotations); mixed (Flickr images) | Yes | Open |
| MS MARCO | Microsoft | text | 1M+ passages (with additional datasets: 100K QnA, 1M QnA, 180K NLG, 148K KeyPhrase, Crawling dataset, Conversational search) | Custom (non-commercial research only, no license granted) | No | Open |
| Natural Questions | text | 300K+ questions | CC-BY-SA 3.0 | Yes | Open | |
| Open Images V7 | image | 15.8M+ images | CC-BY 4.0 | Yes | Open | |
| OpenAssistant Conversations | LAION | text | 88.8k rows | Apache 2.0 | Yes | Open |
| RedPajama v2 | Together AI | text | 100B+ documents (30T+ tokens, 20B deduplicated) | CC-BY 4.0 | Yes | Open |
| SlimPajama | Cerebras | text | 627B tokens | Apache 2.0 | Yes | Open |
| SQuAD 2.0 | Stanford | text | 150K questions | CC-BY-SA 4.0 | Yes | Open |
| StarCoderData | BigCode | code | 783GB | Apache 2.0 | Yes | Open |
| The Pile | EleutherAI | text | 825 GiB | mixed (constituent data) | Yes | Open |
| The Stack v2 | BigCode / HuggingFace | code | 67.5TB | Mixed per-repo | No | Open |
| UltraChat | Tsinghua NLP | text | ~949k dialogues | MIT | Yes | Open |
| WebVid | University of Oxford | video | 10 million video-text pairs | CC-BY 4.0 (metadata); mixed (videos from web) | No | Open |
| Wikipedia Dump | Wikimedia | text | 22GB (English) | CC-BY-SA 3.0 / GFDL | Yes | Open |