DCLM: Forging a Transparent Future with an Open Dataset for Language Models

Review Pending

Session Description

The remarkable capabilities of some of the leading Large Language Models (LLMs) [ChatGPT, Gemini, Claude] often lies in the vast, yet opaque and proprietary datasets, hindering reproducibility and broader community-driven innovation. Not just proprietary models, but various frontier open ~~source~~ models like Llama, Qwen, DeepSeek & Gemma also do not disclose the secret sauce (pre-training dataset & code) which hinders further developments and study in the role of pre-training datasets for building frontier models. Due to the high cost of training language models, it is also paramount to understand the performance of training recipes (data + strategy) across different compute and data scales.

In this talk, I will share my insights on the necessity of open pre-training datasets for building efficient foundational models and introduce DataComp for Language Models (DCLM), a groundbreaking initiative addressing this challenge by bringing together researchers to study the massive 240 trillion tokens DCLM-Pool dataset (the largest public corpus for language model training) with the goal of improving data curation for language models.

Under DCLM, researchers conducted 400+ baseline experiments with different training sets and compute scales to identify key components for effective data curation. We will delve into the results of these experiments and dive deeper into the details of meticulous construction of DCLM-Baseline dataset (state-of-the-art 2.6T tokens public training set), using an innovative multi-stage filtering pipeline, including model-based scoring and semantic deduplication, designed to maximize data quality and diversity while ensuring full transparency. When a 7B model is pre-trained on this dataset, if achieves 64% on MMLU, which is state-of-the art among similar-sized open-data models.

This talk will highlight DCLM's role in fostering a more open, collaborative, and scientifically rigorous future for language model research, directly contributing to the principles of the Open Data track.

Key Takeaways

Attendees will gain insights into:

The critical need for open and reproducible datasets in advancing LLM research.
The methodologies behind creating high-quality, large-scale text corpora from noisy web data.
The performance and characteristics of models trained on DCLM compared to existing benchmarks.
How the DCLM ecosystem (testbed, dataset, code, models) empowers researchers to transparently study the data & model interactions and democratizes access to state-of-the-art LLM development.

References

https://www.datacomp.ai/dclm/

https://arxiv.org/abs/2406.11794

https://github.com/mlfoundations/dclm

Session Categories

Knowledge Commons (Open Hardware, Open Science, Open Data etc.)