New Croissant metadata format helps standardise ML datasets

Finally, we’re excited to share the release of Croissant 🥐, a metadata format to help standardise machine learning datasets. Croissant aims to enhance the discoverability and usability of datasets across various tools and platforms, making them more accessible to everyone. Today’s release includes the format documentation, an open source library, visual editor, with industry support from HuggingFace, Google Dataset Search, Kaggle, and OpenML amongst others.

The problem

Data serves as the foundation for all AI and ML models, yet there's a lack of a standardized approach for organising and structuring the data and files constituting each dataset. Consequently, the process of locating, comprehending, and utilizing ML datasets often becomes cumbersome and lengthy.

Solution

A key objective of Croissant is to enhance the accessibility and findability of data. By extending the schema.org vocabulary, a machine-readable standard for describing structured data utilized by over 40 million datasets on the web, Croissant enables datasets to be easily found through search engines like Google Dataset Search.

Benefits

Croissant is easy to adopt because it doesn’t require changing the data itself or how it is represented. Instead, it adds a layer of metadata that represents the contents of the dataset in a standardized way, describing key attributes and properties.

Croissant enables datasets to be loaded into different ML platforms without the need for reformatting. Popular ML frameworks like TensorFlow, JAX and PyTorch can already load Croissant datasets via the TensorFlow Datasets library. Additionally, by providing operationalized documentation, Croissant users can easily understand the best practices for contributing to and utilizing the data.

Users looking to publish a dataset in the Croissant format benefit from the Croissant editor which allows them to easily inspect, create, or modify Croissant descriptions for their dataset.

Croissant’ working group co-chairs speak

"Data is a critical element of any model's performance, and as some experts suggest it will run out, making the need to harness it even more important. Croissant allows more people to do more with data. As co-chair of the working group, it is a privilege to collaborate with world-class machine learning scientists and engineers around the globe making an enormous contribution to the AI data ecosystem."

  • Elena Simperl, professor of Computer Science at King’s College London and Croissant working group co-chair

"The development of Croissant was grounded in the needs of ML practitioners, and the technical requirements of ML tools, platforms, and datasets. Our goal with Croissant is to unlock real value for users by enabling the tools they use to work seamlessly together, while keeping the format as simple and intuitive as possible."

  • Omar Benjelloun, software engineer at Google and Croissant working group co-chair
Lets improve the ML/AI work together

We welcome creators of datasets to supply descriptions in Croissant format, and for those hosting datasets to make files available for download in Croissant format, as well as integrating Croissant metadata into their dataset pages. Additionally, developers of tools for data analysis and labeling are encouraged to incorporate support for Croissant datasets, simplifying the process of locating and utilizing these datasets.

MLCommons Croissant working group that ‘baked’ the Croissant includes talented and inspiring AI/ML experts from the following companies:

🧩 Bayer

🧩 cTuning foundation

🧩 DANS-KNAW

🧩 Dotphoton

🧩 Google

🧩 Harvard

🧩 Hugging Face

🧩 Kaggle

🧩 King's College London

🧩 Meta

🧩 NASA

🧩 NASA IMPACT

🧩 Open Data Institute

🧩 Open University of Catalonia

🧩 Luxembourg Institute of Science and Technology

🧩 TU Eindhoven

AI/ML scientists who’re passionate about standardisation, are welcome to join Croissant Working Group, contribute to the GitHub repository, and stay informed about the latest updates.

You can download the Croissant Editor to start implementing the Croissant vocabulary on your existing datasets today! Together, we can reduce the data development burden and enable a richer ecosystem of AI and ML research and development.