Zarr: Cloud-Native, Chunked & Compressed N-Dimensional Arrays

Approved

Session Description

A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.

In this talk, I’ll be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works and the need for using it. Zarr is based on an open technical specification, making implementations across several languages possible. I’d mainly talk about Zarr’s Python implementation and show how it beautifully interoperates with the existing libraries in the PyData stack.

I will also briefly discuss the evolution of the Zarr - the development of the Zarr Enhancement Process (ZEP) and its use to define the next major version of the specification (V3), as well as the uptake of the format across the research landscape.

Key Takeaways

None

Reviews

100 %

Approvability

2

Approvals

0

Rejections

0

Not Sure

The proposal is accepted from my side because it is descriptive enough, has all the required content, and the speaker is part of the OSS project.

Reviewer #1

Approved

An introductory talk about Zarr will be interesting by in my personal opinion, a talk that goes into the details about how the Technical Specification is put together, enabling implementations in different languages. The Zarr Enhancement Proposal (ZEP) process is also worth an entire talk in my humble opinion as it highlights the value of a slow but methodical process in improving a library that is used by thousands (maybe tens of thousands) of projects directly. Overall, the proposal is good to go.

Reviewer #2

Approved

Zarr: Cloud-Native, Chunked & Compressed N-Dimensional Arrays

Sanket Verma