Zarr: Cloud-optimised, N-dimensional, typed array storage

Review Pending

Session Description

A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (https://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely, the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community-driven process, the Zarr specification enables the storage of large, out-of-memory datasets both locally and in the cloud. Implementations exist in C++, C, Rust, Java, JavaScript, Julia, and Python, enabling.

This talk presents a systematic approach to understanding and implementing the latest version of Zarr-Python, specifically Zarr-Python 3, by explaining the new API, deprecations, the new storage backend, and the improved codec pipeline, among other features.

Zarr is a data format for storing chunked, compressed N-dimensional arrays and is fiscally sponsored by NumFOCUS under their umbrella.

It is based on an open-source technical specification and has implementations in several languages, with Zarr-Python being the most used.

Following the successful adoption of Specification V3, our team has worked diligently over the past year to ensure the Python library's compliance with the latest specification.

Outline

First, I’d be talking about:

Introduction and Working of Zarr (10 mins.)

What is Zarr, and how does it work?
- The inner workings of Zarr using illustrated graphics
What is the Zarr Specification?
- A summary of the technical specification of Zarr
- What's new in Zarr Spec V3?

Then, I'll be talking about the new Zarr-Python 3 and its significant features:

What's new in Zarr-Python 3? (10 mins.)

Major design updates
- New storage backend
- Creating Zarr arrays and groups asynchronously
- New and improved codec pipeline
- GPU support for creating and writing arrays
Changes and deprecations
- Overview of the new API
- Optimising performance for large arrays
- Deprecation of several stores like LMDBStore, SQLStore, MongoDBStore, etc.
Extensions
- How can Zarr-Python 3 be extended to add new custom data types, stores, chunking strategies, etc.?

Then, I’d be doing a hands-on session, which would cover the following:

Hands-on (5 mins.)

Creating Zarr arrays and groups using Zarr-Python 3
- Plus, a walkthrough of the new features (mentioned above)
Looking under the hood
- Use the store and info functions to explain how your Zarr data is stored and display important information

I'd be closing the talk by:

Conclusion (5 mins.)

Key takeaways
How can you get involved?
QnA

This talk aims to address an audience that works with large amounts of data and is looking for a format that is transparent, open-source, reliable, cloud-optimised, and environmentally friendly.

Zarr is widely adopted across bioimaging, geospatial, genomics and research communities. The talk is highly relevant for communities or organisations dealing with large, high-volume array datasets.

The tone of the talk is set to be informative, storytelling and fun.

Intermediate knowledge of Python and NumPy arrays is required for attendees to attend this talk.

Key Takeaways

Understand the basics of Zarr and what's new in V3
Leverage the new functionalities of Zarr-Python 3 with improved performance
Make an informed decision on what data format to use for your data

Which track are you applying for?

FOSS in Science Devroom

Reviews

100 %

Approvability

1

Approvals

0

Rejections

0

Not Sure

Overall the proposal looks good, but the proposer should start the talk by explicitly introducing the various Array packages, which they already described at the start of the proposal. Given the domains where zarr is predominantly used, it will also be useful to cover how zarr data can be streamed. And given that one of the focal points is zarr version 3, the proposer should consider introducing the zarr standardisation process i.e. how can changes be introduced to the standard. We also expect users from non-Python backgrounds to be present in the audience, so it'll be great to explicitly highlight the zarr packages from such ecosystems.

Reviewer #1

Approved