A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (https://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely, the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community-driven process, the Zarr specification enables the storage of large, out-of-memory datasets both locally and in the cloud. Implementations exist in C++, C, Rust, Java, JavaScript, Julia, and Python, enabling.
This talk presents a systematic approach to understanding and implementing the latest version of Zarr-Python, specifically Zarr-Python 3, by explaining the new API, deprecations, the new storage backend, and the improved codec pipeline, among other features.
Zarr is a data format for storing chunked, compressed N-dimensional arrays and is fiscally sponsored by NumFOCUS under their umbrella.
It is based on an open-source technical specification and has implementations in several languages, with Zarr-Python being the most used.
Following the successful adoption of Specification V3, our team has worked diligently over the past year to ensure the Python library's compliance with the latest specification.
Outline
First, I’d be talking about:
Introduction and Working of Zarr (10 mins.)
Then, I'll be talking about the new Zarr-Python 3 and its significant features:
What's new in Zarr-Python 3? (10 mins.)
Major design updates
New storage backend
Creating Zarr arrays and groups asynchronously
New and improved codec pipeline
GPU support for creating and writing arrays
Changes and deprecations
Overview of the new API
Optimising performance for large arrays
Deprecation of several stores like LMDBStore, SQLStore, MongoDBStore, etc.
Extensions
Then, I’d be doing a hands-on session, which would cover the following:
Hands-on (5 mins.)
I'd be closing the talk by:
Conclusion (5 mins.)
This talk aims to address an audience that works with large amounts of data and is looking for a format that is transparent, open-source, reliable, cloud-optimised, and environmentally friendly.
Zarr is widely adopted across bioimaging, geospatial, genomics and research communities. The talk is highly relevant for communities or organisations dealing with large, high-volume array datasets.
The tone of the talk is set to be informative, storytelling and fun.
Intermediate knowledge of Python and NumPy arrays is required for attendees to attend this talk.