About Scientific Python
Python has been helping scientists and researchers with their work since more than a decade. There are several reasons why Python is so popular for research. First, it is a very easy language to learn. The syntax is simple and straightforward, making it accessible to beginners. Second, Python is very versatile. It can be used for a wide range of tasks, including web development, data science, machine learning, and scientific computing. Third, Python has a large and active community of developers. This means that there are many resources available to help you learn Python and solve problems.
Another reason why Python is a popular choice for scientific computing research, due to its numerical computing libraries.
Problem with Array and Array Libraries
The Scientific Python ecosystem is incredibly powerful but historically fragmented. Core libraries like NumPy, PyTorch, JAX, TensorFlow, CuPy, and Dask all implement their own array types with differing APIs and semantics. While each was developed with specific goals in mind—performance, autodiff, GPU support, etc.—this divergence has led to a lack of coherence when building tools or applications meant to span multiple libraries.
What does that mean? To understand this let's first divide the Array libraries into two types: Array Producer libraries and Array Consumer libraries.
Array Producer libraries are the ones that create or expose new kinds of arrays. Examples: Numpy, JAX, Pydata/Sparse, Cupy etc. Array Consumer libraries are the ones that accept arrays as input but don’t care where they came from. Example: scikit-learn, SciPy, Matplotlib.
If a Consumer Library wants to support multiple Producer Library, it becomes a problem. Why? Simply because different Producer library define Array objects differently according to their need and design semantics. For example if a library uses Numpy array, it's only restriced to CPU, if we want GPU support, we may want to use libraries like Cupy. If we want to run in a distributed environment, we would need support of Dask arrays.
It would have been much simpler if we had a standard API across all libraries so that it becomes easier and maintainable for Consumer libraries to adopt. Here Array API Standards comes into picture.
The Array-API-Standards
Initiative began under the Consortium for Python Data API Standards, with the goal to define a minimal, consistent, framework-agnostic standard API for arrays — based on the common subset of operations used by libraries. The first stable release of the Array API Standard was published in 2022, with latest release in 2024.
With this standards, we now have a uniform behavior across backends. We can now make our consumer library backend agnostic.
Producer libraries basically have to stick to the Array-API-Standards. To do that they have to ensure their array exposes required methods and attributes (.shape, .dtype, .ndim, .astype, etc.), standard functions (add, reshape, matmul, mean, etc.) and consistent behavior (broadcasting, indexing, casting, etc.).
How the Standards have helped the community
- Scikit-learn on GPUs with Array API
- PyTorch added torch._refs
, an internal namespace that implements the Array API Standard.
- Thanks to its cupy.array_api
namespace, libraries like SciPy and future tools can use GPU acceleration transparently by supporting the Array API.
- JAX implements its own jax.experimental.array_api
module
- array-api-compat : This third-party compatibility layer wraps existing backends (NumPy, PyTorch, JAX, etc.) to expose the standardized API.
Talk Outline
Here is a tentative outline of my talk.
Introduction to Scientific Python (2-3 min)
- What is Scientific Python.
- How you can contribute and related communities.
Overview of Arrays (5 min)
- What are Producer and Consumer libraries.
- Some examples of how different libraries define Arrays and how are they different from each other.
- Problem faced by having multiple array providers.
Array-API-Standards (10 min)
- What are Array-API-Standards.
- How does these standards help unite the forces, one Standard to rule them all!
- How to adhere to these standards with real world examples.
- Code snippets explaining how it helps across various libraries.
My work towards Array API compatibility in Pydata/Sparse (3-5 min)
- What is Pydata/Sparse
- How we implement functions to improve compatibality.
- How to bump up to Array-API-Standards v2024.12
- How can you contribute
Conclusion and AMA (2-5 min)
- How it can help if you are a library maintainer or core contributor.
- Few points to keep in mind.
- QnA
Attendees can gain insight about:
The Scientific Python communities.
How Array works in different python libraries.
How Array-API-Standards help create uniformity across arrays.
How to contribute to the scientific python ecosystem and why more people should talk about it.
Intersting talk.
Audience gets to understand:
1. what goes under the hood.
2. dynamic dispatching of array routines
Good talk overall - there aren't a lot of array producer libraries out there but there definitely are a lot of array consumer libraries so highlighting what APIs consumer libraries can rely on will be valuable to the end users.
Additionally, it'll be good to also spend some time talking about other language ecosystems. Is this a problem that only Python suffers with or is this true for other languages e.g. JS?
Finally, it'll be good to introduce how the Array API Standard was codified in the first place and how the Standard evolves, in case new array producer or consumer libraries want to provide comments