-
Notifications
You must be signed in to change notification settings - Fork 17
Add components and flexibility pages #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add components and flexibility pages #131
Conversation
**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". | ||
Most, but not all, zarr implementations will serialize to this format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this needs an explicit section in the specification, even if it's pretty trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out it does (at least for filesystems - there's nothing for object storage). See #131 (comment) for more context.
components/index.md
Outdated
**Zarr-Python Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, based on a `MutableMapping`-like API. | ||
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels weird to have "abstract" base classes in the "concrete" section, but I think jumping back and forth between talking about zarr-python and language-agnostic concepts would be more confusing.
**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html). | ||
It consists of a heirarchical tree of groups and arrays, with optional arbitrary metadata at every node. This model is completely domain-agnostic. | ||
|
||
**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay for me to enshrine the name "Native Zarr Format" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "native" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following #131 (comment), the word "native" is perhaps redundant if we have a clear understanding of what "format" refers to.
See the following GitHub repositories for more information: | ||
|
||
* [Zarr Python](https://github.com/zarr-developers/zarr) | ||
* [Zarr Specs](https://github.com/zarr-developers/zarr-specs) | ||
* [Numcodecs](https://github.com/zarr-developers/numcodecs) | ||
* [Z5](https://github.com/constantinpape/z5) | ||
* [N5](https://github.com/saalfeldlab/n5) | ||
* [Zarr.jl](https://github.com/meggart/Zarr.jl) | ||
* [ndarray.scala](https://github.com/lasersonlab/ndarray.scala) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's deeply unhelpful to immediately point at specific implementations here as the source of further explanation. That's not what their docs are for!
@@ -51,6 +45,7 @@ See the following GitHub repositories for more information: | |||
## Features | |||
|
|||
* Chunk multi-dimensional arrays along any dimension. | |||
* Compress array chunks via an extensible system of compressors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed like a important omission.
components/index.md
Outdated
|
||
These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system. | ||
|
||
**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data and metadata as byte streams via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and metadata as byte streams
small nit: the spec doesn't say the metadata has to be serialized as bytes. (e.g. a memorystore or other database could keep the metadata in a dict-like object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be addressed by 3514d41
components/index.md
Outdated
|
||
- **NCZarr** and **Lindi** can both in some sense be considered as the opposite of VirtualiZarr - they allow interacting with zarr-formatted data on disk via a non-zarr API. | ||
Lindi maps zarr's data model to the HDF data model and allows access to via the `h5py` library through the [`LindiH5pyFile`](https://github.com/NeurodataWithoutBorders/lindi/blob/b125c111880dd830f2911c1bc2084b2de94f6d71/lindi/LindiH5pyFile/LindiH5pyFile.py#L28) class. | ||
[NCZarr](https://docs.unidata.ucar.edu/nug/current/nczarr_head.html) allows interacting with zarr-formatted data via the netcdf-c library. Note that both libraries implement optional additional optimizations by going beyond the zarr specification and format on disk, which is not recommended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very confident that I've actually understood what NCZarr does properly.
components/index.md
Outdated
- **MongoDBStore** is a concrete store implementation in python, which stores values in a MongoDB NoSQL database under zarr keys. | ||
It is therefore spec-compliant, and can be interacted with via the zarr-python user API, but does not write data in the native zarr format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this still exist anywhere? I wanted an example of a python store implementation that wasn't in zarr-python v3's zarr.storage
module, and didn't use the zarr native format on disk.
index.md
Outdated
For more details read about the various [Components of Zarr](https://zarr.dev/components/), | ||
see the canonical [Zarr-Python](https://github.com/zarr-developers/zarr-python) implementation, | ||
or look through [other Zarr implementations](https://zarr.dev/implementations/) for one in your preferred language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to do relative links on this site. These links are broken in the preview docs build because they don't exist on the released site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way to have the framework check them would be to link to the .md within the directory.
components/index.md
Outdated
|
||
These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system. | ||
|
||
**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data as byte streams as well as store metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more accurate to call this the "Zarr Protocol" - that's what it actually is, a set of rules for transferring data between devices. The "specification" then could refer to the description of the protocol + of the data model + of the zarr native format specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I edited this - the more I think about it the more I think that the spec itself should explicitly talk about the protocol and the format as separate things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #131 (comment) for more explanation
Thanks Tom! I'm on the road for the next week and will read ASAP but I love the idea. 🙌🏼 |
|
||
## Features | ||
|
||
* Serialize NumPy-like arrays in a simple and fast way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt like the applications and features were mixed up together.
* Store arrays in memory, on disk, inside a Zip file, on S3, etc. | ||
* Read and write arrays concurrently from multiple threads or processes. | ||
* Organize arrays into hierarchies via annotatable groups. | ||
* Extend easily thanks to the [flexible design](https://zarr.dev/flexibility/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link here is intended to start the reader reading through each page in turn, as the other technical pages I added also have a link at the bottom to the next one along.
The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). | ||
A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps. | ||
|
||
**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like it makes more sense to lead with the data model. the spec, i.e. the protocol, defines operations (create group, create array, write chunks to an array, etc) that only make sense in light of that particular data model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the spec, i.e. the protocol
I think I disagree that these are one and the same (see #131 (comment)), but otherwise agree with your suggestion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between the contents of the zarr v2 / v3 specs and the zarr v2 / v3 protocols?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my long comment below: #131 (comment)
**Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, using a `Store` ABC, which is based on a `MutableMapping`-like API. | ||
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In zarr-python v2 the store API was based on MutableMapping
, but IMO the zarr-python v3 Store
api is not really MutableMapping
like. Instead it's a pretty vanilla "read and write stuff to kv storage" API.
|
||
**Protocol**: All zarr-related projects use the Zarr Protocol, described in the [Zarr Specification](https://zarr-specs.readthedocs.io/), which allows transfer of chunked array data and metadata between devices (or between memory regions of the same device). | ||
The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). | ||
A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to distinguish how metadata documents are stored vs how chunk data is stored. for example, it's significant that the compresspr / filters (v2) and codecs (v3) define the encoding of chunk data, not metadata documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My wording was intended to make that distinction already, because Joe said the same thing in an earlier comment. Clearly I need to distinguish them better though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the prose only needs a minor adjustment, since in the previous section you distinguish array data and metadata. It might be sufficient to just disambiguate what exactly is encoded and serialized by the codecs (i.e., the chunks of an array).
**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". | ||
Most, but not all, zarr implementations will serialize to this format. | ||
|
||
**Extensions**: Zarr provides a core set of generally-useful features, but extensions to this core are encouraged. These might take the form of domain-specific [metadata conventions](https://zarr.dev/conventions/), new codecs, or additions to the data model via [extension points](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points). These can be abstract, or enforced by implementations or client libraries however they like, but generally should be opt-in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does opt-in mean here? if you are using xarray with zarr, the xarray extensions to zarr are mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. All extensions are by definition not required (as then they would be core), but specific tools might well require you to use a certain extension, so calling things "opt-in" or "opt-out" doesn't make much sense.
thanks for working on this, here are few rambling thoughts that hopefully you find useful: you list 4 abstract components of zarr:
I'm having trouble placing these 4 things in separate conceptual categories. For me, a clearer "abstract parts list" would be something like this:
It feels weird calling this latter description a "protocol" without defining some verbs, but we could restructure the statements to take the form "to create an array at When discussing extensibility, I think it's important to distinguish between a few scenarios that all get called "extensions":
so basically I don't see "extensibility" as a core abstract component of zarr. Instead I see extensibility as a vital property / feature of different layers of the zarr model, and this varies the version of the zarr format. And I'm not sure what you mean when you say an extension is "abstract". |
You've convinced me that extensions are not a core abstract component, they are something else. I can edit this PR to reflect that. That leaves
Before going through these, let me re-describe the conceptual confusion that I'm attempting to clarify with this nomenclature. I used to think that "zarr" was simply a format, which was laid out in a specific way in filesystem or object storage, and that the spec described this format. I think a lot of other people assumed (and still assume) this. Then once Icechunk came out I was told that icechunk was a valid implementation of the zarr specification but did not use the same format on-disk. This was quite surprising and confusing, because I had thought the specification dictated the format on-disk. Now I realise that VirtualiZarr's For this to all be consistent, one of the following must be true:
I'm trying to find a nomenclature that more clearly separates these two components. Perhaps that nomenclature already exists, but if so then it's not documented at all on the main We seem to agree that the data model is it's own abstract component. You then mention
this is basically what I mean by the "protocol". I was looking for a word to play counterpart to "format" (i.e. "icechunk obeys the X, not the format"). I think protocol is quite a good word for it - it's an agreement between two systems (or parts of a system) on a scheme for transferring chunk data and metadata. It makes no claims about the type of system implementing the protocol. It's not a networking protocol, but still seems to fit the broader definition of a protocol.
I had thought this didn't exist anywhere, but it turns out that it's here - https://zarr-specs.readthedocs.io/en/latest/v3/stores/filesystem/v1.0.html#file-system-store-v1. (At least that document covers filesystem storage - I think there should be another one for object storage too.) So (1) above is incorrect. That leaves a choice between (2) and (3): whether we say that there is one zarr specification with a mandatory "protocol" and an optional "format", or we say that zarr has a "protocol" and an optional "format", with separate specifications describing each. I have no strong opinion on that, I only request that we have some word other than "specification" to describe the non-format abstract component of zarr. |
I agree with this, and I think we should emphasize the protocol angle. In this context I think a key difference between a format and a protocol is that a format is a state, but a protocol is set of rules (generally speaking). So a good framing of the zarr specs would be:
So I think my preference would be to give primacy to the protocol. The stored representation of metadata and chunks should be considered the interaction between the zarr protocol and the behavior of a storage backend. |
I don't see how that solves the nomenclature problem I'm talking about.
The expected result of anything involving "interpretation" should be standardized. And that standardized result (which is a format on object storage/filesystems) needs a specific name we can use to refer to it. If we're using "zarr format" to refer to the metadata documents in-memory then we need another name to mean the whole thing on disk.
I don't disagree that this is how the format comes about, but I still think the resultant format needs a name so that we can distinguish it. Otherwise it's still hard for me to express that e.g. "icechunk follows the zarr protocol and the in-memory zarr metadata format, but doesn't use the ??? on disk". |
I'm definitely not suggesting we use the term "zarr format" to refer to metadata documents. I'm pointing that that, while the spec defines a format for those documents, the spec does not explicitly define a format for the on-disk representation. So I'm kind of wondering why we need to give on-disk representations their own conceptual category, when they are essentially a side-effect of a particular store API. For example, this sentence: could be rephrased as "icechunk follows the zarr protocol (this already implies that it uses the right representation for in-memory zarr metadata); it uses a unique stored representation / layout for zarr data". |
Then how would you refer to the not-unique stored representation of zarr data (what 90% of people erroneously believe to be "the zarr format")? |
I guess it depends on the context. If I just saved some data to s3, and I want to describe what I saved to another person, I would say "I saved the data to zarr". If I'm having a high level discussion about the spec, I would probably say "the layout used by stores that target cloud storage" (this is of course inaccurate if we include stores like icechunk!). I might also say "the conventional zarr-on-[backend] layout". For s3 and posix storage I think basically all implementations agree on the same layout without the need for a spec. But I could imagine storage backends with more degrees of freedom, like databases, where one zarr implementation might make very different choices from another. In this case the store API would need a specification, but we could then use the name of that specification to address the layout it required. |
Both of these seem unnecessarily verbose and confusing to me. For that reason alone I think there deserves to be a specific agreed-upon name to use to refer to the "conventional/native"-persistent-zarr-on-object/file-storage-layout. These definitions seem problematic in other ways too. The first is clearly store-implementation-dependent, and potentially circular: "
If that's true then why is there currently a dedicated section in the spec to describe the layout for posix storage, but not one for s3? Whatever your position on the need for a format spec that seems inconsistent. I don't understand what the reluctance is to just:
That would make it so much easier to explain what zarr/zarr-python/zarr-formatted data/icechunk/virtualizarr are unambiguously to people who aren't deep in the weeds like we are. (e.g. " |
I wasn't providing definitions, I was answering the question you asked, which was "how would you refer to it". The definition, IMO, would be something like "the implementation of the zarr protocol where object names are used directly as keys in key-value storage", but this is also not very concise.
I'm not expressing a normative view here, just stating how things have worked so far. in zarr v2, there is no formal spec for posix file systems or object storage, yet implementations in different languages basically all agreed on the exact same stored representation. And I doubt anyone actually relies on the spec document you linked for v3. Organic agreement between implementations isn't surprising because zarr was designed to be easy to implement for posix file systems and object storage. I brought up the database-zarr backend as an example of a scenario where a spec would be necessary, because the mapping from zarr to a db engine has a lot more ambiguity.
It wouldn't hurt to write up the object store layout, but I don't think there have been issues with interoperability there, so I'm not sure what the value would be. It would also be a lot of work.
I don't think the word "native" is very helpful here, but something like "plain zarr layout" would certainly make sense to me if you introduced the term specifically in contrast to zarr layouts like the in-memory storage, or icechunk. For reference, this is exactly why I suggested making a zarr from scratch demo as an explanatory tool. if you do everything in memory, there is no stable layout at all -- it's all protocol. Once the protocol is clear, then you can introduce the concept of the stored representation, and demonstrate how those representations can look different on different backends.
I do think the spec should be more clear and say "this is a protocol". But it's probably too late to make these changes for the published specs. |
Yeah that's fair, sorry for shifting the goalposts there. Ultimately I just want to be able to write documentation that refers to things succinctly and unambiguously.
I agree that the lack of this object storage layout spec doesn't seem to have caused any interoperability issues. I'm just saying that the lack of it is causing me some documentation issues.
I fully agree that would be a very demonstrative and worthwhile example, but that's far more in-depth and developer-oriented than what I'm looking to put on the
So what is the recommendation here then? Can I use the word "protocol" on this website? Can I use "plain zarr layout"? I would have hyperlinks each time to the spec for details. |
Those both seem like good ideas to me. But given that the spec document is not super clear about this, I do think the most important thing is that these concepts are clearly defined where they are introduced. |
Hi folks. Apologies for the delay. I'm not sure how many of the above comments are still outstanding (can I suggest resolving those that aren't?) My knee-jerk reaction from the first glance still remains: can I suggest we rename this page to Concepts? I've recently seen a few pages that use that to suggest, "Read these docs first". If so, the headlines might be:
And +1 for having a section on "native zarr format" (or similar) if it's going to be mentioned. All that being said, happy to get this in sooner rather than later and then encourage others to add to over time. Thanks again, @TomNicholas. |
Implements the suggestion in zarr-developers/zarr-python#2956.
Not quite finished yet.This is ready for review (@d-v-b @joshmoore)