Add components and flexibility pages #131

TomNicholas · 2025-04-04T22:30:56Z

Implements the suggestion in zarr-developers/zarr-python#2956.

~~Not quite finished yet.~~ This is ready for review (@d-v-b @joshmoore)

…mplementations page

TomNicholas · 2025-04-04T22:35:34Z

components/index.md

+**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". 
+Most, but not all, zarr implementations will serialize to this format.


I feel like this needs an explicit section in the specification, even if it's pretty trivial.

Turns out it does (at least for filesystems - there's nothing for object storage). See #131 (comment) for more context.

TomNicholas · 2025-04-04T22:36:51Z

components/index.md

+**Zarr-Python Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, based on a `MutableMapping`-like API. 
+This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store.


Feels weird to have "abstract" base classes in the "concrete" section, but I think jumping back and forth between talking about zarr-python and language-agnostic concepts would be more confusing.

TomNicholas · 2025-04-04T22:37:32Z

components/index.md

+**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html). 
+It consists of a heirarchical tree of groups and arrays, with optional arbitrary metadata at every node. This model is completely domain-agnostic.
+
+**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". 


Is it okay for me to enshrine the name "Native Zarr Format" here?

what does "native" mean here?

Following #131 (comment), the word "native" is perhaps redundant if we have a clear understanding of what "format" refers to.

TomNicholas · 2025-04-04T22:38:22Z

index.md

-See the following GitHub repositories for more information:
-
-* [Zarr Python](https://github.com/zarr-developers/zarr)
-* [Zarr Specs](https://github.com/zarr-developers/zarr-specs)
-* [Numcodecs](https://github.com/zarr-developers/numcodecs)
-* [Z5](https://github.com/constantinpape/z5)
-* [N5](https://github.com/saalfeldlab/n5)
-* [Zarr.jl](https://github.com/meggart/Zarr.jl)
-* [ndarray.scala](https://github.com/lasersonlab/ndarray.scala)


I think it's deeply unhelpful to immediately point at specific implementations here as the source of further explanation. That's not what their docs are for!

TomNicholas · 2025-04-04T22:39:04Z

index.md

@@ -51,6 +45,7 @@ See the following GitHub repositories for more information:
 ## Features

 * Chunk multi-dimensional arrays along any dimension.
+* Compress array chunks via an extensible system of compressors.


Seemed like a important omission.

components/index.md

jhamman · 2025-04-04T23:58:58Z

components/index.md

+
+These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system.
+
+**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data and metadata as byte streams via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). 


and metadata as byte streams

small nit: the spec doesn't say the metadata has to be serialized as bytes. (e.g. a memorystore or other database could keep the metadata in a dict-like object)

Should be addressed by 3514d41

TomNicholas · 2025-04-05T00:08:20Z

components/index.md

+
+- **NCZarr** and **Lindi** can both in some sense be considered as the opposite of VirtualiZarr - they allow interacting with zarr-formatted data on disk via a non-zarr API. 
+Lindi maps zarr's data model to the HDF data model and allows access to via the `h5py` library through the [`LindiH5pyFile`](https://github.com/NeurodataWithoutBorders/lindi/blob/b125c111880dd830f2911c1bc2084b2de94f6d71/lindi/LindiH5pyFile/LindiH5pyFile.py#L28) class. 
+[NCZarr](https://docs.unidata.ucar.edu/nug/current/nczarr_head.html) allows interacting with zarr-formatted data via the netcdf-c library. Note that both libraries implement optional additional optimizations by going beyond the zarr specification and format on disk, which is not recommended.


I'm not very confident that I've actually understood what NCZarr does properly.

TomNicholas · 2025-04-05T00:11:17Z

components/index.md

+- **MongoDBStore** is a concrete store implementation in python, which stores values in a MongoDB NoSQL database under zarr keys. 
+It is therefore spec-compliant, and can be interacted with via the zarr-python user API, but does not write data in the native zarr format.


Does this still exist anywhere? I wanted an example of a python store implementation that wasn't in zarr-python v3's zarr.storage module, and didn't use the zarr native format on disk.

TomNicholas · 2025-04-05T00:41:45Z

index.md

+For more details read about the various [Components of Zarr](https://zarr.dev/components/), 
+see the canonical [Zarr-Python](https://github.com/zarr-developers/zarr-python) implementation, 
+or look through [other Zarr implementations](https://zarr.dev/implementations/) for one in your preferred language.


I'm not sure how to do relative links on this site. These links are broken in the preview docs build because they don't exist on the released site.

The way to have the framework check them would be to link to the .md within the directory.

TomNicholas · 2025-04-05T00:59:29Z

components/index.md

+
+These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system.
+
+**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data as byte streams as well as store metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). 


It might be more accurate to call this the "Zarr Protocol" - that's what it actually is, a set of rules for transferring data between devices. The "specification" then could refer to the description of the protocol + of the data model + of the zarr native format specification.

I edited this - the more I think about it the more I think that the spec itself should explicitly talk about the protocol and the format as separate things.

See #131 (comment) for more explanation

joshmoore · 2025-04-05T09:57:52Z

Thanks Tom! I'm on the road for the next week and will read ASAP but I love the idea. 🙌🏼

TomNicholas · 2025-04-05T20:11:42Z

index.md


 ## Features

+* Serialize NumPy-like arrays in a simple and fast way.


I felt like the applications and features were mixed up together.

TomNicholas · 2025-04-05T20:12:28Z

index.md

 * Store arrays in memory, on disk, inside a Zip file, on S3, etc.
 * Read and write arrays concurrently from multiple threads or processes.
 * Organize arrays into hierarchies via annotatable groups.
+* Extend easily thanks to the [flexible design](https://zarr.dev/flexibility/).


The link here is intended to start the reader reading through each page in turn, as the other technical pages I added also have a link at the bottom to the next one along.

components/index.md

d-v-b · 2025-04-05T21:36:00Z

components/index.md

+The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). 
+A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps.
+
+**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html). 


i feel like it makes more sense to lead with the data model. the spec, i.e. the protocol, defines operations (create group, create array, write chunks to an array, etc) that only make sense in light of that particular data model.

the spec, i.e. the protocol

I think I disagree that these are one and the same (see #131 (comment)), but otherwise agree with your suggestion here.

what's the difference between the contents of the zarr v2 / v3 specs and the zarr v2 / v3 protocols?

See my long comment below: #131 (comment)

d-v-b · 2025-04-05T21:40:37Z

components/index.md

+**Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, using a `Store` ABC, which is based on a `MutableMapping`-like API. 
+This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store.


In zarr-python v2 the store API was based on MutableMapping, but IMO the zarr-python v3 Store api is not really MutableMapping like. Instead it's a pretty vanilla "read and write stuff to kv storage" API.

d-v-b · 2025-04-05T21:43:33Z

components/index.md

+
+**Protocol**: All zarr-related projects use the Zarr Protocol, described in the [Zarr Specification](https://zarr-specs.readthedocs.io/), which allows transfer of chunked array data and metadata between devices (or between memory regions of the same device). 
+The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). 
+A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps.


I would try to distinguish how metadata documents are stored vs how chunk data is stored. for example, it's significant that the compresspr / filters (v2) and codecs (v3) define the encoding of chunk data, not metadata documents.

My wording was intended to make that distinction already, because Joe said the same thing in an earlier comment. Clearly I need to distinguish them better though.

I think the prose only needs a minor adjustment, since in the previous section you distinguish array data and metadata. It might be sufficient to just disambiguate what exactly is encoded and serialized by the codecs (i.e., the chunks of an array).

d-v-b · 2025-04-06T08:42:59Z

components/index.md

+**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". 
+Most, but not all, zarr implementations will serialize to this format.
+
+**Extensions**: Zarr provides a core set of generally-useful features, but extensions to this core are encouraged. These might take the form of domain-specific [metadata conventions](https://zarr.dev/conventions/), new codecs, or additions to the data model via [extension points](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points). These can be abstract, or enforced by implementations or client libraries however they like, but generally should be opt-in.


what does opt-in mean here? if you are using xarray with zarr, the xarray extensions to zarr are mandatory.

Fair point. All extensions are by definition not required (as then they would be core), but specific tools might well require you to use a certain extension, so calling things "opt-in" or "opt-out" doesn't make much sense.

d-v-b · 2025-04-06T09:35:27Z

thanks for working on this, here are few rambling thoughts that hopefully you find useful:

you list 4 abstract components of zarr:

the protocol
the data model
the format
extensions

I'm having trouble placing these 4 things in separate conceptual categories. For me, a clearer "abstract parts list" would be something like this:

the data model
- there 2 entities: arrays and groups
- arrays and groups have arbitrary user-defined attributes
- arrays contain n-dimensional typed values
- groups contain arrays or other groups
a scheme for representing this data model in key-value storage (not sure if this is the format or the protocol?)
- an array or group named x is denoted by a structured JSON metadata document at the key x/<metadata document name>
- attributes for an array or group x are stored at x/.zattrs for v2, or in a special field in the array / group metadata document for v3.
- chunks, etc

It feels weird calling this latter description a "protocol" without defining some verbs, but we could restructure the statements to take the form "to create an array at x ....", then it would feel more protocol-like.

When discussing extensibility, I think it's important to distinguish between a few scenarios that all get called "extensions":

"extensible" metadata
The zarr v2 spec doesn't constrain the set of codecs. It just requires that codecs have a certain JSON structure. So in a sense the compressor and filters fields are "extensible" insofar as there's an infinite set of spec-compliant codecs. But from a spec POV, creating and using your own codec is not really an extension of the spec, any more than using a non-0 fill value for integer data is an "extension".
subsetting / narrowing conventions:
if we consider the space of all possible zarr hierarchies as a set, there are many zarr extensions (or conventions) that can be defined as a subset of this larger set. For example, some zarr conventions impose requirements on particular metadata fields. A contrived example would be a zarr convention where all the arrays must be 8bit integers, and all groups must contain 1 array with the name "foo", and the user attributes must contain a field called name that has a string value. This doesn't add any new entities to the base zarr data model or the stored representation. I think ome-ngff and geozarr fall into this category.
true extensions
consolidated metadata in zarr v2 adds a new metadata object and defines semantics for this object that interact with the creation / mutation of arrays and groups. For this reason consolidated metadata is a true extension (addition) to the zarr stored representation. But IMO zarr v2 consolidated metadata doesn't expand the basic zarr data model. By contrast, I think one could argue that Zarr v3 consolidated metadata does expand the basic zarr data model, because it gives groups a new attribute (an image of their contents). Another example of a true extension to the data model would be allowing arrays to contain other arrays (e.g., making "group" a trait that arrays could implement).

so basically I don't see "extensibility" as a core abstract component of zarr. Instead I see extensibility as a vital property / feature of different layers of the zarr model, and this varies the version of the zarr format. And I'm not sure what you mean when you say an extension is "abstract".

Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>

TomNicholas · 2025-04-06T17:10:33Z

so basically I don't see "extensibility" as a core abstract component of zarr.

You've convinced me that extensions are not a core abstract component, they are something else. I can edit this PR to reflect that.

That leaves

the protocol
the data model
the format

Before going through these, let me re-describe the conceptual confusion that I'm attempting to clarify with this nomenclature.

I used to think that "zarr" was simply a format, which was laid out in a specific way in filesystem or object storage, and that the spec described this format. I think a lot of other people assumed (and still assume) this.

Then once Icechunk came out I was told that icechunk was a valid implementation of the zarr specification but did not use the same format on-disk. This was quite surprising and confusing, because I had thought the specification dictated the format on-disk.

Now I realise that VirtualiZarr's ManifestStore is another example of a specification-compliant implementation of a store that does not use the same format on-disk (it is a read-only store that uses archival formats such as netCDF instead).

For this to all be consistent, one of the following must be true:

The "specification" doesn't actually specify anything about the on-disk format. That's what I assumed in DOC: Missing page on layers of Zarr abstractions zarr-python#2956.
The "specification" is split into multiple parts, one of which does not specify the on-disk format, and that's the one people mean when they refer to "the specification" in the context of icechunk.
There are actually multiple specifications, one specifying the on-disk format, and another specifying the thing icechunk implements.

I'm trying to find a nomenclature that more clearly separates these two components. Perhaps that nomenclature already exists, but if so then it's not documented at all on the main zarr.dev website, even though users do need to understand this distinction (they need to know what format they are writing!).

We seem to agree that the data model is it's own abstract component. You then mention

a scheme for representing this data model in key-value storage

this is basically what I mean by the "protocol". I was looking for a word to play counterpart to "format" (i.e. "icechunk obeys the X, not the format"). I think protocol is quite a good word for it - it's an agreement between two systems (or parts of a system) on a scheme for transferring chunk data and metadata. It makes no claims about the type of system implementing the protocol. It's not a networking protocol, but still seems to fit the broader definition of a protocol.

the format

I had thought this didn't exist anywhere, but it turns out that it's here - https://zarr-specs.readthedocs.io/en/latest/v3/stores/filesystem/v1.0.html#file-system-store-v1. (At least that document covers filesystem storage - I think there should be another one for object storage too.) So (1) above is incorrect.

That leaves a choice between (2) and (3): whether we say that there is one zarr specification with a mandatory "protocol" and an optional "format", or we say that zarr has a "protocol" and an optional "format", with separate specifications describing each. I have no strong opinion on that, I only request that we have some word other than "specification" to describe the non-format abstract component of zarr.

d-v-b · 2025-04-06T19:33:53Z

this is basically what I mean by the "protocol". I was looking for a word to play counterpart to "format" (i.e. "icechunk obeys the X, not the format"). I think protocol is quite a good word for it - it's an agreement between two systems (or parts of a system) on a scheme for transferring chunk data and metadata. It makes no claims about the type of system implementing the protocol. It's not a networking protocol, but still seems to fit the broader definition of a protocol.

I agree with this, and I think we should emphasize the protocol angle.

In this context I think a key difference between a format and a protocol is that a format is a state, but a protocol is set of rules (generally speaking). So a good framing of the zarr specs would be:

the specs define a format for in-memory metadata documents (they must be JSON with such and such fields)
the specs define rules (a protocol) for the storage of those metadata documents.
the specs do not explicitly define a format for the stored representation of metadata documents. "POSIX file system, but all filenames are reversed" could implement the zarr protocol (we would have v2 array metadata called yarraz.).

But implementations using the simplest interpretation of the storage protocol, targeting POSIX file systems or commercial cloud storage, will produce compatible stored representations.

So I think my preference would be to give primacy to the protocol. The stored representation of metadata and chunks should be considered the interaction between the zarr protocol and the behavior of a storage backend.

TomNicholas · 2025-04-07T15:43:19Z

the specs do not explicitly define a format for the stored representation of metadata documents. "POSIX file system, but all filenames are reversed" could implement the zarr protocol (we would have v2 array metadata called yarraz.).

But implementations using the simplest interpretation of the storage protocol, targeting POSIX file systems or commercial cloud storage, will produce compatible stored representations.

I don't see how that solves the nomenclature problem I'm talking about.

the simplest interpretation of the storage protocol

The expected result of anything involving "interpretation" should be standardized. And that standardized result (which is a format on object storage/filesystems) needs a specific name we can use to refer to it. If we're using "zarr format" to refer to the metadata documents in-memory then we need another name to mean the whole thing on disk.

So I think my preference would be to give primacy to the protocol. The stored representation of metadata and chunks should be considered the interaction between the zarr protocol and the behavior of a storage backend.

I don't disagree that this is how the format comes about, but I still think the resultant format needs a name so that we can distinguish it. Otherwise it's still hard for me to express that e.g. "icechunk follows the zarr protocol and the in-memory zarr metadata format, but doesn't use the ??? on disk".

d-v-b · 2025-04-07T16:15:40Z

The expected result of anything involving "interpretation" should be standardized. And that standardized result (which is a format on object storage/filesystems) needs a specific name we can use to refer to it. If we're using "zarr format" to refer to the metadata documents in-memory then we need another name to mean the whole thing on disk.

I'm definitely not suggesting we use the term "zarr format" to refer to metadata documents. I'm pointing that that, while the spec defines a format for those documents, the spec does not explicitly define a format for the on-disk representation.

So I'm kind of wondering why we need to give on-disk representations their own conceptual category, when they are essentially a side-effect of a particular store API.

For example, this sentence:
"icechunk follows the zarr protocol and the in-memory zarr metadata format, but doesn't use the ??? on disk"

could be rephrased as

"icechunk follows the zarr protocol (this already implies that it uses the right representation for in-memory zarr metadata); it uses a unique stored representation / layout for zarr data".

TomNicholas · 2025-04-07T16:25:30Z

"icechunk follows the zarr protocol (this already implies that it uses the right representation for in-memory zarr metadata); it uses a unique stored representation / layout for zarr data".

Then how would you refer to the not-unique stored representation of zarr data (what 90% of people erroneously believe to be "the zarr format")?

d-v-b · 2025-04-07T16:40:35Z

"icechunk follows the zarr protocol (this already implies that it uses the right representation for in-memory zarr metadata); it uses a unique stored representation / layout for zarr data".

Then how would you refer to the not-unique stored representation of zarr data (what 90% of people erroneously believe to be "the zarr format")?

I guess it depends on the context. If I just saved some data to s3, and I want to describe what I saved to another person, I would say "I saved the data to zarr". If I'm having a high level discussion about the spec, I would probably say "the layout used by stores that target cloud storage" (this is of course inaccurate if we include stores like icechunk!). I might also say "the conventional zarr-on-[backend] layout".

For s3 and posix storage I think basically all implementations agree on the same layout without the need for a spec. But I could imagine storage backends with more degrees of freedom, like databases, where one zarr implementation might make very different choices from another. In this case the store API would need a specification, but we could then use the name of that specification to address the layout it required.

TomNicholas · 2025-04-07T17:09:54Z

"the layout used by stores that target cloud storage"

"the conventional zarr-on-[backend] layout".

Both of these seem unnecessarily verbose and confusing to me. For that reason alone I think there deserves to be a specific agreed-upon name to use to refer to the "conventional/native"-persistent-zarr-on-object/file-storage-layout.

These definitions seem problematic in other ways too. The first is clearly store-implementation-dependent, and potentially circular: "FsspecStore uses the layout used by stores that target cloud storage, such as `FsspecStore". The second definition begs the question of "which convention?"

For s3 and posix storage I think basically all implementations agree on the same layout without the need for a spec.

If that's true then why is there currently a dedicated section in the spec to describe the layout for posix storage, but not one for s3? Whatever your position on the need for a format spec that seems inconsistent.

I don't understand what the reluctance is to just:

adding a equivalent to https://zarr-specs.readthedocs.io/en/latest/v3/stores/filesystem/v1.0.html but for object storage to the spec
referring to them both consistently by a single short name (e.g. the "zarr native format/layout")
distinguishing the rest of the spec by referring to that as e.g. "the zarr protocol".

That would make it so much easier to explain what zarr/zarr-python/zarr-formatted data/icechunk/virtualizarr are unambiguously to people who aren't deep in the weeds like we are. (e.g. "Icechunk/VirtualiZarr are store implementations that follow the zarr protocol but do not use the zarr native format")

d-v-b · 2025-04-07T18:16:07Z

These definitions seem problematic in other ways too. The first is clearly store-implementation-dependent, and potentially circular: "FsspecStore uses the layout used by stores that target cloud storage, such as `FsspecStore". The second definition begs the question of "which convention?"

I wasn't providing definitions, I was answering the question you asked, which was "how would you refer to it". The definition, IMO, would be something like "the implementation of the zarr protocol where object names are used directly as keys in key-value storage", but this is also not very concise.

For s3 and posix storage I think basically all implementations agree on the same layout without the need for a spec.

If that's true then why is there currently a dedicated section in the spec to describe the layout for posix storage, but not one for s3? Whatever your position on the need for a format spec that seems inconsistent.

I'm not expressing a normative view here, just stating how things have worked so far. in zarr v2, there is no formal spec for posix file systems or object storage, yet implementations in different languages basically all agreed on the exact same stored representation. And I doubt anyone actually relies on the spec document you linked for v3.

Organic agreement between implementations isn't surprising because zarr was designed to be easy to implement for posix file systems and object storage. I brought up the database-zarr backend as an example of a scenario where a spec would be necessary, because the mapping from zarr to a db engine has a lot more ambiguity.

I don't understand what the reluctance is to just:

adding a equivalent to https://zarr-specs.readthedocs.io/en/latest/v3/stores/filesystem/v1.0.html but for object storage to > the spec

It wouldn't hurt to write up the object store layout, but I don't think there have been issues with interoperability there, so I'm not sure what the value would be. It would also be a lot of work.

referring to them both consistently by a single short name (e.g. the "zarr native format/layout")

I don't think the word "native" is very helpful here, but something like "plain zarr layout" would certainly make sense to me if you introduced the term specifically in contrast to zarr layouts like the in-memory storage, or icechunk. For reference, this is exactly why I suggested making a zarr from scratch demo as an explanatory tool. if you do everything in memory, there is no stable layout at all -- it's all protocol. Once the protocol is clear, then you can introduce the concept of the stored representation, and demonstrate how those representations can look different on different backends.

distinguishing the rest of the spec by referring to that as e.g. "the zarr protocol".

I do think the spec should be more clear and say "this is a protocol". But it's probably too late to make these changes for the published specs.

TomNicholas · 2025-04-07T18:52:37Z

I wasn't providing definitions, I was answering the question you asked, which was "how would you refer to it".

Yeah that's fair, sorry for shifting the goalposts there. Ultimately I just want to be able to write documentation that refers to things succinctly and unambiguously.

there is no formal spec for posix file systems or object storage, yet implementations in different languages basically all agreed on the exact same stored representation. And I doubt anyone actually relies on the spec document you linked for v3.

I don't think there have been issues with interoperability there, so I'm not sure what the value would be. It would also be a lot of work.

I agree that the lack of this object storage layout spec doesn't seem to have caused any interoperability issues. I'm just saying that the lack of it is causing me some documentation issues.

For reference, this is exactly why I zarr-developers/zarr-python#2956 (comment) making a zarr from scratch demo as an explanatory tool.

I fully agree that would be a very demonstrative and worthwhile example, but that's far more in-depth and developer-oriented than what I'm looking to put on the zarr.dev website here. I need to be able to explain to people what format zarr/icechunk do/don't use without asking them to follow a whole zarr-from-scratch tutorial.

something like "plain zarr layout" would certainly make sense to me if you introduced the term specifically in contrast to zarr layouts like the in-memory storage, or icechunk

I do think the spec should be more clear and say "this is a protocol". But it's probably too late to make these changes for the published specs.

So what is the recommendation here then? Can I use the word "protocol" on this website? Can I use "plain zarr layout"? I would have hyperlinks each time to the spec for details.

d-v-b · 2025-04-07T18:55:30Z

So what is the recommendation here then? Can I use the word "protocol" on this website? Can I use "plain zarr layout"? I would have hyperlinks each time to the spec for details.

Those both seem like good ideas to me. But given that the spec document is not super clear about this, I do think the most important thing is that these concepts are clearly defined where they are introduced.

joshmoore · 2025-04-21T16:16:37Z

Hi folks. Apologies for the delay. I'm not sure how many of the above comments are still outstanding (can I suggest resolving those that aren't?)

My knee-jerk reaction from the first glance still remains: can I suggest we rename this page to Concepts? I've recently seen a few pages that use that to suggest, "Read these docs first".

If so, the headlines might be:

Zarr Concepts
- High-level concepts
- Concrete example of Zarr-python (perhaps something about "Reference implementation")

And +1 for having a section on "native zarr format" (or similar) if it's going to be mentioned.

All that being said, happy to get this in sooner rather than later and then encourage others to add to over time. Thanks again, @TomNicholas.

TomNicholas added 5 commits April 4, 2025 17:28

hagne description to point to components page, zarr-python, and the i…

53318b9

…mplementations page

add compression as another key feature of zarr

d91c743

describe abstract components

237b674

add section on concrete components

072d3c5

add heading for section on flexibility

96cbd4d

TomNicholas mentioned this pull request Apr 4, 2025

DOC: Missing page on layers of Zarr abstractions zarr-developers/zarr-python#2956

Open

make each sentence a new line

750e99c

TomNicholas commented Apr 4, 2025

View reviewed changes

TomNicholas added 2 commits April 4, 2025 18:49

add section on extensions

4a24927

add section on TensorStore

dfd10e4

TomNicholas commented Apr 4, 2025

View reviewed changes

components/index.md Outdated Show resolved Hide resolved

add extensions, icechunk, and mongodb

d6d0c13

jhamman reviewed Apr 4, 2025

View reviewed changes

TomNicholas added 3 commits April 4, 2025 20:01

NCZarr and Lindi

0469920

add virtualizarr

3ab1fbd

format onto one sentence per line

1bb6238

TomNicholas commented Apr 5, 2025

View reviewed changes

TomNicholas added 2 commits April 4, 2025 20:10

virtualizarr clarifications

5a655e6

linebreak

13fc385

TomNicholas commented Apr 5, 2025

View reviewed changes

TomNicholas added 3 commits April 4, 2025 20:14

don't imply metadata are serialized as byte streams

3514d41

add to sidebar and fix link

8179105

fix some links

d16deab

TomNicholas commented Apr 5, 2025

View reviewed changes

redirection layer

bda516f

TomNicholas commented Apr 5, 2025

View reviewed changes

TomNicholas mentioned this pull request Apr 5, 2025

Conda environment error trying to build website locally #130

Open

TomNicholas added 2 commits April 5, 2025 13:00

specification->protocol

6d73ece

organize sidebar better

180b5ba

TomNicholas changed the title ~~Add components and flexibility page~~ Add components and flexibility pages Apr 5, 2025

TomNicholas added the enhancement New feature or request label Apr 5, 2025

TomNicholas added 6 commits April 5, 2025 15:30

add types of flexibility

1c05a9c

more links between pages

b487799

link to external example libraries

1f9613c

add flexibility as a feature

22df358

add more applications

76cf0ef

split the applications and the features better

fbdfcda

TomNicholas mentioned this pull request Apr 5, 2025

Implementations page should list if they use the zarr format. #132

Open

be consistent about bullet points

87cb05c

TomNicholas commented Apr 5, 2025

View reviewed changes

d-v-b reviewed Apr 5, 2025

View reviewed changes

components/index.md Outdated Show resolved Hide resolved

d-v-b reviewed Apr 5, 2025

View reviewed changes

d-v-b reviewed Apr 6, 2025

View reviewed changes

Spelling

d499af0

Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>

		Format: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format".
		Most, but not all, zarr implementations will serialize to this format.

		Zarr-Python Abstract Base Classes: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, based on a `MutableMapping`-like API.
		This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store.


		These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system.

		Specification: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data and metadata as byte streams via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface).

		- MongoDBStore is a concrete store implementation in python, which stores values in a MongoDB NoSQL database under zarr keys.
		It is therefore spec-compliant, and can be interacted with via the zarr-python user API, but does not write data in the native zarr format.


		## Features

		* Serialize NumPy-like arrays in a simple and fast way.

		Abstract Base Classes: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, using a `Store` ABC, which is based on a `MutableMapping`-like API.
		This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store.

Add components and flexibility pages #131

Are you sure you want to change the base?

Add components and flexibility pages #131

Conversation

TomNicholas commented Apr 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas Apr 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

TomNicholas Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshmoore commented Apr 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b commented Apr 6, 2025

TomNicholas commented Apr 6, 2025

d-v-b commented Apr 6, 2025

TomNicholas commented Apr 7, 2025 • edited Loading

d-v-b commented Apr 7, 2025

TomNicholas commented Apr 7, 2025

d-v-b commented Apr 7, 2025 • edited Loading

TomNicholas commented Apr 7, 2025 • edited Loading

d-v-b commented Apr 7, 2025

TomNicholas commented Apr 7, 2025

d-v-b commented Apr 7, 2025

joshmoore commented Apr 21, 2025

TomNicholas commented Apr 4, 2025 •

edited

Loading

TomNicholas Apr 4, 2025 •

edited

Loading

TomNicholas Apr 5, 2025 •

edited

Loading

TomNicholas Apr 5, 2025 •

edited

Loading

TomNicholas Apr 5, 2025 •

edited

Loading

TomNicholas commented Apr 7, 2025 •

edited

Loading

d-v-b commented Apr 7, 2025 •

edited

Loading

TomNicholas commented Apr 7, 2025 •

edited

Loading