Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collections Specification #31

Open
DragaDoncila opened this issue Feb 24, 2021 · 119 comments
Open

Collections Specification #31

DragaDoncila opened this issue Feb 24, 2021 · 119 comments

Comments

@DragaDoncila
Copy link

What is an image collection?
A collection of images is a semantic grouping of two or more associated ome-ngff images and/or image-labels.

This definition could include

  • Images which do not share a physical coordinate space e.g. training dataset of images containing bees
  • Images which share a physical coordinate space and whose storage specification must support sufficient metadata to determine this positioning e.g. high-content screening plates and wells
  • A hierarchy of image groups of arbitrary depth which may or may not share physical coordinates
  • Other things…?

What workflows should it support?
The specification should support implementations being able to traverse the image collection and, where relevant, map the associated metadata to the physical coordinate space for loading these images.

Ideally, the specification should provide sufficient information at each level of a hierarchical grouping to allow for the loading of both the entire collection, and the loading of an arbitrary level of the hierarchy. This can be important when wanting to share/view partial datasets or update only small parts of the entire collection.

Where labels or other related data is provided (e.g. meshes, points…), the specification should support being able to associate any member of the image collection with its associated labels, regardless of the level in the hierarchy.

The OME-NGFF spec is close to supporting this functionality with the HCS specification which allows the positioning of wells into rows and plates. The main drawbacks of this specification are

  • It is too specific to be easily used for images which ARE physically associated but are not HCS acquisitions
  • It may be difficult to understand for researchers who are not working with HCS images but nevertheless wish to store their collection in OME-NGFF format
  • It does not support an arbitrary depth of groupings
  • It does not support collections which are not physically associated

What should it be called?

  • Dataset - this term is already used in various places so may not be the best choice
  • Collection - a general enough term which is currently mostly unused
  • Hierarchical definition - there is a case for this specification being a hierarchy of specifications, with each one defining a more tightly bound collection e.g.
    • Bag - associated images with no metadata
    • Stack - associated images which overlap in physical space
    • Panorama - associated images which stitch together in physical space

Ideally, the names used in the base specification would be general enough to support a broad variety of use cases and tailored use cases could be demonstrated using examples in the documentation.

Reference specifications
BDV XML Files
SVG
TrakEM2
Napari Plugin for image-label collections
mobie grid view of many sources

Related
Image.SC discussion on collections
Live notes from latest community call
HCS Specification

What next?
I think we should first decide on whether we want to support arbitrary levels in the hierarchy and whether we want a general spec which we can “inherit” from for more detailed specs, or whether we want one spec to rule them all.

My vote is that we define the most generic collection (a “bag” of images) which works with arbitrary levels of grouping (it’s collections all the way down), and then work to add to it for more complex collections. I will be working on this over the coming week and will post here once I have something working, but of course would love to hear what everyone’s thoughts are on the best way forward.

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/next-call-on-next-gen-bioimaging-data-tools-feb-23/48386/9

@tischi
Copy link

tischi commented Mar 1, 2021

@DragaDoncila Thank you very much for the detailed post! It makes a lot of sense and I am looking forward to whatever you come up with! Incidentally we (cc @constantinpape ) were also working on this topic during the past few days.

I also ping @d-v-b

I would like to add a notion and would be curious to hear opinions:

Currently I think that I would prefer storing images within a zarr container without any hierarchy, i.e. just as a flat list. The main reason is simplicity for the reader and writer libraries (the current HCS specifications does not follow this). Anything that imposes a hierarchy would be handled by the collections specification, which I think could be seen as metadata that specifies how to display and layout several images together.

@constantinpape
Copy link
Contributor

Currently I think that I would prefer storing images within a zarr container without any hierarchy, i.e. just as a flat list.

This means you would like to store images as 2d arrays and volumes as 3d arrays, correct?
I created #35 to discuss this.

@tischi
Copy link

tischi commented Mar 1, 2021

This means you would like to store images as 2d arrays and volumes as 3d arrays, correct?

I was not gonna enter the 3D vs 5D discussion here, but just wanted to say that I feel that structuring the zarr like this: https://ngff.openmicroscopy.org/latest/#hcs-layout feels overly complex to me.

@constantinpape
Copy link
Contributor

I was not gonna enter the 3D vs 5D discussion here, but just wanted to say that I feel that structuring the zarr like this: https://ngff.openmicroscopy.org/latest/#hcs-layout feels overly complex to me.

Ok, so your point is to rather have a flat hierarchy of images in the zarr container:

image1/
image2/
image3/
image4/
...

and then define the potential hierarchies in the collections metadata (just a mock-up):

{
  "well1": ["image1", "image2"],
  "well2": ["image3", "image4"]
}

@tischi
Copy link

tischi commented Mar 1, 2021

Yes, exactly.

The way I see it conceptually is that a multi-well plate is a specific layout of a bag of images and, as such, should be covered by our collections specification, which I would currently see as metadata that exists independent of the way we store the raw image data. What do you think?

@will-moore
Copy link
Member

One feature that the current HCS layout gives us is a URL to a specific Well. So I can open a specific Well like: https://hms-dbmi.github.io/vizarr/v0.1?source=https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/plates/2551.zarr/A/1
Demo movie at https://twitter.com/will_j_moore/status/1322187662762090497

I guess you could try to use a URL ?query or a #fragment to refer to a Well or other subgroup. E.g. path/to/plate.zarr/#A1

@tischi
Copy link

tischi commented Mar 1, 2021

a URL to a specific Well

I see that this is cool, but I am afraid that (i) these hierarchies make it harder parse an ome.zarr and (ii) it is not flexible; for example, I guess I cannot produce an single URL to show me all the images that were subjected to the same biological treatment (which may be several wells).

@DragaDoncila
Copy link
Author

The way I see it conceptually is that a multi-well plate is a specific layout of a bag of images and, as such, should be covered by our collections specification, which I would currently see as metadata that exists independent of the way we store the raw image data.

I completely agree that the metadata and storage should be independent, because I think this also provides the opportunity to support a wider range of custom metadata. For example this:

{
  "well1": ["image1", "image2"],
  "well2": ["image3", "image4"]
}

could easily be this (for some geographical feature learning model):

{
  "lakes": ["image1", "image2"],
  "mountains": ["image3", "image4"]
}

I guess that's what I was thinking of when I said

tailored use cases could be demonstrated using examples in the documentation.

I like the idea of a flat set of images with the hierarchy determined entirely by the metadata. That certainly seems the easiest way to support an arbitrary level of hierarchy without ending up with a very complex storage structure.

@joshmoore
Copy link
Member

#31 (comment) Currently I think that I would prefer storing images within a zarr container without any hierarchy, i.e. just as a flat list.

Is this a MAY or a MUST? And what happens when/if someone does make use of the folder structure available in Zarr/N5/HDF5?

@will-moore
Copy link
Member

Re @tischi "flexibility and biological treatment" - I'm wondering if there must be a single 'hierarchy' in the container, e.g. If we can have multiple. E.g.

{
  "well1": ["image1", "image2"],
  "well2": ["image3", "image4"]
}

And:

{
  "aquisition1": ["image1", "image3"],
  "aquisition2": ["image2", "image4"]
}

or

{
  "drug1": ["image1", "image3"],
  "drug2": ["image2", "image4"]
}

Those are all different ways to grouping the images.
But if you have:

{
  "well1": ["image1", "image2"],
  "well2": ["image3", "image4"],
  "drug1": ["image1", "image3"],
  "drug2": ["image2", "image4"]
}
how do you know which groups are mutually exclusive. E.g. which ones are Wells vv Treatments?

Having multiple hierarchies might provide more flexibility, but this makes it harder to understand how to view the data.
Instead, it might make more sense to only have a single hierarchy (like a file-system) and then add other metadata in other ways?

@tischi
Copy link

tischi commented Mar 4, 2021

@will-moore

My current idea would be to have no hierarchy on the data storage level, but provide the possibility to specify different "views" on the data on the metadata level. Something along the lines:

views: 
{
   "well_based": {...},
   "treatment_based":{...}
}
default_view: "well_based"

Does that make sense to you?

@tischi
Copy link

tischi commented Mar 4, 2021

Is this a MAY or a MUST? And what happens when/if someone does make use of the folder structure available in Zarr/N5/HDF5?

Personally, I'd be for a MUST, i.e. not support hierarchies and then ignore anything stored at deeper levels.
But, obviously, that's just my personal opinion. Very curious to hear other opinions!

@constantinpape
Copy link
Contributor

Personally, I'd be for a MUST, i.e. not support hierarchies and then ignore anything stored at deeper levels.
But, obviously, that's just my personal opinion. Very curious to hear other opinions!

I think I am not such a big fan of the MUST here. There are some use cases where hierarchies make a lot of sense to keep the data ordered. As an simple example:
I have segmentations computed with two different algorithms and two hyperparameters for the algorithms, and I want to store them in the same container to compare them with some viewer that can ingest it.
For this use case having

algorithm1/
    parameter_set1
    parameter_set2
    parameter_set3
algorithm2/
    parameter_set1
    parameter_set2

is a more natural (and easier to navigate) way of storing this then

algorithm1_parameter_set1
algorithm1_parameter_set2
algorithm1_parameter_set3
algorithm2_parameter_set1
algorithm2_parameter_set1

@tischi
Copy link

tischi commented Mar 4, 2021

There are some use cases where hierarchies make a lot of sense to keep the data ordered

OK, fair enough :)

I guess the question is whether, in practice, one would navigate the data via the "views" or via the folder structure. If one changes ones mind at some point about the folder structure, this could be quite expensive in terms of reordering all the data (at least that's how I understood how the object stores work), while it would be very cheap to just replace the views, isn't it?

@constantinpape
Copy link
Contributor

I guess the question is whether, in practice, one would navigate the data via the "views" or via the folder structure. If one changes ones mind at some point about the folder structure, this could be quite expensive in terms of reordering all the data (at least that's how I understood how the object stores work), while it would be very cheap to just replace the views

Sure, reordering the folder structure is not such a good idea but also not necessary because we can have multiple views for the same data. But having a hierarchical folder structure does not change anything about the views except that there will be some \ in the data names.

@tischi
Copy link

tischi commented Mar 5, 2021

But having a hierarchical folder structure does not change anything about the views except that there will be some \ in the data names.

Yes, that is true. I guess it'd be fine with a MAY, but should we then maybe "strongly encourage" that there is a default_view specified that one could go to in order to efficiently find out what's in the dataset, without having to go through the whole "folder structure"? (I am also thinking about our experience that things like cd and ls sometimes are super slow on object stores).

@will-moore
Copy link
Member

I'm also a little hazy on object stores, but my impression is that all the 'paths' within a bucket are really just 'keys'. So I imagine they could be changed without moving the data on disk.
So, as @constantinpape's said, I'm not sure there's really much difference between algorithm1/parameter_set1 and algorithm1_parameter_set1. I don't think you can browse to algorithm1/ on an object store.
This is why you need to specify all the paths to child objects in the group metadata.

However, since you won't always be working with object stores, allowing algorithm1/parameter_set1 could let you browse the data via algorithm1/ elsewhere, so I think this could be helpful. Also conceptually helpful to tokenise the path in this way. So I think we should allow / in the path names.

E.g. this could be valid:

{
  "lakes": ["day1/image1", "image2"],
  "mountains": ["image3", "day2/image4"]
}

Any reason not to allow this?

@tischi
Copy link

tischi commented Mar 5, 2021

@will-moore I think that's fine and a very good point. On a file system there are some benefits to this and on an object store there are no disadvantages.

@will-moore
Copy link
Member

OK, so it looks like there's enough consensus here to start on something a bit more concrete.
Aiming for something that is just a list of images in its simplest form, but can include more metadata without a breaking change.

Option 1

A path/to/collection/ directory would include a .zattrs file that defines a "collection" because it MUST include the collection key, which MUST contain an images list:

{
  "collection": {
    "images": [
      {"path": "image1"},
      {"path": "dir_1/image2"},
    ]
  }
}

Each path is a path/to/directory containing an OME-Zarr image.
Each item in the images list MUST have a path, but MAY also have other attributes (TBD: e.g. id, name, timestamp, etc. Maybe even 'row': 0, 'column': 1, for a grid layout.) We should probably not allow any user-chosen key-value data here, since that could lead to breaking changes if we add keys to the spec. So maybe a properties: {} for user-defined metadata.

Option 2

An alternative is to use the "path" as the ID/key of each image. Any reason not to do this? (for labels-metadata we decided not to use an ID as key because the ID was a number which is not a valid key in JSON.). This protects from having 2 identical 'path' values which could be possible above.

{
  "collection": {
    "images": {
      "image1": {},    # empty if we don't have any other info
      "dir_1/image2": {"row": 0, "column": 1},
      "dir_2/image3": {"properties": {"rating": 5}},
    }
  }
}

Other optional metadata

Within the collection, alongside images we could imagine other metadata such as layout:

  "layout": {
    "type": "grid",   # or 'auto-grid'
    "rows": [
      {"name": "A"}, {"name": "B"}, {"name": "C"}
    ],
    "columns": [
      {"name": "1"}, {"name": "2"}, {"name": "3"}
    ],
  },

and groupings. I guess we could use the path as the identifier of each image.

  "groups": {
    "lakes": ["image1", "dir_1/image2"],
    "mountains": ["dir_2/image3"]
  }

Which could mean that the "images" list/dict above is not needed (if we don't have any other metadata, and every image is in a group)? BUT it simplifies the spec to say that images MUST exist, and it's not hard to always add it.

So, is everyone happy with Option 1 or Option 2? Or would like to suggest improvements to whichever is their favourite?
The other metadata can be decided later, but any suggestions welcome.

@joshmoore
Copy link
Member

#31 (comment) So I imagine they could be changed without moving the data on disk.

No. To move objects around in object storage is always a copy/delete operation.

#31 (comment) I guess the question is whether, in practice, one would navigate the data via the "views" or via the folder structure.

It sounds like we're struggling with the semantics of having one "hard-coded" hierarchy beside the additional collections. In the ome-zarr-py implementation (and we could work to formalize this), there's a generator pattern. You start at the group you're given and then ask for what it "points to" and then process that. You will always start from a single group, so perhaps we're saying that you will only use the metadata of the given group for the objects that are generated.

@tischi
Copy link

tischi commented Mar 5, 2021

It sounds like we're struggling with the semantics of having one "hard-coded" hierarchy beside the additional collections.

@joshmoore I agree. Maybe, for simplicity, we could restrict this issue to discussion of the additional collections? and make an extra issue but "hard-coded" hierarchy?

A path/to/collection/ directory would include a .zattrs file

@will-moore Do I get it right that currently one such .zattrs file would contain only one collection? Meaning that to specify multiple additional collection we would need several path/to/collection/ .zattrs? I guess that's fine, but then I guess somewhere there should be information how to find them?

So, is everyone happy with Option 1 or Option 2?

Option 2 looks more concise, so maybe slight preference for that one.

In terms of the layout, instead of specifying row and column, I think specifying a translation in physical coordinates may also be an option.

{
  "collection": {
    "images": {
      "image1": {"translate": [0,0,0], "name": "A"},   
      "dir_1/image2": {"translate": [10,0,0], "name": "B"},
      "dir_2/image3": {"translate": [20,0,0], "name": "C"},
    }
  }
}

@constantinpape
Copy link
Contributor

I also prefer option2. And as @tischi brought up I think it's important to think about how to map different collections for the same data (or subsets of it), either in the same .zattrs or distributed into different ones in some defined pattern.

@will-moore
Copy link
Member

I was thinking the multiple groups above were different groupings of images in a collection. But I guess that's not enough, e.g. if each image has a different translate or other property in each collection.

So the simplest way to support multiple collections is to make the dict -> list, and plural:

# .zattrs
{
  "collections": [
    {
      "name": "first collection",
      "images": {
        "image1": {"translate": [0,0,0], "name": "A"},   
        "dir_1/image2": {"translate": [10,0,0], "name": "B"},
        "dir_2/image3": {"translate": [20,0,0], "name": "C"},
      }
    },
    {
      "images": {
        "dir_2/image3": {},
        "dir_2/image4": {},
      }
    }
  ]
}

@DragaDoncila
Copy link
Author

Do I get it right that currently one such .zattrs file would contain only one collection? Meaning that to specify multiple additional collection we would need several path/to/collection/ .zattrs? I guess that's fine, but then I guess somewhere there should be information how to find them?

I think if we want to easily support images being opened both as part of their collection and on their own then it would make sense to have each image as its own well-formed ome-zarr, including a .zattrs file? It would mean either duplication of some metadata, or a top level .zattrs which only contains the necessary information for traversing the collection i.e. the snippet @will-moore posted just above

@will-moore
Copy link
Member

Yes, in the examples I've posted, there would be a full OME-Zarr in each of the images paths. E.g. dir_1/image2/ would contain .zattrs etc.

@joshmoore
Copy link
Member

Just to clarify one issue that will become more important with Zarr V3, each of those paths contains an OME-Zarr image. In the future, there will be a root file which will define the entire OME-Zarr fileset.

@will-moore
Copy link
Member

Taking the current HCS spec as an example, it is true that this doesn't allow an image to be in more than one collection (plate). The collection's metadata is one (or more) levels up from the Image or Well.

When I pass the single collection's path to a workflow, the workflow would not know (or would have to guess) where the "collections" metadata is stored

I'm not sure I understand this. The path/to/collection would point to the collections metadata at the same level in path/to/collection/.zattrs.

E.g. viewing https://hms-dbmi.github.io/vizarr/v0.1?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/plates/2551.zarr that path to plate contains the plate metadata at https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/plates/2551.zarr/.zattrs

Also in the case of HCS, the hierarchies are quite well defined, so we can even start at a Well (and we know it's a well because the .zattrs has well data) and if we want plate metadata we can simply look in the parent directory.
E.g. we load the plate JSON data at https://hms-dbmi.github.io/vizarr/v0.1?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/plates/2551.zarr/B/1
This may not be ideal, but I don't think it's a security issue and it's a balance between having all the metadata in one place or duplicating it at different levels.

In the case of a more generic Collection, as discussed above, it's true that for a given path/to/image/ we wouldn't know which parent directory defines the collection container and holds the collection metadata.
But if you're starting at a file path to the collections zarr, then that is not such a limitation.

@will-moore
Copy link
Member

I can't find any info on JSON-schema validating references within JSON instances. The only use of references or ids is within a schema, to refer to other schemas. e.g. https://json-schema.org/understanding-json-schema/structuring.html#ref
Also see
openMetadataInitiative/openMINDS_core#256

@aeisenbarth
Copy link

I'm not sure I understand this.

I was referring to the proposals above. A path to a Zarr subgroup with images of a single collection does not have the NGFF Collection metadata, so when reading it, it looks like a plain (non-NGFF) Zarr group containing NGFF images. One could check whether the parent directory is a Zarr group and has NGFF Collections metadata, then iterate over all collections to find the one containing these images. This works if I introduce restrictions on my Zarr hierarchy. But for the general case you would have to walk up to the file system root and check every parent because image paths can have several nesting levels.

In short, "a Zarr collection of images" cannot be expressed as a path/URL (but as path to collections + collection name).

@will-moore
Copy link
Member

At ome/omero-cli-zarr#88 there is code for exporting a Dataset of Images, according to one of the Collection specs discussed above. And there is an example Dataset hosted at https://minio-dev.openmicroscopy.org/idr/v0.3/datasets/idr0043/13901.zarr/.zattrs

That URL defines a collection of Images, listed in the .zattrs where we could also include the name and other metadata (this example doesn't).

The .zattrs is simple, and we have since decided to use a different structure, but it doesn't look like a plain (non-NGFF) Zarr group because it contains the "collection" dictionary:

{
    "collection": {
        "images": {
            "165383_A_1_1.tif": {},
            "165383_A_1_2.tif": {},
            "165383_A_1_4.tif": {},
            ...
        }
    }
}

That collection can be viewed in a vizarr PR deployed at
https://deploy-preview-124--vizarr.netlify.app/?source=https://minio-dev.openmicroscopy.org/idr/v0.3/datasets/idr0043/13901.zarr (or see screenshot on the PR above).

Apologies if I'm getting my Zarr terminology wrong, but is that not an example of a path to a Zarr group (or subgroup?) with (limited) Collection metadata and no searching of parent directories needed to find the images?

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/next-call-on-next-gen-bioimaging-data-tools-2022-01-27/60885/11

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/collections-in-ome-ngff/63656/1

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/collections-in-ome-ngff/63656/5

@joshmoore joshmoore mentioned this issue Feb 25, 2022
2 tasks
@normanrz
Copy link
Contributor

normanrz commented Mar 23, 2022

Copying from image.sc:

We are in the process of integrating OME-NGFF in webKnossos. webKnossos has the concept of “Datasets” to organize images. A “Dataset” consists of multiple multi-scale layers, which can be variety of images including raw EM data, multi-channel images, instance segmentations, ML probability maps etc. All layers in a Dataset have a shared coordinate system and somewhat belong together in a semantic sense.

With regards to this proposal, I was wondering if there should be different types of collections (those with shared coordinate spaces and ones with separate spaces)?

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/intermission-ome-ngff-0-4-1-bioformats2raw-0-5-0-et-al/72214/1

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/cli-programmatic-browsing-of-ome-zarr-hierarchies-on-idr/75907/7

@tischi
Copy link

tischi commented Mar 21, 2023

@normanrz @joshmoore @will-moore

  1. Should we consider meeting to work on this?
  2. I wonder whether such a collection spec is OME-Zarr specific or whether it could be on top of it in a sense that also a few important other image data formats could be handled?

@normanrz
Copy link
Contributor

  1. Should we consider meeting to work on this?

Yes!

2. I wonder whether such a collection spec is OME-Zarr specific or whether it could be on top of it in a sense that also a few important other image data formats could be handled?

I think it fits nicely in the OME-Zarr world. For our purposes, there is no requirement to include other image formats.

@tischi
Copy link

tischi commented Mar 21, 2023

For me, having a collection spec for a bunch of TIFF files also would be very handy.
We can see whether the collection spec will need to refer to OME-Zarr specifics.
So far all the JSON snippets that have posted above seem to also work for other image files...

@jluethi
Copy link
Contributor

jluethi commented Mar 21, 2023

I'd also be very interested to see where the collection work is going! We're just interested in OME-Zarr images and use the HCS spec heavily. But would be interesting to hear how this can generalize to other collections and how metadata about the contents of the collection is represented :)

@will-moore
Copy link
Member

I think we got bogged down previously because we tried to do too much, so let's see if we can list the requirements that we all agree on before we try to find solutions. Here's a possible list, and I'm only selecting 2 items to start. Maybe others could copy this list and select items or add their own items?

@normanrz
Copy link
Contributor

normanrz commented Mar 21, 2023

What it could look like:

{
  "collection": {
    "images": [{
      "path": "em", # relative
      "label": "EM",
      "rendering": {
        "visible": true,
        "color": {
          "format": "hex",
          "type": "rgb",
          "value": "000000"
        },
        "blending": "normal",
        "opacity": 1
      }
    }, {
      "path": "s3://bucketname/prefix/image", # absolute, fsspec-style
      "label": "LM",
      "rendering": {
        "visible": false,
        "color": {
          "format": "hex",
          "type": "rgb",
          "value": "00FF00"
        },
        "blending": "normal",
        "opacity": 0.7
      }
    }],
    "attributes": {
      "name": "Fancy data",
      "description": "Lorem ipsum",
    }
  }
}

@tischi
Copy link

tischi commented Mar 21, 2023

How do I copy the list? :-) If someone tells me I am happy to reformat what I write here:

For my current work I (and I think @jluethi may be interested) would need one more thing:

Some way to specify the shape and pixel type metadata for a collection of images, such that I don't need to open all of them during initialisation (use case is similar to HCS where we have hundreds of images with identical metadata that we want to open in a "grid view"); essentially I want to be able to express that "here is a collection of images and they all have the same shape (dimensions) and pixel type, namely ...
I realise now that this may be covered by: [X] A way to store extra metadata on the collection`.

  • Attach rendering metadata to image in the collection
    • For me it would be important to express that all images in the collection should have the same rendering settings (same use-case as above).

@jluethi
Copy link
Contributor

jluethi commented Mar 21, 2023

Some way to specify the shape and pixel type metadata for a collection of images, such that I don't need to open all of them during initialisation (use case is similar to HCS where we have hundreds of images with identical metadata that we want to open in a "grid view")

it would be important to express that all images in the collection should have the same rendering settings (same use-case as above).

In the use case where we assume all items of the collection (e.g. all wells) have the same metadata, I don't see a huge difference about whether I load the metadata from the collection level or from a single (randomly chosen? alphabetically first?) image. A simple flag may be sufficient then.
It gets most interesting to us when we know that the collection items (e.g. different different wells) may have images of different shapes. This, we'd need to know upon plate loading (to initialize the pyramids correctly) and loading of metadata from hundreds of images becomes a bottleneck (see discussion in this PR to add support to ome-zarr-py to load HCS plates with varying well sizes: ome/ome-zarr-py#241)

@aeisenbarth
Copy link

We had a need for collections and did a most minimal implementation of this proposal. We were using it for grouping images associated with each well (collection A1 → {modality1, modality2…}). Now with the plate spec not being marked transitional, we are contemplating moving to that (storing each modality as a separate plate), because it is more standards-compliant than our own extension, we practically don't encounter plate layouts that are not row/column based, and we don't need assumptions that each collection actually contains all expected images.

Things to consider:

  • Do you want to be able to reference a collection like a file path?
  • It is practical that images are still self-contained and complete. When attaching metadata at the collection level, it should probably only be a summary/copy of what is at the image level, to avoid that when reading a child in the hierarchy you have to look up its parent (if it has).
  • Maybe most flexible is a design where the optional image properties follow the NGFF image spec. That way you could have overrides, let's say when loading image1 as part of this collection, use a different coordinate transformation.

@jluethi, couldn't you define (sub)collections based on same metadata? Unless the partitioning into subcollections is different for every metadata property.

@will-moore
Copy link
Member

@tischi here's the markdown for the list to copy/paste

 - [x] Simple `collection` that lists child images: e.g. https://github.com/ome/ngff/issues/31#issuecomment-1011869652
 - [x] A way to store extra metadata on the collection: e.g. `properties: {}`
 - [ ] Image in multiple collections (needs absolute paths or `../../relative/path/to/sibling/collection`)
 - [ ] Replacement for Plate: e.g. https://github.com/ome/ngff/issues/31#issuecomment-802064440
 - [ ] Types, e.g. `"@type": "ImageCollection"` subclassing `Collection`etc e.g. https://github.com/ome/ngff/issues/31#issuecomment-982629030 questions remain
 - [ ] Different collection types/metadata for shared coordinates spaces (views) and separate spaces
 - [ ] Multiple collections in a single file? e.g. https://github.com/ome/ngff/issues/31#issuecomment-792582677
 - [ ] Collections can contain collections? 

@will-moore
Copy link
Member

@normanrz Do you need to specify rendering settings for each image (rather than a rendering for the whole collection)? Is this just to save time, to avoid loading the settings from each image.zarr/.zattrs?

I like @jluethi's idea of flags to say e.g.

{
  "allSameShape": true.
  "allSameDataType": true,
  "allSameRendering": true,
}

This is very minimal and once we know that, we can load shape, dtype, rendering settings from the first image. But it might be too restrictive if e.g. you want different settings for the plate than for individual images.

Making these "image properties follow the NGFF image spec" as @aeisenbarth suggested makes sense when possible (e.g. name and rendering settings that are in the NGFF multiscales metadata), but might be tricky for others that are in the Zarr .zarray data (dtype, shape).

@tischi
Copy link

tischi commented Mar 22, 2023

Use-case wise I realised that for me it would also be interesting to know whether

  • all images are 2D
  • all images have the same spatial extends (but may have different number of time points)

I wonder now, if this collection.json is mainly meant to be read and written by a computer, whether it would be easier and more generic to simply have the option to add for to each image all sorts of metadata from the .zarray files, such as dtype, shape, multiscales. Since this is in a single file, for a computer it would still be very fast to figure out whether "all are 2D" or "all have same shape" a.s.o.. I think the same argument can be made for rendering settings.

What about the following:

We first figure out all properties that we would like to be able to add to each individual image of the collection?

Then, as second step, we discuss whether it is worth to have a common_properties field where we can say things that apply to all images?

@normanrz
Copy link
Contributor

@normanrz Do you need to specify rendering settings for each image (rather than a rendering for the whole collection)? Is this just to save time, to avoid loading the settings from each image.zarr/.zattrs?

This came up in the community call: Think of a collection as the definition of the view state in a visualization tool. Now, the render settings feel better placed in the view than with the individual images because you may want to have the same image in different views with different settings. For example, you may want to choose different colors for an image in different views. visible also only really makes sense in the context of a view. So to clarify my proposal: move all rendering settings from the image into the collection.

Making these "image properties follow the NGFF image spec" as @aeisenbarth suggested makes sense when possible (e.g. name and rendering settings that are in the NGFF multiscales metadata), but might be tricky for others that are in the Zarr .zarray data (dtype, shape).

I want to mention that idea of consolidated metadata has come up on the Zarr-level quite a few times. This would mean that the zarr.json of a group could store the zarr.json contents of descendent groups and arrays. This is primarily a performance optimization to save IO.
Even though there is no formal proposal for this yet, I don't think we should reinvent that on the OME-level. At least for core metadata (shape, dtype etc). The OME-Zarr metadata would still be duplicated for each image in the consolidated json. I don't think that is a big problem, because I assume for 100s of images it should still be on the order of 1MB. Also, the json probably compresses very well.

Storing information like

{
"allSameShape": true.
"allSameDataType": true,
"allSameRendering": true,
}

or

  • all images are 2D
  • all images have the same spatial extends (but may have different number of time points)

seems like redundant aggregations that could easily be derived by the visualization or processing software (given consolidated metadata).

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/faim-hcs-functions-to-work-with-hcs-data/78868/20

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/save-a-single-labels-dataset-into-an-ome-zarr/93505/39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests