Support exporting dataset #83

dominikl · 2021-09-17T13:26:14Z

Attempt to add 'export dataset' functionality. Exports all images of a dataset into a directory with the dataset id as name. Export of the images happens in parallel using dask.
But something's not quite right. After running for a while you'll get a out of Java heap space error on the server side:

...
File "/home/dlindner/miniconda3/lib/python3.9/site-packages/omero_api_RawPixelsStore_ice.py", line 1199, in getPlane
    return _M_omero.api.RawPixelsStore._op_getPlane.invoke(self, ((z, c, t), _ctx))
omero.InternalException: exception ::omero::InternalException
{
    serverStackTrace = ome.conditions.InternalException:  Wrapped Exception: (java.lang.OutOfMemoryError):
Java heap space
        at loci.formats.tiff.TiffParser.getSamples(TiffParser.java:1030)

Is that an error on the client side (session not closed, etc.) or a server side issue? Any ideas @sbesson @joshmoore ?

joshmoore · 2021-09-17T13:34:43Z

I don't see a missed close() or similar, so perhaps just the total number of calls to

omero-cli-zarr/src/omero_zarr/raw_pixels.py

Line 111 in 5d66d65

planes = pixels.getPlanes(zct_list)

?

will-moore · 2021-09-17T13:46:05Z

@dominikl That's great. I wonder if you could add .zattrs into the top-level Dataset dir, to list the paths to the images.

"collection": {
  "images": {
      "image1": {}, "image2": {}, "image3": {}, "image4": {}, "image5": {},
  },

I think that represents the summary of the discussion at ome/ngff#31

We probably want to consider some different naming of Dataset and Images (not just ID.zarr)?
That's not part of the spec, but if you're browsing the "collection" in another client, you probably want to know the Image names at the top, instead of having to open the .zattrs of each Image to get the names.

I guess we could do:

"collection": {
"images": {
"123.zarr": {"name": "image1.tiff"}, "456.zarr": {"name": "image2.tiff"},
},

but that's less nice. Using IDs as paths on the file-system is also not so human readable.

I wonder if we want to convert e.g. "image1.tiff" into "image1" since it's not really a Tiff anymore!?

dominikl · 2021-09-17T13:58:10Z

Yes, I guess in the end the dataset has to be a "zarr" itself with the appropriate metadata instead of just a directory. The PR is more a draft, trying to export images in parallel, which unfortunately doesn't work very well, needs some more debugging. I agree too, names are nicer than meaningless Ids, but leads to conflicts.

sbesson · 2021-09-17T19:37:20Z

Even without the addition of .zattrs as the discussion is ongoing, I would say structuring the top-level directory as a Zarr group should already allow the different images to be grouped in agreement with the Zarr specification.

Incidentally this raises the question of whether ome_zarr info should be updated to detect and report on this type of hierarchies. Another use case would be the Zarr groups created by bioformat2raw.

dominikl · 2021-09-22T10:40:26Z

Tested locally, doesn't work very well either. After a while the export dies with (even with only 3 threads):

ERROR:omero.gateway:Failed to getPlane() or getTile() from rawPixelsStore
Traceback (most recent call last):
  File "/Users/dom/miniconda3/envs/zarr/lib/python3.9/site-packages/omero/gateway/__init__.py", line 7466, in getTiles
    rawPlane = rawPixelsStore.getPlane(z, c, t)
  File "/Users/dom/miniconda3/envs/zarr/lib/python3.9/site-packages/omero/gateway/__init__.py", line 4796, in __call__
    return self.handle_exception(e, *args, **kwargs)
  File "/Users/dom/miniconda3/envs/zarr/lib/python3.9/site-packages/omero/gateway/__init__.py", line 4793, in __call__
    return self.f(*args, **kwargs)
  File "/Users/dom/miniconda3/envs/zarr/lib/python3.9/site-packages/omero_api_RawPixelsStore_ice.py", line 1199, in getPlane
    return _M_omero.api.RawPixelsStore._op_getPlane.invoke(self, ((z, c, t), _ctx))
Ice.UnknownLocalException: exception ::Ice::UnknownLocalException
{
    unknown = ../../include/Ice/BasicStream.h:307: Ice::EncapsulationException:
protocol error: illegal encapsulation
  0 IceUtil::Exception::Exception(char const*, int) in /opt/ice-3.6.5/bin/../lib64/libIceUtil.so.36
  1 Ice::LocalException::LocalException(char const*, int) in /opt/ice-3.6.5/bin/../lib64/libIce.so.36
...

Does that mean the RawPixelsStore simply can't be called in parallel? Or is the parallel omero.cli.cli_login() call in order to create sessions the issue, and if so is there an alternative?

joshmoore · 2021-09-22T10:42:01Z

Does that mean the RawPixelsStore simply can't be called in parallel?

A single RawPixelsStore should block multiple access, i.e. it should be thread-safe but not concurrent. Can you find the underlying exception that was thrown?

dominikl · 2021-09-22T11:58:26Z

It was actually an out of heap space error too. Ran it with just one thread, got it too:

2021-09-22 11:54:35,192 ERROR [            ome.services.throttling.Task] (l.Server-0) Failed to invoke: ome.services.throttling.Callback@4c7969ed (omero.api._AMD_RawPixelsStore_getPlane@47a90f73 )
java.lang.reflect.InvocationTargetException: null
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
...
Caused by: java.lang.OutOfMemoryError: Java heap space

Is there a memory leak somewhere on the server side when repeatedly calling RawPixelsStore.getPlane()?

joshmoore · 2021-09-22T12:08:09Z

The only route I know of is If the pixel store doesn't get closed and more are created.

dominikl · 2021-09-22T13:06:00Z

I think the problem is that I create a omero.gateway.BlitzGateway(client_obj=c.get_client()) for each image, but don't close it (so probably the pixel store isn't closed either). However I can't close it as it also kills the session. Is there a way to close the BlitzGateway without closing the session @will-moore ?

will-moore · 2021-09-22T13:22:46Z

the add_image() uses BlitzGateway's getPlanes() -> getTiles() and this should close the raw pixel store after the last plane is requested: https://github.com/ome/omero-py/blob/e99bcbbdfdbfffdff9beaa954b5bb2143f23effa/src/omero/gateway/__init__.py#L7553

You could also try conn.close(False) which should close all the various proxy services but not kill the session: https://github.com/ome/omero-py/blob/e99bcbbdfdbfffdff9beaa954b5bb2143f23effa/src/omero/gateway/__init__.py#L1965

for more information, see https://pre-commit.ci

dominikl · 2021-09-22T16:44:46Z

Thanks. I tried conn.close(False) which seemed to work, but after a while it also closes the session. I found a way now which works. But the RawPixelsstore has to be closed explicitly. There seems to be a leak somewhere in getPlanes()/getTiles(), but I couldn't see it, the methods look fine to me.

will-moore · 2021-09-23T13:38:05Z

I wonder if some images aren't getting all their planes downloaded - so they don't get closed.
Might be interesting to see if conn.c.getStatefulServices() gives you anything when you're closing the pixelstore and log e.g. image name to check if the image all got downloaded OK.
Also see https://github.com/ome/omero-py/blob/master/src/omero/gateway/__init__.py#L1650 conn._assert_unregistered() that does this.

NB: We discussed an alternative export strategy (when you don't have access to the binary repo):

download all the original files locally, using the cliomero download command (e.g. into temp dir)
then run bioformats2raw on this dir

This means that the server is doing much less work and maybe more parallelisation is possible.
I could start looking at that approach if it sounds like a good idea (and if you don't want to take it on)?
Potential down-side: possible to download more data than you want e.g. for MIF where some images are in a different Dataset, you'd still download them.

dominikl · 2021-09-24T08:55:33Z

👍 Would be good to have different strategies.
So far there would be three different ways to do it:

This PR: Retrieve the pixels data via the API and write the zarr. Might be the fastest way for smallish images with just a few planes, and a limited amount of parallelisation.
Your suggestion: Download the original files and run bioformats2raw. That should definitely be faster for larger and/or many plane images. And like you said, doesn't need access to the managed repo either.
@sbesson 's suggestion: Parallelise the current bioformats2raw export with direct access to the managed repository. As the bioformats2raw call has quite some overhead (e. g. single HPA image export is slower using bioformats2raw than getting pixels via API), this is probably the fastest way. But for smallish images it needs a certain minimum amount of parallelisation to outweigh the bioformats2raw overhead.

img_group named image.getName() added to collection

will-moore · 2021-10-04T07:13:58Z

So, with a bit more testing, I realise that my approach of blindly using the Image name for the zarr image group causes issues with various characters, such as / (obviously) but also [ and ] I think.
Needs a thorough test of all special characters to see what needs replacing or escaping. Happy to look into this...?

dominikl · 2021-10-04T09:48:58Z

Ah true, you'd have to escape all posix and windows special characters, that can be tricky. How about using UUIDs as directory names (instead of the arbitrary omero IDs), and put the names into the metadata? That means one always has to lookup the metadata to make sense of the directory structure, but saves the hassle of handling all sorts of special characters.

dominikl · 2021-10-05T08:45:08Z

Superseded by #88

Support exporting dataset

2e3afcc

Export dataset into zarr instead of directory

b61435c

dominikl force-pushed the export_dataset branch from 5d66d65 to b61435c Compare September 20, 2021 13:32

Close rawpixelsstore

be6225a

dominikl force-pushed the export_dataset branch from eb558b5 to be6225a Compare September 22, 2021 16:40

[pre-commit.ci] auto fixes from pre-commit.com hooks

9995d13

for more information, see https://pre-commit.ci

sbesson mentioned this pull request Sep 24, 2021

Use nested chunk store for labels #82

Merged

will-moore mentioned this pull request Sep 28, 2021

View OME-NGFF collection of Images hms-dbmi/vizarr#123

Open

will-moore and others added 2 commits September 28, 2021 16:50

img_group named image.getName() added to collection

676d46d

Merge pull request #1 from will-moore/export_dataset

fb1a43c

img_group named image.getName() added to collection

dominikl closed this Oct 5, 2021

will-moore mentioned this pull request Oct 5, 2021

Export dataset #88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support exporting dataset #83

Support exporting dataset #83

dominikl commented Sep 17, 2021

joshmoore commented Sep 17, 2021

will-moore commented Sep 17, 2021

dominikl commented Sep 17, 2021

sbesson commented Sep 17, 2021

dominikl commented Sep 22, 2021

joshmoore commented Sep 22, 2021

dominikl commented Sep 22, 2021

joshmoore commented Sep 22, 2021

dominikl commented Sep 22, 2021

will-moore commented Sep 22, 2021

dominikl commented Sep 22, 2021

will-moore commented Sep 23, 2021

dominikl commented Sep 24, 2021

will-moore commented Oct 4, 2021

dominikl commented Oct 4, 2021

dominikl commented Oct 5, 2021

Support exporting dataset #83

Support exporting dataset #83

Conversation

dominikl commented Sep 17, 2021

joshmoore commented Sep 17, 2021

will-moore commented Sep 17, 2021

dominikl commented Sep 17, 2021

sbesson commented Sep 17, 2021

dominikl commented Sep 22, 2021

joshmoore commented Sep 22, 2021

dominikl commented Sep 22, 2021

joshmoore commented Sep 22, 2021

dominikl commented Sep 22, 2021

will-moore commented Sep 22, 2021

dominikl commented Sep 22, 2021

will-moore commented Sep 23, 2021

dominikl commented Sep 24, 2021

will-moore commented Oct 4, 2021

dominikl commented Oct 4, 2021

dominikl commented Oct 5, 2021