Proposal: reduce number of top level packages #331

Kirill888 · 2021-08-25T06:03:53Z

Problem

This repository grew "organically" for a while and some of the earlier experiments have proven to be less relevant than other. The side effect of this growth is a large number of packages and namespaces. Currently there is a one to one mapping between namespace and package, like odc.algo.* is shipped by odc-algo, and odc-algo ships only files in odc/algo. While this is a reasonable and clean relation it does make for a larger number of packages. On one hand this allows for higher granularity when declaring dependencies, on the other, packages are not "free". It adds to CI delays, makes renaming and moving code around harder, and makes publishing/managing to pypi/conda harder too as more secrets need to be managed, maintainers need to be added to more projects etc. (I spent some time adding @GypsyBojangles as an owner to every pypi project pushed from this repo, and will need to do a similar thing for generating publishing tokens.)

Stocktake

First let's decide which thing we definitely keeping as is. I'd say that apps can stay as they are. And as far as user facing libraries go I have this list:

odc.algo - mostly xarray + dask tool: example xr_rerpoject
odc.ui - tools for visualizations in jupyter
odc.stac - (previously odc.index) STAC and missing datacube index utilities
odc.stats - large scale data processing libs/apps (work in progress)
odc.dscache - used by odc.stats, but has other possible future use cases (odc index export/import for example)

Then we have "cloud io helper libs" that are mostly used via apps or other higher level libs and not directly by users.

odc.aws - AWS S3 and SQS
odc.azure - Azure blob storage
odc.thredds- Crawling THREDDS
odc.aio - AWS S3, but async, has annoying aiobotocore dependency

The reason why these are all separate is due to dependencies they pull in. For example odc.aio depends on aiobotocore which is really challenging to install in the presence of dependencies on boto* libraries. One option is to put them all into one package, say odc-cloud, and use feature flags to enable/disable features, so instead of depending on odc-azure one would declare dependency on odc-cloud[AZURE].

Finally there is an odd bunch of libs

odc.io - poor name, used by cli apps, for various unrelated text processing helpers
odc.ppt - "paralllell processing tools", some generic "Future" object handling(not used) and Async->Thread adapter
- only used by odc.aio
odc.dtools - not used, mostly moved to datacube, used to have "rasterio environment activation/configuration" helpers.
odc.geom - not used, mostly moved to datacube, still has some unfinished work that might be of use later on

Immediate Actions

~~Dissolve odc.ppt by moving AsyncThread into odc.aio, and abandoning the rest~~
~~Dissolve odc.dtools, possibly move some of things into odc.algo (and add tests)~~
~~Remove dead code from other libs~~
Decide on a course of action for "cloud libs"

The text was updated successfully, but these errors were encountered:

alexgleith · 2021-08-25T06:08:23Z

I'm not addressing your post quite yet, but will have a read and do so.

But related, I would like to refactor the s3-to-dc and I guess s3-find to use threading instead of async, because I think performance will be the same and the aio code is slow to move and is a frustrating dependency. Any thoughts on this?

alexgleith · 2021-08-25T06:08:50Z

Another quick point is that we could combine the two apps installs. cloud and dc_tools are not that far apart.

Kirill888 · 2021-08-25T06:24:24Z

Another quick point is that we could combine the two apps installs. cloud and dc_tools are not that far apart.

cloud and dc_tools were separate because cloud was independent of datacube dependency, and it was important at the time. I don't mind merging the two now. We can always move s3-find and s3-to-tar into a lib if need be.

But related, I would like to refactor the s3-to-dc and I guess s3-find to use threading instead of async, because I think performance will be the same and the aio code is slow to move and is a frustrating dependency. Any thoughts on this?

I do agree that odc.aio.S3Fetcher needs more love, not just swapping out async for threads but better bucket listing strategy, but that's discussion for a different issue. I do believe that async CAN offer better performance especially on lower end machines, but I also agree that aiobotocore dependency is just too much. I was thinking of using just plain aiohttp instead, even have code for signing requests here:

odc-tools/libs/aws/odc/aws/misc.py

Line 8 in 03235c5

def s3_get_object_request_maker(region_name=None, credentials=None, ssl=True):

But ultimately it's not so much threading vs async that would allow for better performance but a better strategy of user guided "parallell" listing that combines some shallow directory listing followed by deep prefix listing running across several "threads", regardless of whether those threads are async or "normal".

alexgleith · 2021-08-25T06:43:01Z

I know what you mean about parallel listing. Not sure if there's an elegant way to partition that.

Damien said that the s5cmd tool was the fastest he found, maybe we could wrap that... https://github.com/peak/s5cmd

Kirill888 · 2021-08-25T07:06:26Z

I know what you mean about parallel listing. Not sure if there's an elegant way to partition that.

Damien said that the s5cmd tool was the fastest he found, maybe we could wrap that... https://github.com/peak/s5cmd

made this #332 for this

It's only used by odc-aio and we have too many projects

whatever remaining functionality was there was moved to odc.algo.*, as this is where all the Dask related experiments/utilities are now

It's only used by odc-aio and we have too many projects

whatever remaining functionality was there was moved to odc.algo.*, as this is where all the Dask related experiments/utilities are now

Kirill888 · 2021-08-26T23:15:51Z

Update

odc-dtools and odc-ppt are no more, and odc-index is now odc-stac.

Still not sure about cloud libs. I'm leaning towards making it one package with feature flags to pull in optional dependencies like thredds, might actually make it easier to refactor S3Fetcher to work without aiobotocore.

alexgleith · 2021-08-27T05:00:02Z

I agree that we should merge cloud and dc apps, keep it simple.

Kirill888 · 2021-08-27T05:10:19Z

@alexgleith apps I'm not too concerned about as they are "leaf nodes" and can be merged later on without much disruption. What do we do with libraries though. We have 4 related and separate libs:

odc-aws
odc-aio
odc-azure
odc-thredds

One option is to put it into odc-cloud package, and make -aio,-azure,-thredds optional (hide behind feature flag). Only downside I see is that one can not have dependency on say -thredds without also depending on boto, but currently only dc-tools/cloud are using those and they don't have aws as optional dependency.

alexgleith · 2021-08-27T05:44:14Z

Ok, sorry.

Yeah, I'm happy with odc-cloud.

Thredds is such a niche use case, and I think if they're using it they have bigger problems than needing to have an unused boto dependency!

Kirill888 added a commit that referenced this issue Aug 25, 2021

Move odc.ppt lib into odc-aio project #331

440395e

It's only used by odc-aio and we have too many projects

Kirill888 added a commit that referenced this issue Aug 25, 2021

Remove odc.dtools. libs #331

b38b2e4

whatever remaining functionality was there was moved to odc.algo.*, as this is where all the Dask related experiments/utilities are now

Kirill888 mentioned this issue Aug 25, 2021

WIP: Spring Clean #333

Merged

Kirill888 added a commit that referenced this issue Aug 25, 2021

Move odc.ppt lib into odc-aio project #331

18dba18

It's only used by odc-aio and we have too many projects

Kirill888 added a commit that referenced this issue Aug 25, 2021

Remove odc.dtools. libs #331

6feb5ed

whatever remaining functionality was there was moved to odc.algo.*, as this is where all the Dask related experiments/utilities are now

Kirill888 added a commit that referenced this issue Aug 27, 2021

Merge cloud libs into one package (#331)

b331d3d

Kirill888 mentioned this issue Aug 27, 2021

Merge cloud libs into one package (#331) #343

Merged

Kirill888 added a commit that referenced this issue Aug 27, 2021

Merge cloud libs into one package (#331)

b67c2f5

Kirill888 closed this as completed Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: reduce number of top level packages #331

Proposal: reduce number of top level packages #331

Kirill888 commented Aug 25, 2021 •

edited

Loading

alexgleith commented Aug 25, 2021

alexgleith commented Aug 25, 2021

Kirill888 commented Aug 25, 2021

alexgleith commented Aug 25, 2021

Kirill888 commented Aug 25, 2021

Kirill888 commented Aug 26, 2021

alexgleith commented Aug 27, 2021

Kirill888 commented Aug 27, 2021

alexgleith commented Aug 27, 2021

Proposal: reduce number of top level packages #331

Proposal: reduce number of top level packages #331

Comments

Kirill888 commented Aug 25, 2021 • edited Loading

Problem

Stocktake

Immediate Actions

alexgleith commented Aug 25, 2021

alexgleith commented Aug 25, 2021

Kirill888 commented Aug 25, 2021

alexgleith commented Aug 25, 2021

Kirill888 commented Aug 25, 2021

Kirill888 commented Aug 26, 2021

Update

alexgleith commented Aug 27, 2021

Kirill888 commented Aug 27, 2021

alexgleith commented Aug 27, 2021

Kirill888 commented Aug 25, 2021 •

edited

Loading