Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: reduce number of top level packages #331

Closed
Kirill888 opened this issue Aug 25, 2021 · 9 comments
Closed

Proposal: reduce number of top level packages #331

Kirill888 opened this issue Aug 25, 2021 · 9 comments

Comments

@Kirill888
Copy link
Member

Kirill888 commented Aug 25, 2021

Problem

This repository grew "organically" for a while and some of the earlier experiments have proven to be less relevant than other. The side effect of this growth is a large number of packages and namespaces. Currently there is a one to one mapping between namespace and package, like odc.algo.* is shipped by odc-algo, and odc-algo ships only files in odc/algo. While this is a reasonable and clean relation it does make for a larger number of packages. On one hand this allows for higher granularity when declaring dependencies, on the other, packages are not "free". It adds to CI delays, makes renaming and moving code around harder, and makes publishing/managing to pypi/conda harder too as more secrets need to be managed, maintainers need to be added to more projects etc. (I spent some time adding @GypsyBojangles as an owner to every pypi project pushed from this repo, and will need to do a similar thing for generating publishing tokens.)

Stocktake

First let's decide which thing we definitely keeping as is. I'd say that apps can stay as they are. And as far as user facing libraries go I have this list:

  • odc.algo - mostly xarray + dask tool: example xr_rerpoject
  • odc.ui - tools for visualizations in jupyter
  • odc.stac - (previously odc.index) STAC and missing datacube index utilities
  • odc.stats - large scale data processing libs/apps (work in progress)
  • odc.dscache - used by odc.stats, but has other possible future use cases (odc index export/import for example)

Then we have "cloud io helper libs" that are mostly used via apps or other higher level libs and not directly by users.

  • odc.aws - AWS S3 and SQS
  • odc.azure - Azure blob storage
  • odc.thredds- Crawling THREDDS
  • odc.aio - AWS S3, but async, has annoying aiobotocore dependency

The reason why these are all separate is due to dependencies they pull in. For example odc.aio depends on aiobotocore which is really challenging to install in the presence of dependencies on boto* libraries. One option is to put them all into one package, say odc-cloud, and use feature flags to enable/disable features, so instead of depending on odc-azure one would declare dependency on odc-cloud[AZURE].

Finally there is an odd bunch of libs

  • odc.io - poor name, used by cli apps, for various unrelated text processing helpers
  • odc.ppt - "paralllell processing tools", some generic "Future" object handling(not used) and Async->Thread adapter
    • only used by odc.aio
  • odc.dtools - not used, mostly moved to datacube, used to have "rasterio environment activation/configuration" helpers.
  • odc.geom - not used, mostly moved to datacube, still has some unfinished work that might be of use later on

Immediate Actions

  • Dissolve odc.ppt by moving AsyncThread into odc.aio, and abandoning the rest
  • Dissolve odc.dtools, possibly move some of things into odc.algo (and add tests)
  • Remove dead code from other libs
  • Decide on a course of action for "cloud libs"
@alexgleith
Copy link
Contributor

I'm not addressing your post quite yet, but will have a read and do so.

But related, I would like to refactor the s3-to-dc and I guess s3-find to use threading instead of async, because I think performance will be the same and the aio code is slow to move and is a frustrating dependency. Any thoughts on this?

@alexgleith
Copy link
Contributor

Another quick point is that we could combine the two apps installs. cloud and dc_tools are not that far apart.

@Kirill888
Copy link
Member Author

Another quick point is that we could combine the two apps installs. cloud and dc_tools are not that far apart.

cloud and dc_tools were separate because cloud was independent of datacube dependency, and it was important at the time. I don't mind merging the two now. We can always move s3-find and s3-to-tar into a lib if need be.

But related, I would like to refactor the s3-to-dc and I guess s3-find to use threading instead of async, because I think performance will be the same and the aio code is slow to move and is a frustrating dependency. Any thoughts on this?

I do agree that odc.aio.S3Fetcher needs more love, not just swapping out async for threads but better bucket listing strategy, but that's discussion for a different issue. I do believe that async CAN offer better performance especially on lower end machines, but I also agree that aiobotocore dependency is just too much. I was thinking of using just plain aiohttp instead, even have code for signing requests here:

def s3_get_object_request_maker(region_name=None, credentials=None, ssl=True):

But ultimately it's not so much threading vs async that would allow for better performance but a better strategy of user guided "parallell" listing that combines some shallow directory listing followed by deep prefix listing running across several "threads", regardless of whether those threads are async or "normal".

@alexgleith
Copy link
Contributor

I know what you mean about parallel listing. Not sure if there's an elegant way to partition that.

Damien said that the s5cmd tool was the fastest he found, maybe we could wrap that... https://github.com/peak/s5cmd

@Kirill888
Copy link
Member Author

I know what you mean about parallel listing. Not sure if there's an elegant way to partition that.

Damien said that the s5cmd tool was the fastest he found, maybe we could wrap that... https://github.com/peak/s5cmd

made this #332 for this

Kirill888 added a commit that referenced this issue Aug 25, 2021
It's only used by odc-aio and we have too many projects
Kirill888 added a commit that referenced this issue Aug 25, 2021
whatever remaining functionality was there was moved to odc.algo.*, as this is
where all the Dask related experiments/utilities are now
Kirill888 added a commit that referenced this issue Aug 25, 2021
It's only used by odc-aio and we have too many projects
Kirill888 added a commit that referenced this issue Aug 25, 2021
whatever remaining functionality was there was moved to odc.algo.*, as this is
where all the Dask related experiments/utilities are now
@Kirill888
Copy link
Member Author

Update

odc-dtools and odc-ppt are no more, and odc-index is now odc-stac.

Still not sure about cloud libs. I'm leaning towards making it one package with feature flags to pull in optional dependencies like thredds, might actually make it easier to refactor S3Fetcher to work without aiobotocore.

@alexgleith
Copy link
Contributor

I agree that we should merge cloud and dc apps, keep it simple.

@Kirill888
Copy link
Member Author

@alexgleith apps I'm not too concerned about as they are "leaf nodes" and can be merged later on without much disruption. What do we do with libraries though. We have 4 related and separate libs:

  • odc-aws
  • odc-aio
  • odc-azure
  • odc-thredds

One option is to put it into odc-cloud package, and make -aio,-azure,-thredds optional (hide behind feature flag). Only downside I see is that one can not have dependency on say -thredds without also depending on boto, but currently only dc-tools/cloud are using those and they don't have aws as optional dependency.

@alexgleith
Copy link
Contributor

Ok, sorry.

Yeah, I'm happy with odc-cloud.

Thredds is such a niche use case, and I think if they're using it they have bigger problems than needing to have an unused boto dependency!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants