Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directory model has very bad performance on large directories. #10

Open
eap opened this issue Mar 9, 2018 · 2 comments
Open

Directory model has very bad performance on large directories. #10

eap opened this issue Mar 9, 2018 · 2 comments

Comments

@eap
Copy link
Contributor

eap commented Mar 9, 2018

In the _dir_model method there is a segment of code that calls self.get on every contained blob if contents are set (see below). This organization creates a massive slowdown when navigating to GCS directories with more than a dozen files.

I suspect it should be straightforward to refactor this to directly use the returned google.cloud.storage.Blob objects returned by bucket.list_blobs. Similarly for directories with many sub-directories, you should be able to use the list of prefixes directly rather than running self.get many times.

offending code:

def _dir_model(self, path, members, content=True):
    ...
    for blob in blobs:
        ...
        contents.append(self.get(

I've filed this bug for tracking purposes - I don't have the bandwidth to resolve the bug at present.

@vmarkovtsev
Copy link
Collaborator

Yes, this is a known problem. In our org's case the slowdown happened on about 100 entries, I tried to investigate and stuck at the inability of the Google API to properly limit the number of returned results from the iterator. It may be fixed now, don't know.

The reason why I call get() is that we need to decide which of the children are directories and which are files. As you are aware, GCS makes no distinction between a file and a directory, and even the standard viewer in cloud console behaves weird sometimes (attaching a blob to a transient node is pretty much possible). So I have to list every entry of the listed directory to decide if it is a file or a dir, and yep, that's slow.

The solution would be to limit the listing iterator to a sane number - but as I said, Google API did a bad job back then.

@eap
Copy link
Contributor Author

eap commented Mar 9, 2018

w/r/t iterator limiting, that's now supported, but I wouldn't recommend it for this case. In my experience even large collections of blobs (well over 1000) can be listed quickly. The round-trip API call penalty is usually much steeper so I minimize paging unless there's a critical motivator. Another optimization in this case could be the use of a fields constraint. I don't recall offhand if this API supports it, but many GCP APIs use field constraints to speed up their back-end calls, so limiting fields to just the essentials could make this even snappier.

As you noted, there will always be fun edge cases inherent to the prefix-based emulation of folders but I also suspect that there are ways around the issue with relatively simple constraints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants