Directory model has very bad performance on large directories. #10

eap · 2018-03-09T00:18:48Z

In the _dir_model method there is a segment of code that calls self.get on every contained blob if contents are set (see below). This organization creates a massive slowdown when navigating to GCS directories with more than a dozen files.

I suspect it should be straightforward to refactor this to directly use the returned google.cloud.storage.Blob objects returned by bucket.list_blobs. Similarly for directories with many sub-directories, you should be able to use the list of prefixes directly rather than running self.get many times.

offending code:

def _dir_model(self, path, members, content=True):
    ...
    for blob in blobs:
        ...
        contents.append(self.get(

I've filed this bug for tracking purposes - I don't have the bandwidth to resolve the bug at present.

The text was updated successfully, but these errors were encountered:

vmarkovtsev · 2018-03-09T07:27:18Z

Yes, this is a known problem. In our org's case the slowdown happened on about 100 entries, I tried to investigate and stuck at the inability of the Google API to properly limit the number of returned results from the iterator. It may be fixed now, don't know.

The reason why I call get() is that we need to decide which of the children are directories and which are files. As you are aware, GCS makes no distinction between a file and a directory, and even the standard viewer in cloud console behaves weird sometimes (attaching a blob to a transient node is pretty much possible). So I have to list every entry of the listed directory to decide if it is a file or a dir, and yep, that's slow.

The solution would be to limit the listing iterator to a sane number - but as I said, Google API did a bad job back then.

eap · 2018-03-09T08:49:58Z

w/r/t iterator limiting, that's now supported, but I wouldn't recommend it for this case. In my experience even large collections of blobs (well over 1000) can be listed quickly. The round-trip API call penalty is usually much steeper so I minimize paging unless there's a critical motivator. Another optimization in this case could be the use of a fields constraint. I don't recall offhand if this API supports it, but many GCP APIs use field constraints to speed up their back-end calls, so limiting fields to just the essentials could make this even snappier.

As you noted, there will always be fun edge cases inherent to the prefix-based emulation of folders but I also suspect that there are ways around the issue with relatively simple constraints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Directory model has very bad performance on large directories. #10

Directory model has very bad performance on large directories. #10

eap commented Mar 9, 2018 •

edited

Loading

vmarkovtsev commented Mar 9, 2018

eap commented Mar 9, 2018

Directory model has very bad performance on large directories. #10

Directory model has very bad performance on large directories. #10

Comments

eap commented Mar 9, 2018 • edited Loading

vmarkovtsev commented Mar 9, 2018

eap commented Mar 9, 2018

eap commented Mar 9, 2018 •

edited

Loading