You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the _dir_model method there is a segment of code that calls self.get on every contained blob if contents are set (see below). This organization creates a massive slowdown when navigating to GCS directories with more than a dozen files.
I suspect it should be straightforward to refactor this to directly use the returned google.cloud.storage.Blob objects returned by bucket.list_blobs. Similarly for directories with many sub-directories, you should be able to use the list of prefixes directly rather than running self.get many times.
Yes, this is a known problem. In our org's case the slowdown happened on about 100 entries, I tried to investigate and stuck at the inability of the Google API to properly limit the number of returned results from the iterator. It may be fixed now, don't know.
The reason why I call get() is that we need to decide which of the children are directories and which are files. As you are aware, GCS makes no distinction between a file and a directory, and even the standard viewer in cloud console behaves weird sometimes (attaching a blob to a transient node is pretty much possible). So I have to list every entry of the listed directory to decide if it is a file or a dir, and yep, that's slow.
The solution would be to limit the listing iterator to a sane number - but as I said, Google API did a bad job back then.
w/r/t iterator limiting, that's now supported, but I wouldn't recommend it for this case. In my experience even large collections of blobs (well over 1000) can be listed quickly. The round-trip API call penalty is usually much steeper so I minimize paging unless there's a critical motivator. Another optimization in this case could be the use of a fields constraint. I don't recall offhand if this API supports it, but many GCP APIs use field constraints to speed up their back-end calls, so limiting fields to just the essentials could make this even snappier.
As you noted, there will always be fun edge cases inherent to the prefix-based emulation of folders but I also suspect that there are ways around the issue with relatively simple constraints.
In the
_dir_model
method there is a segment of code that callsself.get
on every contained blob if contents are set (see below). This organization creates a massive slowdown when navigating to GCS directories with more than a dozen files.I suspect it should be straightforward to refactor this to directly use the returned
google.cloud.storage.Blob
objects returned bybucket.list_blobs
. Similarly for directories with many sub-directories, you should be able to use the list of prefixes directly rather than running self.get many times.offending code:
I've filed this bug for tracking purposes - I don't have the bandwidth to resolve the bug at present.
The text was updated successfully, but these errors were encountered: