Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added ListObjectsV2 call #64

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ExpandingMan
Copy link
Contributor

This adds the ListObjectsV2 call. As far as I can tell, this is the only way to actually list all objects in a bucket. No matter how I try, the older version never lists more than 1000 objects for me.

Right now the way I have this written gives lots of code duplication. I left it this way for now because merging s3_list_objects and s3_list_objects_v2 was a little bit messy. As I see it we have 3 options:

  • Leave it as is (two completely separate functions).
  • Merge the two functions and handle all possible keys in one function.
  • Deprecate s3_list_objects in favor of the new version. I can't see any reason not to do this as the older version seems like a subset of the new one.

@bkamins
Copy link

bkamins commented Dec 4, 2019

CC @pszufe

@mattBrzezinski mattBrzezinski self-assigned this Dec 4, 2019
@mattBrzezinski
Copy link
Member

No matter how I try, the older version never lists more than 1000 objects for me.

ListObjects is limited to 1,000 objects.

Deprecate s3_list_objects in favor of the new version. I can't see any reason not to do this as the older version seems like a subset of the new one.

I think this makes the most sense.

Copy link
Member

@mattBrzezinski mattBrzezinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a first pass on this CR. I'm in favour of just marking list_objects as deprecated and getting people to use v2.

Also, could you bump the version number in the Project.toml, and add some testing for this?

if delimiter != ""
q["delimiter"] = delimiter
end
if contoken ≠ ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency we should probably keep using != instead of

# Add each object from the response and update our object count / marker
if haskey(r, "Contents")
l = isa(r["Contents"], Vector) ? r["Contents"] : [r["Contents"]]
for object in l
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of iterating over these objects, could we not just put! the vector into the channel? (Not sure if this is possible).

And then just do num_objects += size(l)[1]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the xml_dict() function still need to be applied to each element in that vector before put!-ing the vector to the channel?

@pszufe
Copy link

pszufe commented Dec 4, 2019

In production systems an S3 bucket can contain millions of items and S3 has no directory structure (this what you see as directories are in fact name prefixes so there is not such thing as directory index).
Maybe since you have a chance yo use a new AWS API it would be also a good idea to update the AWSS3 API?
Perhaps, s3_list_objects_v2 could return some ResultSet obejct and then one can call getdata(resultset) and next!(resultset). Otherwise for large scale system this API will be not usable anyway (listing 1million objects can take very considerable amount of time while a typical large scale scenario is to list objects and asynchronously launch jobs for processing them).

I just noticed that you have a Channel in the new code so disregard the last comment. Paginating via a Channel seems to be more beautiful. so disregard the last paragraph. However, still tests for buckets with more than 1000 items are required.

@ExpandingMan
Copy link
Contributor Author

Note that there is (and has been) an option for limiting the number of items returned, so I'd assume it would work fine as is.

Is there a consensus on deprecating the older ListObjects in favor of this new ListObjectsV2?

If so I will eliminate the old function and try to make sure we don't lose any functionality in the new one (might have to add a couple of keys, I will check).

@pszufe
Copy link

pszufe commented Dec 5, 2019

The AWS S# REST API docs reads (https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html):

We recommend that you use this revised API for application development. For backward compatibility, Amazon S3 continues to support the prior version of this API, ListObjects.

So it seems to be reasonable to depreciate the old version.

@mattBrzezinski
Copy link
Member

I am in favour of adding the @deprecated tag to the ListObjects function, introducing the ListObjectsV2 call and releasing a patch rather than removing the ListObjects function and doing a major release.

@ExpandingMan
Copy link
Contributor Author

Any concensus about what should be done here so I can update the PR? Are we ok with deprecating the old version?

@s2maki
Copy link

s2maki commented Dec 13, 2019

The only reason I can think of for keeping both is that the pagination arguments are different between calls. ListObjects uses Marker, while V2 uses ContinuationToken. I use ListObjects, but I expect my object count to be small (a couple dozen). Anything more than that, and I find I have to maintain my index another way anyhow.

@mattBrzezinski
Copy link
Member

X-posting a similar merge request from @pschmied back in here:

#94

@pschmied
Copy link

pschmied commented Aug 3, 2020

The only reason I can think of for keeping both is that the pagination arguments are different between calls. ListObjects uses Marker, while V2 uses ContinuationToken. I use ListObjects, but I expect my object count to be small (a couple dozen). Anything more than that, and I find I have to maintain my index another way anyhow.

In the V1 ListObjects API, the "marker" parameter causes the API to return objects that follow the marker lexicographically:

https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html#API_ListObjects_RequestSyntax

In the ListObjectsV2 API, the "start-after" parameter causes the API to return objects that follow the start-after parameter lexicographically:

https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_RequestSyntax

My feeling is that exposing the "start-after" parameter in the s3_list_objects_v2 function would allow anyone relying on the old "marker" parameter to do what they need to do. Thus you could deprecate the old ListObjects / s3_list_objects. Do others agree?

Copy link

@pschmied pschmied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding the start-after query parameter, this should allow people to achieve with the ListObjectsV2 API what they were able to do with the deprecated ListObjects API.


"""
s3_list_objects_v2([::AWSConfig], bucket, [path_prefix]; delimiter="/", max_items=1000)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s3_list_objects_v2([::AWSConfig], bucket, [path_prefix]; delimiter="/", max_items=1000)
s3_list_objects_v2([::AWSConfig], bucket, [path_prefix]; delimiter="/", start_after="", max_items=1000)


This uses the `ListObjectV2` function call.
"""
function s3_list_objects_v2(aws::AWSS3.AWSConfig, bucket, path_prefix=""; delimiter="/", max_items=nothing)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function s3_list_objects_v2(aws::AWSS3.AWSConfig, bucket, path_prefix=""; delimiter="/", max_items=nothing)
function s3_list_objects_v2(aws::AWSS3.AWSConfig, bucket, path_prefix=""; delimiter="/", start_after="", max_items=nothing)

end
if contoken ≠ ""
q["continuation-token"] = contoken
end
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end
end
if start_after != ""
q["start-after"] = start_after
end

if delimiter != ""
q["delimiter"] = delimiter
end
if contoken ≠ ""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if contoken ""
if contoken != ""

@pschmied
Copy link

Just bumping here to see if there's anything more I can do to help move this along.

@pschmied
Copy link

If it is useful, I have collected I think all of the feedback, plus tests into the master branch of my fork: https://github.com/pschmied/AWSS3.jl

@mattBrzezinski
Copy link
Member

Just bumping here to see if there's anything more I can do to help move this along.

I'm currently quite busy with other work, but at some point I will have time to dedicate to this. If someone else could take the lead on reviewing / giving feedback that'd be greatly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants