Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Google Drive] Fix permission fetching bug for domain-wide delegation sync #2024

Conversation

jedrazb
Copy link
Member

@jedrazb jedrazb commented Jan 3, 2024

Closes #1979

This fixes a bug when doing domain-wide delegation sync with DLS enabled.

Reproduction

This organisation has 2 files:

File 1: Owned by User A, and User C has read access.
File 2: Owned by User B

Fetch user -> [User C, User A, User B] (this list is generally unordered)

User C:

  • Fetch files -> [File 1]
  • Fetch file content and permissions. This throws an error (403, cannot fetch permissions). Users with read-only access can't fetch permissions. Error handled gracefully, empty permission list [] is returned and the file with a given ID is marked as "seen".

User B: All good.

User A

  • Fetch files -> [File 1]
  • Fetch file content and permissions. We are able to fetch the permissions but the file is marked as seen so we don't update the file with permission list.

Expected behavior

The ideal state of Documents in Elasticsearch should have been:

File 1

  • allow_access_control: [User A, User C]
    File 2
  • _allow_access_control: [User B]

Actual Behavior

File 1

  • _allow_access_control: []
    File 2
  • _allow_access_control: [User B]

The fix was suggested as a part of SDH ticket by the customer. I verified that it:

  • leads to expected results
  • fixes the warnings about permission fetching errors
  • it only affects domain-wide delegation syncs.

Pre-Review Checklist

  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)

@jedrazb jedrazb requested a review from a team January 3, 2024 14:51
Copy link
Member

@artem-shelkovnikov artem-shelkovnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not a problem or impossible setup, but wanted to confirm with you:

How does it work, if we have users A and B who both have only read-only access to the file. Will it mark this file as not accessible by anyone, or is this situation impossible?

@jedrazb
Copy link
Member Author

jedrazb commented Jan 4, 2024

How does it work, if we have users A and B who both have only read-only access to the file. Will it mark this file as not accessible by anyone, or is this situation impossible?

That's a valid question. This scenario is only possible if someone from outside of the organization has shared a read-only file with people inside the organization. This implementation will skip indexing such a file since no-one inside the org has write access to it.

Another possible approach would be to keep the list_files query unchanged (q="trashed=false") when DLS is disabled (meaning we would index also read-only files people from your org have access to), and only use the new list query (q="trashed=false and 'me' in writers") when DLS is enabled to fix the actual bug.

The downside of this approach is that we would end up with different content index with DLS enabled/disabled - because the query would be different. I'm not sure if we need to index files that onone in the org has write access to. Note, I'm talking here about content sync with domain-wide delegation enabled where we iterate over all org users and "impersonate" them to access their files.

So in short:

  • current bugfix: same query q for DLS and no-DLS sync, indexing just files that someone has write-access
  • alternative bugfix: different query q for DLS and no-DLS, potentially more files "discovered" in no-DLS mode (since we are considering read-only access as well)

@artem-shelkovnikov
Copy link
Member

I'm not sure if we need to index files that onone in the org has write access to.

I feel like it makes sense, as all we need is documents that people can read - if you can read a doc A, that you're not allowed to write to, the doc might still be of a lot of value for you.

That's, though, my opinion, I don't know if I can be a good source of answers to how it should work - might it be worth to discuss it with the team?

Copy link
Member

@artem-shelkovnikov artem-shelkovnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile approving it as it seems like an improvement to existing behaviour

@jedrazb
Copy link
Member Author

jedrazb commented Jan 5, 2024

if you can read a doc A, that you're not allowed to write to, the doc might still be of a lot of value for you.

I thought more about it. Also I raised this with the team, but it was pretty complicated setup to explain in a call. I think the best solution would be to:

  • For non-DLS, index all files any org user has read/write access to (status quo)
  • For DLS, fix the actual bug and change the query to consider only files that org has write access to. We need to drop all read-only files anyway since we can't fetch permissions for them.

I think this should maximise utility for users of this connector. What do you think? I will change the PR.

…nly-access' of github.com:elastic/connectors-python into google-drive-retrieve-permissions-for-users-with-read-only-access
@wangch079
Copy link
Member

Trying to review with limited knowledge of Google Drive.

I'm not sure if we need to index files that onone in the org has write access to.

I'd say we need to index files the user has access, unless we specifically document it like in the RCF use_domain_wide_delegation_for_sync:

Enable domain-wide delegation to automatically sync content from all shared and personal drives in the Google workspace. This eliminates the need to manually share Google Drive data with your service account, though it may increase sync time. If disabled, only items and folders manually shared with the service account will be synced. Please refer to the connector documentation to ensure domain-wide delegation is correctly configured and has the appropriate scopes.

The downside of this approach is that we would end up with different content index with DLS enabled/disabled

This concerns me. I don't think enabling/disabling DLS should affect the number of docs indexed.

What if you query with q="trashed=false, and check 'me' in writers" in code? If me is not in writers and DLS is enabled, add me into permissions.

@jedrazb
Copy link
Member Author

jedrazb commented Jan 8, 2024

I don't think enabling/disabling DLS should affect the number of docs indexed.

Correct, but with DLS enabled we are unable to fetch permissions for read-only files shared with me. So those files are unusable anyway from document level security perspective. Even though we can index a file, we can't fetch its access control list.

What if you query with q="trashed=false, and check 'me' in writers" in code?

I don't think we can execute dynamic queries for g drive such as check 'me' in writers" (even if we did it would be one additional request per file which would be wasteful).

If me is not in writers and DLS is enabled, add me into permissions.

The me in writers is not about me being absent from the file's access control list. If I don't have write level access I can't fetch the file's access control list (all accounts that have access to the file). So, I don't think the proposed solution would work in such a case. Also, note that we are not "building" file's ACLs in-mem, we are relying on fetching complete ACL from g drive api.

Copy link
Member

@artem-shelkovnikov artem-shelkovnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation of the problem, @jedrazb !

@jedrazb jedrazb enabled auto-merge (squash) January 9, 2024 10:10
@jedrazb jedrazb merged commit 64d9692 into main Jan 9, 2024
2 checks passed
@jedrazb jedrazb deleted the google-drive-retrieve-permissions-for-users-with-read-only-access branch January 9, 2024 10:27
Copy link

github-actions bot commented Jan 9, 2024

💚 Backport PR(s) successfully created

Status Branch Result
8.11 #2034
8.12 #2035

The backport PRs will be merged automatically after passing CI.

jedrazb added a commit that referenced this pull request Jan 9, 2024
jedrazb added a commit that referenced this pull request Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Google Drive] Unable to Retrieve File Permissions for Users with Read-Only Access
3 participants