Paginate all SCIM list requests in the SDK #440

mgyucht · 2023-11-14T10:26:33Z

Changes

This PR incorporates two hard-coded changes for the SCIM API in the Python SDK:

startIndex starts at 1 for SCIM APIs, not 0. However, the existing .Pagination.Increment controls both the start index as well as whether the pagination is per-page or per-resource. Later, we should replace this extension with two independent OpenAPI options: one_indexed (defaulting to false) and pagination_basis (defaulting to resource but can be overridden to page).
If users don't specify a limit, the SDK will include a hard-coded limit of 100 resources per request. We could add this to the OpenAPI spec as an option default_limit, which is useful for any non-paginated APIs that later expose pagination options and allow the SDK to gracefully support those. However, we don't want to encourage folks to use this pattern: all new list APIs are required to be paginated from the start.

Tests

make test run locally
make fmt applied
relevant integration tests applied

nfx

thank you for adding this. few nits

nfx · 2023-11-14T10:50:15Z

tests/integration/test_iam.py

+    ("/api/2.0/preview/scim/v2/Users", lambda w: w.users.list(count=1)),
+    ("/api/2.0/preview/scim/v2/Groups", lambda w: w.groups.list(count=1)),
+    ("/api/2.0/preview/scim/v2/ServicePrincipals", lambda w: w.service_principals.list(count=1)),


Suggested change

("/api/2.0/preview/scim/v2/Users", lambda w: w.users.list(count=1)),

("/api/2.0/preview/scim/v2/Groups", lambda w: w.groups.list(count=1)),

("/api/2.0/preview/scim/v2/ServicePrincipals", lambda w: w.service_principals.list(count=1)),

("/api/2.0/preview/scim/v2/Users", lambda w: w.users.list(count=10)),

("/api/2.0/preview/scim/v2/Groups", lambda w: w.groups.list(count=10)),

("/api/2.0/preview/scim/v2/ServicePrincipals", lambda w: w.service_principals.list(count=10)),

i think count=1 would make integration tests too slow...

I tried this, but I wanted to make sure that we actually exercised the pagination flows. This test took around 15 seconds for all test cases combined.

nfx · 2023-11-14T10:50:36Z

tests/integration/test_iam.py

+
+
+@pytest.mark.parametrize("path,call", [
+    # there are ~7k users in our aws prod account


Suggested change

# there are ~7k users in our aws prod account

i don't think it should be in public code

excellent point

nfx · 2023-11-14T10:50:57Z

tests/integration/test_iam.py

+    # there are ~7k users in our aws prod account
+    ("/api/2.0/accounts/%s/scim/v2/Users", lambda a: a.users.list(count=1000)),
+    ("/api/2.0/accounts/%s/scim/v2/Groups", lambda a: a.groups.list(count=1)),
+    # there are ~3k service principals in our aws prod account


Suggested change

# there are ~3k service principals in our aws prod account

Excellent point

tanmay-db · 2023-11-14T10:52:11Z

databricks/sdk/service/iam.py

@@ -1182,7 +1182,8 @@ def list(self,

        # deduplicate items that may have been added during iteration
        seen = set()
-        query['startIndex'] = 0
+        query['startIndex'] = 1


Is this because the first item starts from number 1 in query? https://docs.databricks.com/api/workspace/groups/list

Yes, that's right.

* Introduce more specific exceptions, like `NotFound`, `AlreadyExists`, `BadRequest`, `PermissionDenied`, `InternalError`, and others ([#376](#376)). This makes it easier to handle errors thrown by the Databricks API. Instead of catching `DatabricksError` and checking the error_code field, you can catch one of these subtypes of `DatabricksError`, which is more ergonomic and removes the need to rethrow exceptions that you don't want to catch. For example: ```python try: return (self._ws .permissions .get(object_type, object_id)) except DatabricksError as e: if e.error_code in [ "RESOURCE_DOES_NOT_EXIST", "RESOURCE_NOT_FOUND", "PERMISSION_DENIED", "FEATURE_DISABLED", "BAD_REQUEST"]: logger.warning(...) return None raise RetryableError(...) from e ``` can be replaced with ```python try: return (self._ws .permissions .get(object_type, object_id)) except PermissionDenied, FeatureDisabled: logger.warning(...) return None except NotFound: raise RetryableError(...) ``` * Paginate all SCIM list requests in the SDK ([#440](#440)). This change ensures that SCIM list() APIs use a default limit of 100 resources, leveraging SCIM's offset + limit pagination to batch requests to the Databricks API. * Added taskValues support in remoteDbUtils ([#406](#406)). * Added more detailed error message on default credentials not found error ([#419](#419)). * Request management token via Azure CLI only for Service Principals and not human users ([#408](#408)). API Changes: * Fixed `create()` method for [w.functions](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/functions.html) workspace-level service and corresponding `databricks.sdk.service.catalog.CreateFunction` and `databricks.sdk.service.catalog.FunctionInfo` dataclasses. * Changed `create()` method for [w.metastores](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/metastores.html) workspace-level service with new required argument order. * Changed `storage_root` field for `databricks.sdk.service.catalog.CreateMetastore` to be optional. * Added `skip_validation` field for `databricks.sdk.service.catalog.UpdateExternalLocation`. * Added `libraries` field for `databricks.sdk.service.compute.CreatePolicy`, `databricks.sdk.service.compute.EditPolicy` and `databricks.sdk.service.compute.Policy`. * Added `init_scripts` field for `databricks.sdk.service.compute.EventDetails`. * Added `file` field for `databricks.sdk.service.compute.InitScriptInfo`. * Added `zone_id` field for `databricks.sdk.service.compute.InstancePoolGcpAttributes`. * Added several dataclasses related to init scripts. * Added `databricks.sdk.service.compute.LocalFileInfo` dataclass. * Replaced `ui_state` field with `edit_mode` for `databricks.sdk.service.jobs.CreateJob` and `databricks.sdk.service.jobs.JobSettings`. * Replaced `databricks.sdk.service.jobs.CreateJobUiState` dataclass with `databricks.sdk.service.jobs.CreateJobEditMode`. * Added `include_resolved_values` field for `databricks.sdk.service.jobs.GetRunRequest`. * Replaced `databricks.sdk.service.jobs.JobSettingsUiState` dataclass with `databricks.sdk.service.jobs.JobSettingsEditMode`. * Removed [a.o_auth_enrollment](https://databricks-sdk-py.readthedocs.io/en/latest/account/o_auth_enrollment.html) account-level service. This was only used to aid in OAuth enablement during the public preview of OAuth. OAuth is now enabled for all AWS E2 accounts, so usage of this API is no longer needed. * Added `network_connectivity_config_id` field for `databricks.sdk.service.provisioning.UpdateWorkspaceRequest`. * Added [a.network_connectivity](https://databricks-sdk-py.readthedocs.io/en/latest/account/network_connectivity.html) account-level service. * Added `string_shared_as` field for `databricks.sdk.service.sharing.SharedDataObject`. Internal changes: * Added regression question to issue template ([#414](#414)). * Made test_auth no longer fail if you have a default profile setup ([#426](#426)). OpenAPI SHA: d136ad0541f036372601bad9a4382db06c3c912d, Date: 2023-11-14

williamdphillips · 2023-11-22T17:20:51Z

Hi @mgyucht & @nfx, maybe I am missing something here but I'm no longer able to query users due to hardcoding of startIndex.

I have some code that removes inactive users on workspaces, so it iterates over all users. Previously I was using startIndex and count to get 100 users at a time, then query the next 100 users. Now, when I try to query the next 100 users it still shows the first 100 due to startIndex being hardcoded.

Is this change working as intended?

xinjiezhen-db and others added 4 commits November 13, 2023 09:30

setting up pagination for SCIM

fa57c6e

Merge branch 'main' into scim-pagination

208fab0

Merge branch 'main' into scim-pagination

efbac06

fix

baa423c

mgyucht mentioned this pull request Nov 14, 2023

Adding iterators for SCIM APIs #433

Closed

3 tasks

mgyucht requested a review from tanmay-db November 14, 2023 10:27

Integration tests

00c5d16

nfx approved these changes Nov 14, 2023

View reviewed changes

tanmay-db approved these changes Nov 14, 2023

View reviewed changes

mgyucht added 2 commits November 14, 2023 12:29

address comments

a03dcc3

fmt

abb8212

mgyucht enabled auto-merge November 14, 2023 11:37

mgyucht added this pull request to the merge queue Nov 14, 2023

Merged via the queue into main with commit 9ba48cc Nov 14, 2023
7 checks passed

mgyucht deleted the feature/scim-pagination branch November 14, 2023 11:40

mgyucht mentioned this pull request Nov 14, 2023

Release v0.13.0 #442

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paginate all SCIM list requests in the SDK #440

Paginate all SCIM list requests in the SDK #440

mgyucht commented Nov 14, 2023

nfx left a comment

nfx Nov 14, 2023

mgyucht Nov 14, 2023

nfx Nov 14, 2023

mgyucht Nov 14, 2023

nfx Nov 14, 2023

mgyucht Nov 14, 2023

tanmay-db Nov 14, 2023

mgyucht Nov 14, 2023

williamdphillips commented Nov 22, 2023 •

edited

Loading



		@pytest.mark.parametrize("path,call", [
		# there are ~7k users in our aws prod account

Paginate all SCIM list requests in the SDK #440

Paginate all SCIM list requests in the SDK #440

Conversation

mgyucht commented Nov 14, 2023

Changes

Tests

nfx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamdphillips commented Nov 22, 2023 • edited Loading

williamdphillips commented Nov 22, 2023 •

edited

Loading