Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make random data in Python tests deterministic #14071

Open
wants to merge 11 commits into
base: branch-23.12
Choose a base branch
from

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Sep 8, 2023

Description

Some random data generators in cuDF default to seed=None, which means that an OS or time dependent seed is used, leading to different test data between systems/runs.
This PR changes the default to a fixed integer so that the same data is always generated.

Contributes to #17045.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added tests Unit testing for project tech debt non-breaking Non-breaking change labels Sep 8, 2023
@vuule vuule self-assigned this Sep 8, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 8, 2023
@vuule vuule added improvement Improvement / enhancement to an existing function and removed Python Affects Python cuDF API. labels Sep 8, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 8, 2023
for arg in args:
set_random_null_mask_inplace(arg)
for idx, arg in enumerate(args):
set_random_null_mask_inplace(arg, seed=idx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seed=idx to ensure different null masks for different columns

@vuule vuule marked this pull request as ready for review September 11, 2023 18:21
@vuule vuule requested review from a team as code owners September 11, 2023 18:21
@vuule
Copy link
Contributor Author

vuule commented Sep 11, 2023

CC @galipremsagar

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise that tracking down all uses of random sampling in the test suite is a big thing, and providing a default fixed seed everywhere is a pragmatic choice to get deterministic tests, but I think I don't want to break API compatibility with pandas for the two sample calls.

@@ -950,7 +950,7 @@ def sample(
frac: Optional[float] = None,
replace: bool = False,
weights: Union[abc.Sequence, "cudf.Series", None] = None,
random_state: Union[np.random.RandomState, int, None] = None,
random_state: Union[np.random.RandomState, int, None] = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: ‏I am not sure I like this change, it means that user code that previously worked to draw a sequence of independent samples from groupby objects now always returns the same result for each sample.

@@ -3346,7 +3346,7 @@ def sample(
frac=None,
replace=False,
weights=None,
random_state=None,
random_state=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: ‏Similarly here, I don't think we should set a specific seed as a default argument to sample. This is also creating a difference in the default API wrt pandas (which defaults to None https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)

@@ -80,7 +80,7 @@ def timeseries(
return gdf


def randomdata(nrows=10, dtypes=None, seed=None):
def randomdata(nrows=10, dtypes=None, seed=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note (non-blocking): ‏I am on the fence about these defaults. I suppose it is OK. Perhaps better would be to flip this to a no-default keyword only argument, forcing the caller to specify a seed:

Suggested change
def randomdata(nrows=10, dtypes=None, seed=1):
def randomdata(nrows=10, dtypes=None, *, seed):

@vuule vuule changed the base branch from branch-23.10 to branch-23.12 September 22, 2023 16:22
@vuule vuule added the 0 - Waiting on Author Waiting for author to respond to review label Sep 22, 2023
@vyasr vyasr removed the tech debt label Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Waiting on Author Waiting for author to respond to review improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API. tests Unit testing for project
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants