Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a duplicates_strategy argument to read_df… #10130

Merged
merged 9 commits into from
Nov 13, 2024
Merged

Conversation

ptormene
Copy link
Member

@ptormene ptormene commented Nov 8, 2024

…so in case of duplicates we can:

  • keep the first occurrence,
  • keep the last one
  • calculate the average values

…es we can keep the first occurrence, the last one or calculate the average values
@ptormene ptormene added this to the Engine 3.22.0 milestone Nov 8, 2024
@ptormene ptormene self-assigned this Nov 8, 2024
@ptormene ptormene marked this pull request as draft November 8, 2024 15:52
@ptormene ptormene marked this pull request as ready for review November 11, 2024 09:24
- 'error': raise an error (default)
- 'keep_first': keep the first occurrence
- 'keep_last': keep the last occurrence
- 'avg': calculate the average numeric values
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a line

assert duplicates_strategy in ('error', 'keep_first', 'keep_last', 'avg'), duplicates_strategy

just to be protected against mispellings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def concat_if_different(values):
unique_values = values.dropna().unique().astype(str)
# If all values are identical, return the single unique value,
# otherwise join with "|"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return '|'.join(unique_values) is shorter and does exactly the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -982,7 +982,8 @@ def _read_risk3(self):
# NB: get_station_data is extending the complete sitecol
# which then is associated to the site parameters below
self.station_data, self.observed_imts = \
readinput.get_station_data(oq, self.sitecol)
readinput.get_station_data(oq, self.sitecol,
duplicates_strategy='error')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would set duplicates_strategy='avg' so that the aristotle tests will get green.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@micheles
Copy link
Contributor

LGTM, but please also update the changelog.

@ptormene ptormene merged commit bb8d783 into master Nov 13, 2024
7 checks passed
@ptormene ptormene deleted the duplicates-strategy branch November 13, 2024 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants