Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant column check with nunique #119

Open
1 of 3 tasks
sbrugman opened this issue Jul 7, 2023 · 2 comments
Open
1 of 3 tasks

Constant column check with nunique #119

sbrugman opened this issue Jul 7, 2023 · 2 comments
Labels
new check New check for the linter

Comments

@sbrugman
Copy link

sbrugman commented Jul 7, 2023

Try to respond to as many of the following as possible

Generally describe the pandas behavior that the linter should check for and why that is a problem. Links to resources, recommendations, docs appreciated

The linter should check for nunique being compared to 1. The detected pattern is less performant because it does not leverage short-circuiting when multiple unique values are found, and simply continues counting..

perf_short_circuit

def setup(n):
    return pd.Series(list(range(n)))

perf_worst

def setup(n):
    return pd.Series([1] * (n - 1) + [2])

Suggest specific syntax or pattern(s) that should trigger the linter (e.g., .iat)

  • df.column.nunique() == 1
  • df.column.nunique() != 1
  • df.column.nunique(dropna=True) == 1
  • df.column.nunique(dropna=True) != 1
  • df.column.nunique(dropna=False) == 1
  • df.column.nunique(dropna=False) != 1

Suggest specific syntax or pattern(s) that the linter should allow (e.g., .iloc)

Note that the solution is simple when there are no NaN values:

(series.values[0] == series.values).all()

And needs some additional logic when NaN/NA values are present.

For dropna=True

v = series.values
v = remove_na_arraylike(v)
if v.shape[0] == 0:
    return False
(v[0] == v).all()

For dropna=False

v = s.values
if v.shape[0] == 0:
    return False
(v[0] == v).all() or not pd.notna(v).any()

if included in pandas:

series.is_constant()

Suggest a specific error message that the linter should display (e.g., "Use '.iloc' instead of '.iat'. If speed is important, use numpy indexing")

Consider checking equality to first element instead of .nunique() == 1 for checking for a constant column.

Are you willing to try to implement this check?

  • Yes
  • No
  • Maybe, with some guidance
@sbrugman
Copy link
Author

Note that pandas-dev/pandas#54064 was merged. This adds documentation to the cookbook on how users can check for constant columns.

@deppen8
Copy link
Owner

deppen8 commented Aug 9, 2023

Thanks for flagging this, @sbrugman. This would be nice to add, indeed. We can match the PD101 code from ruff if it gets done.

Side note: I had no idea ruff wrapped in the pandas-vet checks!

This repo isn't very active (pandas doesn't change much these days) but it is still alive. I just made a number of long-overdue improvements to docs and CI/CD that should hopefully keep things current for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new check New check for the linter
Projects
None yet
Development

No branches or pull requests

2 participants