Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bool(x == x) when x is masked? #28

Open
mdhaber opened this issue Nov 25, 2024 · 9 comments
Open

bool(x == x) when x is masked? #28

mdhaber opened this issue Nov 25, 2024 · 9 comments

Comments

@mdhaber
Copy link
Owner

mdhaber commented Nov 25, 2024

According to NumPy:

import numpy as np
y = np.ma.MaskedArray(1, mask=True)
bool(y == y)  # False

Yet:

x = np.ma.MaskedArray([1], mask=True)
bool(x == x)  # True
bool(np.ma.all(x==x))  # False
bool(np.all([]))  # True

Let me try to interpret this:

  • bool(y == y) is False because y == y is a masked 0D thing, and any masked 0D thing is "falsy".
  • bool(np.ma.all(x==x)) is False because np.ma.all(x==x) reduces to a masked 0D thing.
  • bool(np.all([])) is True for the same reason that all([]) is True.
  • bool(x == x) being True is erroneous all the ways I try to look at it. x == x is a 1D thing with one masked element. I think bool(x == x) is most comparable to bool([]), which should be False.

All that said, I think it was choice to make bool of a masked 0D thing "falsy". Similarly, what should be returned when we try to get an int/float/complex out of a masked number? It would probably be most consistent to raise an error about the ambiguity whenever bool/int/float/complex is used on a masked array with only a masked element.


All this came up when I was thinking about whether masked elements should be treated as the same or distinct when computing unique_ functions.

The note in the documentation (e.g. here) of these functions states:

Uniqueness should be determined based on value equality (see equal()).

But I think this is still ambiguous for the reasons above, so I think we have a choice.

I'll just say that treating all masked elements as the same is easier because then the implementation is as simple as replacing the masked values with a sentinel that is not already in the data. If the masked elements are distinct, I see a few options:

  • use a different sentinel value for each masked element (which may require promoting the type - e.g. bool has only two values)
  • use np.nan as the sentinel value because nans are treated as distinct (but this only works for inexact types when there are no NaNs in the data),
  • performing the calculation on the compressed array, then append to the results:
    • values - one masked element for each masked element of the input
    • indices - the indices of all the masked elements of the input (e.g. nonzero(mask))
    • count - a separate 1 for each masked element

inverse_indices is a bit more complicated because we can't just append to the results of the compressed array. I think we have to:

  • create a new array the size of the argument.
  • wherever there was a non-masked value, use inverse_indices from the compressed array
  • replace the elements where there was a masked value with arange(n_nonmasked, n_total) (because in values, the masked elements appear after n_nonmasked values).

And all that headache for something that doesn't make much sense to begin with, IMO!

@keewis
Copy link
Collaborator

keewis commented Dec 7, 2024

I think bool(x == x) is most comparable to bool([]), which should be False.

As far as I can tell, the Array API only defines the __<python scalar type>__ methods for 0D arrays, so I think we can safely assume this means we should raise (which is what numpy.ndarray does)

It would probably be most consistent to raise an error about the ambiguity whenever bool/int/float/complex is used on a masked array with only a masked element

I think this is where we can come up with our own strategy. Raising does make sense to me, but we could just as well return _sentinel (if we still have that) as the requested type.

I will need to take some more time to understand what equality would mean in the context of unique_* and inverse_indices.

@mdhaber
Copy link
Owner Author

mdhaber commented Dec 28, 2024

@lucascolley can you weigh in from the perspective of whether masked values should be treated as distinct in the set functions? It's easier to implement the set functions if they are the same; see bottom of #28 (comment).

@lucascolley
Copy link
Collaborator

I think they should be treated as absent from the set. So if a value is masked, it shouldn't contribute to a count in unique_counts. Does that make sense, or am I missing something?

@mdhaber
Copy link
Owner Author

mdhaber commented Dec 28, 2024

Maybe. If we want to be able to recreate the original set (including masked elements) from the inverse_indices, we can't just strip the masked elements out completely, right?

(Otherwise, it sounds like what you're proposing is that the set functions are as simple as removing the masked values completely, performing the function, and returning the results as MArrays with no elements masked, right?)

@keewis
Copy link
Collaborator

keewis commented Dec 28, 2024

If we want to be able to recreate the original set (including masked elements) from the inverse_indices, we can't just strip the masked elements out completely, right?

that, or the result of inverse_indices needs to contain information on the missing values. However, for that we'd have to allow indexing with a masked array, which we wanted to avoid before.

@lucascolley
Copy link
Collaborator

(Otherwise, it sounds like what you're proposing is that the set functions are as simple as removing the masked values completely, performing the function, and returning the results as MArrays with no elements masked, right?)

That is what I was proposing, but I hadn't considered that you may want to reconstruct the original masked array.

@mdhaber
Copy link
Owner Author

mdhaber commented Dec 28, 2024

Right. The standard says

the indices from the set of unique elements that reconstruct x

With that in mind, do you prefer one of the other options?

@lucascolley
Copy link
Collaborator

And all that headache for something that doesn't make much sense to begin with, IMO!

My preference would be to give up on being able to reconstruct the original array (just throw masked elements away), until somebody makes a feature request. I think it would be a cool thing to think through if there is a use-case. But probably not worth the hassle without one!

@mdhaber
Copy link
Owner Author

mdhaber commented Dec 29, 2024

There is an option on the table for allowing reconstruction that is simple: treat all masked elements as the same. No headache. I think I will do that for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants