-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributions: unequal bin width #5131
Comments
It is not entirely clear what should this visualization look like. The variable has values 1, 2, 3, 4, 5, 6, 7, 8, 24. For variables with so few distinct values, the widget can also assign have one bin for each value. But what is the bin width in this case? (Note that the x axis is not categorical.) |
Or, in general, consider a variable whose distinct values are [1.5, 1.8, 2, 2.34, 10]. With one value per bin, what is the expected bin width? I would tend to say this works as expected, but with unexpected results. We can add an information icon, explaining that each bin represents one unique value. |
This doesn't happen only for single value per bin. I have a dataset with 56108 instances. The default visualization for a certain variable creates the following bins:
I would expect the following default bins: |
Alternatively, I would expect the bin not to stretch more than the other bins. If a bin represents a single value, then its width should be the same as other bins. For RAD, the first seven bins should be a single large bin, or the final two bins should be two narrow bins with empty space in between. No? |
If bins' boundaries are not round decimal numbers, I guess they must represent single values. Don't you by chance have just 9 distinct values in your data? (We're not talking about single instances but about single values, right?.)
If a bin represents a single value, then all bins represent single values and thus have various widths. In this particular case, all widths except the last were 1. But here are 17 instances from heart disease with 9 distinct values. Bar widths are 1, 2, 3 or 4. All we can do is to let the widths of all bins equal the smallest distance betwen two values (what is currently shown as the narrowest bin). I've done so in #5139. Please report how this looks on your data. |
You're right, I confused the two. I'll check the PR. |
Distributions in some cases shows unequal bin width, making the histogram confusing. I would expect all bins (bars) to be of equal width.
File (housing) - Distributions. Select RAD column with bin width at minimum. It seems to happens when there are a lot of integer-like floats and only some decimal data, e.g. [20.0, 20.0, 20.5, 21.0, 21.0, 21.0, 21.6, 24.0, 24.0., 24.0].
The text was updated successfully, but these errors were encountered: