Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributions: unequal bin width #5131

Closed
3 tasks
ajdapretnar opened this issue Dec 16, 2020 · 6 comments · Fixed by #5139
Closed
3 tasks

Distributions: unequal bin width #5131

ajdapretnar opened this issue Dec 16, 2020 · 6 comments · Fixed by #5139
Assignees
Labels
bug A bug confirmed by the core team

Comments

@ajdapretnar
Copy link
Contributor

  • What's wrong?

Distributions in some cases shows unequal bin width, making the histogram confusing. I would expect all bins (bars) to be of equal width.

Screen Shot 2020-12-16 at 10 02 49

  • How can we reproduce the problem?

File (housing) - Distributions. Select RAD column with bin width at minimum. It seems to happens when there are a lot of integer-like floats and only some decimal data, e.g. [20.0, 20.0, 20.5, 21.0, 21.0, 21.0, 21.6, 24.0, 24.0., 24.0].

  • What's your environment?
  • Operating system: OSX High Sierra
  • Orange version: 3.28.dev
  • How you installed Orange: conda/pip
@ajdapretnar ajdapretnar added the bug report Bug is reported by user, not yet confirmed by the core team label Dec 16, 2020
@janezd janezd added bug A bug confirmed by the core team good first issue and removed bug report Bug is reported by user, not yet confirmed by the core team labels Dec 18, 2020
@janezd
Copy link
Contributor

janezd commented Dec 18, 2020

It is not entirely clear what should this visualization look like.

The variable has values 1, 2, 3, 4, 5, 6, 7, 8, 24. For variables with so few distinct values, the widget can also assign have one bin for each value. But what is the bin width in this case? (Note that the x axis is not categorical.)

@janezd
Copy link
Contributor

janezd commented Dec 18, 2020

Or, in general, consider a variable whose distinct values are [1.5, 1.8, 2, 2.34, 10]. With one value per bin, what is the expected bin width?

I would tend to say this works as expected, but with unexpected results. We can add an information icon, explaining that each bin represents one unique value.

@ajdapretnar
Copy link
Contributor Author

This doesn't happen only for single value per bin. I have a dataset with 56108 instances. The default visualization for a certain variable creates the following bins:

  • (, 48.67) (5383 instances)
  • [48.67, 49) (143 instances)
  • [49, 50) (7426 instances)
  • [50, 50.33) (246 instances)
  • [50.33, 50.5) (75 instances)
  • [50.5, 50.67) (2723 instances)
  • [50.67, 51) (70 instances)
  • [51, 52) (18411 instances)
  • [52,) (3979 instances)

I would expect the following default bins:
(, 49), [49, 50), [50, 51), [51, 52), [52,)

@ajdapretnar
Copy link
Contributor Author

Alternatively, I would expect the bin not to stretch more than the other bins. If a bin represents a single value, then its width should be the same as other bins. For RAD, the first seven bins should be a single large bin, or the final two bins should be two narrow bins with empty space in between. No?

@janezd
Copy link
Contributor

janezd commented Dec 18, 2020

This doesn't happen only for single value per bin.

If bins' boundaries are not round decimal numbers, I guess they must represent single values. Don't you by chance have just 9 distinct values in your data? (We're not talking about single instances but about single values, right?.)

If a bin represents a single value, then its width should be the same as other bins.

If a bin represents a single value, then all bins represent single values and thus have various widths. In this particular case, all widths except the last were 1. But here are 17 instances from heart disease with 9 distinct values. Bar widths are 1, 2, 3 or 4.

Screenshot 2020-12-18 at 17 13 05

All we can do is to let the widths of all bins equal the smallest distance betwen two values (what is currently shown as the narrowest bin).

I've done so in #5139. Please report how this looks on your data.

@ajdapretnar
Copy link
Contributor Author

We're not talking about single instances but about single values, right?

You're right, I confused the two.

I'll check the PR.

@janezd janezd self-assigned this Dec 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug confirmed by the core team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants