Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][ENH] Scatterplot, HeatMap, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data #1655

Merged
merged 15 commits into from
Oct 19, 2016

Conversation

VesnaT
Copy link
Contributor

@VesnaT VesnaT commented Oct 12, 2016

No description provided.

@janezd
Copy link
Contributor

janezd commented Oct 12, 2016

I haven't yet checked the content of the PR (obviously), but before we merge it we should perhaps talk about the signal name again. I don't think that the verb "to flag" appears in the dictionary with that meaning. Even the use of word "flag" for some kind of marker is IMHO limited to computer science, so the name "flagged data" wouldn't mean anything to an outsider.

What is wrong with "marked"?

Vesna, sorry if this will require some (hopefully trivial) refactoring.

@codecov-io
Copy link

codecov-io commented Oct 12, 2016

Current coverage is 89.38% (diff: 100%)

Merging #1655 into master will increase coverage by <.01%

@@             master      #1655   diff @@
==========================================
  Files            79         79          
  Lines          8589       8593     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           7677       7681     +4   
  Misses          912        912          
  Partials          0          0          

Sunburst

Powered by Codecov. Last update 4d1ea03...5610750

@VesnaT
Copy link
Contributor Author

VesnaT commented Oct 12, 2016

No worries.. I've intentionally made this PS 'small' (only four widgets are modified).

@VesnaT VesnaT changed the title [ENH] Scatterplot and Unsupervised widgets: Output Flagged Data [WIP][ENH] Scatterplot and Unsupervised widgets: Output Flagged Data Oct 12, 2016
@VesnaT VesnaT changed the title [WIP][ENH] Scatterplot and Unsupervised widgets: Output Flagged Data [ENH] Scatterplot and Unsupervised widgets: Output Flagged Data Oct 13, 2016
@VesnaT VesnaT force-pushed the flagged_data branch 4 times, most recently from cd69ccc to cafa07c Compare October 14, 2016 12:38
@VesnaT VesnaT changed the title [ENH] Scatterplot and Unsupervised widgets: Output Flagged Data [ENH] Scatterplot, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data Oct 14, 2016
@VesnaT VesnaT changed the title [ENH] Scatterplot, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data [ENH] Scatterplot, ConfusionMatrix and Unsupervised widgets: Output Flagged Data Oct 14, 2016
@VesnaT VesnaT changed the title [ENH] Scatterplot, ConfusionMatrix and Unsupervised widgets: Output Flagged Data [ENH] Scatterplot, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data Oct 14, 2016
@VesnaT VesnaT changed the title [ENH] Scatterplot, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data [ENH] Scatterplot, HeatMap, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data Oct 14, 2016
if name not in names:
return name
counts = [int(re.match(r"(" + name + " )(\d{1,}$)", n).group(2))
for n in names if re.match(r"(" + name + " )(\d{1,}$)", n)] + [1]
Copy link
Contributor

@janezd janezd Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this:

counts = max((int(mo.group(2))
              for mo in re.finditer(r"(" + name + " )(\d{1,}$)", n)),
             default=0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matter of taste: I'd write r"({})(\d{{1,}}$".format(name)

@@ -228,6 +228,11 @@ def get_instances(self, nodes):
if subsets:
return self.instances[np.unique(np.hstack(subsets))]

def get_indices(self, nodes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_instances could call this function. Don't forget to handle the case when get_indices returns None.

self.assertEqual(len(flagged), len(self.zoo))
self.assertEqual(0, np.sum([i[FLAGGED_FEATURE_NAME] for i in flagged]))

def test_cascade_flagged_tables(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be too smart by half and replace your function that uses regular expressions with one that uses the first unoccupied name. This test wouldn't fail because there are no "holes". Can you add some code to this test to remove the second meta and then "flag" the table again, so my smart idea would fail?

import numpy as np
from Orange.data import Table, Domain, DiscreteVariable

FLAGGED_SIGNAL_NAME = "Flagged Data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it again, I started liking the constant because it will indeed make us stick to the same name in all widgets.

The name is not perfect, though. Not only will we change "flagged" to something else, but it also suggests that it is a name of a flagged signal. ANNOTATED_DATA_SIGNAL_NAME = "Data" is better (than ANNOTATED_SIGNAL_NAME) but awfully long. Think about it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing else, the module belongs within Orange.widgets because it includes a name of the signal, hence it is related to widgets. :)

self.assertEqual(0, np.sum([i[FLAGGED_FEATURE_NAME] for i in flagged]))

# select data points
points = random.sample(range(0, len(self.iris)), 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer deterministic tests. Randomly choose some indices instead of choosing some indices at random. (https://xkcd.com/221/).


# check selected data output
selected = self.get_output("Data")
self.assertEqual(len(selected), len(points))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about testing that the correct instances were chosen? There may be a better way to do it, but I somewhere used something like np.testing.assert_almost_equal(selected.X, self.iris.X[points]).

self.assertEqual(0, np.sum([i[FLAGGED_FEATURE_NAME] for i in flagged]))

# select data points
points = random.sample(range(0, len(self.iris)), 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in DistanceMap.

@janezd
Copy link
Contributor

janezd commented Oct 14, 2016

Github is showing some of my comments as outdated, although you haven't made any further commits. Please check those, too.

Apart from these minor suggestions, I like the PR, in particular your factoring out of the parts of the tests. It would be even greater if you could simulate, say, selection action, but I know this is probably too hard.

Please tell me/us when you make the changes, so we don't wait too long with merging, since rebasing dozens of widgets is not that much fun.

@VesnaT VesnaT changed the title [ENH] Scatterplot, HeatMap, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data [WIP][ENH] Scatterplot, HeatMap, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data Oct 17, 2016
@VesnaT
Copy link
Contributor Author

VesnaT commented Oct 17, 2016

The comments are outdated because the code was moved to a Mixin in one of the following commits (441b15a), since it was very similar for all widgets. I could have rebased, but wanted to keep the code in case someone didn't like the Mixin idea.
I thought I did simulate the selection...

@VesnaT
Copy link
Contributor Author

VesnaT commented Oct 18, 2016

Done.

domain = Domain(data.domain.attributes, data.domain.class_vars, metas)
annotated = np.zeros((len(data), 1))
if selected_indices is not None:
annotated[selected_indices] = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 0 or 1? If nothing is selected, all instances have to have Selected=No, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 0 or 1? If nothing is selected, all instances have to have Selected=No, no?

I'm stupid. Please ignore.

selected = [i for i, t in enumerate(zip(
self.widget.results.actual, self.widget.results.predicted[0]))
if t in indices]
self.selected_indices = self.widget.results.row_indices[selected]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be nicer if _select_data returned selected_indices instead of (ab?)using the instance's attributes for semi-global data storage?

self.same_input_output_domain = True
self.selected_indices = []

def test_outputs(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way you factor out the tests.

def _compare_selected_annotated_domains(self, selected, annotated):
selected_vars = selected.domain.variables + selected.domain.metas
annotated_vars = annotated.domain.variables + annotated.domain.metas
self.assertTrue(all((var in annotated_vars for var in selected_vars)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests whether annotated.domain.variables + annotated.domain.metas are a subset (<=) of selected.domain.variables + selected.domain.metas. Doing it explicitly, using sets, would be more obvious and easier to read, I guess.

@@ -1092,10 +1099,12 @@ def commit(self):

if not selected_indices:
self.send("Selected Data", None)
self.send("Other Data", None)
annotated_data = create_annotated_table(items, selected_indices) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really an issue, but since you're going to make another commit anyway, can you replace selected_indices with [], so that it's obvious this call will always select all (or no) instances.

unselected_data = data[~mask]
if self.append_clusters:
def remove_other_value(vars_):
vars_ = [var for var in vars_]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not copy?

Copy link
Contributor

@janezd janezd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through all changes and widgets. Since none of my suggestions are substantial, I'm approving the request, but you can still follow them if you decide so.

@VesnaT
Copy link
Contributor Author

VesnaT commented Oct 19, 2016

I will fix the suggested in the next PR.

@astaric astaric modified the milestone: 3.3.9 Nov 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants