-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][ENH] Scatterplot, HeatMap, TreeGraph, ConfusionMatrix and Unsupervised widgets: Output Flagged Data #1655
Conversation
I haven't yet checked the content of the PR (obviously), but before we merge it we should perhaps talk about the signal name again. I don't think that the verb "to flag" appears in the dictionary with that meaning. Even the use of word "flag" for some kind of marker is IMHO limited to computer science, so the name "flagged data" wouldn't mean anything to an outsider. What is wrong with "marked"? Vesna, sorry if this will require some (hopefully trivial) refactoring. |
Current coverage is 89.38% (diff: 100%)@@ master #1655 diff @@
==========================================
Files 79 79
Lines 8589 8593 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 7677 7681 +4
Misses 912 912
Partials 0 0
|
No worries.. I've intentionally made this PS 'small' (only four widgets are modified). |
cd69ccc
to
cafa07c
Compare
if name not in names: | ||
return name | ||
counts = [int(re.match(r"(" + name + " )(\d{1,}$)", n).group(2)) | ||
for n in names if re.match(r"(" + name + " )(\d{1,}$)", n)] + [1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this:
counts = max((int(mo.group(2))
for mo in re.finditer(r"(" + name + " )(\d{1,}$)", n)),
default=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matter of taste: I'd write r"({})(\d{{1,}}$".format(name)
@@ -228,6 +228,11 @@ def get_instances(self, nodes): | |||
if subsets: | |||
return self.instances[np.unique(np.hstack(subsets))] | |||
|
|||
def get_indices(self, nodes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_instances
could call this function. Don't forget to handle the case when get_indices
returns None
.
self.assertEqual(len(flagged), len(self.zoo)) | ||
self.assertEqual(0, np.sum([i[FLAGGED_FEATURE_NAME] for i in flagged])) | ||
|
||
def test_cascade_flagged_tables(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be too smart by half and replace your function that uses regular expressions with one that uses the first unoccupied name. This test wouldn't fail because there are no "holes". Can you add some code to this test to remove the second meta and then "flag" the table again, so my smart idea would fail?
import numpy as np | ||
from Orange.data import Table, Domain, DiscreteVariable | ||
|
||
FLAGGED_SIGNAL_NAME = "Flagged Data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it again, I started liking the constant because it will indeed make us stick to the same name in all widgets.
The name is not perfect, though. Not only will we change "flagged" to something else, but it also suggests that it is a name of a flagged signal. ANNOTATED_DATA_SIGNAL_NAME = "Data"
is better (than ANNOTATED_SIGNAL_NAME
) but awfully long. Think about it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nothing else, the module belongs within Orange.widgets
because it includes a name of the signal, hence it is related to widgets. :)
self.assertEqual(0, np.sum([i[FLAGGED_FEATURE_NAME] for i in flagged])) | ||
|
||
# select data points | ||
points = random.sample(range(0, len(self.iris)), 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer deterministic tests. Randomly choose some indices instead of choosing some indices at random. (https://xkcd.com/221/).
|
||
# check selected data output | ||
selected = self.get_output("Data") | ||
self.assertEqual(len(selected), len(points)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about testing that the correct instances were chosen? There may be a better way to do it, but I somewhere used something like np.testing.assert_almost_equal(selected.X, self.iris.X[points])
.
self.assertEqual(0, np.sum([i[FLAGGED_FEATURE_NAME] for i in flagged])) | ||
|
||
# select data points | ||
points = random.sample(range(0, len(self.iris)), 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in DistanceMap
.
Github is showing some of my comments as outdated, although you haven't made any further commits. Please check those, too. Apart from these minor suggestions, I like the PR, in particular your factoring out of the parts of the tests. It would be even greater if you could simulate, say, selection action, but I know this is probably too hard. Please tell me/us when you make the changes, so we don't wait too long with merging, since rebasing dozens of widgets is not that much fun. |
The comments are outdated because the code was moved to a Mixin in one of the following commits (441b15a), since it was very similar for all widgets. I could have rebased, but wanted to keep the code in case someone didn't like the Mixin idea. |
Since there are only clusters in Selected Data, 'Other' should be removed from its domain. The value is still present in Flagged Data domain.
Done. |
domain = Domain(data.domain.attributes, data.domain.class_vars, metas) | ||
annotated = np.zeros((len(data), 1)) | ||
if selected_indices is not None: | ||
annotated[selected_indices] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 0 or 1? If nothing is selected, all instances have to have Selected=No, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 0 or 1? If nothing is selected, all instances have to have Selected=No, no?
I'm stupid. Please ignore.
selected = [i for i, t in enumerate(zip( | ||
self.widget.results.actual, self.widget.results.predicted[0])) | ||
if t in indices] | ||
self.selected_indices = self.widget.results.row_indices[selected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be nicer if _select_data
returned selected_indices
instead of (ab?)using the instance's attributes for semi-global data storage?
self.same_input_output_domain = True | ||
self.selected_indices = [] | ||
|
||
def test_outputs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the way you factor out the tests.
def _compare_selected_annotated_domains(self, selected, annotated): | ||
selected_vars = selected.domain.variables + selected.domain.metas | ||
annotated_vars = annotated.domain.variables + annotated.domain.metas | ||
self.assertTrue(all((var in annotated_vars for var in selected_vars))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests whether annotated.domain.variables + annotated.domain.metas
are a subset (<=
) of selected.domain.variables + selected.domain.metas
. Doing it explicitly, using sets, would be more obvious and easier to read, I guess.
@@ -1092,10 +1099,12 @@ def commit(self): | |||
|
|||
if not selected_indices: | |||
self.send("Selected Data", None) | |||
self.send("Other Data", None) | |||
annotated_data = create_annotated_table(items, selected_indices) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really an issue, but since you're going to make another commit anyway, can you replace selected_indices
with []
, so that it's obvious this call will always select all (or no) instances.
unselected_data = data[~mask] | ||
if self.append_clusters: | ||
def remove_other_value(vars_): | ||
vars_ = [var for var in vars_] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not copy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through all changes and widgets. Since none of my suggestions are substantial, I'm approving the request, but you can still follow them if you decide so.
I will fix the suggested in the next PR. |
No description provided.