Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Move Louvain clustering from prototypes to core #3111

Merged
merged 14 commits into from
Aug 2, 2018

Conversation

pavlin-policar
Copy link
Collaborator

Issue

Fixes #3110

Description of changes

Move Louvain clustering widget from prototypes

Includes
  • Code changes
  • Tests
  • Documentation

@pavlin-policar pavlin-policar requested a review from lanzagar July 3, 2018 11:20
@BlazZupan BlazZupan self-requested a review July 6, 2018 07:10
_DEFAULT_K_NEIGHBORS = 30


METRICS = [('Euclidean', 'l2'), ('Manhattan', 'l1')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the only two metrics that are supported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, only these two metrics are supported because of the way sklearn computes nearest neighbours. The only other metrics I had considered adding was the cosine, pearsonr and spearmanr distances, but unfortunately the fast tree methods for finding nearest neighbours in sklearn don't support these and fall back to the O(n^2) brute force approach, which is prohibitively expensive for larger data sets, so we decided it is better to leave them out.

If you feel any of the other supported metrics should be added, they are listed in the source: KDTree and BallTree. Most of these don't appear anywhere else in Orange, so including them doesn't really make sense.

self._invalidate_graph()
self.commit()

def _update_k_neighbors(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you do not expect these methods will be expanded in the future with code specific to changing one of the settings I don't see a reason to have 3 methods with the same body but different names.
I think having just one _update_graph and using it as the callback for all 3 controls would be better.


import Orange
from Orange.data import Table
from community import best_partition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this import belongs to the above group of external libraries now

@lanzagar
Copy link
Contributor

lanzagar commented Jul 6, 2018

The widget works well as far as I was able to test it.

What I am missing is documentation - at least a stub with something completely basic that can be expanded later.
Also, I was wondering what numbers to set for Resolution and was thinking that a tooltip explaining what I am doing there wouldn't be bad either. But maybe when the widget has some documentation, that will be enough.

Another thought was, why not a slider for Resolution like for PCA components. If anything I would sooner expect a spinbox for the number of components (everybody understands the number, I know what I am doing when setting e.g. 5 or 20). For Resolution, I have no idea what 0.5 means and how much bigger 1.5 is or 2 (or why 5 is the max). So a resolution slider from smaller scale communities to larger scale could make more sense.

@lanzagar
Copy link
Contributor

lanzagar commented Jul 6, 2018

Oh, and the only issue I have with the functionality is that the pca slider now commits on any change so dragging it starts a lot of computations and isn't smooth. Commit should happen when I stop dragging.

@codecov-io
Copy link

codecov-io commented Jul 17, 2018

Codecov Report

Merging #3111 into master will increase coverage by 0.05%.
The diff coverage is 92.55%.

@@            Coverage Diff             @@
##           master    #3111      +/-   ##
==========================================
+ Coverage   82.54%   82.59%   +0.05%     
==========================================
  Files         337      340       +3     
  Lines       58431    58767     +336     
==========================================
+ Hits        48229    48540     +311     
- Misses      10202    10227      +25

@pavlin-policar
Copy link
Collaborator Author

Using sliders makes more sense, I agree. I've changed it now. As far as the documentation goes, I can write some documentation, but I can't for the life of me figure out how to incorporate that into Orange.

@ajdapretnar
Copy link
Contributor

ajdapretnar commented Jul 17, 2018

Ajda to the rescue! 🏇
Here is where you put the .rst file of your widget: https://github.com/biolab/orange3/tree/master/doc/visual-programming/source/widgets
Best thing is to copy an existing file and just change the content, so you keep the form. Then just include it in your PR. You will have to run make html to see how the output looks like.
Also important:

  • stamper for numbers on images can be found here: https://github.com/biolab/orange-tools
  • in that same repo you will find trim.py that removes a background of the image (say for complex workflows)
  • you will have to index images for tests to pass (I recommend Gimp)

If you have any questions, just drop by. I would be happy if you just provide the key content, I can later take care of the rest.

@pavlin-policar
Copy link
Collaborator Author

Great, thanks! I had no idea this existed.

I've added a fairly minimal stub explaining what the controls do and how the widget works. I've tried to make the stamping consistent with the other widgets. Let me know if I should add anything or anything is poorly worded.

Copy link
Contributor

@ajdapretnar ajdapretnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some small comments to the documentation. You can opt not to include the example as the widget is easy to use, but perhaps it could be nice if an expert did it. :) However, it is best if you run make html in the visual-programming folder to inspect where the documentation needs to be fixed. The log will give you hints.

Another a small note: it would be nice to include a report option in the widget, too. Just implement send_report function and report on parameters.

@@ -0,0 +1,38 @@
Louvain Clustering
=======
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to extend the signs to encompass at minimum the entire title.


.. figure:: images/Louvain-stamped.png

1. PCA processing is typically be applied to the original data to remove noise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically be applied

values recover clusters containing more data points.
5. When *Apply Automatically* is ticked, the widget will automatically
communicate all changes. Alternatively, click *Apply*.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the bottom it is great to provide a minimum workflow for the widget. Or a cool way how to use it, preferably on a simple data set that exists in Orange.

@pavlin-policar
Copy link
Collaborator Author

@ajdapretnar I've added the report and a simple example showing off integration with the Network addon. I also don't know what indexing images is, or if it's a problem in this case - if it is I can look into it. Is this all right?

@ajdapretnar
Copy link
Contributor

🤔 I think tests won't pass if images aren't indexed. This is what I had in mind. It is as simple as opening Gimp and using Image - Mode - Indexed (Convert). Then File - Export As (and click all the Export buttons that appear). This is what you add then to the repo.

@pavlin-policar
Copy link
Collaborator Author

@lanzagar I've fixed the images, all the tests are passing. The only thing remaining are some pointless pylint errors. Should I try to fix them and add ignore statements where necessary or is this fine?

@lanzagar
Copy link
Contributor

Most of the pylint errors seem to be import related and easy to fix and improve the code.
There is one bare-except that probably should be except ImportError:.
The cell-var-from-loop warning might not be relevant in this case, but is one of those whatthehellishappening python things and even coupled with threading here - so I suggest to make it as clear and bug proof as possible :)
The only "pointless" warning that should probably be ignored is the broad-except (we actually want it) - a pylint ignore comment wouldn't be bad, but a single pylint warning is acceptable too.

current_progress = idx / num_tasks
# How much progress can each task contribute to the total
# work to be done
task_percentage = len(self.__tasks) ** -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with 1 / len(self.__tasks)? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wrote this code at a point where I was doing a considerable amount of calculus and I guess I found this clearer at that point.

@pavlin-policar
Copy link
Collaborator Author

@lanzagar The only remaining issue is the coverage. I've written all the tests that I think are reasonable for this widget, and I think most of the code is hit at least once, so I'm not sure where this poor coverage comes from. Should I look into this or can this be merged?

@lanzagar
Copy link
Contributor

I would like to merge this tomorrow if possible. If you can, take a look at the diff from @astaric and add a test or two to cover the bigger red parts. While I know that saying tests can also be added later, after this is merged, is not good practice, but I also wouldn't like to leave this PR open too long. It would be nice if it gets some use before the release and really should be in the next one.

@lanzagar lanzagar merged commit 68d10cb into biolab:master Aug 2, 2018
@pavlin-policar pavlin-policar deleted the louvain-clustering branch November 7, 2018 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants