-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Tree (hierarchical clustering, dendrogram) of clusters #3680
Comments
I agree with everything written: HC is unfeasible for large data, tree of clusters can be useful, cluster centroids can be used for this (provided by kmeans). One issue I have with this approach is that when the original data contains categorical variables, the centroids have a different domain (kmeans continuizes the data). But for numerical data (e.g. gene expression) this is not a problem. |
It's more an enhancement than an issue. Certain clustering algorithms do not output cluster centroids, but only categorical labels. Hierarchical clustering would not even need to operate on centroids, but use the appropriate linkage method to aggregate distances between provided clusters. Categorical variables can be continuized too. |
I see - so instead of using k-means you would like to use predefined clusters (categorical var in data), which can be a result of (any) clustering or otherwise given. In that case, the solution would be to have a Pivot table, which can compute averages (or other aggregates) to obtain the centroids of selected groups. |
Closed due to inactivity. Probably partially solved by Pivot widget. |
It is unfeasible to compute hierarchical clustering with datasets of more than a few thousand data points. Instead, a tree between the clusters is useful. Use cluster centroid (mean point) or other linkage options from Scipy to compute linkages.
Tree between 17 clusters:
tree.pdf
The text was updated successfully, but these errors were encountered: