Skip to content

Commit

Permalink
Further expository text updates
Browse files Browse the repository at this point in the history
  • Loading branch information
lmcinnes committed Oct 24, 2015
1 parent 26ffa4b commit 0cb7296
Showing 1 changed file with 15 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -745,7 +745,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation. I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. Note how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
"Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation."
]
},
{
Expand Down Expand Up @@ -888,6 +888,15 @@
"datasize_table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. It is also worth keeping in mind that some of the results of the models for larger numbers are simply false -- you'll recall that Fastcluster and Scipy's single linkage both didn't scale at all well past 40000 points on my laptop, so I'm certainly not ging to manage 50000 or 100000 over lunch. The same applies to DeBaCl and the slower Sklearn implementations as they also produce the full pairwise distance matrix during computations.\n",
"\n",
"The main thing to note is how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -911,21 +920,21 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python 3",
"language": "python",
"name": "python2"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 0cb7296

Please sign in to comment.