Further expository text updates

scikit-learn-contrib · Oct 24, 2015 · 0cb7296 · 0cb7296
1 parent 26ffa4b
commit 0cb7296
Showing 1 changed file with 15 additions and 6 deletions.
diff --git a/notebooks/Benchmarking scalability of clustering implementations.ipynb b/notebooks/Benchmarking scalability of clustering implementations.ipynb
@@ -745,7 +745,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation. I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. Note how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
+    "Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation."
    ]
   },
   {
@@ -888,6 +888,15 @@
     "datasize_table"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. It is also worth keeping in mind that some of the results of the models for larger numbers are simply false -- you'll recall that Fastcluster and Scipy's single linkage both didn't scale at all well past 40000 points on my laptop, so I'm certainly not ging to manage 50000 or 100000 over lunch. The same applies to DeBaCl and the slower Sklearn implementations as they also produce the full pairwise distance matrix during computations.\n",
+    "\n",
+    "The main thing to note is how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -911,21 +920,21 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 2",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "python2"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.10"
+   "pygments_lexer": "ipython3",
+   "version": "3.4.3"
   }
  },
  "nbformat": 4,