Clustering - DBSCAN #86

Wei-1 · 2019-09-12T12:16:59Z

Description of Changes

Add clustering algorithm, DBSCAN.
Please help to check the format and the corresponding tests.

Includes

Code changes
Tests
Documentation

matejklemen

Thanks for the PR 🙂
I've left some comments regarding changes that need to be made. I've only familiarized myself with the algorithm this weekend, so please let me know if you feel that I've made a mistake in any of the comments regarding the algorithm's behavior.

Besides these comments, please try to limit the use of mutable constructs (not at all costs, of course) - I'll let you know if I can think of some concrete improvements along the way.

src/main/scala/io/picnicml/doddlemodel/cluster/DBSCAN.scala

Wei-1

I have been thinking about the mutable parameters,
but I haven't think of a good way to avoid them yet.
Please also provide some comment if you have an idea on that.

src/main/scala/io/picnicml/doddlemodel/cluster/DBSCAN.scala

inejc

Hi @Wei-1, thanks for the PR, it's much appreciated 🙂. I only skimmed through the code and will do a more detailed review over the weekend. One thing that stands out are the calls to the findNeighbors function; distance for the same points is unnecessarily calculated many times. We have at least two better ways of doing this IMHO:

construct a pairwise distance matrix before the clustering and use it in findNeighbors
construct a kd-tree (or a ball-tree) before the clustering and use it in findNeighbors

The second approach is what we eventually want to get to as it's faster and more memory efficient but for the first solution I'm happy with the first method as it already avoids recalculation of the same distances.

Wei-1 · 2019-09-19T01:09:36Z

Add the Distance Map as stated by @inejc

src/main/scala/io/picnicml/doddlemodel/cluster/DBSCAN.scala

Wei-1 · 2019-09-20T02:10:22Z

Modified as stated. @matejklemen

src/main/scala/io/picnicml/doddlemodel/cluster/DBSCAN.scala

matejklemen · 2019-09-20T22:21:02Z

I'm just leaving a comment with some sample code here to start a bit of discussion (the following should not be integrated in the code in current state).
I've been playing around a bit with the code to see which of the mutable things could be removed and I've come up with this:

// current seed points -> new seed points
LazyList.iterate(groupQueue)((seedPoints: Set[Int]) => {
  // iterate over each seed point
  seedPoints.foldLeft(Set[Int]())(
    (newSeedPoints: Set[Int], currSeedPoint: Int) => {
      label(currSeedPoint) = groupId
      val neighsOfNeigh = findNeighbors(currSeedPoint, x, model.eps)
      // check if neighbourhood of neighbor contains at least minSamples
      if (neighsOfNeigh.size + 1 >= model.minSamples) {
        neighsOfNeigh.foldLeft(newSeedPoints: Set[Int])((seeds, currNeighPoint) => {
          label(currNeighPoint) match {
            case NOISE =>
              label(currNeighPoint) = groupId
              seeds
            case UNASSIGNED =>
              label(currNeighPoint) = groupId
              seeds + currNeighPoint
            case _ => seeds
          }
        })
      }
      else
        newSeedPoints
  })
}).takeWhile(_.nonEmpty).foreach(_ => {})

This would be a replacement for the while loop.
It probably could be trimmed down a bit (I'm guessing the types are not needed in all the places), but still, I'm not completely certain if this is readable.

I'd like to hear your thoughts and/or suggestions on improving this (though it's really not of a high priority). @Wei-1 @inejc

src/main/scala/io/picnicml/doddlemodel/typeclasses/Clusterer.scala

inejc · 2019-09-23T16:48:52Z

Notes:

1. there is no space in all algorithms that I referenced between `private(eps: ...`

2. I am not able to use DenseVector[Double] in the output

3. getOrBreak doesn't seem to work

I fixed this in linear models (thanks!) so we should also fix it here
I will provide an example for that
@matejklemen's suggestion should work

…ei-1-master

inejc · 2019-09-24T16:45:50Z

Hey @Wei-1, I went ahead and fixed the Clusterer typeclass and improved the calculation of distances a bit (and fixed the tests). I'm copying all three files here to avoid committing into your branch. All that is left is the actual algorithm which I will work on in the next days. Let me know if this makes sense to you (I also changed scaladoc stuff, moved things around, etc.) If you have any questions about the proposed changes please also let me know :).

Clusterer.scala

package io.picnicml.doddlemodel.typeclasses

import breeze.linalg.DenseVector
import io.picnicml.doddlemodel.data.Features

trait Clusterer[A] extends Estimator[A] {

  def fit(model: A, x: Features): A = {
    require(!isFitted(model), "Called fit on a model that is already fitted")
    fitSafe(copy(model), x)
  }

  def fitPredict(model: A, x: Features): DenseVector[Double] = {
    require(!isFitted(model), "Called fit on a model that is already fitted")
    labelsSafe(fitSafe(copy(model), x))
  }

  /** A function that creates an identical clusterer. */
  protected def copy(model: A): A

  /** A function that is guaranteed to be called on a non-fitted model. **/
  protected def fitSafe(model: A, x: Features): A

  def labels(model: A):  DenseVector[Double] = {
    require(isFitted(model), "Called labels on a model that is not fitted yet")
    labelsSafe(model)
  }

  /** A function that is guaranteed to be called on a fitted model. */
  protected def labelsSafe(model: A): DenseVector[Double]
}

DBSCAN.scala

package io.picnicml.doddlemodel.cluster

import breeze.linalg.DenseVector
import breeze.linalg.functions.euclideanDistance
import cats.syntax.option._
import io.picnicml.doddlemodel.data.Features
import io.picnicml.doddlemodel.syntax.OptionSyntax._
import io.picnicml.doddlemodel.typeclasses.Clusterer

import scala.collection.mutable

/** An immutable DBSCAN clustering model.
  *
  * @param eps: the maximum distance between two datapoints to be considered in a common neighborhood
  * @param minSamples: the minimum number of datapoints in a neighborhood for a point to be considered the core point
  */
case class DBSCAN private (eps: Double, minSamples: Int, private val labels: Option[DenseVector[Double]])

object DBSCAN {

  def apply(eps: Double = 0.5, minSamples: Int = 5): DBSCAN = {
    require(eps > 0.0, "Maximum distance eps needs to be larger than 0")
    require(minSamples > 0, "Minimum number of samples needs to be larger than 0")
    DBSCAN(eps, minSamples, none)
  }

  implicit lazy val ev: Clusterer[DBSCAN] = new Clusterer[DBSCAN] {

    override protected def copy(model: DBSCAN): DBSCAN =
      model.copy()

    override def isFitted(model: DBSCAN): Boolean = model.labels.isDefined

    override protected def fitSafe(model: DBSCAN, x: Features): DBSCAN = {
      val distances = computeDistances(x)
      println(distances)
      ???
    }

    private def computeDistances(x: Features): Distances = {
      val distanceMatrix = mutable.AnyRefMap[(Int, Int), Double]()
      (0 until x.rows).combinations(2).foreach { case rowIndex0 +: rowIndex1 +: IndexedSeq() =>
        distanceMatrix((rowIndex0, rowIndex1)) = euclideanDistance(x(rowIndex0, ::).t, x(rowIndex1, ::).t)
      }
      new Distances(distanceMatrix)
    }

    override protected def labelsSafe(model: DBSCAN): DenseVector[Double] = model.labels.getOrBreak
  }

  private class Distances(private val distanceMatrix: mutable.AnyRefMap[(Int, Int), Double]) {
    def get(x: Int, y: Int): Double = if (x > y) distanceMatrix((y, x)) else distanceMatrix((x, y))
  }
}

DBSCANTest.scala

package io.picnicml.doddlemodel.cluster

import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.TestingUtils
import io.picnicml.doddlemodel.cluster.DBSCAN.ev
import org.scalactic.{Equality, TolerantNumerics}
import org.scalatest.{FlatSpec, Matchers}

class DBSCANTest extends FlatSpec with Matchers with TestingUtils {

  implicit val doubleTolerance: Equality[Double] = TolerantNumerics.tolerantDoubleEquality(1e-4)

  private val x = DenseMatrix(
    List(1.0, 1.0),
    List(0.0, 2.0),
    List(2.0, 0.0),
    List(8.0, 1.0),
    List(7.0, 2.0),
    List(9.0, 0.0)
  )

  "DBSCAN" should "cluster the datapoints" in {
    val model = DBSCAN(eps = 3.0, minSamples = 1)
    breezeEqual(ev.fitPredict(model, x), DenseVector(0.0, 0.0, 0.0, 1.0, 1.0, 1.0))
  }

  it should "cluster each datapoint into it's own group when eps is too small" in {
    val model = DBSCAN()
    breezeEqual(ev.fitPredict(model, x), DenseVector(0.0, 1.0, 2.0, 3.0, 4.0, 5.0))
  }

  it should "cluster all data points into a single group when eps is too large" in {
    val model = DBSCAN(eps = 10.0)
    breezeEqual(ev.fitPredict(model, x), DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
  }

  it should "label all points as outliers when min samples is too large" in {
    val model = DBSCAN(minSamples = 7)
    breezeEqual(ev.fitPredict(model, x), DenseVector(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0))
  }

  it should "cluster all datapoints into a single group when eps equals the distance between points" in {
    val smallX = DenseMatrix(
      List(0.0, 0.0),
      List(3.0, 0.0)
    )
    val model = DBSCAN(eps = 3.0)
    breezeEqual(ev.fitPredict(model, smallX), DenseVector(0.0, 0.0))
  }

  it should "cluster all datapoints into a single group" in {
    val d1X = DenseMatrix(
      List(0.0, 12.0),
      List(0.0, 9.0),
      List(0.0, 6.0),
      List(0.0, 3.0),
      List(0.0, 0.0)
    )
    val model = DBSCAN(eps = 3.0, minSamples = 3)
    breezeEqual(ev.fitPredict(model, d1X), DenseVector(0.0, 0.0, 0.0, 0.0, 0.0))
  }

  it should "prevent the usage of negative eps" in {
    an [IllegalArgumentException] shouldBe thrownBy(DBSCAN(eps = -0.5))
  }

  it should "prevent the usage of negative min samples" in {
    an [IllegalArgumentException] shouldBe thrownBy(DBSCAN(minSamples = -1))
  }
}

inejc · 2019-09-24T16:47:38Z

The changes are available in this branch which I will delete later.

Wei-1 · 2019-09-24T17:04:32Z

These are beautiful!
I will follow your modification after you are done.

inejc · 2019-09-25T20:32:44Z

@Wei-1 @matejklemen the proposed implementation for DBSCAN is available in this file. Typeclass is available here. The tests need to be fixed because I changed default values of parameters (I suggest that we come up with a few examples produced by the scikit-learn implementation).

Anyway, let me know what you think, there might be things there that should be changed as I haven't tested this (and it probably needs some refactoring too).

@Wei-1 should we rebase your branch with doddle-model/Wei-1-master, or will you copy the changes and commit them directly once we agree on them 🤔?

inejc · 2019-09-25T20:49:23Z

Also todo: fix 2.11/2.12 compatibility because of mutable.Queue.enqueueAll.

inejc · 2019-09-26T06:10:00Z

@matejklemen I also looked at your proposal for the replacement of the while loop and I think it's good (I initially came up with a similar example but then decided to go back to the loop for some reason :D). What are your preferences around that?

Wei-1 · 2019-09-26T16:09:51Z

@inejc, I think it is easier that we simply close this PR and start another one with doddle-model/Wei-1-master

inejc · 2019-09-26T17:54:50Z

@Wei-1 I can try to push my changes to your fork or alternatively, you can try to pull doddle-model/Wei-1-master into your own fork and open a PR from there.

matejklemen · 2019-09-26T22:53:32Z

@inejc Well if we don't have a clean way to do it in a functional style, I'm fine with leaving a while loop in (my solution is not more readable than a simple while loop IMO).

update-format-from-inejc

Wei-1 · 2019-09-27T00:59:20Z

Updated this PR with new commits from @inejc

inejc · 2019-09-27T07:47:33Z

Thanks @Wei-1. Will you work on 2.11 and 2.12 support and fixing the tests or should I do it?

Wei-1 · 2019-09-27T12:53:56Z

Before that, the current code will fail 2 unit tests.

[info] DBSCANTest:
[info] DBSCAN
[info] - should cluster each datapoint into it's own group when eps is too small *** FAILED ***
[info]   false did not equal true (DBSCANTest.scala:30)
[info] - should cluster all datapoints into a single group when eps equals the distance between points *** FAILED ***
[info]   false did not equal true (DBSCANTest.scala:50)

I checked and found that all data points are clustered as NOISEs.
Can you please help to find out what went wrong in the algorithms? @inejc

inejc · 2019-09-27T20:00:40Z

Can you please help to find out what went wrong in the algorithms? @inejc

I will check it out, the tests might be failing due to changed default parameters for minSamples and eps but need to confirm. As I said, I will do that by coming up with some examples and then running the scikit-learn implementations to get the clusters and make tests based on that.

inejc · 2019-09-27T20:03:03Z

By 2.11 and 2.12 I mean scala versions and the fact that the current code doesn't compile for them. It compiles for 2.13 though. We have this for 2.11/2.12 and this for 2.13 interop.

Wei-1 · 2019-09-28T08:22:01Z

Interesting. I thought we always need to compile with the specific scala version.
Do we have to change anything for that this machanism can support the clustering?

inejc · 2019-09-29T21:35:06Z

Interesting. I thought we always need to compile with the specific scala version.
Do we have to change anything for that this machanism can support the clustering?

Compatibility files allow for compilation on 2.11, 2.12 and 2.13, one at a time, i.e. one still has to choose a specific Scala version. The current code only compiles on 2.13 though due to mutable.Queue.enqueueAll which is not available on 2.11 and 2.12 so this needs to be fixed.

Wei-1 · 2019-10-22T06:39:11Z

@inejc I don't seem to find why your structure is not working as intended for those unit tests.

Wei-1 added 2 commits September 12, 2019 20:00

DBSCAN

c260077

DBSCAN-Test

daf9ad9

inejc requested review from inejc and matejklemen September 12, 2019 12:19

inejc added awaits review enhancement New feature or request labels Sep 12, 2019

matejklemen requested changes Sep 15, 2019

View reviewed changes

camel

d22ec10

picnicml deleted a comment Sep 15, 2019

Wei-1 commented Sep 15, 2019

View reviewed changes

Wei-1 requested a review from matejklemen September 16, 2019 15:12

matejklemen reviewed Sep 17, 2019

View reviewed changes

src/main/scala/io/picnicml/doddlemodel/cluster/DBSCAN.scala Outdated Show resolved Hide resolved

minSamples-bug-fix

b912b0d

picnicml deleted a comment Sep 18, 2019

Wei-1 requested a review from matejklemen September 18, 2019 02:18

inejc reviewed Sep 18, 2019

View reviewed changes

optimize-O

78502df

Wei-1 requested a review from inejc September 19, 2019 01:09

picnicml deleted a comment Sep 19, 2019

matejklemen reviewed Sep 19, 2019

View reviewed changes

src/main/scala/io/picnicml/doddlemodel/cluster/DBSCAN.scala Outdated Show resolved Hide resolved

matejklemen reviewed Sep 19, 2019

View reviewed changes

src/main/scala/io/picnicml/doddlemodel/cluster/DBSCAN.scala Outdated Show resolved Hide resolved

const

800a99d

Wei-1 requested a review from matejklemen September 20, 2019 02:09

picnicml deleted a comment Sep 20, 2019

matejklemen reviewed Sep 20, 2019

View reviewed changes

src/main/scala/io/picnicml/doddlemodel/cluster/DBSCAN.scala Outdated Show resolved Hide resolved

matejklemen reviewed Sep 20, 2019

View reviewed changes

src/main/scala/io/picnicml/doddlemodel/cluster/DBSCAN.scala Outdated Show resolved Hide resolved

picnicml deleted a comment Sep 23, 2019

inejc reviewed Sep 23, 2019

View reviewed changes

src/main/scala/io/picnicml/doddlemodel/typeclasses/Clusterer.scala Outdated Show resolved Hide resolved

getOrBreak

cabd632

Wei-1 requested a review from inejc September 23, 2019 23:23

picnicml deleted a comment Sep 23, 2019

Merge branch 'master' of https://github.com/Wei-1/doddle-model into W…

d24a935

…ei-1-master

start working on DBSCAN by @Wei-1

65204ef

inejc added 2 commits September 25, 2019 00:02

fix dbscan tests

77478ff

propose dbscan implementation

a5cfe2a

fix cluster formation conditions

a8890cf

Merge pull request #1 from Wei-1/Wei-1-master

33e5916

update-format-from-inejc

picnicml deleted a comment Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering - DBSCAN #86

Clustering - DBSCAN #86

Wei-1 commented Sep 12, 2019

matejklemen left a comment

Wei-1 left a comment

inejc left a comment

Wei-1 commented Sep 19, 2019

Wei-1 commented Sep 20, 2019

matejklemen commented Sep 20, 2019

inejc commented Sep 23, 2019

inejc commented Sep 24, 2019

inejc commented Sep 24, 2019

Wei-1 commented Sep 24, 2019

inejc commented Sep 25, 2019 •

edited

Loading

inejc commented Sep 25, 2019

inejc commented Sep 26, 2019

Wei-1 commented Sep 26, 2019

inejc commented Sep 26, 2019

matejklemen commented Sep 26, 2019

Wei-1 commented Sep 27, 2019

inejc commented Sep 27, 2019

Wei-1 commented Sep 27, 2019

inejc commented Sep 27, 2019 •

edited

Loading

inejc commented Sep 27, 2019

Wei-1 commented Sep 28, 2019

inejc commented Sep 29, 2019

Wei-1 commented Oct 22, 2019

Clustering - DBSCAN #86

Are you sure you want to change the base?

Clustering - DBSCAN #86

Conversation

Wei-1 commented Sep 12, 2019

Description of Changes

Includes

matejklemen left a comment

Choose a reason for hiding this comment

Wei-1 left a comment

Choose a reason for hiding this comment

inejc left a comment

Choose a reason for hiding this comment

Wei-1 commented Sep 19, 2019

Wei-1 commented Sep 20, 2019

matejklemen commented Sep 20, 2019

inejc commented Sep 23, 2019

inejc commented Sep 24, 2019

inejc commented Sep 24, 2019

Wei-1 commented Sep 24, 2019

inejc commented Sep 25, 2019 • edited Loading

inejc commented Sep 25, 2019

inejc commented Sep 26, 2019

Wei-1 commented Sep 26, 2019

inejc commented Sep 26, 2019

matejklemen commented Sep 26, 2019

Wei-1 commented Sep 27, 2019

inejc commented Sep 27, 2019

Wei-1 commented Sep 27, 2019

inejc commented Sep 27, 2019 • edited Loading

inejc commented Sep 27, 2019

Wei-1 commented Sep 28, 2019

inejc commented Sep 29, 2019

Wei-1 commented Oct 22, 2019

inejc commented Sep 25, 2019 •

edited

Loading

inejc commented Sep 27, 2019 •

edited

Loading