-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering - DBSCAN #86
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR 🙂
I've left some comments regarding changes that need to be made. I've only familiarized myself with the algorithm this weekend, so please let me know if you feel that I've made a mistake in any of the comments regarding the algorithm's behavior.
Besides these comments, please try to limit the use of mutable constructs (not at all costs, of course) - I'll let you know if I can think of some concrete improvements along the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been thinking about the mutable parameters,
but I haven't think of a good way to avoid them yet.
Please also provide some comment if you have an idea on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Wei-1, thanks for the PR, it's much appreciated 🙂. I only skimmed through the code and will do a more detailed review over the weekend. One thing that stands out are the calls to the findNeighbors
function; distance for the same points is unnecessarily calculated many times. We have at least two better ways of doing this IMHO:
- construct a pairwise distance matrix before the clustering and use it in
findNeighbors
- construct a kd-tree (or a ball-tree) before the clustering and use it in
findNeighbors
The second approach is what we eventually want to get to as it's faster and more memory efficient but for the first solution I'm happy with the first method as it already avoids recalculation of the same distances.
Add the Distance Map as stated by @inejc |
Modified as stated. @matejklemen |
I'm just leaving a comment with some sample code here to start a bit of discussion (the following should not be integrated in the code in current state). // current seed points -> new seed points
LazyList.iterate(groupQueue)((seedPoints: Set[Int]) => {
// iterate over each seed point
seedPoints.foldLeft(Set[Int]())(
(newSeedPoints: Set[Int], currSeedPoint: Int) => {
label(currSeedPoint) = groupId
val neighsOfNeigh = findNeighbors(currSeedPoint, x, model.eps)
// check if neighbourhood of neighbor contains at least minSamples
if (neighsOfNeigh.size + 1 >= model.minSamples) {
neighsOfNeigh.foldLeft(newSeedPoints: Set[Int])((seeds, currNeighPoint) => {
label(currNeighPoint) match {
case NOISE =>
label(currNeighPoint) = groupId
seeds
case UNASSIGNED =>
label(currNeighPoint) = groupId
seeds + currNeighPoint
case _ => seeds
}
})
}
else
newSeedPoints
})
}).takeWhile(_.nonEmpty).foreach(_ => {}) This would be a replacement for the while loop. I'd like to hear your thoughts and/or suggestions on improving this (though it's really not of a high priority). @Wei-1 @inejc |
src/main/scala/io/picnicml/doddlemodel/typeclasses/Clusterer.scala
Outdated
Show resolved
Hide resolved
|
Hey @Wei-1, I went ahead and fixed the Clusterer.scala package io.picnicml.doddlemodel.typeclasses
import breeze.linalg.DenseVector
import io.picnicml.doddlemodel.data.Features
trait Clusterer[A] extends Estimator[A] {
def fit(model: A, x: Features): A = {
require(!isFitted(model), "Called fit on a model that is already fitted")
fitSafe(copy(model), x)
}
def fitPredict(model: A, x: Features): DenseVector[Double] = {
require(!isFitted(model), "Called fit on a model that is already fitted")
labelsSafe(fitSafe(copy(model), x))
}
/** A function that creates an identical clusterer. */
protected def copy(model: A): A
/** A function that is guaranteed to be called on a non-fitted model. **/
protected def fitSafe(model: A, x: Features): A
def labels(model: A): DenseVector[Double] = {
require(isFitted(model), "Called labels on a model that is not fitted yet")
labelsSafe(model)
}
/** A function that is guaranteed to be called on a fitted model. */
protected def labelsSafe(model: A): DenseVector[Double]
} DBSCAN.scala package io.picnicml.doddlemodel.cluster
import breeze.linalg.DenseVector
import breeze.linalg.functions.euclideanDistance
import cats.syntax.option._
import io.picnicml.doddlemodel.data.Features
import io.picnicml.doddlemodel.syntax.OptionSyntax._
import io.picnicml.doddlemodel.typeclasses.Clusterer
import scala.collection.mutable
/** An immutable DBSCAN clustering model.
*
* @param eps: the maximum distance between two datapoints to be considered in a common neighborhood
* @param minSamples: the minimum number of datapoints in a neighborhood for a point to be considered the core point
*/
case class DBSCAN private (eps: Double, minSamples: Int, private val labels: Option[DenseVector[Double]])
object DBSCAN {
def apply(eps: Double = 0.5, minSamples: Int = 5): DBSCAN = {
require(eps > 0.0, "Maximum distance eps needs to be larger than 0")
require(minSamples > 0, "Minimum number of samples needs to be larger than 0")
DBSCAN(eps, minSamples, none)
}
implicit lazy val ev: Clusterer[DBSCAN] = new Clusterer[DBSCAN] {
override protected def copy(model: DBSCAN): DBSCAN =
model.copy()
override def isFitted(model: DBSCAN): Boolean = model.labels.isDefined
override protected def fitSafe(model: DBSCAN, x: Features): DBSCAN = {
val distances = computeDistances(x)
println(distances)
???
}
private def computeDistances(x: Features): Distances = {
val distanceMatrix = mutable.AnyRefMap[(Int, Int), Double]()
(0 until x.rows).combinations(2).foreach { case rowIndex0 +: rowIndex1 +: IndexedSeq() =>
distanceMatrix((rowIndex0, rowIndex1)) = euclideanDistance(x(rowIndex0, ::).t, x(rowIndex1, ::).t)
}
new Distances(distanceMatrix)
}
override protected def labelsSafe(model: DBSCAN): DenseVector[Double] = model.labels.getOrBreak
}
private class Distances(private val distanceMatrix: mutable.AnyRefMap[(Int, Int), Double]) {
def get(x: Int, y: Int): Double = if (x > y) distanceMatrix((y, x)) else distanceMatrix((x, y))
}
}
DBSCANTest.scala package io.picnicml.doddlemodel.cluster
import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.TestingUtils
import io.picnicml.doddlemodel.cluster.DBSCAN.ev
import org.scalactic.{Equality, TolerantNumerics}
import org.scalatest.{FlatSpec, Matchers}
class DBSCANTest extends FlatSpec with Matchers with TestingUtils {
implicit val doubleTolerance: Equality[Double] = TolerantNumerics.tolerantDoubleEquality(1e-4)
private val x = DenseMatrix(
List(1.0, 1.0),
List(0.0, 2.0),
List(2.0, 0.0),
List(8.0, 1.0),
List(7.0, 2.0),
List(9.0, 0.0)
)
"DBSCAN" should "cluster the datapoints" in {
val model = DBSCAN(eps = 3.0, minSamples = 1)
breezeEqual(ev.fitPredict(model, x), DenseVector(0.0, 0.0, 0.0, 1.0, 1.0, 1.0))
}
it should "cluster each datapoint into it's own group when eps is too small" in {
val model = DBSCAN()
breezeEqual(ev.fitPredict(model, x), DenseVector(0.0, 1.0, 2.0, 3.0, 4.0, 5.0))
}
it should "cluster all data points into a single group when eps is too large" in {
val model = DBSCAN(eps = 10.0)
breezeEqual(ev.fitPredict(model, x), DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
}
it should "label all points as outliers when min samples is too large" in {
val model = DBSCAN(minSamples = 7)
breezeEqual(ev.fitPredict(model, x), DenseVector(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0))
}
it should "cluster all datapoints into a single group when eps equals the distance between points" in {
val smallX = DenseMatrix(
List(0.0, 0.0),
List(3.0, 0.0)
)
val model = DBSCAN(eps = 3.0)
breezeEqual(ev.fitPredict(model, smallX), DenseVector(0.0, 0.0))
}
it should "cluster all datapoints into a single group" in {
val d1X = DenseMatrix(
List(0.0, 12.0),
List(0.0, 9.0),
List(0.0, 6.0),
List(0.0, 3.0),
List(0.0, 0.0)
)
val model = DBSCAN(eps = 3.0, minSamples = 3)
breezeEqual(ev.fitPredict(model, d1X), DenseVector(0.0, 0.0, 0.0, 0.0, 0.0))
}
it should "prevent the usage of negative eps" in {
an [IllegalArgumentException] shouldBe thrownBy(DBSCAN(eps = -0.5))
}
it should "prevent the usage of negative min samples" in {
an [IllegalArgumentException] shouldBe thrownBy(DBSCAN(minSamples = -1))
}
} |
The changes are available in this branch which I will delete later. |
These are beautiful! |
@Wei-1 @matejklemen the proposed implementation for DBSCAN is available in this file. Typeclass is available here. The tests need to be fixed because I changed default values of parameters (I suggest that we come up with a few examples produced by the scikit-learn implementation). Anyway, let me know what you think, there might be things there that should be changed as I haven't tested this (and it probably needs some refactoring too). @Wei-1 should we rebase your branch with |
Also todo: fix 2.11/2.12 compatibility because of |
@matejklemen I also looked at your proposal for the replacement of the |
@inejc, I think it is easier that we simply close this PR and start another one with |
@Wei-1 I can try to push my changes to your fork or alternatively, you can try to pull |
@inejc Well if we don't have a clean way to do it in a functional style, I'm fine with leaving a while loop in (my solution is not more readable than a simple while loop IMO). |
update-format-from-inejc
Updated this PR with new commits from @inejc |
Thanks @Wei-1. Will you work on 2.11 and 2.12 support and fixing the tests or should I do it? |
Before that, the current code will fail 2 unit tests.
I checked and found that all data points are clustered as NOISEs. |
I will check it out, the tests might be failing due to changed default parameters for |
By 2.11 and 2.12 I mean scala versions and the fact that the current code doesn't compile for them. It compiles for 2.13 though. We have this for 2.11/2.12 and this for 2.13 interop. |
Interesting. I thought we always need to compile with the specific scala version. |
Compatibility files allow for compilation on 2.11, 2.12 and 2.13, one at a time, i.e. one still has to choose a specific Scala version. The current code only compiles on 2.13 though due to |
@inejc I don't seem to find why your structure is not working as intended for those unit tests. |
Description of Changes
Add clustering algorithm, DBSCAN.
Please help to check the format and the corresponding tests.
Includes