-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Quantile Discretizer #90
base: master
Are you sure you want to change the base?
Conversation
Initially made this using
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Helw150, thanks for opening the PR, this is pretty awesome 🙂. I wrote a couple of comments and suggested a change for the splitEvenly
function, let me know what's your thinking about that. Regarding your comment about using IntVector
for numQuantiles
, I'll let you know once I look at it more thoroughly. The first guess would be to use DenseVector[Int]
instead of IntVector
as this is just a type alias in doddle-model and it might confuse breeze.
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
@inejc I think the new version should address all of your comments! Just changing to DenseVector[Int] still had me running into that issue, but I don't have any particular insights that would allow me to debug so lmk if you can figure anything out. |
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote some suggestions but this is definitely in the right direction! Let me know if you disagree with my comments.
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/test/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizerTest.scala
Show resolved
Hide resolved
@inejc Resolved all but one, let me know if it is a sticking point and I will change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we are one step away from merging this. Regarding using IntVector
for bucketCounts
; I simply changed all : DenseVector[Double]
to : IntVector
, removed all internal DenseVector[Double]
types and removed unnecessary .toInt
and .toDouble
and compilation was successful.
import io.picnicml.doddlemodel.typeclasses.Transformer | ||
import scala.Double.{MaxValue, MinValue} | ||
|
||
case class QuantileDiscretizer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this formatting. Is this scalafmt? We need to add a formatter to the project 😅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I ran Scalafmt (but didn't PR my setup of it since I didn't know if you wanted it). It's a one line addition to the plugins file and a configurable settings file. I can make a PR with the setup and you can tune the config to your personal preferences!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this issue: #95.
|
||
/** Create a quantile discretizer which splits data into discrete evenly sized buckets. | ||
* | ||
* @param bucketCount The number of quantiles desired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm nitpicking here but I wouldn't describe bucketCount
as The number of quantiles
because the number of quantiles is always one less than the number of buckets. From Wikipedia: Quartiles: the three points that divide the data set into four equal groups in descriptive statistics
.
|
||
private def computeQuantiles(target: Seq[Double], bucketCount: Int): Seq[(Double, Double)] = { | ||
val binPercentileWidth = 1.0 / bucketCount | ||
val targetArray = target.toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I wrote the comment about target: Seq[Double]
is that as a result, we copy each numerical column twice, instead of just once. The first time it is copied in def fit
with x(::, colIndex).toScalaVector
and the second time in def computeQuantiles
with target.toArray
.
The solution would be to change target: Seq[Double]
to target: Array[Double]
here and then create an array in def fit
directly with .toArray
which also makes a copy based on this.
Hope this makes sense and I'm not making a mistake reading this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I didn't understand the comment but makes sense now
.map(_.toDouble) | ||
.map(DescriptiveStats.percentileInPlace(targetArray, _)) | ||
.sliding(2) | ||
.map({case Seq(lowerBound, upperBound) => (lowerBound, upperBound)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be just .map { case Seq(lowerBound, upperBound) => (lowerBound, upperBound) }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oop yeah, I always use parens but it's personal preference
|
||
override protected def transformSafe(model: QuantileDiscretizer, x: Features): Features = { | ||
val xCopy = x.copy | ||
model.featureIndex.numerical.columnIndices.zipWithIndex.foreach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, it's elegant. I personally would format it as (just a subjective preference):
model.featureIndex.numerical.columnIndices.zipWithIndex.foreach { case (colIndex, bucketsIndex) =>
val buckets = model.quantiles.getOrBreak(bucketsIndex)
(0 until xCopy.rows).foreach { rowIndex =>
xCopy(rowIndex, colIndex) = buckets.indexWhere { case (lowerBound, upperBound) =>
lowerBound <= xCopy(rowIndex, colIndex) && xCopy(rowIndex, colIndex) <= upperBound
}.toDouble
}
}
import io.picnicml.doddlemodel.data.Feature.FeatureIndex | ||
import io.picnicml.doddlemodel.data.Features | ||
import io.picnicml.doddlemodel.syntax.OptionSyntax._ | ||
import io.picnicml.doddlemodel.typeclasses.Transformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another line between the penultimate import and import scala.Double.{MaxValue, MinValue}
(I used Optimize imports
in IntelliJ which also reordered some of the imports).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use Emacs sadly, but I'll open IntelliJ for the import optimization (I have used Scalafix for similar things, but it's a big dependency for something IntelliJ does for free for most folks)
import scala.Double.{MaxValue, MinValue} | ||
|
||
case class QuantileDiscretizer( | ||
private val bucketCounts: DenseVector[Double], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood correctly, you implemented this with IntVector
initially and the tests failed (didn't compile)?
I tried using IntVector
here and changing the DenseVector[Double]
s to DenseVector[Int]
s throughout the code/tests and the tests passed on all three supported versions of Scala for this project (2.11.12, 2.12.9 and 2.13.0). I got some mysterious error once that I can't reproduce anymore, but it went away after deleting the target/
folder and building again.
Could you please check again, deleting target/
if the problem persists? I'm not sure where the problem could be as I'm assuming you are using the dependency versions listed in project/Dependencies
Description of Changes
Adds Quantile Discretization for NumericalFeatures
Includes