-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move feature frequencies to separate table #185
Conversation
|
||
CassandraConnector(session.sparkContext).withSessionDo { cassandra => | ||
val docsCols = tables.featuresDocsCols | ||
cassandra.execute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's a good idea to do execute
on each row.
Maybe it's better to sc.parallelize(seq).saveToCassandra
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this may better wait for someone with more insights than me on Spark. 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We calculate docFreq on driver. sc.parallelize
would split it and copy to workers so we can work with the collection in parallel. But I'm not sure saving to db would actually benefit much from it. Taking into account that copying isn't free also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's already in the driver, can't we write in batches from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly because Cassandra does not recommend it
But batches are often mistakenly used in an attempt to optimize performance. Depending on the batch operation, the performance may actually worsen. Some batch operations place a greater burden on the coordinator node and lessen the efficiency of the data insertion. The number of partitions involved in a batch operation, and thus the potential for multi-node accessing, can increase the latency dramatically. In all batching, the coordinator node manages all write operations, so that the coordinator node can pose a bottleneck to completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I update code using prepared statement to make it a bit better.
docFreqCols: DocFreqCols) | ||
featuresDocsCols: FeaturesDocsCols, | ||
featuresFreqCols: FeaturesFreqCols) { | ||
def hashtables(mode: String): String = s"${hashtables}_$mode" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use ${x}
or $x
and not a mix of them at least not in the same string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't do as above InteliJ IDEA linter complains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it complain? 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't use $x
because "can't resolve symbol hashtables_
.
if I use ${mode}
it complains "the enclosing block is redundant".
I try to keep IntelliJ IDEA happy because right now it's the only linter we have.
Some work about style/lint was done before in #87 and #86, but there is no agreement.
If we add some opinionated linter/formatter (like gofmt or prettier) most probably it would be possible to update code to follow the style automatically.
val docsCols = tables.featuresDocsCols | ||
val freqCols = tables.featuresFreqCols | ||
val docsRow = conn.execute(s"SELECT * FROM ${tables.featuresDocs} WHERE ${docsCols.id} = '$mode'").one() | ||
if (docsRow == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null
in Scala should be avoided. It is usually preferred to wrap it with Option
. You could then also change the if-else with the .fold
method of Option
:
docsRow.fold[Option[OrderedDocFreq]]({
log.warn("Document frequency table is empty.")
None
})(r => {
var tokens = IndexedSeq[String]()
val df = conn
.execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode'")
.asScala
.map { row =>
// tokens have to be sorted, df.keys isn't sorted
val name = row.getString(freqCols.feature)
tokens = tokens :+ name
(name, row.getInt(freqCols.weight))
}.toMap
Some(OrderedDocFreq(r.getInt(docsCols.docs), tokens, df))
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I avoid null here? It's java lib returns null.
This code works similar to Converting a null into an Option, or something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing as follows:
val docsRow = Option(conn.execute(s"SELECT * FROM ${tables.featuresDocs} WHERE ${docsCols.id} = '$mode'").one())
But maybe is just me that don't like reading null
so I usually wrap it with Option
asap, also in order to use the scala stdlib such as fold
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm. I see what you mean now. Imo in this particular case checking null
makes the code simpler.
Also, I see some other checks on null in our code base but we don't do Option(<something>)
at all.
Please let me know if you see any clear disadvantages of the current code or clear advantages of the code with fold and I'll update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think there's any clear pros/cons in favour/disfavour for one or another. I think it's more a style thing.
Also, I see some other checks on null in our code base but we don't do Option() at all.
In this case I think we should be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for completeness, using pattern matching is even more readable imo 😛
Option(docsRow) match {
case None => {
log.warn("Document frequency table is empty.")
None
}
case Some(row) => {
var tokens = IndexedSeq[String]()
val df = conn
.execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode'")
.asScala
.map { row =>
// tokens have to be sorted, df.keys isn't sorted
val name = row.getString(freqCols.feature)
tokens = tokens :+ name
(name, row.getInt(freqCols.weight))
}.toMap
Some(OrderedDocFreq(docsRow.getInt(docsCols.docs), tokens, df))
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you agreed to keep the code as it is now.
With you last example I don't see why it's better:
- you wrap into option only to do
case None/Some
. Most probably it's possible to write the same withcase null/_
- it adds one more level of indention and cyclomatic-complexity compare to simple if/else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I just wrote it down for completeness. Anyway I just found it more idiomatic, but I don't actually have that much experience in Scala to say what is more idiomatic and what is not. 👍
.asScala | ||
.mapValues(_.toInt) | ||
.map { row => | ||
// tokens have to be sorted, df.keys isn't sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you want them to be sorted? Alphabetical? Isn't it possible to sort outside the block? I'm thinking about something as:
val df = conn
.execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode'")
.asScala
.map(row => (row.getString(freqCols.feature), row.getInt(freqCols.weight)))
.toMap
Some(OrderedDocFreq(r.getInt(docsCols.docs), df.keys.toIndexedSeq.sorted, df))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alphabetical. DB already returns it in the correct order.
This map is quite big. On a small 1k repos dataset serialized version of this map took 45mb. I would prefer to avoid unnecessary sorting if it's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know cassandra actually, so correct me if I'm wrong.
Isn't relying on the db ordering without actually specifying it at query time a bit dangerous? Let me explain better. I guess that in the table there's some sort of index that is being used as ordering key, and we're using it implicitly here. But if we change something at db level we also need to remember to change the code. Isn't it better to also include the ordering in the query? I guess that it won't affect the execution time given that even without including it, the db still orders them using that ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If then we change something at db level everything should continue to work. It could just perform much slower (in case for example we remove the ordering key).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(mode, feature) is a primary key that's why it's sorted.
Thanks for pointing that I missed order by in SELECT! I'll add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Map field of cassanda/scylladb works great but on many repositories dictionary becomes too big and scylladb can't commit the row with default configuration: exception during mutation write to xx.xx.xx.xx: std::invalid_argument (Mutation of 45358979 bytes is too large for the maxiumum size of 16777216) it's possible to increase commit size but for really huge dataset the dictionary would exceed any reasonable limit. Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
rebased on master and solved conflicts |
Map field of cassanda/scylladb works great but on many repositories
dictionary becomes too big and scylladb can't commit the row with
default configuration:
exception during mutation write to xx.xx.xx.xx: std::invalid_argument
(Mutation of 45358979 bytes is too large for the maxiumum size of 16777216)
it's possible to increase commit size but for really huge dataset the
dictionary would exceed any reasonable limit.
Signed-off-by: Maxim Sukharev max@smacker.ru