move feature frequencies to separate table #185

smacker · 2019-01-28T16:37:44Z

Map field of cassanda/scylladb works great but on many repositories
dictionary becomes too big and scylladb can't commit the row with
default configuration:

exception during mutation write to xx.xx.xx.xx: std::invalid_argument
(Mutation of 45358979 bytes is too large for the maxiumum size of 16777216)

it's possible to increase commit size but for really huge dataset the
dictionary would exceed any reasonable limit.

Signed-off-by: Maxim Sukharev max@smacker.ru

smacker · 2019-01-28T19:48:40Z

src/main/scala/tech/sourced/gemini/Hash.scala


+    CassandraConnector(session.sparkContext).withSessionDo { cassandra =>
+      val docsCols = tables.featuresDocsCols
      cassandra.execute(


Not sure if it's a good idea to do execute on each row.
Maybe it's better to sc.parallelize(seq).saveToCassandra.

Regarding this may better wait for someone with more insights than me on Spark. 😛

We calculate docFreq on driver. sc.parallelize would split it and copy to workers so we can work with the collection in parallel. But I'm not sure saving to db would actually benefit much from it. Taking into account that copying isn't free also.

If it's already in the driver, can't we write in batches from there?

Mostly because Cassandra does not recommend it

But batches are often mistakenly used in an attempt to optimize performance. Depending on the batch operation, the performance may actually worsen. Some batch operations place a greater burden on the coordinator node and lessen the efficiency of the data insertion. The number of partitions involved in a batch operation, and thus the potential for multi-node accessing, can increase the latency dramatically. In all batching, the coordinator node manages all write operations, so that the coordinator node can pose a bottleneck to completion.

I update code using prepared statement to make it a bit better.

se7entyse7en · 2019-01-28T22:29:50Z

src/main/scala/tech/sourced/gemini/Database.scala

-                  docFreqCols: DocFreqCols)
+                  featuresDocsCols: FeaturesDocsCols,
+                  featuresFreqCols: FeaturesFreqCols) {
+  def hashtables(mode: String): String = s"${hashtables}_$mode"


I'd use ${x} or $x and not a mix of them at least not in the same string.

If I don't do as above InteliJ IDEA linter complains.

Why does it complain? 😕

we can't use $x because "can't resolve symbol hashtables_.
if I use ${mode} it complains "the enclosing block is redundant".

I try to keep IntelliJ IDEA happy because right now it's the only linter we have.
Some work about style/lint was done before in #87 and #86, but there is no agreement.
If we add some opinionated linter/formatter (like gofmt or prettier) most probably it would be possible to update code to follow the style automatically.

se7entyse7en · 2019-01-28T23:20:06Z

src/main/scala/tech/sourced/gemini/FileQuery.scala

+    val docsCols = tables.featuresDocsCols
+    val freqCols = tables.featuresFreqCols
+    val docsRow = conn.execute(s"SELECT * FROM ${tables.featuresDocs} WHERE ${docsCols.id} = '$mode'").one()
+    if (docsRow == null) {


null in Scala should be avoided. It is usually preferred to wrap it with Option. You could then also change the if-else with the .fold method of Option:

docsRow.fold[Option[OrderedDocFreq]]({ log.warn("Document frequency table is empty.") None })(r => { var tokens = IndexedSeq[String]() val df = conn .execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode'") .asScala .map { row => // tokens have to be sorted, df.keys isn't sorted val name = row.getString(freqCols.feature) tokens = tokens :+ name (name, row.getInt(freqCols.weight)) }.toMap Some(OrderedDocFreq(r.getInt(docsCols.docs), tokens, df)) })

How can I avoid null here? It's java lib returns null.
This code works similar to Converting a null into an Option, or something else

By doing as follows:

val docsRow = Option(conn.execute(s"SELECT * FROM ${tables.featuresDocs} WHERE ${docsCols.id} = '$mode'").one())

But maybe is just me that don't like reading null so I usually wrap it with Option asap, also in order to use the scala stdlib such as fold in this case.

hm. I see what you mean now. Imo in this particular case checking null makes the code simpler.
Also, I see some other checks on null in our code base but we don't do Option(<something>) at all.

Please let me know if you see any clear disadvantages of the current code or clear advantages of the code with fold and I'll update it.

No, I don't think there's any clear pros/cons in favour/disfavour for one or another. I think it's more a style thing.

Also, I see some other checks on null in our code base but we don't do Option() at all.

In this case I think we should be consistent.

Just for completeness, using pattern matching is even more readable imo 😛

Option(docsRow) match { case None => { log.warn("Document frequency table is empty.") None } case Some(row) => { var tokens = IndexedSeq[String]() val df = conn .execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode'") .asScala .map { row => // tokens have to be sorted, df.keys isn't sorted val name = row.getString(freqCols.feature) tokens = tokens :+ name (name, row.getInt(freqCols.weight)) }.toMap Some(OrderedDocFreq(docsRow.getInt(docsCols.docs), tokens, df)) } }

I hope you agreed to keep the code as it is now.

With you last example I don't see why it's better:

you wrap into option only to do case None/Some. Most probably it's possible to write the same with case null/_

it adds one more level of indention and cyclomatic-complexity compare to simple if/else.

Sure! I just wrote it down for completeness. Anyway I just found it more idiomatic, but I don't actually have that much experience in Scala to say what is more idiomatic and what is not. 👍

se7entyse7en · 2019-01-28T23:26:13Z

src/main/scala/tech/sourced/gemini/FileQuery.scala

        .asScala
-        .mapValues(_.toInt)
+        .map { row =>
+          // tokens have to be sorted, df.keys isn't sorted


How do you want them to be sorted? Alphabetical? Isn't it possible to sort outside the block? I'm thinking about something as:

val df = conn .execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode'") .asScala .map(row => (row.getString(freqCols.feature), row.getInt(freqCols.weight))) .toMap Some(OrderedDocFreq(r.getInt(docsCols.docs), df.keys.toIndexedSeq.sorted, df))

Alphabetical. DB already returns it in the correct order.
This map is quite big. On a small 1k repos dataset serialized version of this map took 45mb. I would prefer to avoid unnecessary sorting if it's possible.

I don't know cassandra actually, so correct me if I'm wrong.

Isn't relying on the db ordering without actually specifying it at query time a bit dangerous? Let me explain better. I guess that in the table there's some sort of index that is being used as ordering key, and we're using it implicitly here. But if we change something at db level we also need to remember to change the code. Isn't it better to also include the ordering in the query? I guess that it won't affect the execution time given that even without including it, the db still orders them using that ordering.

If then we change something at db level everything should continue to work. It could just perform much slower (in case for example we remove the ordering key).

(mode, feature) is a primary key that's why it's sorted.

Thanks for pointing that I missed order by in SELECT! I'll add it.

Map field of cassanda/scylladb works great but on many repositories dictionary becomes too big and scylladb can't commit the row with default configuration: exception during mutation write to xx.xx.xx.xx: std::invalid_argument (Mutation of 45358979 bytes is too large for the maxiumum size of 16777216) it's possible to increase commit size but for really huge dataset the dictionary would exceed any reasonable limit. Signed-off-by: Maxim Sukharev <max@smacker.ru>

Signed-off-by: Maxim Sukharev <max@smacker.ru>

smacker · 2019-01-29T16:22:34Z

rebased on master and solved conflicts

smacker requested review from carlosms, se7entyse7en and smola January 28, 2019 19:10

smacker commented Jan 28, 2019

View reviewed changes

se7entyse7en suggested changes Jan 28, 2019

View reviewed changes

se7entyse7en approved these changes Jan 29, 2019

View reviewed changes

smacker added 4 commits January 29, 2019 17:15

use prepared statement for docfreq insertion

4623061

Signed-off-by: Maxim Sukharev <max@smacker.ru>

add order by to cql query to make sorting explicit

605b80c

Signed-off-by: Maxim Sukharev <max@smacker.ru>

solve conflicts after rebase

3ae28ba

Signed-off-by: Maxim Sukharev <max@smacker.ru>

carlosms approved these changes Jan 29, 2019

View reviewed changes

smacker merged commit df2042e into src-d:master Jan 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move feature frequencies to separate table #185

move feature frequencies to separate table #185

smacker commented Jan 28, 2019

smacker Jan 28, 2019

se7entyse7en Jan 29, 2019

smacker Jan 29, 2019

smola Jan 29, 2019

smacker Jan 29, 2019

smacker Jan 29, 2019

se7entyse7en Jan 28, 2019

smacker Jan 29, 2019

smola Jan 29, 2019

smacker Jan 29, 2019

se7entyse7en Jan 28, 2019

smacker Jan 29, 2019

se7entyse7en Jan 29, 2019 •

edited

Loading

smacker Jan 29, 2019

se7entyse7en Jan 29, 2019 •

edited

Loading

se7entyse7en Jan 29, 2019

smacker Jan 29, 2019

se7entyse7en Jan 29, 2019

se7entyse7en Jan 28, 2019

smacker Jan 29, 2019

se7entyse7en Jan 29, 2019 •

edited

Loading

se7entyse7en Jan 29, 2019 •

edited

Loading

smacker Jan 29, 2019

smacker Jan 29, 2019

smacker commented Jan 29, 2019

move feature frequencies to separate table #185

move feature frequencies to separate table #185

Conversation

smacker commented Jan 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

se7entyse7en Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

se7entyse7en Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

se7entyse7en Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

se7entyse7en Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smacker commented Jan 29, 2019

se7entyse7en Jan 29, 2019 •

edited

Loading

se7entyse7en Jan 29, 2019 •

edited

Loading

se7entyse7en Jan 29, 2019 •

edited

Loading

se7entyse7en Jan 29, 2019 •

edited

Loading