Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move feature frequencies to separate table #185

Merged
merged 4 commits into from
Jan 30, 2019
Merged

move feature frequencies to separate table #185

merged 4 commits into from
Jan 30, 2019

Conversation

smacker
Copy link
Contributor

@smacker smacker commented Jan 28, 2019

Map field of cassanda/scylladb works great but on many repositories
dictionary becomes too big and scylladb can't commit the row with
default configuration:

exception during mutation write to xx.xx.xx.xx: std::invalid_argument
(Mutation of 45358979 bytes is too large for the maxiumum size of 16777216)

it's possible to increase commit size but for really huge dataset the
dictionary would exceed any reasonable limit.

Signed-off-by: Maxim Sukharev max@smacker.ru


CassandraConnector(session.sparkContext).withSessionDo { cassandra =>
val docsCols = tables.featuresDocsCols
cassandra.execute(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's a good idea to do execute on each row.
Maybe it's better to sc.parallelize(seq).saveToCassandra.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this may better wait for someone with more insights than me on Spark. 😛

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We calculate docFreq on driver. sc.parallelize would split it and copy to workers so we can work with the collection in parallel. But I'm not sure saving to db would actually benefit much from it. Taking into account that copying isn't free also.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's already in the driver, can't we write in batches from there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly because Cassandra does not recommend it

But batches are often mistakenly used in an attempt to optimize performance. Depending on the batch operation, the performance may actually worsen. Some batch operations place a greater burden on the coordinator node and lessen the efficiency of the data insertion. The number of partitions involved in a batch operation, and thus the potential for multi-node accessing, can increase the latency dramatically. In all batching, the coordinator node manages all write operations, so that the coordinator node can pose a bottleneck to completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update code using prepared statement to make it a bit better.

docFreqCols: DocFreqCols)
featuresDocsCols: FeaturesDocsCols,
featuresFreqCols: FeaturesFreqCols) {
def hashtables(mode: String): String = s"${hashtables}_$mode"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use ${x} or $x and not a mix of them at least not in the same string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't do as above InteliJ IDEA linter complains.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it complain? 😕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't use $x because "can't resolve symbol hashtables_.
if I use ${mode} it complains "the enclosing block is redundant".

I try to keep IntelliJ IDEA happy because right now it's the only linter we have.
Some work about style/lint was done before in #87 and #86, but there is no agreement.
If we add some opinionated linter/formatter (like gofmt or prettier) most probably it would be possible to update code to follow the style automatically.

val docsCols = tables.featuresDocsCols
val freqCols = tables.featuresFreqCols
val docsRow = conn.execute(s"SELECT * FROM ${tables.featuresDocs} WHERE ${docsCols.id} = '$mode'").one()
if (docsRow == null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null in Scala should be avoided. It is usually preferred to wrap it with Option. You could then also change the if-else with the .fold method of Option:

    docsRow.fold[Option[OrderedDocFreq]]({
      log.warn("Document frequency table is empty.")
      None
    })(r => {
      var tokens = IndexedSeq[String]()
      val df = conn
        .execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode'")
        .asScala
        .map { row =>
          // tokens have to be sorted, df.keys isn't sorted
          val name = row.getString(freqCols.feature)
          tokens = tokens :+ name

          (name, row.getInt(freqCols.weight))
        }.toMap

      Some(OrderedDocFreq(r.getInt(docsCols.docs), tokens, df))
    })

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I avoid null here? It's java lib returns null.
This code works similar to Converting a null into an Option, or something else

Copy link

@se7entyse7en se7entyse7en Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing as follows:

val docsRow = Option(conn.execute(s"SELECT * FROM ${tables.featuresDocs} WHERE ${docsCols.id} = '$mode'").one())

But maybe is just me that don't like reading null so I usually wrap it with Option asap, also in order to use the scala stdlib such as fold in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. I see what you mean now. Imo in this particular case checking null makes the code simpler.
Also, I see some other checks on null in our code base but we don't do Option(<something>) at all.

Please let me know if you see any clear disadvantages of the current code or clear advantages of the code with fold and I'll update it.

Copy link

@se7entyse7en se7entyse7en Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think there's any clear pros/cons in favour/disfavour for one or another. I think it's more a style thing.

Also, I see some other checks on null in our code base but we don't do Option() at all.

In this case I think we should be consistent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for completeness, using pattern matching is even more readable imo 😛

    Option(docsRow) match {
      case None => {
        log.warn("Document frequency table is empty.")
        None
      }
      case Some(row) => {
        var tokens = IndexedSeq[String]()
        val df = conn
          .execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode'")
          .asScala
          .map { row =>
            // tokens have to be sorted, df.keys isn't sorted
            val name = row.getString(freqCols.feature)
            tokens = tokens :+ name

            (name, row.getInt(freqCols.weight))
          }.toMap

        Some(OrderedDocFreq(docsRow.getInt(docsCols.docs), tokens, df))
      }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you agreed to keep the code as it is now.

With you last example I don't see why it's better:

  1. you wrap into option only to do case None/Some. Most probably it's possible to write the same with case null/_
  2. it adds one more level of indention and cyclomatic-complexity compare to simple if/else.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I just wrote it down for completeness. Anyway I just found it more idiomatic, but I don't actually have that much experience in Scala to say what is more idiomatic and what is not. 👍

.asScala
.mapValues(_.toInt)
.map { row =>
// tokens have to be sorted, df.keys isn't sorted

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want them to be sorted? Alphabetical? Isn't it possible to sort outside the block? I'm thinking about something as:

      val df = conn
        .execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode'")
        .asScala
        .map(row => (row.getString(freqCols.feature), row.getInt(freqCols.weight)))
        .toMap
      Some(OrderedDocFreq(r.getInt(docsCols.docs), df.keys.toIndexedSeq.sorted, df))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alphabetical. DB already returns it in the correct order.
This map is quite big. On a small 1k repos dataset serialized version of this map took 45mb. I would prefer to avoid unnecessary sorting if it's possible.

Copy link

@se7entyse7en se7entyse7en Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know cassandra actually, so correct me if I'm wrong.

Isn't relying on the db ordering without actually specifying it at query time a bit dangerous? Let me explain better. I guess that in the table there's some sort of index that is being used as ordering key, and we're using it implicitly here. But if we change something at db level we also need to remember to change the code. Isn't it better to also include the ordering in the query? I guess that it won't affect the execution time given that even without including it, the db still orders them using that ordering.

Copy link

@se7entyse7en se7entyse7en Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If then we change something at db level everything should continue to work. It could just perform much slower (in case for example we remove the ordering key).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(mode, feature) is a primary key that's why it's sorted.

Thanks for pointing that I missed order by in SELECT! I'll add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Map field of cassanda/scylladb works great but on many repositories
dictionary becomes too big and scylladb can't commit the row with
default configuration:

exception during mutation write to xx.xx.xx.xx: std::invalid_argument
(Mutation of 45358979 bytes is too large for the maxiumum size of 16777216)

it's possible to increase commit size but for really huge dataset the
dictionary would exceed any reasonable limit.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
@smacker
Copy link
Contributor Author

smacker commented Jan 29, 2019

rebased on master and solved conflicts

@smacker smacker merged commit df2042e into src-d:master Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants