Optimize mnesia cache querying #625

jchristgit · 2024-08-15T19:03:22Z

QLC queries via mnesia-based caches that would use the {traverse, {select, MatchSpec}} in any shape or form would cause the QLC query to be executed in two parts, the mnesia:table call running the entire table over the selected match specification, and then another Erlang list comprehension that would filter the results of that across a list comprehension in plain Erlang. Per the discussion in #622, this is how such a query would look like under qlc:info/1:

qlc:q([
       Guild ||
           {GuildId, Guild} <-
               mnesia:table(nostrum_guilds,
                            [{n_objects, 100},
                             {lock, read} |
                             {traverse,
                              {select,
                               [{{'_', '$1', '$2'},
                                 [],
                                 [{{'$1', '$2'}}]}]}}]),
           GuildId =:= RequestedGuildId
      ])

The issue is that neither QLC nor mnesia can cleanly optimize it here: Mnesia does not know about the condition specified in QLC, and QLC's optimization to use a lookup_fun is knocked out by the fact that it can't reach into the traverse call to detect the reordered columns. It might be possible to implement this in QLC itself, given it is smart enough to figure out when to use the lookup function based on the indices, together with some cooperation from Mnesia itself. Obviously, this behaviour would lead to unacceptable performance.

This commit introduces an optimization that allows guild, member and presence cache implementations to export a query_handle/1 function that accepts a match specification guard, that is, the "middle part" of a match specification. The guard determines which rows shall be filtered. Do note, however, that this is still unable to perform a complete optimization of lookups of single records - it will still traverse the table, but in the native ETS code.

QLC queries via mnesia-based caches that would use the `{traverse, {select, MatchSpec}}` in any shape or form would cause the QLC query to be executed in two parts, the `mnesia:table` call running the entire table over the selected match specification, and then another Erlang list comprehension that would filter the results of that across a list comprehension in plain Erlang. Per the discussion in #622, this is how such a query would look like under `qlc:info/1`: qlc:q([ Guild || {GuildId, Guild} <- mnesia:table(nostrum_guilds, [{n_objects, 100}, {lock, read} | {traverse, {select, [{{'_', '$1', '$2'}, [], [{{'$1', '$2'}}]}]}}]), GuildId =:= RequestedGuildId ]) The issue is that neither QLC nor mnesia can cleanly optimize it here: Mnesia does not know about the condition specified in QLC, and QLC's optimization to use a `lookup_fun` is knocked out by the fact that it can't reach into the traverse call to detect the reordered columns. It might be possible to implement this in QLC itself, given it is smart enough to figure out when to use the lookup function based on the indices, together with some cooperation from Mnesia itself. Obviously, this behaviour would lead to unacceptable performance. This commit introduces an optimization that allows guild, member and presence cache implementations to export a `query_handle/1` function that accepts a match specification guard, that is, the "middle part" of a match specification. The guard determines which rows shall be filtered. Do note, however, that this is still unable to perform a complete optimization of lookups of single records - it will still traverse the table, but in the native ETS code.

Th3-M4jor · 2024-08-15T23:31:51Z

I don't think this improves the performance on large tables much at all

qlc:q([ 
       Guild ||
           {GuildId, Guild} <-
               mnesia:table(nostrum_guilds,
                            [{n_objects, 100},
                             {lock, read} |
                             {traverse,
                              {select,
                               [{{'_', '$1', '$2'},
                                 [{'==', '$1', {const, 12234}}],
                                 [{{'$1', '$2'}}]}]}}]),
           GuildId =:= RequestedGuildId
      ])

I'll inject a few thousand fake guilds and see how it performs locally in this PR vs 0.10.0

Th3-M4jor · 2024-08-15T23:58:12Z

Okay this is weird, tested with 5000 guilds injected into the table and

iex(8)> func = fn ->
...(8)> start = :erlang.monotonic_time()
...(8)> Nostrum.Cache.GuildCache.get(guild.id)
...(8)> :erlang.monotonic_time() - start
...(8)> end)

is usually taking 40-ish milliseconds on 0.10.0

But an almost empty table takes under 1 millisecond on 0.10.0

The report said it took closer to 400ms at 4000 guilds 🤔

There was no significant difference at 5000 guilds when using this PR

atlas-oc · 2024-08-16T17:49:10Z

That is strange... another metric impacted was our CPU usage after dropping Nostrum.Cache.GuildCache.get/1 for :mnesia.read(:nostrum_guilds, id):

The spikes correspond to America-centric server activity (with our workload, almost every :MESSAGE_CREATE event was hitting the GuildCache, so each process spawned by the Consumer was presumably doing a full table scan before anything, causing massive CPU strain).

This is also reflected by our GC statistics, where (I think?) the entire table was constantly being loaded in and out of memory, causing 200-500 MB per reclamation:

jchristgit · 2024-08-16T18:45:05Z

@Th3-M4jor can you share your benchmarking script?

@atlas-oc thanks for the valuable info from the graph. To be honest, that looks as expected. Our table type is a set, that gives us O(1) access on reads :-)

jchristgit · 2024-08-16T18:46:07Z

Speaking of benchmarks, it might be nice to have that in a benchmark file, if you are interested in adding it. See benchmarks/member_cache_bench.exs for an example

Th3-M4jor · 2024-08-16T23:03:56Z

Opened #627 to add a basic benchmark for get/1

I may update it with one for create/1 and possibly also add one for the MessageCache this weekend if I can come up with a clean way to test both :set and :ordered_set on that one.

jchristgit · 2024-08-17T12:29:21Z

Thanks for the PR.

I've noticed something else, namely that in theory QLC supports what we're doing - I've dug through the source code and it seems capable to translate QLC filters into match specifications directly without having to input it here. POC:

diff --git a/lib/nostrum/cache/guild_cache/mnesia.ex b/lib/nostrum/cache/guild_cache/mnesia.ex
index 04f27955..80bc6dbb 100644
--- a/lib/nostrum/cache/guild_cache/mnesia.ex
+++ b/lib/nostrum/cache/guild_cache/mnesia.ex
@@ -291,8 +291,9 @@ if Code.ensure_loaded?(:mnesia) do
     @doc "Get a QLC handle for the guild cache."
     @spec query_handle :: :qlc.query_handle()
     def query_handle do
-      ms = [{{:_, :"$1", :"$2"}, [], [{{:"$1", :"$2"}}]}]
-      :mnesia.table(@table_name, {:traverse, {:select, ms}})
+      #ms = [{{:_, :"$1", :"$2"}, [], [{{:"$1", :"$2"}}]}]
+      qh = :mnesia.table(@table_name, {:traverse, :select})
+      :nostrum_guild_cache_mnesia_qlc.transform_table_records(qh)
     end
 
     @impl GuildCache
diff --git a/src/nostrum_guild_cache_mnesia_qlc.erl b/src/nostrum_guild_cache_mnesia_qlc.erl
new file mode 100644
index 00000000..bea90c7c
--- /dev/null
+++ b/src/nostrum_guild_cache_mnesia_qlc.erl
@@ -0,0 +1,8 @@
+-module(nostrum_guild_cache_mnesia_qlc).
+-export([transform_table_records/1]).
+
+-include_lib("stdlib/include/qlc.hrl").
+
+
+transform_table_records(QH) ->
+    qlc:q([{GuildId, Guild} || {_RecordTag, GuildId, Guild} <- QH]).

Yields query:

iex(4)> IO.puts(:qlc.info(:nostrum_guild_cache_qlc.get(123, Nostrum.Cache.GuildCache.Mnesia)))
qlc:q([ 
       Guild ||
           {GuildId, Guild} <-
               mnesia:table(nostrum_guilds,
                            [{traverse,
                              {select,
                               [{{'_', '$1', '$2'},
                                 [true],
                                 [{{'$1', '$2'}}]}]}},
                             {n_objects, 100},
                             {lock, read} |
                             {traverse, select}]),
           GuildId =:= RequestedGuildId
      ])

Notice that we didn't specify an explicit matchspec but QLC got the message. That seems to stem from https://github.com/erlang/otp/blob/0b33c5a4c0a647cf7df6d50c8768b574c48752fd/lib/stdlib/src/qlc_pt.erl#L1140.

I think it boils down to this issue I reported last year: erlang/otp#7268
And indeed, if I modify the line like this:

diff --git a/lib/nostrum/cache/guild_cache/mnesia.ex b/lib/nostrum/cache/guild_cache/mnesia.ex
index 80bc6dbb..0ad7a855 100644
--- a/lib/nostrum/cache/guild_cache/mnesia.ex
+++ b/lib/nostrum/cache/guild_cache/mnesia.ex
@@ -293,7 +293,7 @@ if Code.ensure_loaded?(:mnesia) do
     def query_handle do
       #ms = [{{:_, :"$1", :"$2"}, [], [{{:"$1", :"$2"}}]}]
       qh = :mnesia.table(@table_name, {:traverse, :select})
-      :nostrum_guild_cache_mnesia_qlc.transform_table_records(qh)
+      qh
     end
 
     @impl GuildCache

which effectively removes the intermediate query handle and allows it to directly hook to the :mnesia.table call, then we get the following:

iex(2)>  IO.puts(:qlc.info(:nostrum_guild_cache_qlc.get(123, Nostrum.Cache.GuildCache.Mnesia)))
mnesia:table(nostrum_guilds,
             [{traverse,
               {select,
                [{{'$1', '$2'}, [{'=:=', '$1', {const, 123}}], ['$2']}]}},
              {n_objects, 100},
              {lock, read} |
              {traverse, select}])

(Note this query is "wrong" because it will compare the atom in the first column, not the guild ID, to 123). This is not perfect still as you see, because it still does not use the lookup_fun. I think that that is caused by the fact that a match spec is input there. I need to do more testing, I think we may be able to implement optimizations upstream.

jchristgit · 2024-08-18T06:07:00Z

So, I've been digging through the QLC source code.

Parse transform

At compile time, the qlc_pt.erl parse transform modifies any calls to qlc:q to be replaced with a qlc_lc record. That record then contains some information about the query list comprehension.
The core idea behind it is that the parse transform generates a function that seems to be a state machine. The comment in qlc_pt.erl starting from line 521 explains it decently. Any included variables are stored in a function that returns said variable.

Don't even try to understand the code of the parse transform. I have no idea what is going on there...

Evaluation

With our list comprehension rewritten to a record with magic functions, this is something that we can pass to qlc:e to evaluate it. Important logic then happens around prep_qlc_lc with the qlc_v1 branch (this is the inner content of qlc_lc if we have a non-trivial list comprehension in there).

My proposed optimization would basically traverse down into any nested included qlc_v1 query handles and check if they support match specifications. If they do, we check if our current query handle is expressed using a match specification, and if yes, we merge the match specification of our parent query into the child query and kill off the child query altogether.

Problem

The issue is that this is much harder than it sounds, because pretty much everything in the qlc record is represented a function that appears to be hooked together with the state machine in various ways. Even the match specification requires calling a function with some QNum parameter. To give you an insight, this is the function that allows you to get the information from there:

fun(size) -> fun(1) -> 2; (_) -> undefined end;
   (template) -> fun(_, _) -> [] end;
   (constants) -> fun(_) -> no_column_fun end;
   (equal_constants) ->
       fun(1) -> fun(1) -> {values, [2], {all, [2]}};
                 (_) -> false
              end;
          (_) -> no_column_fun
       end;
   (n_leading_constant_columns) -> fun(1) -> 1; (_) -> 0 end;
   (constant_columns) -> fun(1) -> "\001"; (_) -> [] end;
   (match_specs) ->
       fun(1) -> {[{{'$1', '$2'}, [{'==', '$1', 2}], ['$2']}], all};
          (_) -> undefined end;
   (_) -> undefined

Some of these are somewhat self-explanatory and then you have the question of when you have other variables than 1 etc. I assume it comes into play in larger queries. But anyhow, my point being, due to this state machine construction, changing the query is out.
My guess behind all of this being implemented as a state machine is that it is for performance. The actual recursive function calling itself for results is a bit hard to read at first, but I think it is pretty smart.

Conclusion

Honestly, after some discussion with @jb3, we mostly conclude that while QLC is pretty cool tech, and a pretty cool idea, but with the invisible issues we have with it, and the (as evident also by the lack of upstream contributions to it) very hard to digest codebase. We've invested a lot of time into using it, testing it, and - sometimes - fighting it, and I'm sure given enough time one of us could manage to fix this issue, but I also think that time would be better supported by building something new, maybe supported by an EEP to integrate it into Erlang (think Elixir's Stream module).

We'll therefore sunset the QLC cache implementations and revert back to requiring individual functions for our cache accesses.

See #625 and #629 for further information.

jchristgit mentioned this pull request Aug 15, 2024

Optimize fetching individual records on each cache type #622

Merged

jchristgit requested a review from jb3 August 15, 2024 19:07

Th3-M4jor mentioned this pull request Aug 16, 2024

Add simple benchmark for GuildCache's get/1 function #627

Merged

jchristgit closed this Aug 18, 2024

jchristgit deleted the match-spec-speed branch August 18, 2024 06:07

Th3-M4jor mentioned this pull request Aug 18, 2024

Sunset QLC requirements from cache behaviour #629

Merged

6 tasks

jchristgit added a commit that referenced this pull request Aug 18, 2024

Remove QLC from member & guild caches

1622fac

See #625 and #629 for further information.

jchristgit mentioned this pull request Aug 18, 2024

Remove QLC from member & guild caches #630

Merged

jchristgit added a commit that referenced this pull request Aug 18, 2024

Remove QLC from member & guild caches

387ee43

See #625 and #629 for further information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize mnesia cache querying #625

Optimize mnesia cache querying #625

jchristgit commented Aug 15, 2024

Th3-M4jor commented Aug 15, 2024

Th3-M4jor commented Aug 15, 2024 •

edited

Loading

atlas-oc commented Aug 16, 2024

jchristgit commented Aug 16, 2024

jchristgit commented Aug 16, 2024

Th3-M4jor commented Aug 16, 2024

jchristgit commented Aug 17, 2024 •

edited

Loading

jchristgit commented Aug 18, 2024

Optimize mnesia cache querying #625

Optimize mnesia cache querying #625

Conversation

jchristgit commented Aug 15, 2024

Th3-M4jor commented Aug 15, 2024

Th3-M4jor commented Aug 15, 2024 • edited Loading

atlas-oc commented Aug 16, 2024

jchristgit commented Aug 16, 2024

jchristgit commented Aug 16, 2024

Th3-M4jor commented Aug 16, 2024

jchristgit commented Aug 17, 2024 • edited Loading

jchristgit commented Aug 18, 2024

Parse transform

Evaluation

Problem

Conclusion

Th3-M4jor commented Aug 15, 2024 •

edited

Loading

jchristgit commented Aug 17, 2024 •

edited

Loading