Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize mnesia cache querying #625

Closed
wants to merge 1 commit into from
Closed

Conversation

jchristgit
Copy link
Collaborator

QLC queries via mnesia-based caches that would use the {traverse, {select, MatchSpec}} in any shape or form would cause the QLC query to be executed in two parts, the mnesia:table call running the entire table over the selected match specification, and then another Erlang list comprehension that would filter the results of that across a list comprehension in plain Erlang. Per the discussion in #622, this is how such a query would look like under qlc:info/1:

qlc:q([
       Guild ||
           {GuildId, Guild} <-
               mnesia:table(nostrum_guilds,
                            [{n_objects, 100},
                             {lock, read} |
                             {traverse,
                              {select,
                               [{{'_', '$1', '$2'},
                                 [],
                                 [{{'$1', '$2'}}]}]}}]),
           GuildId =:= RequestedGuildId
      ])

The issue is that neither QLC nor mnesia can cleanly optimize it here: Mnesia does not know about the condition specified in QLC, and QLC's optimization to use a lookup_fun is knocked out by the fact that it can't reach into the traverse call to detect the reordered columns. It might be possible to implement this in QLC itself, given it is smart enough to figure out when to use the lookup function based on the indices, together with some cooperation from Mnesia itself. Obviously, this behaviour would lead to unacceptable performance.

This commit introduces an optimization that allows guild, member and presence cache implementations to export a query_handle/1 function that accepts a match specification guard, that is, the "middle part" of a match specification. The guard determines which rows shall be filtered. Do note, however, that this is still unable to perform a complete optimization of lookups of single records - it will still traverse the table, but in the native ETS code.

QLC queries via mnesia-based caches that would use the `{traverse,
{select, MatchSpec}}` in any shape or form would cause the QLC query to
be executed in two parts, the `mnesia:table` call running the entire
table over the selected match specification, and then another Erlang
list comprehension that would filter the results of that across a list
comprehension in plain Erlang. Per the discussion in #622, this is how
such a query would look like under `qlc:info/1`:

    qlc:q([
           Guild ||
               {GuildId, Guild} <-
                   mnesia:table(nostrum_guilds,
                                [{n_objects, 100},
                                 {lock, read} |
                                 {traverse,
                                  {select,
                                   [{{'_', '$1', '$2'},
                                     [],
                                     [{{'$1', '$2'}}]}]}}]),
               GuildId =:= RequestedGuildId
          ])

The issue is that neither QLC nor mnesia can cleanly optimize it here:
Mnesia does not know about the condition specified in QLC, and QLC's
optimization to use a `lookup_fun` is knocked out by the fact that it
can't reach into the traverse call to detect the reordered columns. It
might be possible to implement this in QLC itself, given it is smart
enough to figure out when to use the lookup function based on the
indices, together with some cooperation from Mnesia itself. Obviously,
this behaviour would lead to unacceptable performance.

This commit introduces an optimization that allows guild, member and
presence cache implementations to export a `query_handle/1` function
that accepts a match specification guard, that is, the "middle part" of
a match specification. The guard determines which rows shall be
filtered. Do note, however, that this is still unable to perform a
complete optimization of lookups of single records - it will still
traverse the table, but in the native ETS code.
@Th3-M4jor
Copy link
Contributor

I don't think this improves the performance on large tables much at all

qlc:q([ 
       Guild ||
           {GuildId, Guild} <-
               mnesia:table(nostrum_guilds,
                            [{n_objects, 100},
                             {lock, read} |
                             {traverse,
                              {select,
                               [{{'_', '$1', '$2'},
                                 [{'==', '$1', {const, 12234}}],
                                 [{{'$1', '$2'}}]}]}}]),
           GuildId =:= RequestedGuildId
      ])

I'll inject a few thousand fake guilds and see how it performs locally in this PR vs 0.10.0

@Th3-M4jor
Copy link
Contributor

Th3-M4jor commented Aug 15, 2024

Okay this is weird, tested with 5000 guilds injected into the table and

iex(8)> func = fn ->
...(8)> start = :erlang.monotonic_time()
...(8)> Nostrum.Cache.GuildCache.get(guild.id)
...(8)> :erlang.monotonic_time() - start
...(8)> end)

is usually taking 40-ish milliseconds on 0.10.0

But an almost empty table takes under 1 millisecond on 0.10.0

The report said it took closer to 400ms at 4000 guilds 🤔

There was no significant difference at 5000 guilds when using this PR

@atlas-oc
Copy link

That is strange... another metric impacted was our CPU usage after dropping Nostrum.Cache.GuildCache.get/1 for :mnesia.read(:nostrum_guilds, id):
image
The spikes correspond to America-centric server activity (with our workload, almost every :MESSAGE_CREATE event was hitting the GuildCache, so each process spawned by the Consumer was presumably doing a full table scan before anything, causing massive CPU strain).

This is also reflected by our GC statistics, where (I think?) the entire table was constantly being loaded in and out of memory, causing 200-500 MB per reclamation:
image

@jchristgit
Copy link
Collaborator Author

@Th3-M4jor can you share your benchmarking script?

@atlas-oc thanks for the valuable info from the graph. To be honest, that looks as expected. Our table type is a set, that gives us O(1) access on reads :-)

@jchristgit
Copy link
Collaborator Author

Speaking of benchmarks, it might be nice to have that in a benchmark file, if you are interested in adding it. See benchmarks/member_cache_bench.exs for an example

@Th3-M4jor
Copy link
Contributor

Opened #627 to add a basic benchmark for get/1

I may update it with one for create/1 and possibly also add one for the MessageCache this weekend if I can come up with a clean way to test both :set and :ordered_set on that one.

@jchristgit
Copy link
Collaborator Author

jchristgit commented Aug 17, 2024

Thanks for the PR.

I've noticed something else, namely that in theory QLC supports what we're doing - I've dug through the source code and it seems capable to translate QLC filters into match specifications directly without having to input it here. POC:

diff --git a/lib/nostrum/cache/guild_cache/mnesia.ex b/lib/nostrum/cache/guild_cache/mnesia.ex
index 04f27955..80bc6dbb 100644
--- a/lib/nostrum/cache/guild_cache/mnesia.ex
+++ b/lib/nostrum/cache/guild_cache/mnesia.ex
@@ -291,8 +291,9 @@ if Code.ensure_loaded?(:mnesia) do
     @doc "Get a QLC handle for the guild cache."
     @spec query_handle :: :qlc.query_handle()
     def query_handle do
-      ms = [{{:_, :"$1", :"$2"}, [], [{{:"$1", :"$2"}}]}]
-      :mnesia.table(@table_name, {:traverse, {:select, ms}})
+      #ms = [{{:_, :"$1", :"$2"}, [], [{{:"$1", :"$2"}}]}]
+      qh = :mnesia.table(@table_name, {:traverse, :select})
+      :nostrum_guild_cache_mnesia_qlc.transform_table_records(qh)
     end
 
     @impl GuildCache
diff --git a/src/nostrum_guild_cache_mnesia_qlc.erl b/src/nostrum_guild_cache_mnesia_qlc.erl
new file mode 100644
index 00000000..bea90c7c
--- /dev/null
+++ b/src/nostrum_guild_cache_mnesia_qlc.erl
@@ -0,0 +1,8 @@
+-module(nostrum_guild_cache_mnesia_qlc).
+-export([transform_table_records/1]).
+
+-include_lib("stdlib/include/qlc.hrl").
+
+
+transform_table_records(QH) ->
+    qlc:q([{GuildId, Guild} || {_RecordTag, GuildId, Guild} <- QH]).

Yields query:

iex(4)> IO.puts(:qlc.info(:nostrum_guild_cache_qlc.get(123, Nostrum.Cache.GuildCache.Mnesia)))
qlc:q([ 
       Guild ||
           {GuildId, Guild} <-
               mnesia:table(nostrum_guilds,
                            [{traverse,
                              {select,
                               [{{'_', '$1', '$2'},
                                 [true],
                                 [{{'$1', '$2'}}]}]}},
                             {n_objects, 100},
                             {lock, read} |
                             {traverse, select}]),
           GuildId =:= RequestedGuildId
      ])

Notice that we didn't specify an explicit matchspec but QLC got the message. That seems to stem from https://github.com/erlang/otp/blob/0b33c5a4c0a647cf7df6d50c8768b574c48752fd/lib/stdlib/src/qlc_pt.erl#L1140.

I think it boils down to this issue I reported last year: erlang/otp#7268
And indeed, if I modify the line like this:

diff --git a/lib/nostrum/cache/guild_cache/mnesia.ex b/lib/nostrum/cache/guild_cache/mnesia.ex
index 80bc6dbb..0ad7a855 100644
--- a/lib/nostrum/cache/guild_cache/mnesia.ex
+++ b/lib/nostrum/cache/guild_cache/mnesia.ex
@@ -293,7 +293,7 @@ if Code.ensure_loaded?(:mnesia) do
     def query_handle do
       #ms = [{{:_, :"$1", :"$2"}, [], [{{:"$1", :"$2"}}]}]
       qh = :mnesia.table(@table_name, {:traverse, :select})
-      :nostrum_guild_cache_mnesia_qlc.transform_table_records(qh)
+      qh
     end
 
     @impl GuildCache

which effectively removes the intermediate query handle and allows it to directly hook to the :mnesia.table call, then we get the following:

iex(2)>  IO.puts(:qlc.info(:nostrum_guild_cache_qlc.get(123, Nostrum.Cache.GuildCache.Mnesia)))
mnesia:table(nostrum_guilds,
             [{traverse,
               {select,
                [{{'$1', '$2'}, [{'=:=', '$1', {const, 123}}], ['$2']}]}},
              {n_objects, 100},
              {lock, read} |
              {traverse, select}])

(Note this query is "wrong" because it will compare the atom in the first column, not the guild ID, to 123). This is not perfect still as you see, because it still does not use the lookup_fun. I think that that is caused by the fact that a match spec is input there. I need to do more testing, I think we may be able to implement optimizations upstream.

@jchristgit
Copy link
Collaborator Author

So, I've been digging through the QLC source code.

Parse transform

At compile time, the qlc_pt.erl parse transform modifies any calls to qlc:q to be replaced with a qlc_lc record. That record then contains some information about the query list comprehension.
The core idea behind it is that the parse transform generates a function that seems to be a state machine. The comment in qlc_pt.erl starting from line 521 explains it decently. Any included variables are stored in a function that returns said variable.

Don't even try to understand the code of the parse transform. I have no idea what is going on there...

Evaluation

With our list comprehension rewritten to a record with magic functions, this is something that we can pass to qlc:e to evaluate it. Important logic then happens around prep_qlc_lc with the qlc_v1 branch (this is the inner content of qlc_lc if we have a non-trivial list comprehension in there).

My proposed optimization would basically traverse down into any nested included qlc_v1 query handles and check if they support match specifications. If they do, we check if our current query handle is expressed using a match specification, and if yes, we merge the match specification of our parent query into the child query and kill off the child query altogether.

Problem

The issue is that this is much harder than it sounds, because pretty much everything in the qlc record is represented a function that appears to be hooked together with the state machine in various ways. Even the match specification requires calling a function with some QNum parameter. To give you an insight, this is the function that allows you to get the information from there:

fun(size) -> fun(1) -> 2; (_) -> undefined end;
   (template) -> fun(_, _) -> [] end;
   (constants) -> fun(_) -> no_column_fun end;
   (equal_constants) ->
       fun(1) -> fun(1) -> {values, [2], {all, [2]}};
                 (_) -> false
              end;
          (_) -> no_column_fun
       end;
   (n_leading_constant_columns) -> fun(1) -> 1; (_) -> 0 end;
   (constant_columns) -> fun(1) -> "\001"; (_) -> [] end;
   (match_specs) ->
       fun(1) -> {[{{'$1', '$2'}, [{'==', '$1', 2}], ['$2']}], all};
          (_) -> undefined end;
   (_) -> undefined

Some of these are somewhat self-explanatory and then you have the question of when you have other variables than 1 etc. I assume it comes into play in larger queries. But anyhow, my point being, due to this state machine construction, changing the query is out.
My guess behind all of this being implemented as a state machine is that it is for performance. The actual recursive function calling itself for results is a bit hard to read at first, but I think it is pretty smart.

Conclusion

Honestly, after some discussion with @jb3, we mostly conclude that while QLC is pretty cool tech, and a pretty cool idea, but with the invisible issues we have with it, and the (as evident also by the lack of upstream contributions to it) very hard to digest codebase. We've invested a lot of time into using it, testing it, and - sometimes - fighting it, and I'm sure given enough time one of us could manage to fix this issue, but I also think that time would be better supported by building something new, maybe supported by an EEP to integrate it into Erlang (think Elixir's Stream module).

We'll therefore sunset the QLC cache implementations and revert back to requiring individual functions for our cache accesses.

@jchristgit jchristgit closed this Aug 18, 2024
@jchristgit jchristgit deleted the match-spec-speed branch August 18, 2024 06:07
jchristgit added a commit that referenced this pull request Aug 18, 2024
See #625 and #629 for further information.
jchristgit added a commit that referenced this pull request Aug 18, 2024
See #625 and #629 for further information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants