Skip to content

Commit

Permalink
Ra log v2
Browse files Browse the repository at this point in the history
This is a substantial refactoring of the Ra log implementation. All disk formats remain the same - the changes in this PR primarily deal with:

Changes to memtable management and lifetimes
Refinement of Ra server / WAL / Segment writer messages and handling.
Motivation
Previously the WAL process had quite a lot of responsibilities:

* batching incoming writes from all local ra servers (writers).
* Accounting
* Writing entries to disk and fsyncing
* Notifying ra server post fsync
* Creating and updating new memtable and the mem-table lookup tables. (ra_log_open_mem_tables, ra_log_closed_mem_tables).
* Notifying segment writer

As the WAL is the primary contention point of the Ra log system it was clear it would benefit from fewer responsibilities which should increase the performance and scalability of the log implementation.

In addition the ra server wrote all entries first to a local ETS "cache" table, the same entry that later was written to a memtable by the WAL. So each entry was written to two ETS tables (at a minimum, low priority entries may also be written to yet another ETS table).

V2

In the v2 implementation the WAL no longer writes to memtables. Instead of the ra servers maintaining a cache table this table, in effect, becomes the memtable. Only the ra servers write to memtables now.

The WAL's responsibilities have been reduced to:

* batching
* accounting
* writing and syncing
* notifying

Instead of memtables being linked to the life-time of a WAL file and deleted after being flushed to segments they are now maintained indefinitely and instead flushed entries are deleted from the table.

There are many additional changes and improvements and some API simplifictions.

Breaking change:

The rarely used ra:register_external_reader/2 API has been deprecated and will now only read entries from segment files. So in effect it will provide the same behaviour just delayed.

The docs/LOG_V2.md provides further details on the new log design.

Fix typos

wip
  • Loading branch information
kjnilsson committed Nov 11, 2024
1 parent 0146f87 commit 24c7c00
Show file tree
Hide file tree
Showing 60 changed files with 4,802 additions and 3,619 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/erlang.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ jobs:
strategy:
fail-fast: false
matrix:
otp_version: [25, 26, 27]
otp_version: [26, 27]
steps:
- name: CHECKOUT
uses: actions/checkout@v2
Expand Down
4 changes: 2 additions & 2 deletions MODULE.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ erlang_package.hex_package(

erlang_package.hex_package(
name = "gen_batch_server",
sha256 = "94a49a528486298b009d2a1b452132c0a0d68b3e89d17d3764cb1ec879b7557a",
version = "0.8.7",
sha256 = "c8581fe4a4b6bccf91e53ce6a8c7e6c27c8c591bab5408b160166463f5579c22",
version = "0.8.9",
)

erlang_package.git_package(
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ PROJECT = ra
ESCRIPT_NAME = ra_fifo_cli
ESCRIPT_EMU_ARGS = -noinput -setcookie ra_fifo_cli

dep_gen_batch_server = hex 0.8.8
dep_gen_batch_server = hex 0.8.9
dep_aten = hex 0.6.0
dep_seshat = hex 0.6.0
DEPS = aten gen_batch_server seshat
Expand Down
5 changes: 2 additions & 3 deletions docs/edoc-info
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
%% encoding: UTF-8
{application,ra}.
{modules,[ra,ra_counters,ra_dbg,ra_directory,ra_env,ra_leaderboard,
ra_log_pre_init,ra_log_reader,ra_machine,ra_monitors,ra_server,
ra_snapshot,ra_system]}.
{modules,[ra,ra_aux,ra_counters,ra_dbg,ra_directory,ra_env,ra_leaderboard,
ra_log_reader,ra_machine,ra_snapshot,ra_system]}.
109 changes: 109 additions & 0 deletions docs/internals/LOG_V2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Ra log v2 internals

```mermaid
sequenceDiagram
participant ra-server-n
participant wal
participant segment-writer
loop until wal full
ra-server-n-)+wal: write(Index=1..N, Term=T)
wal->>wal: write-batch([1,2,3])
wal-)-ra-server-n: written event: Term=T, Range=(1, 3)
end
wal-)+segment-writer: flush-wal-ranges
segment-writer-->segment-writer: flush to segment files
segment-writer-)ra-server-n: notify flushed segments
ra-server-n-->ra-server-n: update mem-table-ranges
ra-server-n-)ets-server: delete range from mem-table
```

In the Ra log v2 implementation some work previously done by the `ra_log_wal`
process has now been either factored out or moved elsewhere.

In Ra log v1 the WAL process would be responsible for both writing to disk and
to memtables (ETS). Each writer (identified by a locally scoped binary "UId") would
have a unique ETS table to cover the lifetime of each WAL file. Once the WAL breaches
its' configured `max_wal_size_bytes` limit it closes the file and hands it over to
the segment writer to flush any still live entries to per-server segments.
The segment writer reads each entry from the memtables, not the WAL file.
When all still live entries in the WAL have been flushed to segments the segment
writer deletes the WAL file and notifies all relevant ra servers of the new
segments. Once each ra server receives this notifications and updates their
"seg-refs" they delete the whole memtable.

In the v2 implementation the WAL no longer writes to memtables during normal
operation (exception being the recovery phase). Instead the memtables are
written to by the Ra servers before the write request is sent to the WAL.
The removes the need for a separate ETS table per Ra server "cache" which was
required in the v1 implementation.

In v2 memtables aren't deleted after segment flush. Instead they are kept until
a Ra server needs to overwrite some entries. This cannot be allowed due to the
async nature of the log implementation. E.g. the segment writer could be reading
from the memtables and if an index is overwritten it may generate an inconsistent
end result. Overwrites are typically only needed when a leader has been replaced
and have some written but uncommitted entries that another leader in a higher
term has overwritten.


## In-Memory Tables (memtables)

Module: `ra_mt`

Mem tables are owned and created by the `ra_log_ets` process. Ra servers call
into the process to create new memtables and a registry of current tables is
kept in the `ra_log_open_memtables` table. From v2 the `ra_log_closed_memtables`
ETS table is no longer used or created.

Invariant: Entries can be written or deleted but never overwritten.

During normal operation each Ra server only writes to a single ETS memtable.
Entries that are no longer required to be kept in the memtable due to snapshotting
or having been written to disk segments are deleted. The actual delete operation
is performed by `ra_log_ets` on request by Ra servers.

Memtables are no longer linked to the lifetime of a given WAL file as before.
Apart from recovery after a system restart only the Ra servers write to
memtables which reduces the workload of the WAL process.

New memtables are only created when a server needs to overwrite indexes in its
log. This typically only happens when a leader has been replaced and steps down
to follower with uncommitted entries in its log. Due to the async nature of the
Ra log implementation it is not safe to ever overwrite an entry in a memtable
(as concurrent reads may be done by the segment writer process). Therefore a new
memtable needs to be created when this situation occurs.

When a new memtable is created the old ones will not be written to any further
and will be deleted as soon as they are emptied.

## WAL

Module: `ra_log_wal`

The `ra_log_wal` process now has the following responsibilities:

* Write entries to disk and notify the writer processes when their entries
have been synced to the underlying storage.
* Track the ranges written by each writer (ra server) for which ETS table and
notifies the segment writer when a WAL file has filled up.
* Recover memtables from WAL files after a system restart.

## Segment Writer

Module: `ra_log_segment_writer`

The segment writer's responsibilities remain much as before.
When a WAL file reaches it's max size limit the WAL will send the segment writer
a map of `#{ra_uid() => [{ets:tid(), ra_range()}]}` describing the "tid ranges"
that need to be written to disk for each `ra_uid()` (i.e. a Ra server).

The range that is actually written can be dynamically truncated if the Ra server
writes a snapshot before or during the segment flush. E.g. if the segment writer
is asked to flush the range `{1000, 2000}` and the Ra server writes a snapshot
at index 1500 the segment writer will update the range to `{1501, 2000}` to avoid flushing
redundant entries to disk.

The latest snapshot index for each Ra server is kept in the `ra_log_snapshot_state`
ETS table.

4 changes: 1 addition & 3 deletions docs/modules-frame.html
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,14 @@
<h2 class="indextitle">Modules</h2>
<table width="100%" border="0" summary="list of modules">
<tr><td><a href="ra.html" target="overviewFrame" class="module">ra</a></td></tr>
<tr><td><a href="ra_aux.html" target="overviewFrame" class="module">ra_aux</a></td></tr>
<tr><td><a href="ra_counters.html" target="overviewFrame" class="module">ra_counters</a></td></tr>
<tr><td><a href="ra_dbg.html" target="overviewFrame" class="module">ra_dbg</a></td></tr>
<tr><td><a href="ra_directory.html" target="overviewFrame" class="module">ra_directory</a></td></tr>
<tr><td><a href="ra_env.html" target="overviewFrame" class="module">ra_env</a></td></tr>
<tr><td><a href="ra_leaderboard.html" target="overviewFrame" class="module">ra_leaderboard</a></td></tr>
<tr><td><a href="ra_log_pre_init.html" target="overviewFrame" class="module">ra_log_pre_init</a></td></tr>
<tr><td><a href="ra_log_reader.html" target="overviewFrame" class="module">ra_log_reader</a></td></tr>
<tr><td><a href="ra_machine.html" target="overviewFrame" class="module">ra_machine</a></td></tr>
<tr><td><a href="ra_monitors.html" target="overviewFrame" class="module">ra_monitors</a></td></tr>
<tr><td><a href="ra_server.html" target="overviewFrame" class="module">ra_server</a></td></tr>
<tr><td><a href="ra_snapshot.html" target="overviewFrame" class="module">ra_snapshot</a></td></tr>
<tr><td><a href="ra_system.html" target="overviewFrame" class="module">ra_system</a></td></tr></table>
</body>
Expand Down
Loading

0 comments on commit 24c7c00

Please sign in to comment.