-
Notifications
You must be signed in to change notification settings - Fork 96
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This is a substantial refactoring of the Ra log implementation. All disk formats remain the same - the changes in this PR primarily deal with: Changes to memtable management and lifetimes Refinement of Ra server / WAL / Segment writer messages and handling. Motivation Previously the WAL process had quite a lot of responsibilities: * batching incoming writes from all local ra servers (writers). * Accounting * Writing entries to disk and fsyncing * Notifying ra server post fsync * Creating and updating new memtable and the mem-table lookup tables. (ra_log_open_mem_tables, ra_log_closed_mem_tables). * Notifying segment writer As the WAL is the primary contention point of the Ra log system it was clear it would benefit from fewer responsibilities which should increase the performance and scalability of the log implementation. In addition the ra server wrote all entries first to a local ETS "cache" table, the same entry that later was written to a memtable by the WAL. So each entry was written to two ETS tables (at a minimum, low priority entries may also be written to yet another ETS table). V2 In the v2 implementation the WAL no longer writes to memtables. Instead of the ra servers maintaining a cache table this table, in effect, becomes the memtable. Only the ra servers write to memtables now. The WAL's responsibilities have been reduced to: * batching * accounting * writing and syncing * notifying Instead of memtables being linked to the life-time of a WAL file and deleted after being flushed to segments they are now maintained indefinitely and instead flushed entries are deleted from the table. There are many additional changes and improvements and some API simplifictions. Breaking change: The rarely used ra:register_external_reader/2 API has been deprecated and will now only read entries from segment files. So in effect it will provide the same behaviour just delayed. The docs/LOG_V2.md provides further details on the new log design. Fix typos wip
- Loading branch information
Showing
60 changed files
with
4,802 additions
and
3,619 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,4 @@ | ||
%% encoding: UTF-8 | ||
{application,ra}. | ||
{modules,[ra,ra_counters,ra_dbg,ra_directory,ra_env,ra_leaderboard, | ||
ra_log_pre_init,ra_log_reader,ra_machine,ra_monitors,ra_server, | ||
ra_snapshot,ra_system]}. | ||
{modules,[ra,ra_aux,ra_counters,ra_dbg,ra_directory,ra_env,ra_leaderboard, | ||
ra_log_reader,ra_machine,ra_snapshot,ra_system]}. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# Ra log v2 internals | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant ra-server-n | ||
participant wal | ||
participant segment-writer | ||
loop until wal full | ||
ra-server-n-)+wal: write(Index=1..N, Term=T) | ||
wal->>wal: write-batch([1,2,3]) | ||
wal-)-ra-server-n: written event: Term=T, Range=(1, 3) | ||
end | ||
wal-)+segment-writer: flush-wal-ranges | ||
segment-writer-->segment-writer: flush to segment files | ||
segment-writer-)ra-server-n: notify flushed segments | ||
ra-server-n-->ra-server-n: update mem-table-ranges | ||
ra-server-n-)ets-server: delete range from mem-table | ||
``` | ||
|
||
In the Ra log v2 implementation some work previously done by the `ra_log_wal` | ||
process has now been either factored out or moved elsewhere. | ||
|
||
In Ra log v1 the WAL process would be responsible for both writing to disk and | ||
to memtables (ETS). Each writer (identified by a locally scoped binary "UId") would | ||
have a unique ETS table to cover the lifetime of each WAL file. Once the WAL breaches | ||
its' configured `max_wal_size_bytes` limit it closes the file and hands it over to | ||
the segment writer to flush any still live entries to per-server segments. | ||
The segment writer reads each entry from the memtables, not the WAL file. | ||
When all still live entries in the WAL have been flushed to segments the segment | ||
writer deletes the WAL file and notifies all relevant ra servers of the new | ||
segments. Once each ra server receives this notifications and updates their | ||
"seg-refs" they delete the whole memtable. | ||
|
||
In the v2 implementation the WAL no longer writes to memtables during normal | ||
operation (exception being the recovery phase). Instead the memtables are | ||
written to by the Ra servers before the write request is sent to the WAL. | ||
The removes the need for a separate ETS table per Ra server "cache" which was | ||
required in the v1 implementation. | ||
|
||
In v2 memtables aren't deleted after segment flush. Instead they are kept until | ||
a Ra server needs to overwrite some entries. This cannot be allowed due to the | ||
async nature of the log implementation. E.g. the segment writer could be reading | ||
from the memtables and if an index is overwritten it may generate an inconsistent | ||
end result. Overwrites are typically only needed when a leader has been replaced | ||
and have some written but uncommitted entries that another leader in a higher | ||
term has overwritten. | ||
|
||
|
||
## In-Memory Tables (memtables) | ||
|
||
Module: `ra_mt` | ||
|
||
Mem tables are owned and created by the `ra_log_ets` process. Ra servers call | ||
into the process to create new memtables and a registry of current tables is | ||
kept in the `ra_log_open_memtables` table. From v2 the `ra_log_closed_memtables` | ||
ETS table is no longer used or created. | ||
|
||
Invariant: Entries can be written or deleted but never overwritten. | ||
|
||
During normal operation each Ra server only writes to a single ETS memtable. | ||
Entries that are no longer required to be kept in the memtable due to snapshotting | ||
or having been written to disk segments are deleted. The actual delete operation | ||
is performed by `ra_log_ets` on request by Ra servers. | ||
|
||
Memtables are no longer linked to the lifetime of a given WAL file as before. | ||
Apart from recovery after a system restart only the Ra servers write to | ||
memtables which reduces the workload of the WAL process. | ||
|
||
New memtables are only created when a server needs to overwrite indexes in its | ||
log. This typically only happens when a leader has been replaced and steps down | ||
to follower with uncommitted entries in its log. Due to the async nature of the | ||
Ra log implementation it is not safe to ever overwrite an entry in a memtable | ||
(as concurrent reads may be done by the segment writer process). Therefore a new | ||
memtable needs to be created when this situation occurs. | ||
|
||
When a new memtable is created the old ones will not be written to any further | ||
and will be deleted as soon as they are emptied. | ||
|
||
## WAL | ||
|
||
Module: `ra_log_wal` | ||
|
||
The `ra_log_wal` process now has the following responsibilities: | ||
|
||
* Write entries to disk and notify the writer processes when their entries | ||
have been synced to the underlying storage. | ||
* Track the ranges written by each writer (ra server) for which ETS table and | ||
notifies the segment writer when a WAL file has filled up. | ||
* Recover memtables from WAL files after a system restart. | ||
|
||
## Segment Writer | ||
|
||
Module: `ra_log_segment_writer` | ||
|
||
The segment writer's responsibilities remain much as before. | ||
When a WAL file reaches it's max size limit the WAL will send the segment writer | ||
a map of `#{ra_uid() => [{ets:tid(), ra_range()}]}` describing the "tid ranges" | ||
that need to be written to disk for each `ra_uid()` (i.e. a Ra server). | ||
|
||
The range that is actually written can be dynamically truncated if the Ra server | ||
writes a snapshot before or during the segment flush. E.g. if the segment writer | ||
is asked to flush the range `{1000, 2000}` and the Ra server writes a snapshot | ||
at index 1500 the segment writer will update the range to `{1501, 2000}` to avoid flushing | ||
redundant entries to disk. | ||
|
||
The latest snapshot index for each Ra server is kept in the `ra_log_snapshot_state` | ||
ETS table. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.