[no-release-notes] go: sqle: dprocedures: dolt_gc: Implement a session-aware GC safepoint controller. #8798

reltuk · 2025-01-29T00:03:48Z

Currently, if you run GC against a running sql-server, as part of the GC process the server would cancel all inflight queries and close all existing SQL connections. It would also leave the inflight connection which made the call dolt_gc() request in an invalidated state where it would fail all queries going forward. The end result is that calling dolt_gc on a running server is disruptive and requires careful handling by existing users.

This PR introduces a new session-aware safepoint controller. In order to establish a safepoint, it starts tracking all inflight sessions shortly after the GC process begins. It adds lifecycle hooks so that those sessions get a chance to give their GC roots to the GC process once they are quiesced and before the GC process completes. It allows the GC process to block on these rendezvous so that it can be certain it has seen all inflight work before it finalizes the GC.

To turn this behavior on, run dolt sql-server with DOLT_GC_SAFEPOINT_CONTROLLER_CHOICE=session_aware.

This PR builds on a number of preceding PRs, and introduces two major subtleties:

Session lifecycle callbacks need to be made from everywhere that may be running mutations against the database. For example, PRs leading up to this one needed to change GMS server/handler, GMS eventscheduler, Dolt remotesrv and Dolt sqle/cluster. It is easy to forget these in a particular instance and everything will seemingly work, but GC will no longer be safe.
(*DoltSession).VisitGCRoots needs to know how to find all reachable GC roots from the session. It's easy to forget to update it if you add things to session state, or anywhere else. If a developer omits a GC root, everything will seemingly work, but GC will no longer be safe.

That being said, auto GC is a critical requirement for Dolt and that means making it so that GC is correct and not disruptive to ongoing workloads. As a result, we are pursuing this solution as is, and will continue iterating on developer ergonomics and safety under change going forward.

Allows dolt_gc implementation to carry state, such as a session manager. This prepares for it to implement more robust GC safepoints.

…afepointController which can work with it.

…ntroller.

…t_gc safepoint controller.

…le to control dolt_gc safepoint behavior. This is a short-term setting which will allow choosing the session-aware gc safepoint behavior, instead of the legacy behavior which kills all in-flight connections when performing a GC.

coffeegoddd · 2025-01-29T00:38:44Z

@reltuk DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`fc3217e`	ok	5937457

version	total_tests
`fc3217e`	5937457

correctness_percentage
100.0

…ave more principled lifecycle. Starting replication never uses the replcation execution context.

coffeegoddd · 2025-01-29T01:34:52Z

@reltuk DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`13c9ddf`	ok	5937457

version	total_tests
`13c9ddf`	5937457

correctness_percentage
100.0

coffeegoddd · 2025-01-29T18:21:03Z

@reltuk DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`819136d`	ok	5937457

version	total_tests
`819136d`	5937457

correctness_percentage
100.0

zachmu

The thing I don't understand about this behavior: new sessions starting after the call to BeginGC are not subject to the restrictions to not begin new work until GC finalization finishes. This means they can write chunks that the GC finalizer won't know about, since they aren't in the set of sessions visited during finalization.

The only way this could work is if the chunks they are writing aren't subject to collection, maybe because they are after a high water mark in the journal or something similar? I missed the first several PRs in this chain of work so I'm catching up here, but I couldn't easily find whether that assumption is true. Even then it's not obvious to me how we would prevent a new session from having a chunk it needs be collected in all cases (it could need a chunk that it thinks is already present in the value store, but is getting marked for collection).

Some high-level and in-method comments could clear this up, maybe

zachmu · 2025-01-29T20:00:41Z

go/libraries/doltcore/doltdb/doltdb.go

+// from the working set. This is used in GC, for example, where all dependencies of the in-memory working
+// set value need to be accounted for.
+func (ddb *DoltDB) WorkingSetHashes(ctx context.Context, ws *WorkingSet) ([]hash.Hash, error) {
+	spec, err := ws.writeValues(ctx, ddb, nil)


The logic in this method seems like it would be more naturally contained by the WorkingSet type

go/libraries/doltcore/sqle/dsess/gc_safepoint_controller.go

zachmu · 2025-01-29T20:45:27Z

go/libraries/doltcore/sqle/dsess/gc_safepoint_controller.go

+		panic("SesisonBeginCommand called on a session that already had an outstanding command.")
+	}
+	toWait := state.QuiesceCallbackDone.Load().(chan struct{})
+	select {


A comment might be helpful for this latch logic

Is the idea that we immediately unblock on a closed channel, but otherwise we actually do block until channel close? Not obvious why the unlocking then relocking is required in the latter case.

So reading the tests, it seems the purpose of this logic is that existing sessions must wait to begin new commands until an existing call to Wait() has completed, but new sessions aren't subject to this constraint?

Added a bunch of comments and some helper methods and stuff. Maybe it's more clear...if you want to take a look...

go/libraries/doltcore/sqle/dsess/session.go

go/libraries/doltcore/sqle/dsess/gc_safepoint_controller_test.go

…ome things.

…te.sh

max-hoffman

LGTM, just some naming comments for session controller. Same comments about testing from before, I'm running into the same issues in stats where you have to dig a bit to find weird concurrency bugs.

max-hoffman · 2025-01-30T18:25:18Z

go/libraries/doltcore/sqle/binlogreplication/binlog_replica_controller.go

-func (d *doltBinlogReplicaController) AutoStart(_ context.Context) error {
-	runningState, err := loadReplicationRunningState(d.ctx)
+func (d *doltBinlogReplicaController) AutoStart(ctx *sql.Context) error {
+	sql.SessionCommandBegin(ctx.Session)


I noted this on the other PR, I don't think we are generally very disciplined about sql session lifecycle management. Maybe it only matters in a few key places for GC

Agreed, but we will have to get better where it matters...

max-hoffman · 2025-01-30T18:55:26Z

go/libraries/doltcore/sqle/enginetest/dolt_harness.go

@@ -484,9 +485,11 @@ func (d *DoltHarness) NewReadOnlyEngine(provider sql.DatabaseProvider) (enginete
 	if err != nil {
 		return nil, err
 	}
+	gcSafepointController := dsess.NewGCSafepointController()
+	readOnlyProvider.RegisterProcedure(dprocedures.NewDoltGCProcedure(gcSafepointController))


does this need to be structured different from other procs? the controller is accessible from the session, and it seems like the controller holds all of the state

max-hoffman · 2025-01-30T20:08:14Z

go/libraries/doltcore/sqle/dsess/gc_safepoint_controller.go

+// 3) It sets |OutstandingCommand| for the Session to true. Only
+//    one command can be outstanding at a time, and whether a command


wording maybe confused me a bit, you mean that each session can only have one controller callback at a time?

also seems like you use OutstandingVisitCall and OutstandingCommand interchangeably?

the optionality and use of this in waiter seems like it's basically like "do callback", which is always going to be a session finalize outside of unit tests. I guess just it might be possible to simplify "SessionCommand"'s naming, there are so many session prefixed names moving around

Hmm, I could definitely find better names.

OutstandingVisitCall and OutstandingCommand are different concepts here. Basically, OutstandingCommand just means "the application layer is currently touching this session." OutstandingCommand is used to control when we visit the session if we make a Waiter – we can visit sessions without commands immediately, but for sessions currently doing work, we need to wait until their commandend call comes in.

OutstandingVisitCall means we made a waiter and we are currently calling the callback on this session. In the case, the rest of the application layer should not be touching the session, and SessionCommandBegin will block.

coffeegoddd · 2025-01-30T20:39:00Z

@reltuk DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`7c1cff7`	ok	5937457

version	total_tests
`7c1cff7`	5937457

correctness_percentage
100.0

coffeegoddd · 2025-01-30T20:48:13Z

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`f79db4a`	ok	5937457

version	total_tests
`f79db4a`	5937457

correctness_percentage
100.0

… session. Fixes special-case lifecycle for dolt_gc procedure.

coffeegoddd · 2025-01-30T22:19:58Z

@reltuk DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`b9f6a56`	ok	5937457

version	total_tests
`b9f6a56`	5937457

correctness_percentage
100.0

reltuk added 5 commits January 28, 2025 15:12

go: sqle/dprocedures: dolt_gc: Move to an intantiated instance.

366e466

Allows dolt_gc implementation to carry state, such as a session manager. This prepares for it to implement more robust GC safepoints.

go: sqle: dsess: Make DoltSession Lifecycle aware. Move towards a GCS…

915e392

…afepointController which can work with it.

go: sqle/dprocedures: dolt_gc: Implement a session aware safepoint co…

ef954d0

…ntroller.

dolt_gc,dsess: Add VisitGCRoots to dsess.Session and call it from dol…

bdc8ff1

…t_gc safepoint controller.

reltuk requested review from max-hoffman and zachmu January 29, 2025 00:03

coffeegoddd added the correctness_approved label Jan 29, 2025

go: sqlserver,binlogreplication: Clean up session usage a little to h…

13c9ddf

…ave more principled lifecycle. Starting replication never uses the replcation execution context.

go: dtables: help_table: Manually specify that dolt_gc exists for now.

819136d

zachmu approved these changes Jan 29, 2025

View reviewed changes

reltuk and others added 4 commits January 30, 2025 11:30

go: sqle: gc_safepoint_controller: PR feedback, comments to explain s…

153d46b

…ome things.

Merge remote-tracking branch 'origin/main' into aaron/dsess-lifecycle

5826451

integration-tests/go-sql-server-driver: Fix bug in concurrent_gc_test.

7c1cff7

[ga-format-pr] Run go/utils/repofmt/format_repo.sh and go/Godeps/upda…

f79db4a

…te.sh

max-hoffman approved these changes Jan 30, 2025

View reviewed changes

go: sqle: dprocedures: dolt_gc: Get the safepoint controller from the…

b9f6a56

… session. Fixes special-case lifecycle for dolt_gc procedure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[no-release-notes] go: sqle: dprocedures: dolt_gc: Implement a session-aware GC safepoint controller. #8798

[no-release-notes] go: sqle: dprocedures: dolt_gc: Implement a session-aware GC safepoint controller. #8798

reltuk commented Jan 29, 2025

coffeegoddd commented Jan 29, 2025

coffeegoddd commented Jan 29, 2025

coffeegoddd commented Jan 29, 2025

zachmu left a comment

zachmu Jan 29, 2025

zachmu Jan 29, 2025

zachmu Jan 29, 2025

reltuk Jan 30, 2025

max-hoffman left a comment

max-hoffman Jan 30, 2025

reltuk Jan 30, 2025

max-hoffman Jan 30, 2025

reltuk Jan 30, 2025

max-hoffman Jan 30, 2025

max-hoffman Jan 30, 2025

max-hoffman Jan 30, 2025

reltuk Jan 30, 2025

coffeegoddd commented Jan 30, 2025

coffeegoddd commented Jan 30, 2025

coffeegoddd commented Jan 30, 2025

		// 3) It sets \|OutstandingCommand\| for the Session to true. Only
		// one command can be outstanding at a time, and whether a command

[no-release-notes] go: sqle: dprocedures: dolt_gc: Implement a session-aware GC safepoint controller. #8798

Are you sure you want to change the base?

[no-release-notes] go: sqle: dprocedures: dolt_gc: Implement a session-aware GC safepoint controller. #8798

Conversation

reltuk commented Jan 29, 2025

coffeegoddd commented Jan 29, 2025

coffeegoddd commented Jan 29, 2025

coffeegoddd commented Jan 29, 2025

zachmu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-hoffman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coffeegoddd commented Jan 30, 2025

coffeegoddd commented Jan 30, 2025

coffeegoddd commented Jan 30, 2025