MySQL driver: on connect try setting wsrep_sync_wait=7, swallow error 1193 #665

Al2Klimov · 2023-11-15T13:00:09Z

In Galera clusters wsrep_sync_wait=7 lets statements catch up all pending sync between nodes first. This way new child rows await fresh parent ones from other nodes not to run into foreign key errors. MySQL single nodes will reject this with error 1193 "Unknown system variable" which is OK.

fixes #577

To do

Code structure OK? (@julianbrost)
GHA pass?
Before/after tests with actual Galera cluster and single node (@Al2Klimov)
Test in prod (customer)
Consider unit testing setGaleraOpts() (@Al2Klimov)

Al2Klimov · 2023-11-15T15:11:40Z

Single node test

https://github.com/Al2Klimov/twintowers with this patch:

--- docker-compose.yml
+++ docker-compose.yml
@@ -25,12 +25,12 @@ x-ido: &x-ido
     MARIADB_DATABASE: ido

 x-dbicinga: &x-dbicinga
-  image: mariadb:10
+  image: mysql:5.7
   environment:
-    MARIADB_RANDOM_ROOT_PASSWORD: '1'
-    MARIADB_USER: icingadb
-    MARIADB_PASSWORD: icingadb
-    MARIADB_DATABASE: icingadb
+    MYSQL_RANDOM_ROOT_PASSWORD: '1'
+    MYSQL_USER: icingadb
+    MYSQL_PASSWORD: icingadb
+    MYSQL_DATABASE: icingadb

 x-icingadb-env: &x-icingadb-env
   ICINGADB_REDIS_PORT: '6379'
@@ -186,7 +186,7 @@ services:
         condition: service_started
       dbicinga1:
         condition: service_started
-    image: icinga/icingadb
+    image: icinga/icingadb:test
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis1
@@ -279,7 +279,7 @@ services:
         condition: service_started
       dbicinga2:
         condition: service_started
-    image: icinga/icingadb
+    image: icinga/icingadb:test
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis2

The MySQL image throws 1193 on the command line with SET SESSION wsrep_sync_wait=4.
👍 But Icinga DB still works.

julianbrost · 2023-11-15T15:23:48Z

What effect does changing mariadb:10 to mysql:5.7? I'd expect both to behave identically. Is that what you wanted to test here, that old MySQL behaves like somewhat recent MariaDB?

Al2Klimov · 2023-11-15T15:59:36Z

The MySQL image throws 1193 on the command line with SET SESSION wsrep_sync_wait=4.

In contrast to the MariaDB Docker image it doesn't even ship cluster functionality and rejects the var. The perfect single node for testing IMAO.

Al2Klimov · 2023-11-15T16:17:31Z

Galera test

Cluster as per #577 (comment) verified with https://github.com/Al2Klimov/provoke-galera (that #577-ish errors can happen under laboratory conditions). In addition there's a Debian 12 load balancer:

root@aklimov-lb:~# iptables -t nat -A PREROUTING -d 10.27.2.59 -p tcp --dport 3306 -m statistic --mode random --probability 0.5 -j DNAT --to-destination 10.27.3.229:3306
root@aklimov-lb:~# iptables -t nat -A PREROUTING -d 10.27.2.59 -p tcp --dport 3306 -j DNAT --to-destination 10.27.1.199:3306
root@aklimov-lb:~# iptables -t nat -A POSTROUTING -o ens3 -j MASQUERADE
root@aklimov-lb:~# sysctl net.ipv4.ip_forward=1
root@aklimov-lb:~#

With this LB in https://github.com/Al2Klimov/twintowers

--- docker-compose.yml
+++ docker-compose.yml
@@ -190,7 +190,7 @@ services:
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis1
-      ICINGADB_DATABASE_HOST: dbicinga1
+      ICINGADB_DATABASE_HOST: 10.27.2.59

   web1:
     depends_on:
@@ -209,7 +209,7 @@ services:
       icingaweb.modules.icingadb.redis.redis1.host: redis1
       icingaweb.modules.monitoring.commandtransports.icinga2.host: master1
       icingaweb.resources.icingaweb_db.host: dbweb1
-      icingaweb.resources.icingadb.host: dbicinga1
+      icingaweb.resources.icingadb.host: 10.27.2.59
       icingaweb.resources.icinga_ido.host: ido1
     volumes:
       - ./volumes/icingaweb2/1:/data

and a bunch of history

--- icinga2.conf
+++ icinga2.conf
@@ -44,3 +44,17 @@ object IdoMysqlConnection "ido-mysql" {
     database = "ido"
     enable_ha = false
 }
+
+object Host "ok" {
+    check_command = "dummy"
+}
+
+for (i in range(10000)) {
+object Service i {
+    host_name = "ok"
+    check_command = "random"
+    max_check_attempts = 1
+    check_interval = 1ms
+    retry_interval = 1ms
+}
+}

bad things happen:

twintowers-icingadb1-1      | 2023-11-15T16:05:34.004Z	FATAL	icingadb	Error 1452 (23000): Cannot add or update a child row: a foreign key constraint fails ("icingadb"."history", CONSTRAINT "fk_history_state_history" FOREIGN KEY ("state_history_id") REFERENCES "state_history" ("id") ON DELETE CASCADE)
twintowers-icingadb1-1      | can't perform "INSERT INTO \"history\" (\"host_id\", \"event_type\", \"state_history_id\", \"event_time\", \"environment_id\", \"object_type\", \"endpoint_id\", \"service_id\", \"id\") VALUES (:host_id,:event_type,:state_history_id,:event_time,:environment_id,:object_type,:endpoint_id,:service_id,:id) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")"
twintowers-icingadb1-1      | github.com/icinga/icingadb/internal.CantPerformQuery
twintowers-icingadb1-1      | 	/icingadb-src/internal/internal.go:30
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.func2.1
twintowers-icingadb1-1      | 	/icingadb-src/pkg/icingadb/db.go:391
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/retry.WithBackoff
twintowers-icingadb1-1      | 	/icingadb-src/pkg/retry/retry.go:49
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.func2
twintowers-icingadb1-1      | 	/icingadb-src/pkg/icingadb/db.go:386
twintowers-icingadb1-1      | golang.org/x/sync/errgroup.(*Group).Go.func1
twintowers-icingadb1-1      | 	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75
twintowers-icingadb1-1      | runtime.goexit
twintowers-icingadb1-1      | 	/usr/local/go/src/runtime/asm_amd64.s:1650
twintowers-icingadb1-1      | can't retry
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/retry.WithBackoff
twintowers-icingadb1-1      | 	/icingadb-src/pkg/retry/retry.go:68
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.func2
twintowers-icingadb1-1      | 	/icingadb-src/pkg/icingadb/db.go:386
twintowers-icingadb1-1      | golang.org/x/sync/errgroup.(*Group).Go.func1
twintowers-icingadb1-1      | 	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75
twintowers-icingadb1-1      | runtime.goexit
twintowers-icingadb1-1      | 	/usr/local/go/src/runtime/asm_amd64.s:1650
twintowers-icingadb1-1 exited with code 1

But not with this PR! 👍

--- docker-compose.yml
+++ docker-compose.yml
@@ -186,7 +186,7 @@ services:
         condition: service_started
       dbicinga1:
         condition: service_started
-    image: icinga/icingadb
+    image: icinga/icingadb:test
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis1
@@ -279,7 +279,7 @@ services:
         condition: service_started
       dbicinga2:
         condition: service_started
-    image: icinga/icingadb
+    image: icinga/icingadb:test
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis2

julianbrost · 2023-11-15T16:33:51Z

pkg/driver/driver.go

-	sql.Register(MySQL, &Driver{ctxDriver: &mysql.MySQLDriver{}, Logger: logger})
+	sql.Register(MySQL, &Driver{ctxDriver: &mySQLDriver{}, Logger: logger})


Wait, we are already wrapping the driver. Why do you wrap it once more instead of just initializing the connection there?

Sure, we have a common Driver, but my stuff is only for MySQL.

That could be a simple callback attribute that's just set for MySQL:

&Driver{ctxDriver: &mysql.MySQLDriver{}, initConn: setGaleraOpts, Logger: logger}

Call it if it's set and you no longer need half of the code you're adding here.

julianbrost

In Galera clusters wsrep_sync_wait=4 ensures inserted rows to be synced over all nodes before reporting success to their inserter.

I don't think that's what this variable does, the documentation says:

Setting this variable ensures causality checks will take place before executing an operation of the type specified by the value, ensuring that the statement is executed on a fully synced node.

Which I read as this waits for everything to sync onto the local node before executing the query, not waiting for the result of the query to sync to all other nodes before reporting success.

The MySQL image throws 1193 on the command line with SET SESSION wsrep_sync_wait=4.

In contrast to the MariaDB Docker image it doesn't even ship cluster functionality and rejects the var. The perfect single node for testing IMAO.

So a non-Galera-cluster MariaDB understands the variable but treats it as a no-op?

FWIW: no testing from my side so far, but looks like it should allow doing what we want to do: testing the effect of wsrep_sync_wait=4.

julianbrost · 2023-11-16T13:39:13Z

pkg/driver/mysql.go

+// setGaleraOpts tries SET SESSION wsrep_sync_wait=4.
+// Error 1193 "Unknown system variable" is ignored to support MySQL single nodes.


This comment would be more helpful if it also said why.

Al2Klimov · 2023-11-16T14:15:22Z

So a non-Galera-cluster MariaDB understands the variable but treats it as a no-op?

Yes, our tests confirm this.

oxzi

This might not be the perfect place for those questions, but I failed finding them being addressed somewhere else. If it was already discussed, please point me there and excuse this comment.

With no intention to sound rude, but isn't setting wsrep_sync_wait more a hack than a solution? The core problem, a foreign key violation due to non-linear or non-synchronized INSERT, is not really mitigated by doing so.
How about putting the state_history and history UPSERT commands into one transaction? Of course, this would require some refactoring of history.Sync..
Why should mode 4 (INSERT and REPLACE) be used? What about UPDATE, which is, e.g., covered by mode 6 (UPDATE, DELETE, INSERT and REPLACE)? Furthermore, are READ operations now not ensured to be executed on a fully synced node?
Can we be sure that there are no other races when using a "master-master" replication or a cluster instead of a single node?
What about increasing wsrep_retry_autocommit, even if this also seems a bit desperate? Or might this only increase the chance of a deadlock?

Please excuse my rambling above. I hope it's still understandable.

Furthermore, I have added two more trivial code comments below.

pkg/driver/mysql.go

oxzi · 2024-02-15T10:30:09Z

pkg/driver/mysql.go

+		err = errors.Wrap(err, "can't execute "+galeraOpts)
+	}
+
+	if err != nil && errors.Is(err, errUnknownSysVar) {


Suggested change

if err != nil && errors.Is(err, errUnknownSysVar) {

if errors.Is(err, errUnknownSysVar) {

errors.Is performs the nil check for you.

julianbrost · 2024-02-15T15:32:36Z

This might not be the perfect place for those questions, but I failed finding them being addressed somewhere else. If it was already discussed, please point me there and excuse this comment.

There's the history of the referenced issue starting at #577 (comment) where wsrep_sync_wait was proposed for the first time.

With no intention to sound rude, but isn't setting wsrep_sync_wait more a hack than a solution? The core problem, a foreign key violation due to non-linear or non-synchronized INSERT, is not really mitigated by doing so.

Our code should synchronize them insofar that it will only issue the second insert after it has received the result for the insert the second row depends on. For a single server, this should be good enough, even if it happens on different connections. However, in Galera cluster, if both queries are issues on connections to different nodes, the result of the first query might not have propagated yet. wsrep_sync_wait is supposed to work around this by forcing the server to wait to be synchronized (probably comes at some cost, the documentation hints this and it also seems plausible, but at least I have no idea how big of an impact this is).

How about putting the state_history and history UPSERT commands into one transaction? Of course, this would require some refactoring of history.Sync..

Shouldn't require a transaction, simply using the same connection should suffice. But maybe using a transaction would provide the nicer API on the Go side, but then we should also see if this has a performance impact (I think we observed some non-obvious performance penalties from explicit transactions over auto-commit in the past, maybe @lippserd remembers some details?).

As you might deduce from #577, I'm not yet convinced that wsrep_sync_wait is the best way to address this problem, I'm still open to other approaches.

pkg/driver/mysql.go

… 1193 In Galera clusters wsrep_sync_wait=7 lets statements catch up all pending sync between nodes first. This way new child rows await fresh parent ones from other nodes not to run into foreign key errors. MySQL single nodes will reject this with error 1193 "Unknown system variable" which is OK.

oxzi

I have successfully tested the current state on top of my three MariaDB node Galera testing cluster. Therefore, I have also altered the database.options.wsrep_sync_wait value.

oxzi · 2024-03-11T10:00:27Z

pkg/icingadb/db.go

+
+	// WsrepSyncWait defines which kinds of SQL statements catch up all pending sync between nodes first, see:
+	// https://mariadb.com/kb/en/galera-cluster-system-variables/#wsrep_sync_wait
+	WsrepSyncWait int `yaml:"wsrep_sync_wait" default:"7"`


This should also be documented in a place that users can easily find. The current example config does not contain an options block. This could perhaps be added, as for many people a look at the default config is a starting point.

Textual documentation would also be helpful. Perhaps also referring to the HA area?

The current example config does not contain an options block.

That's exactly my reason for not documenting it (yet).

The reasoning for the existing options was that those those should not need to be changed by users. The default values should work fine everywhere, they were merely exposed to the config so that we might try to tweak them even in some production environment if problems would arise (but so far, those defaults seem to have worked out fine).

oxzi · 2024-03-11T10:02:40Z

pkg/driver/mysql.go

+		if s, ok := config.Params["wsrep_sync_wait"]; ok {
+			if i, err := strconv.ParseInt(s, 10, 64); err == nil {
+				wsrepSyncWait = i
+				delete(config.Params, "wsrep_sync_wait")


Storing wsrep_sync_wait inside the config.Params feels a bit like a cheat, especially as it will be deleted at this point. Maybe add at least a short comment explaining this?

pkg/driver/mysql.go

julianbrost · 2024-03-19T14:54:45Z

pkg/driver/mysql.go

+	if config, err := mysql.ParseDSN(name); err == nil {
+		if s, ok := config.Params["wsrep_sync_wait"]; ok {
+			if i, err := strconv.ParseInt(s, 10, 64); err == nil {
+				// MySQL single nodes don't know wsrep_sync_wait and fail with error 1193 "Unknown system variable".
+				// We have to SET it manually later and swallow error 1193 not to fail our connections.
+				wsrepSyncWait = i
+				delete(config.Params, "wsrep_sync_wait")
+				name = config.FormatDSN()
+			}
+		}


It should actually be possible to completely avoid the DSN for connection to a database by chaining these functions:

https://pkg.go.dev/github.com/go-sql-driver/mysql#NewConnector

https://pkg.go.dev/database/sql#OpenDB

https://pkg.go.dev/github.com/jmoiron/sqlx#NewDb

This should then allow simply passing the config option as a struct member instead.

(And maybe this could also allow us to avoid having to register the custom "icingadb-mysql" driver at all.)

Why just MySQL? Our Pg driver has a NewConnector() as well.

And what's the problem in general? Describing a connection to a database is the DSN's purpose.

Al2Klimov requested a review from julianbrost November 15, 2023 13:00

cla-bot bot added the cla/signed label Nov 15, 2023

Al2Klimov force-pushed the error-1452-foreign-key-577 branch from d70c3b6 to e99a39b Compare November 15, 2023 14:14

julianbrost reviewed Nov 15, 2023

View reviewed changes

Al2Klimov force-pushed the error-1452-foreign-key-577 branch from e99a39b to 87c520d Compare November 16, 2023 09:20

Al2Klimov requested a review from julianbrost November 16, 2023 09:21

Al2Klimov force-pushed the error-1452-foreign-key-577 branch 2 times, most recently from dae7eb3 to 9a47f6a Compare November 16, 2023 12:25

julianbrost reviewed Nov 16, 2023

View reviewed changes

Al2Klimov requested a review from julianbrost November 16, 2023 16:04

Al2Klimov marked this pull request as ready for review November 16, 2023 16:04

Al2Klimov force-pushed the error-1452-foreign-key-577 branch from 9a47f6a to da9f80c Compare December 6, 2023 15:07

oxzi reviewed Feb 15, 2024

View reviewed changes

Al2Klimov mentioned this pull request Feb 16, 2024

retry.Retryable(): re-try MySQL error 1047 for a "not yet prepared node" #679

Closed

This was referenced Feb 21, 2024

icingadb process crashes with Error 1452: Cannot add or update a child row: a foreign key constraint fails #577

Closed

History Sync: Use single, distinct DB connection #681

Closed

julianbrost added this to the 1.1.2 milestone Mar 8, 2024

Al2Klimov force-pushed the error-1452-foreign-key-577 branch from da9f80c to 71cdede Compare March 8, 2024 13:11

Al2Klimov changed the title ~~MySQL driver: on connect try setting wsrep_sync_wait=4, swallow error 1193~~ MySQL driver: on connect try setting wsrep_sync_wait=7, swallow error 1193 Mar 8, 2024

oxzi reviewed Mar 8, 2024

View reviewed changes

pkg/driver/mysql.go Outdated Show resolved Hide resolved

Al2Klimov force-pushed the error-1452-foreign-key-577 branch from 71cdede to 2f2c943 Compare March 8, 2024 17:02

oxzi reviewed Mar 11, 2024

View reviewed changes

Make value for SET SESSION wsrep_sync_wait= configurable

2f4a542

Al2Klimov force-pushed the error-1452-foreign-key-577 branch from 2f2c943 to 2f4a542 Compare March 11, 2024 10:20

yhabteab reviewed Mar 18, 2024

View reviewed changes

pkg/driver/mysql.go Show resolved Hide resolved

julianbrost reviewed Mar 19, 2024

View reviewed changes

yhabteab mentioned this pull request Mar 19, 2024

Try setting wsrep_sync_wait for mysql connections #711

Merged

julianbrost removed this from the 1.1.2 milestone Apr 5, 2024

julianbrost closed this in #711 Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL driver: on connect try setting wsrep_sync_wait=7, swallow error 1193 #665

MySQL driver: on connect try setting wsrep_sync_wait=7, swallow error 1193 #665

Al2Klimov commented Nov 15, 2023 •

edited

Loading

Al2Klimov commented Nov 15, 2023

julianbrost commented Nov 15, 2023

Al2Klimov commented Nov 15, 2023

Al2Klimov commented Nov 15, 2023

julianbrost Nov 15, 2023

Al2Klimov Nov 15, 2023

julianbrost Nov 16, 2023

Al2Klimov Nov 16, 2023

julianbrost left a comment

julianbrost Nov 16, 2023

Al2Klimov commented Nov 16, 2023

oxzi left a comment

oxzi Feb 15, 2024

julianbrost commented Feb 15, 2024

oxzi left a comment

oxzi Mar 11, 2024

Al2Klimov Mar 11, 2024

julianbrost Mar 13, 2024

oxzi Mar 11, 2024

julianbrost Mar 19, 2024

Al2Klimov Mar 25, 2024

		sql.Register(MySQL, &Driver{ctxDriver: &mysql.MySQLDriver{}, Logger: logger})
		sql.Register(MySQL, &Driver{ctxDriver: &mySQLDriver{}, Logger: logger})

		// setGaleraOpts tries SET SESSION wsrep_sync_wait=4.
		// Error 1193 "Unknown system variable" is ignored to support MySQL single nodes.

	if err != nil && errors.Is(err, errUnknownSysVar) {
	if errors.Is(err, errUnknownSysVar) {

MySQL driver: on connect try setting wsrep_sync_wait=7, swallow error 1193 #665

MySQL driver: on connect try setting wsrep_sync_wait=7, swallow error 1193 #665

Conversation

Al2Klimov commented Nov 15, 2023 • edited Loading

To do

Al2Klimov commented Nov 15, 2023

Single node test

julianbrost commented Nov 15, 2023

Al2Klimov commented Nov 15, 2023

Al2Klimov commented Nov 15, 2023

Galera test

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julianbrost left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Al2Klimov commented Nov 16, 2023

oxzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julianbrost commented Feb 15, 2024

oxzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Al2Klimov commented Nov 15, 2023 •

edited

Loading