Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL driver: on connect try setting wsrep_sync_wait=7, swallow error 1193 #665

Closed
wants to merge 2 commits into from

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Nov 15, 2023

In Galera clusters wsrep_sync_wait=7 lets statements catch up all pending sync between nodes first. This way new child rows await fresh parent ones from other nodes not to run into foreign key errors. MySQL single nodes will reject this with error 1193 "Unknown system variable" which is OK.

fixes #577

To do

  • Code structure OK? (@julianbrost)
  • GHA pass?
  • Before/after tests with actual Galera cluster and single node (@Al2Klimov)
  • Test in prod (customer)
  • Consider unit testing setGaleraOpts() (@Al2Klimov)

@Al2Klimov
Copy link
Member Author

Single node test

https://github.com/Al2Klimov/twintowers with this patch:

--- docker-compose.yml
+++ docker-compose.yml
@@ -25,12 +25,12 @@ x-ido: &x-ido
     MARIADB_DATABASE: ido

 x-dbicinga: &x-dbicinga
-  image: mariadb:10
+  image: mysql:5.7
   environment:
-    MARIADB_RANDOM_ROOT_PASSWORD: '1'
-    MARIADB_USER: icingadb
-    MARIADB_PASSWORD: icingadb
-    MARIADB_DATABASE: icingadb
+    MYSQL_RANDOM_ROOT_PASSWORD: '1'
+    MYSQL_USER: icingadb
+    MYSQL_PASSWORD: icingadb
+    MYSQL_DATABASE: icingadb

 x-icingadb-env: &x-icingadb-env
   ICINGADB_REDIS_PORT: '6379'
@@ -186,7 +186,7 @@ services:
         condition: service_started
       dbicinga1:
         condition: service_started
-    image: icinga/icingadb
+    image: icinga/icingadb:test
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis1
@@ -279,7 +279,7 @@ services:
         condition: service_started
       dbicinga2:
         condition: service_started
-    image: icinga/icingadb
+    image: icinga/icingadb:test
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis2

The MySQL image throws 1193 on the command line with SET SESSION wsrep_sync_wait=4.
👍 But Icinga DB still works.

@julianbrost
Copy link
Contributor

What effect does changing mariadb:10 to mysql:5.7? I'd expect both to behave identically. Is that what you wanted to test here, that old MySQL behaves like somewhat recent MariaDB?

@Al2Klimov
Copy link
Member Author

The MySQL image throws 1193 on the command line with SET SESSION wsrep_sync_wait=4.

In contrast to the MariaDB Docker image it doesn't even ship cluster functionality and rejects the var. The perfect single node for testing IMAO.

@Al2Klimov
Copy link
Member Author

Galera test

Cluster as per #577 (comment) verified with https://github.com/Al2Klimov/provoke-galera (that #577-ish errors can happen under laboratory conditions). In addition there's a Debian 12 load balancer:

root@aklimov-lb:~# iptables -t nat -A PREROUTING -d 10.27.2.59 -p tcp --dport 3306 -m statistic --mode random --probability 0.5 -j DNAT --to-destination 10.27.3.229:3306
root@aklimov-lb:~# iptables -t nat -A PREROUTING -d 10.27.2.59 -p tcp --dport 3306 -j DNAT --to-destination 10.27.1.199:3306
root@aklimov-lb:~# iptables -t nat -A POSTROUTING -o ens3 -j MASQUERADE
root@aklimov-lb:~# sysctl net.ipv4.ip_forward=1
root@aklimov-lb:~#

With this LB in https://github.com/Al2Klimov/twintowers

--- docker-compose.yml
+++ docker-compose.yml
@@ -190,7 +190,7 @@ services:
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis1
-      ICINGADB_DATABASE_HOST: dbicinga1
+      ICINGADB_DATABASE_HOST: 10.27.2.59

   web1:
     depends_on:
@@ -209,7 +209,7 @@ services:
       icingaweb.modules.icingadb.redis.redis1.host: redis1
       icingaweb.modules.monitoring.commandtransports.icinga2.host: master1
       icingaweb.resources.icingaweb_db.host: dbweb1
-      icingaweb.resources.icingadb.host: dbicinga1
+      icingaweb.resources.icingadb.host: 10.27.2.59
       icingaweb.resources.icinga_ido.host: ido1
     volumes:
       - ./volumes/icingaweb2/1:/data

and a bunch of history

--- icinga2.conf
+++ icinga2.conf
@@ -44,3 +44,17 @@ object IdoMysqlConnection "ido-mysql" {
     database = "ido"
     enable_ha = false
 }
+
+object Host "ok" {
+    check_command = "dummy"
+}
+
+for (i in range(10000)) {
+object Service i {
+    host_name = "ok"
+    check_command = "random"
+    max_check_attempts = 1
+    check_interval = 1ms
+    retry_interval = 1ms
+}
+}

bad things happen:

twintowers-icingadb1-1      | 2023-11-15T16:05:34.004Z	FATAL	icingadb	Error 1452 (23000): Cannot add or update a child row: a foreign key constraint fails ("icingadb"."history", CONSTRAINT "fk_history_state_history" FOREIGN KEY ("state_history_id") REFERENCES "state_history" ("id") ON DELETE CASCADE)
twintowers-icingadb1-1      | can't perform "INSERT INTO \"history\" (\"host_id\", \"event_type\", \"state_history_id\", \"event_time\", \"environment_id\", \"object_type\", \"endpoint_id\", \"service_id\", \"id\") VALUES (:host_id,:event_type,:state_history_id,:event_time,:environment_id,:object_type,:endpoint_id,:service_id,:id) ON DUPLICATE KEY UPDATE \"id\" = VALUES(\"id\")"
twintowers-icingadb1-1      | github.com/icinga/icingadb/internal.CantPerformQuery
twintowers-icingadb1-1      | 	/icingadb-src/internal/internal.go:30
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.func2.1
twintowers-icingadb1-1      | 	/icingadb-src/pkg/icingadb/db.go:391
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/retry.WithBackoff
twintowers-icingadb1-1      | 	/icingadb-src/pkg/retry/retry.go:49
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.func2
twintowers-icingadb1-1      | 	/icingadb-src/pkg/icingadb/db.go:386
twintowers-icingadb1-1      | golang.org/x/sync/errgroup.(*Group).Go.func1
twintowers-icingadb1-1      | 	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75
twintowers-icingadb1-1      | runtime.goexit
twintowers-icingadb1-1      | 	/usr/local/go/src/runtime/asm_amd64.s:1650
twintowers-icingadb1-1      | can't retry
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/retry.WithBackoff
twintowers-icingadb1-1      | 	/icingadb-src/pkg/retry/retry.go:68
twintowers-icingadb1-1      | github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.(*DB).NamedBulkExec.func1.1.func2
twintowers-icingadb1-1      | 	/icingadb-src/pkg/icingadb/db.go:386
twintowers-icingadb1-1      | golang.org/x/sync/errgroup.(*Group).Go.func1
twintowers-icingadb1-1      | 	/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75
twintowers-icingadb1-1      | runtime.goexit
twintowers-icingadb1-1      | 	/usr/local/go/src/runtime/asm_amd64.s:1650
twintowers-icingadb1-1 exited with code 1

But not with this PR! 👍

--- docker-compose.yml
+++ docker-compose.yml
@@ -186,7 +186,7 @@ services:
         condition: service_started
       dbicinga1:
         condition: service_started
-    image: icinga/icingadb
+    image: icinga/icingadb:test
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis1
@@ -279,7 +279,7 @@ services:
         condition: service_started
       dbicinga2:
         condition: service_started
-    image: icinga/icingadb
+    image: icinga/icingadb:test
     environment:
       <<: *x-icingadb-env
       ICINGADB_REDIS_HOST: redis2

sql.Register(MySQL, &Driver{ctxDriver: &mysql.MySQLDriver{}, Logger: logger})
sql.Register(MySQL, &Driver{ctxDriver: &mySQLDriver{}, Logger: logger})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, we are already wrapping the driver. Why do you wrap it once more instead of just initializing the connection there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we have a common Driver, but my stuff is only for MySQL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be a simple callback attribute that's just set for MySQL:

&Driver{ctxDriver: &mysql.MySQLDriver{}, initConn: setGaleraOpts, Logger: logger}

Call it if it's set and you no longer need half of the code you're adding here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Al2Klimov Al2Klimov force-pushed the error-1452-foreign-key-577 branch 2 times, most recently from dae7eb3 to 9a47f6a Compare November 16, 2023 12:25
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Galera clusters wsrep_sync_wait=4 ensures inserted rows to be synced over all nodes before reporting success to their inserter.

I don't think that's what this variable does, the documentation says:

Setting this variable ensures causality checks will take place before executing an operation of the type specified by the value, ensuring that the statement is executed on a fully synced node.

Which I read as this waits for everything to sync onto the local node before executing the query, not waiting for the result of the query to sync to all other nodes before reporting success.

The MySQL image throws 1193 on the command line with SET SESSION wsrep_sync_wait=4.

In contrast to the MariaDB Docker image it doesn't even ship cluster functionality and rejects the var. The perfect single node for testing IMAO.

So a non-Galera-cluster MariaDB understands the variable but treats it as a no-op?

FWIW: no testing from my side so far, but looks like it should allow doing what we want to do: testing the effect of wsrep_sync_wait=4.

Comment on lines 12 to 87
// setGaleraOpts tries SET SESSION wsrep_sync_wait=4.
// Error 1193 "Unknown system variable" is ignored to support MySQL single nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment would be more helpful if it also said why.

@Al2Klimov
Copy link
Member Author

So a non-Galera-cluster MariaDB understands the variable but treats it as a no-op?

Yes, our tests confirm this.

@Al2Klimov Al2Klimov marked this pull request as ready for review November 16, 2023 16:04
Copy link
Member

@oxzi oxzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the perfect place for those questions, but I failed finding them being addressed somewhere else. If it was already discussed, please point me there and excuse this comment.

  1. With no intention to sound rude, but isn't setting wsrep_sync_wait more a hack than a solution? The core problem, a foreign key violation due to non-linear or non-synchronized INSERT, is not really mitigated by doing so.
    How about putting the state_history and history UPSERT commands into one transaction? Of course, this would require some refactoring of history.Sync..
  2. Why should mode 4 (INSERT and REPLACE) be used? What about UPDATE, which is, e.g., covered by mode 6 (UPDATE, DELETE, INSERT and REPLACE)? Furthermore, are READ operations now not ensured to be executed on a fully synced node?
    Can we be sure that there are no other races when using a "master-master" replication or a cluster instead of a single node?
  3. What about increasing wsrep_retry_autocommit, even if this also seems a bit desperate? Or might this only increase the chance of a deadlock?

Please excuse my rambling above. I hope it's still understandable.

Furthermore, I have added two more trivial code comments below.

pkg/driver/mysql.go Show resolved Hide resolved
err = errors.Wrap(err, "can't execute "+galeraOpts)
}

if err != nil && errors.Is(err, errUnknownSysVar) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil && errors.Is(err, errUnknownSysVar) {
if errors.Is(err, errUnknownSysVar) {

errors.Is performs the nil check for you.

@julianbrost
Copy link
Contributor

This might not be the perfect place for those questions, but I failed finding them being addressed somewhere else. If it was already discussed, please point me there and excuse this comment.

There's the history of the referenced issue starting at #577 (comment) where wsrep_sync_wait was proposed for the first time.

  1. With no intention to sound rude, but isn't setting wsrep_sync_wait more a hack than a solution? The core problem, a foreign key violation due to non-linear or non-synchronized INSERT, is not really mitigated by doing so.

Our code should synchronize them insofar that it will only issue the second insert after it has received the result for the insert the second row depends on. For a single server, this should be good enough, even if it happens on different connections. However, in Galera cluster, if both queries are issues on connections to different nodes, the result of the first query might not have propagated yet. wsrep_sync_wait is supposed to work around this by forcing the server to wait to be synchronized (probably comes at some cost, the documentation hints this and it also seems plausible, but at least I have no idea how big of an impact this is).

How about putting the state_history and history UPSERT commands into one transaction? Of course, this would require some refactoring of history.Sync..

Shouldn't require a transaction, simply using the same connection should suffice. But maybe using a transaction would provide the nicer API on the Go side, but then we should also see if this has a performance impact (I think we observed some non-obvious performance penalties from explicit transactions over auto-commit in the past, maybe @lippserd remembers some details?).

As you might deduce from #577, I'm not yet convinced that wsrep_sync_wait is the best way to address this problem, I'm still open to other approaches.

@julianbrost julianbrost added this to the 1.1.2 milestone Mar 8, 2024
@Al2Klimov Al2Klimov changed the title MySQL driver: on connect try setting wsrep_sync_wait=4, swallow error 1193 MySQL driver: on connect try setting wsrep_sync_wait=7, swallow error 1193 Mar 8, 2024
pkg/driver/mysql.go Outdated Show resolved Hide resolved
… 1193

In Galera clusters wsrep_sync_wait=7 lets statements catch up all
pending sync between nodes first. This way new child rows await fresh parent
ones from other nodes not to run into foreign key errors. MySQL single nodes
will reject this with error 1193 "Unknown system variable" which is OK.
Copy link
Member

@oxzi oxzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have successfully tested the current state on top of my three MariaDB node Galera testing cluster. Therefore, I have also altered the database.options.wsrep_sync_wait value.


// WsrepSyncWait defines which kinds of SQL statements catch up all pending sync between nodes first, see:
// https://mariadb.com/kb/en/galera-cluster-system-variables/#wsrep_sync_wait
WsrepSyncWait int `yaml:"wsrep_sync_wait" default:"7"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be documented in a place that users can easily find. The current example config does not contain an options block. This could perhaps be added, as for many people a look at the default config is a starting point.

Textual documentation would also be helpful. Perhaps also referring to the HA area?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current example config does not contain an options block.

That's exactly my reason for not documenting it (yet).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning for the existing options was that those those should not need to be changed by users. The default values should work fine everywhere, they were merely exposed to the config so that we might try to tweak them even in some production environment if problems would arise (but so far, those defaults seem to have worked out fine).

if s, ok := config.Params["wsrep_sync_wait"]; ok {
if i, err := strconv.ParseInt(s, 10, 64); err == nil {
wsrepSyncWait = i
delete(config.Params, "wsrep_sync_wait")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing wsrep_sync_wait inside the config.Params feels a bit like a cheat, especially as it will be deleted at this point. Maybe add at least a short comment explaining this?

pkg/driver/mysql.go Show resolved Hide resolved
Comment on lines +30 to +39
if config, err := mysql.ParseDSN(name); err == nil {
if s, ok := config.Params["wsrep_sync_wait"]; ok {
if i, err := strconv.ParseInt(s, 10, 64); err == nil {
// MySQL single nodes don't know wsrep_sync_wait and fail with error 1193 "Unknown system variable".
// We have to SET it manually later and swallow error 1193 not to fail our connections.
wsrepSyncWait = i
delete(config.Params, "wsrep_sync_wait")
name = config.FormatDSN()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should actually be possible to completely avoid the DSN for connection to a database by chaining these functions:

This should then allow simply passing the config option as a struct member instead.

(And maybe this could also allow us to avoid having to register the custom "icingadb-mysql" driver at all.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just MySQL? Our Pg driver has a NewConnector() as well.

And what's the problem in general? Describing a connection to a database is the DSN's purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

icingadb process crashes with Error 1452: Cannot add or update a child row: a foreign key constraint fails
5 participants