Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timesketch-web container crashes and restarts initially due to postgres race-condition #3263

Open
ZauberNerd opened this issue Jan 16, 2025 · 0 comments
Labels

Comments

@ZauberNerd
Copy link

Describe the bug
When starting the docker compose stack for the first time the timesketch-web container tries to create the database tables.
During the initial startup multiple workers try to CREATE TABLE, which causes postgres to throw an error and the timesketch-web container crashes and restarts.

wsgi_error.log
[2025-01-16 13:51:57 +0000] [6] [INFO] Starting gunicorn 22.0.0
[2025-01-16 13:51:57 +0000] [6] [INFO] Listening at: http://0.0.0.0:5000 (6)
[2025-01-16 13:51:57 +0000] [6] [INFO] Using worker: sync
[2025-01-16 13:51:57 +0000] [7] [INFO] Booting worker with pid: 7
[2025-01-16 13:51:57 +0000] [7] [INFO] Starting gunicorn 22.0.0
[2025-01-16 13:51:57 +0000] [7] [INFO] Listening at: http://0.0.0.0:5000 (7)
[2025-01-16 13:51:57 +0000] [7] [INFO] Using worker: sync
[2025-01-16 13:51:57 +0000] [8] [INFO] Booting worker with pid: 8
[2025-01-16 13:52:02 +0000] [8] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/base.py", line 1905, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(user_id_seq, 2200) already exists.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gunicorn/arbiter.py", line 609, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.10/dist-packages/gunicorn/workers/base.py", line 134, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.10/dist-packages/gunicorn/workers/base.py", line 146, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.10/dist-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.10/dist-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.10/dist-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.10/dist-packages/gunicorn/util.py", line 371, in import_app
    mod = importlib.import_module(module)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/timesketch/wsgi.py", line 46, in <module>
    application = create_app()
  File "/usr/local/lib/python3.10/dist-packages/timesketch/app.py", line 132, in create_app
    db = init_db()
  File "/usr/local/lib/python3.10/dist-packages/timesketch/models/__init__.py", line 55, in init_db
    BaseModel.metadata.create_all(bind=engine)
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/sql/schema.py", line 4930, in create_all
    bind._run_ddl_visitor(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/future/engine.py", line 342, in _run_ddl_visitor
    conn._run_ddl_visitor(visitorcallable, element, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/base.py", line 2216, in _run_ddl_visitor
    visitorcallable(self.dialect, self, **kwargs).traverse_single(element)
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/sql/visitors.py", line 524, in traverse_single
    return meth(obj, **kw)
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/sql/ddl.py", line 855, in visit_metadata
    self.traverse_single(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/sql/visitors.py", line 524, in traverse_single
    return meth(obj, **kw)
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/sql/ddl.py", line 899, in visit_table
    self.connection.execute(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/future/engine.py", line 280, in execute
    return self._execute_20(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/base.py", line 1710, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/sql/ddl.py", line 80, in _execute_on_connection
    return connection._execute_ddl(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/base.py", line 1477, in _execute_ddl
    ret = self._execute_context(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/base.py", line 1948, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/base.py", line 2129, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/base.py", line 1905, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(user_id_seq, 2200) already exists.

[SQL: 
CREATE TABLE "user" (
	id SERIAL NOT NULL, 
	created_at TIMESTAMP WITHOUT TIME ZONE, 
	updated_at TIMESTAMP WITHOUT TIME ZONE, 
	username VARCHAR(255), 
	password VARCHAR(128), 
	name VARCHAR(255), 
	email VARCHAR(255), 
	active BOOLEAN, 
	admin BOOLEAN, 
	PRIMARY KEY (id), 
	UNIQUE (username)
)

]
(Background on this error at: https://sqlalche.me/e/14/gkpj)
[2025-01-16 13:52:02 +0000] [8] [INFO] Worker exiting (pid: 8)
[2025-01-16 13:52:02 +0000] [7] [ERROR] Worker (pid:8) exited with code 3
[2025-01-16 13:52:02 +0000] [7] [ERROR] Shutting down: Master
[2025-01-16 13:52:02 +0000] [7] [ERROR] Reason: Worker failed to boot.

multiple workers are started

NUM_WSGI_WORKERS="${NUM_WSGI_WORKERS:-4}"
gunicorn --bind 0.0.0.0:5000 --log-file /var/log/timesketch/wsgi.log \
--error-logfile /var/log/timesketch/wsgi_error.log --log-level info \
--capture-output --timeout 600 --limit-request-line 8190 \
--workers ${NUM_WSGI_WORKERS} timesketch.wsgi:application

an app is created

application = create_app()

which initializes the database

application = create_app()

which calls BaseModel.metadata.create_all

BaseModel.metadata.create_all(bind=engine)

Postgres errors, because concurrent CREATE TABLE statements aren't supported: https://www.postgresql.org/message-id/CA+TgmoZAdYVtwBfp1FL2sMZbiHCWT4UPrzRLNnX1Nb30Ku3-gg@mail.gmail.com

Eventually the table exists and timesketch-web continues to work as intended (with one or two restarts though).

The problem I have with this issue, is that I'm trying to use docker compose up -d --wait with health checks, which should wait until all containers are ready. But because the container crashes it will be marked unhealthy and this seems to be intended behavior with docker compose:

To Reproduce

diff docker-compose.yml
diff --git a/docker-compose.yml b/docker-compose.yml
index 0cad1346..b548c2b8 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -11,6 +11,19 @@ services:
       - ${TIMESKETCH_CONFIG_PATH}:/etc/timesketch/
       - ${TIMESKETCH_UPLOAD_PATH}:/usr/share/timesketch/upload/
       - ${TIMESKETCH_LOGS_PATH}:/var/log/timesketch/
+    depends_on:
+      postgres:
+        condition: service_healthy
+      redis:
+        condition: service_healthy
+      opensearch:
+        condition: service_healthy
+    healthcheck:
+      test: ["CMD", "wget", "--spider", "-q", "http://localhost:5000/"]
+      interval: 5s
+      timeout: 1s
+      retries: 5
+      start_period: 120s
 
   # Temporary service while the V2 UI is in the testing phase.
   # TODO: Remove when V2 is the default.
@@ -25,6 +38,13 @@ services:
       - ./etc/timesketch:/etc/timesketch/
       - ./upload:/usr/share/timesketch/upload/
       - ./logs:/var/log/timesketch/
+    depends_on:
+      postgres:
+        condition: service_healthy
+      redis:
+        condition: service_healthy
+      opensearch:
+        condition: service_healthy
 
   timesketch-worker:
     container_name: timesketch-worker
@@ -37,6 +57,13 @@ services:
       - ${TIMESKETCH_CONFIG_PATH}:/etc/timesketch/
       - ${TIMESKETCH_UPLOAD_PATH}:/usr/share/timesketch/upload/
       - ${TIMESKETCH_LOGS_PATH}:/var/log/timesketch/
+    depends_on:
+      postgres:
+        condition: service_healthy
+      redis:
+        condition: service_healthy
+      opensearch:
+        condition: service_healthy
 
   opensearch:
     container_name: opensearch
@@ -57,6 +84,12 @@ services:
         hard: 65536
     volumes:
       - ${OPENSEARCH_DATA_PATH}:/usr/share/opensearch/data/
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:9200/_cat/health?v"]
+      interval: 5s
+      timeout: 1s
+      retries: 5
+      start_period: 120s
 
   postgres:
     container_name: postgres
@@ -67,12 +100,24 @@ services:
     restart: always
     volumes:
       - ${POSTGRES_DATA_PATH}:/var/lib/postgresql/data
+    healthcheck:
+      test: ["CMD", "pg_isready", "-U", "timesketch"]
+      interval: 5s
+      timeout: 1s
+      retries: 5
+      start_period: 120s
 
   redis:
     container_name: redis
     image: redis:${REDIS_VERSION}
     command: "redis-server"
     restart: always
+    healthcheck:
+      test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
+      interval: 5s
+      timeout: 1s
+      retries: 5
+      start_period: 120s
 
   nginx:
     container_name: nginx

Steps to reproduce the behavior:

  1. Run the deploy_timesketch.sh script
  2. Apply the above patch diff docker-compose.yml to include healthchecks
  3. Run docker compose up -d --wait
  4. Verify the container restarted by watching docker ps or docker inspect timesketch-web | jq '.[] | .RestartCount'
  5. Take a look at the wsgi_error.log

Expected behavior
The container should not crash.

Possible fixes
Call an additional python script that gets run from the docker-entrypoint.sh, before the gunicorn workers are started, to create the tables:

if [ "$1" = "timesketch-web" ]; then
# Get number of WSGI workers from environment, or set it to default value.
NUM_WSGI_WORKERS="${NUM_WSGI_WORKERS:-4}"

Or: use pg_advisory_lock as described by the Posgres maintainers.

Desktop (please complete the following information):

  • OS: Ubuntu 22.04
  • Browser chrome
  • Version 131.0.6778.264

Additional context
Related: #3014 (comment)

@ZauberNerd ZauberNerd added the Bug label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant