Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagster SQLite EventLog failures #3868

Open
zaneselvans opened this issue Sep 26, 2024 · 7 comments
Open

Dagster SQLite EventLog failures #3868

zaneselvans opened this issue Sep 26, 2024 · 7 comments
Labels
dagster Issues related to our use of the Dagster orchestrator sqlite Issues related to interacting with sqlite databases

Comments

@zaneselvans
Copy link
Member

As previously addressed in e.g. #2417, #2996, #3003, #3208, and #3211, SQLite can't handle multiple concurrent writes. Our SQLite IO Manager has worked around this for the PUDL DB, but we seem to be hitting a new limit of some kind with Dagster's event logging DB, which still uses SQLite locally by default.

@cmgosnell recently encountered the issue in attempting to debug some integration test failures through the Dagster UI, in which it failed basically 100% of the time. I have had the problem on and off with maybe a 50-75% failure rate on the full ETL.

For me, the failure always seems to happen right after the execution of the raw_eia860m__all_dfs asset.

2024-09-25 21:43:12 -0600 - dagster - DEBUG - etl_full - 3162fc18-e15b-4284-81dd-b46c063d8d35 - 8873 - raw_eia860m__all_dfs - STEP_SUCCESS - Finished execution of step "raw_eia860m__all_dfs" in 2m35s.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: unable to open database file

Full stack trace here:

2024-09-25 21:43:12 -0600 - dagster - DEBUG - etl_full - 3162fc18-e15b-4284-81dd-b46c063d8d35 - 8873 - raw_eia860m__all_dfs - STEP_SUCCESS - Finished execution of step "raw_eia860m__all_dfs" in 2m35s.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: unable to open database file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_grpc/impl.py", line 266, in start_run_in_subprocess
    _run_in_subprocess(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_grpc/impl.py", line 250, in _run_in_subprocess
    instance.report_engine_event(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2497, in report_engine_event
    self.report_dagster_event(dagster_event, run_id=run_id, log_level=log_level)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2519, in report_dagster_event
    self.handle_new_event(event_record)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2417, in handle_new_event
    self._event_storage.store_event(events[0])
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/storage/event_log/sqlite/sqlite_event_log.py", line 242, in store_event
    conn.execute(insert_event_statement)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
           ^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2355, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file
[SQL: INSERT INTO event_logs (run_id, event, dagster_event_type, timestamp, step_key, asset_key, partition) VALUES (?, ?, ?, ?, ?, ?, ?)]
[parameters: ('3162fc18-e15b-4284-81dd-b46c063d8d35', '{"__class__": "EventLogEntry", "dagster_event": {"__class__": "DagsterEvent", "event_specific_data": {"__class__": "EngineEventData", "error": null,  ... (350 characters truncated) ... "pipeline_name": "etl_full", "run_id": "3162fc18-e15b-4284-81dd-b46c063d8d35", "step_key": null, "timestamp": 1727322130.5879638, "user_message": ""}', 'ENGINE_EVENT', '2024-09-26 03:42:10.587964', None, None, None)]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

This "unable to open database file" error seems somewhat different than the locked DB error that we were getting before due to attempted concurrent writes.

So, is there a new workaround to squeeze some more life out of SQLite? Or what is the easiest way to use Postgres locally for development?

  • @jdangerx suggested Orbstack as a containerized Postgres solution for local use that doesn't run like 10x slower on MacOS/arm64.
  • Is there some way to make Dagster log fewer events? Can we turn off DEBUG level logging? I don't think we really look at it!
  • How does this intersect with our development environment setup, which is already challenging for new contributors? Is there an "easy" way to use dev containers and integrate this Postgres setup into that?
  • Is there a possibility of making our tests not use the in-process (single threaded) executor if we're messing around with containers anyway?
@zaneselvans zaneselvans added sqlite Issues related to interacting with sqlite databases dagster Issues related to our use of the Dagster orchestrator labels Sep 26, 2024
@zaneselvans
Copy link
Member Author

I may just be a superstitious pigeon on this but it seems like sometimes I get this failure 100% reliably, and other times it's maybe just a 50% chance, so if it fails, and I try to run the ETL again, it might work. But I don't know what influences the probability. Last night I tried to run it 10 times in a row, and they all failed within 2.5 minutes.

Some things I've explored in the past without success:

  • Does the probability vary with the python environment? I.e. is there some combination of packages that aren't playing well together that causes the problem to be more likely? Sometimes updating the pudl-dev environment seems to fix things. But it's not clear if that's real or me imagining things.
  • Are there too many files & sockets open at once? Using lsof I can see that Dagster often has thousands of "files" open at once, and my system as a whole might have 20,000 open when the ETL is running.

@zaneselvans
Copy link
Member Author

Looking at the logging databases themselves:

  • they exist, and Dagster has been able to write data to them before it starts failing
  • Recent DBs from failures are 4.0 MB in size
  • They appear to contain only Dagster logging messages in the event_logs table, and that is the only table that contains any records.
  • In the failed logging DBs, there are about 2000 event logging records.

@zaneselvans
Copy link
Member Author

I was able to work around the SQLite limitation by running postgres locally (using Postgres.app) and pointing Dagster at it with the same settings that we use in the nightly builds (after creating the specified database and user):

storage:
  postgres:
    postgres_db:
      username: dagster
      password: dagster_password
      hostname: 127.0.0.1
      db_name: dagster
      port: 5432

I brought up the independent "Too many open files" problem in this Dagster issue and it is getting some attention now, so maybe it'll get fixed.

@KeeganMyers
Copy link

I'm very new to Dagster but would it be possible to wrap SQLite calls in a Mutex in the IOManager. That could resolve the too many files issue without dealing with the inevitable regressions involved in a DB migration to Postgres.

@zaneselvans
Copy link
Member Author

The SQLite DB that's having trouble here isn't the one that we write our data to with an IOManager. It contains the event logging info from Dagster. The SQLite IOManger we use has a bunch of retry logic for when it hits accidental concurrent writes, so it seems to be doing okay. Thankfully it sounds like a fix to the too-many-files problem will show up in Dagster 1.9.3 later this week -- it was a resource leak on their side, where they weren't always closing files that got opened.

@bendnorman
Copy link
Member

Do we feel like we can close this now?

@zaneselvans
Copy link
Member Author

The Too Many Open Files issue was fixed by dagster-io/dagster#25132 and I am currently able to run the DAG locally with Postgres for logging. If that's the way we need to run Dagster locally now, I think we should probably update our documentation to reflect that, otherwise folks will run into the initial DB issue I think? Are others able to run the whole DAG locally without using Postgres?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator sqlite Issues related to interacting with sqlite databases
Projects
Status: New
Development

No branches or pull requests

3 participants