Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded custom pipelines #1003

Merged

Conversation

ZeroErrors
Copy link
Contributor

@ZeroErrors ZeroErrors commented Jul 15, 2023

Changelog

  • Feature: Added the ability for ecs_run_pipeline to execute multi-threaded systems on the worker threads.
    • Added pipeline test cases: run_pipeline_multithreaded and run_pipeline_multithreaded_tasks.
  • Performance:
    • Improved the performance of workers by reducing the number of times the main thread needs to synchronize with the worker threads.
    • Worker threads are no longer restarted when using ecs_set_pipeline
  • Bug: Fixed a bug where if an empty pipeline was run it would cause ecs_progress to block forever (Report on Discord)

TODO

  • Resolve how system_time_total is calculated

Notes

  • I discovered during this work that the way system_time_total is measured with multi-threaded systems is not accurate as it only measures the work the main thread does, but this may not be accurate as a worker thread could have more or less work. It would be straightforward with the changes in this branch to instead measure the system time including thread synchronization. This would make system_time_total the wall time of the system.

Details

Multi-threaded ecs_run_pipeline

  • Modified ecs_run_pipeline to call flecs_workers_progress and to more closely mirror ecs_progress with support for task threads.
  • Removed ecs_worker_state_t and instead pass ecs_pipeline_state_t* via the ecs_stage_t allowing for it to be dynamically updated.
  • Moved the main body of code for executing pipeline operations from flecs_run_pipeline into a new function flecs_run_pipeline_ops.
  • Moved flecs_worker to now call flecs_run_pipeline_ops so it only operates on a specific pipeline operation, making the synchronization with the main thread clearer.
  • Modified flecs_run_pipeline to now only be the main-thread operations handling updating the pipeline, updating the pipeline operation, waking works, synchronizing workers and also calls flecs_run_pipeline_ops to execute the pipeline operation for the main thread.
  • Removed flecs_worker_begin, flecs_worker_end, flecs_worker_sync,flecs_is_multithreaded and flecs_is_main_thread as they were no longer needed anymore.
  • Exposed flecs_signal_workers and flecs_wait_for_sync from worker.c to pipeline.c in pipeline.h.
  • Remove calls to ecs_set_threads from ecs_set_pipeline now that the workers can operate on any pipeline dynamically so there is no need to restart them.

Some info on how this changes worker synchronization.
I ran both master and the branch for this PR with log level 3 enabled in a release build to compare how the workers operate.

Here in the following diff, you can see the functional difference.

diff --git "a/master.txt" "b/multithreaded-custom-piepline.txt"
index 5b1a059..22b34e4 100644
--- "a/master.txt"
+++ "b/multithreaded-custom-piepline.txt"
@@ -2665,42 +2665,40 @@ info: | | pipeline: waiting for worker sync
 info: | | pipeline: workers synced
 info: | merge
 info: | readonly
-info: | | pipeline: signal workers
 info: | | worker 0: ecs_run_intern - SysB
 info: | | worker 0: ecs_run_intern - SysC
-info: | | pipeline: waiting for worker sync
-info: | | pipeline: workers synced
 info: | merge
 info: | readonly
 info: | | pipeline: signal workers
 info: | | worker 0: ecs_run_intern - SysD
 info: | | worker 0: ecs_run_intern - SysE
+info: | | worker 1: run
 info: | | worker 1: ecs_run_intern - SysD
 info: | | worker 1: ecs_run_intern - SysE
+info: | | worker 2: run
 info: | | worker 2: ecs_run_intern - SysD
 info: | | worker 2: ecs_run_intern - SysE
+info: | | worker 3: run
 info: | | worker 3: ecs_run_intern - SysD
 info: | | worker 3: ecs_run_intern - SysE
+info: | | worker 4: run
 info: | | worker 4: ecs_run_intern - SysD
 info: | | worker 4: ecs_run_intern - SysE
+info: | | worker 5: run
 info: | | worker 5: ecs_run_intern - SysD
 info: | | worker 5: ecs_run_intern - SysE
+info: | | worker 6: run
 info: | | worker 6: ecs_run_intern - SysD
 info: | | worker 6: ecs_run_intern - SysE
+info: | | worker 7: run
 info: | | worker 7: ecs_run_intern - SysD
 info: | | worker 7: ecs_run_intern - SysE
 info: | | pipeline: waiting for worker sync
 info: | | pipeline: workers synced
 info: | merge
 info: | readonly
-info: | | pipeline: signal workers
 info: | | worker 0: ecs_run_intern - SysF
-info: | | pipeline: waiting for worker sync
-info: | | pipeline: workers synced
 info: | merge
-info: | pipeline: signal workers
-info: | pipeline: waiting for worker sync
-info: | pipeline: workers synced
 info: shutting down world
 info: | cleanup root entities
 info: | run fini actions

In this diff, you can see a few changes.

  1. When running SysB and SysC, and then also lower down for SysF the worker threads are no longer signalled since the operation is not marked as multi_threaded. This means we can avoid needing to synchronize on the worker threads when not needed.
  2. When running SysD and SysE there are new log messages for worker %d: run, this is because the workers now go back to their main loop after executing one operation, rather than synchronizing internally in flecs_run_pipeline.
  3. At the end of the pipeline execution after the final merge (info: | merge), we no longer wake the worker threads an additional time. This is possible since the workers are staying in their main loop when waiting for the next operation to execute. This effectively removes an unneeded sync point.

Notes about log collection:

  • Had to tweak flecs_log_msg to lock while writing to make the logging to be more stable
  • Tweaked the worker log message in ecs_run_intern so its clearer what it is when viewed next to the worker %d: run message:
-ecs_dbg_3("worker %d: %s", stage_index, path);
+ecs_dbg_3("worker %d: ecs_run_intern - %s", stage_index, path);
  • I manually sorted some log lines by their worker id and that only differs due to multithreading so that the diff is clearer.
    eg.
    info: worker 2: start
    
    info: | | worker 2: ecs_run_intern - SysA
    
    info: | | worker 1: finalizing
    info: | | worker 1: stop
    

I've attached the full logs in case someone wants to look at them.
master.txt
multithreaded-custom-piepline.txt


Deadlock when using ecs_run_pipeline with worker threads enabled (Report on Discord)
Fixed a bug in flecs_worker_begin where if an empty pipeline was run it would cause ecs_progress to block forever

  • The problem was caused by flecs_worker_begin signalling the workers to start but then flecs_run_pipeline would immediately return if there are no operations to perform, meaning the worker threads were never synchronized before exiting.
    if (!flecs_worker_begin(world, stage, pq, true)) {
    return;
    }

Removed unneeded worker signal and sync at end of pipeline
While working on this I noticed that when a pipeline is run across the worker threads (eg. ecs_progress) at the end of the pipeline when no more operations exist flecs_worker_sync would call flecs_worker_begin which would then signal the workers, only for the loop to exit and then flecs_worker_end to immediately be called.

@ZeroErrors ZeroErrors marked this pull request as draft July 15, 2023 00:47
@codylico
Copy link
Contributor

As for the void* pq;, you could try to use something like this as a forward declaration:

struct ecs_pipeline_state_t;
typedef struct ecs_pipeline_state_t ecs_pipeline_state_t;

@ZeroErrors ZeroErrors force-pushed the multithreaded-custom-pipelines branch 2 times, most recently from 075b8d9 to 1106b3e Compare July 15, 2023 18:57
@ZeroErrors ZeroErrors force-pushed the multithreaded-custom-pipelines branch from 1106b3e to 1a460ab Compare July 15, 2023 19:00
@ZeroErrors ZeroErrors marked this pull request as ready for review July 15, 2023 21:35
@ZeroErrors
Copy link
Contributor Author

@codylico Yeah the first thing I did was add typedef struct ecs_pipeline_state_t ecs_pipeline_state_t; to forward declare it, but the build complained about having duplicate typedef's since it builds with c99.

In file included from ./src/addons/pipeline/worker.c:10:
./src/addons/pipeline/pipeline.h:38:3: error: redefinition of typedef 'ecs_pipeline_state_t' is a C11 feature [-Werror,-Wtypedef-redefinition]
} ecs_pipeline_state_t;
  ^
./src/addons/pipeline/../system/../../private_types.h:484:37: note: previous definition is here
typedef struct ecs_pipeline_state_t ecs_pipeline_state_t;
                                    ^

I've now fixed that by just removing the typedef from the original definition of the struct, that seems to work fine but still unsure if this is the correct way to do this for the project or if the code should be organized in a different way.

@ZeroErrors
Copy link
Contributor Author

Not sure why this test failed. Seems unrelated to these changes, I tested it locally on macOS and didn't have any issues. The test also passed in the test-c-unix (ubuntu-latest) job.

It also exited wth signal 4, which is SIGILL which seems strange.

https://github.com/SanderMertens/flecs/actions/runs/5568489788/jobs/10171153134?pr=1003#step:6:126

FAIL: Observer.on_add_yield_existing_wildcard_multi exited with signal 4
To run/debug your test, do:
export $(bake env)
bin/x64-Darwin-debug/api Observer.on_add_yield_existing_wildcard_multi

PASS:111, FAIL:  1, EMPTY:  0 (api.Observer)

@SanderMertens
Copy link
Owner

Rerunning the failed tests, sometimes CI flukes and a process running a test is killed.

src/addons/pipeline/pipeline.c Outdated Show resolved Hide resolved
src/private_types.h Outdated Show resolved Hide resolved
@SanderMertens
Copy link
Owner

LGTM, I like it! Thanks for the PR :)

@SanderMertens SanderMertens merged commit b5241cc into SanderMertens:master Jul 18, 2023
@ZeroErrors ZeroErrors deleted the multithreaded-custom-pipelines branch July 18, 2023 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants