Consistent estimation of task duration between stealing, adaptive and occupancy calculation #9000

hendrikmakait · 2025-02-03T15:05:28Z

We've noticed that stealing could ping pong between two workers if tasks with long execution durations but no average duration were in processing. This PR fixes that.

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2025-02-03T15:58:37Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

27 files ± 0 27 suites ±0 11h 33m 27s ⏱️ + 2m 11s
4 117 tests + 1 4 000 ✅ ± 0 111 💤 ±0 6 ❌ +1
51 628 runs +13 49 324 ✅ +14 2 296 💤 ±0 8 ❌ - 1

For more details on these failures, see this check.

Results for commit 2bd934e. ± Comparison against base commit 5589049.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.

distributed.tests.test_scheduler ‑ test_get_task_duration

distributed.tests.test_scheduler ‑ test_get_prefix_duration
distributed.tests.test_steal ‑ test_do_not_ping_pong

♻️ This comment has been updated with latest results.

hendrikmakait · 2025-02-03T16:53:07Z

distributed/scheduler.py

@@ -2536,13 +2545,6 @@ def _transition_processing_memory(
                    action=startstop["action"],
                )

-        s = self.unknown_durations.pop(ts.prefix.name, set())


I've moved this into the stealing plugin.

hendrikmakait · 2025-02-03T16:53:47Z

distributed/scheduler.py

@@ -1931,22 +1925,37 @@ def total_occupancy(self) -> float:
            self._network_occ_global,
        )

+    def _get_prefix_duration(self, prefix: TaskPrefix) -> float:


This is the single source of truth for the duration estimation.

hendrikmakait · 2025-02-03T16:54:33Z

distributed/scheduler.py

@@ -1674,9 +1674,6 @@ class SchedulerState:
    #: Subset of tasks that exist in memory on more than one worker
    replicated_tasks: set[TaskState]

-    #: Tasks with unknown duration, grouped by prefix
-    #: {task prefix: {ts, ts, ...}}
-    unknown_durations: dict[str, set[TaskState]]


This has been moved into stealing.

hendrikmakait · 2025-02-03T16:55:53Z

distributed/worker_state_machine.py

@@ -236,8 +236,6 @@ class TaskState:
    #: The next state of the task. It is not None iff :attr:`state` == resumed.
    next: Literal["fetch", "waiting", None] = None

-    #: Expected duration of the task
-    duration: float | None = None


This isn't used anywhere.

hendrikmakait · 2025-02-04T06:19:28Z

distributed/scheduler.py

-                queued_occupancy += self.UNKNOWN_TASK_DURATION
-            else:
-                queued_occupancy += ts.prefix.duration_average
+            queued_occupancy += self._get_prefix_duration(ts.prefix)


I don't have a test for this, but the old version was definitely inconsistent.

hendrikmakait added 3 commits February 3, 2025 15:59

Move task duration calculation to stealing

14ac478

add test

78256d3

move

5b88656

hendrikmakait requested a review from fjetter as a code owner February 3, 2025 15:05

comments

49b916d

hendrikmakait added 3 commits February 3, 2025 17:38

Refactor for maintainability

54fdd21

Fix tests

102dc19

Align adaptive target

fb8d8ec

hendrikmakait changed the title ~~Consistent estimation of task duration between stealing and occupancy calculation~~ Consistent estimation of task duration between stealing, adaptive and occupancy calculation Feb 3, 2025

Trigger CI

f1243a3

hendrikmakait commented Feb 4, 2025

View reviewed changes

fix test

2bd934e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent estimation of task duration between stealing, adaptive and occupancy calculation #9000

Consistent estimation of task duration between stealing, adaptive and occupancy calculation #9000

hendrikmakait commented Feb 3, 2025

github-actions bot commented Feb 3, 2025 •

edited

Loading

hendrikmakait Feb 3, 2025

hendrikmakait Feb 3, 2025

hendrikmakait Feb 3, 2025

hendrikmakait Feb 3, 2025

hendrikmakait Feb 4, 2025

Consistent estimation of task duration between stealing, adaptive and occupancy calculation #9000

Are you sure you want to change the base?

Consistent estimation of task duration between stealing, adaptive and occupancy calculation #9000

Conversation

hendrikmakait commented Feb 3, 2025

github-actions bot commented Feb 3, 2025 • edited Loading

Unit Test Results

hendrikmakait Feb 3, 2025

Choose a reason for hiding this comment

hendrikmakait Feb 3, 2025

Choose a reason for hiding this comment

hendrikmakait Feb 3, 2025

Choose a reason for hiding this comment

hendrikmakait Feb 3, 2025

Choose a reason for hiding this comment

hendrikmakait Feb 4, 2025

Choose a reason for hiding this comment

github-actions bot commented Feb 3, 2025 •

edited

Loading