Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pause/resume/context to workers #101

Merged
merged 5 commits into from
Aug 18, 2022

Conversation

Northbadge
Copy link
Contributor

@Northbadge Northbadge commented Aug 11, 2022

  • Allows a user to start/stop runners' subprocesses at will, via OS signals SIGSTOP and SIGCONT, in cancellation_manager.
  • Allows a user to bind processes to specific CPUs, and modify their niceness.
  • Allows local_worker_pool to be used outside of a context manager (by adding a destructor)
  • Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker)
  • Fixes a memory leak in unregister_process

Part of #96

@Northbadge Northbadge requested a review from mtrofin August 11, 2022 22:12
@Northbadge
Copy link
Contributor Author

@petrhosek - I added a new dependency, psutil here

@Northbadge Northbadge force-pushed the validation-peel-2 branch 3 times, most recently from 468adc3 to f505374 Compare August 11, 2022 23:32
@Northbadge
Copy link
Contributor Author

so turns out this just sends SIGSTOP to the python worker and not the clang subprocess as well. I could either:

  • have the clang subprocesses each be in their own pgrp, which makes .resume and .pause work on a per-worker basis
  • or, have LocalValidationDataCollector spawned clang subprocesses share a single pgrp
    But, either way, bash-sent signals would stop working since those get sent to the pgrp of the first process, and doing the above creates multiple pgrps

alternatively, could figure out a way to maintain a record of spawned clang processes PIDs, do the "cancel and requeue" version, or accept that some clang processes will run to completion when asked to stop. I'll probably get back to this next week.


from absl import logging
# pylint: disable=unused-import
from compiler_opt.distributed.worker import Worker

from contextlib import AbstractContextManager
from multiprocessing import connection
from typing import Any, Callable, Dict, Optional
from typing import Any, Callable, Dict, Optional, List
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put list in alphabetical order

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same further below)

ContextAwareWorker can check for this with isinstance(obj, ContextAwareWorker)
"""

def set_context(self, local: bool) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContextAwareWorker is used nowhere, remove it for now.

@@ -102,13 +104,22 @@ class WorkerCancellationManager:
managing resources.
"""

def __init__(self):
@dataclasses.dataclass
class ProcData:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a thought: the motivating scenario here is the validator. For validation, we're actually OK to let compilation run longer than x seconds - because the goal is to get a thorough idea of what-if this model were shipped. So, how about:

  • no ProcData
  • just have the validator use a very large timeout, like 20 minutes (i.e. ~half of that in real compilation time)

- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT.
- Allows a user to bind processes to specific CPUs.
- Allows local_worker_pool to be used outside of a context manager
- Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker)

Part of google#96
@mtrofin mtrofin merged commit 16eb08b into google:main Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants