-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose stats #279
base: master
Are you sure you want to change the base?
Expose stats #279
Conversation
5fc01f6
to
b63a2e9
Compare
b63a2e9
to
54ce432
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to know some more details on how this will be used exactly before approving.
def __init__(self, stats: Stats) -> None: | ||
super().__init__() | ||
|
||
self.tiger = stats.tiger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this ideally just say self._stats_interval = stats.tiger.config["STATS_INTERVAL"]
rather than looking it up every time?
# For example, the worker's utilisation over the last 30 minutes | ||
# can be obtained by dividing the sum of task durations reported | ||
# over the last 30 minutes by 30 minutes. | ||
"STATS_CALLBACK": None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the point of the stats thread is to print out metrics periodically, I would have also expected this to be called periodically rather than at the end of the task, and report the same metrics that we log (time total, time busy, utilization). Otherwise this could be solved via something like CHILD_CONTEXT_MANAGERS
, although that runs in the child process. Should this be called TASK_END_STATS_CALLBACK
then?
the worker's utilisation over the last 30 minutes can be obtained by dividing the sum of task durations reported over the last 30 minutes by 30 minutes.
This is not actually correct. Example: If a task runs from 0m to 5m, and another task runs from 29m59s-34m59s, and the second task's stats callback looks at any reports in the past 30 minutes (i.e. 5m - 34m59s) , it would show 10m in task durations over 30m (~33% utilization), rather than the actual 5m1s (~17% utilization).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely have both start and end callbacks to measure and report utilization accurately while the task is still running. With our configuration, we often collect metrics on sub-minute intervals, and a single task can run for more than a few minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the point of the stats thread is to print out metrics periodically, I would have also expected this to be called periodically rather than at the end of the task, and report the same metrics that we log (time total, time busy, utilization).
Yep, that sounds better than what I did.
This is not actually correct. Example: If a task runs from 0m to 5m, and another task runs from 29m59s-34m59s, and the second task's stats callback looks at any reports in the past 30 minutes (i.e. 5m - 34m59s) , it would show 10m in task durations over 30m (~33% utilization), rather than the actual 5m1s (~17% utilization).
Good point.
We should definitely have both start and end callbacks to measure and report utilization accurately while the task is still running. With our configuration, we often collect metrics on sub-minute intervals, and a single task can run for more than a few minutes.
Couldn't we achieve it with simply moving the callback to the log
method to be called at an interval?
I don't think we need to get notified of the start and stop of every task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get notified of the start and stop of every task.
Having that interface allows for reporting much more than what we do now. We could report utilization grouped per some other metric, for example task duration bucket, task name, or success/failure condition.
The interval of metric collection should be owned by the metric collection setup, not by tasktiger. Tasktiger does provide a simple log-only "monitoring" of periodic prints but in production we could use much more sophisticated stuff like opentelemetry. Reporting metrics on an interval is also not great if your metric collection is also pull-based periodic and the periods don't align well. For example, there was a task running between 1s and 16s. At 30s Tasktiger reports utilization of 50% (15s busy out of last 30s) and then does nothing for then next minute. At 59s (just before next log
call at 60s) monitoring pulls the last reported value of 50%. The monitoring shows utilization of 50% at 59s while in reality it was 0%. The metric is lagging and can be very confusing when debugging performance issues. In contrast, with start/end notifications we can compute utilization metric right at the collection time - or better yet, report the up-to-date (not lagging) totals of busy/idle times and let the monitoring compute utilization.
This allows hooking into the stats by defining a
STATS_CALLBACK
which will receive task runtimes.This is useful for measuring the utilisation of workers.