-
Notifications
You must be signed in to change notification settings - Fork 40
AUTO TUNING of jobs time limit
Problem: most jobs have a very long time limit, because users either do not put anything (and so use default of 24h) or set a convervative limit that can make all jobs in the task succeed. But gWms uses the time limit both to kill jobs (so it is good to be conservative) and to schedule. Using long time for scheduling makes it impossible to fit jobs in the tail of existing pilots, leading to pilot churn and/or underutilization of partially used multicore pilots.
Solution (proposed by Brian) in two steps:
-
Introduce two ClassAds attributes (see https://github.com/dmwm/CRABServer/pull/5463 for implementation):
-
EstimatedWallTimeMins
: Used for matchmaking of jobs within HTCondor. This is initially set to the wall time requested by the user. -
MaxWallTimeMins
: If the job is idle (to be matched), evaluates to the value ofEstimatedWallTimeMins
. Otherwise, used by thecondor_schedd
for killing jobs that have gone over the runtime limit and set to the user-requested limit (in CRAB, this defaults to 20 hours).
-
-
Introduce a mechanism (based on the existing work for WMAgent) to automatically tune
EstimatedWallTimeMins
based on the time it actually takes for jobs to run:-
gwmsmon
provides running time percentiles for a task. - a python script calculates the new
EstimatedWallTimeMins
as follows:- If less than 20 jobs have finished - or the
gwmsmon
query results in errors - do nothing! - If at least 20 jobs have finished, take the 95th percentile of the runtime for completed jobs; set estimated run time as
min(95th percentile, user-provided runtime)
.
- If less than 20 jobs have finished - or the
- This python script will provide a new configuration for the JobRouter running on the CRAB3 schedd. The route will update the ClassAds for idle jobs
- JobRouter scales much better than a cronjob performing
condor_qedit
for CRAB3 jobs.
- JobRouter scales much better than a cronjob performing
- In order to preserve a single autocluster per task, all jobs in a CRAB3 task will get the same value of
EstimatedWallTimeMins
.
-
As of April 19, Justas has done the work in gwmsmon [1]
Work to do is tracked in:
- put here pointers to git issues and/or Jira issues as proper
QUESTIONS:
-
- how do we deal with jobs which run into time limit ? Do we resubmit in post-job with
limit *= 1.5
until we hit 48h ? see https://hypernews.cern.ch/HyperNews/CMS/get/crabDevelopment/2617/1/1.html
- how do we deal with jobs which run into time limit ? Do we resubmit in post-job with
-
- what happens to jobs killed by pilot reaching end of life before payload does ?
REFERENCES:
-
original mail thread: https://github.com/dmwm/CRABServer/files/938902/AutoTuningMails.pdf
-
[1] gwmsmon API to use:
http://cms-gwmsmon.cern.ch/analysisview/json/historynew/percentileruntime720/<user>/<task>
works from CERN LAN. Outside CERN need to usehttps
and SSO -
example:
https://cms-gwmsmon.cern.ch/analysisview/json/historynew/percentileruntime720/smitra/170411_132805:smitra_crab_DYJets
which returns a json file
{"hits": {"hits": [], "total": 263, "max_score": 0.0}, "_shards": {"successful": 31, "failed": 0, "total": 31}, "took": 40, "aggregations": {"2": {"values": {"5.0": 6.5436111111111126, "25.0": 11.444305555555555, "1.0": 3.5115222222222222, "95.0": 19.811305555555556, "75.0": 16.773194444444446, "99.0": 20.513038888888889, "50.0": 13.365277777777777}}}, "timed_out": false}
in hopefully fixed forever format so that the "values" can be extracted
and one would e.g. pick the 95.0 one (i.e. 19.8 hours)
-
gwmsmon : https://github.com/juztas/gwmsmon
-
how this is done in production: https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/go_condor.py#L168 Note: There, Unified pre-computes the final number -
go_condor.py
is simply converting the Unified data to a JobRouter configuration. Here, we would be doing the calculation in the script based on the raw data from gwmsmon. -
job router used in production is
here
fill link ! -
job router for CRAB schedd's is
here
fill link !