AUTO TUNING of jobs time limit

Problem: most jobs have a very long time limit, because users either do not put anything (and so use default of 24h) or set a convervative limit that can make all jobs in the task succeed. But gWms uses the time limit both to kill jobs (so it is good to be conservative) and to schedule. Using long time for scheduling makes it impossible to fit jobs in the tail of existing pilots, leading to pilot churn and/or underutilization of partially used multicore pilots.

Solution (proposed by Brian) in two steps:

Introduce two ClassAds attributes (see https://github.com/dmwm/CRABServer/pull/5463 for implementation):
- EstimatedWallTimeMins: Used for matchmaking of jobs within HTCondor. This is initially set to the wall time requested by the user.
- MaxWallTimeMins: If the job is idle (to be matched), evaluates to the value of EstimatedWallTimeMins. Otherwise, used by the condor_schedd for killing jobs that have gone over the runtime limit and set to the user-requested limit (in CRAB, this defaults to 20 hours).
Introduce a mechanism (based on the existing work for WMAgent) to automatically tune EstimatedWallTimeMins based on the time it actually takes for jobs to run:
- gwmsmon provides running time percentiles for a task.
- a python script calculates the new EstimatedWallTimeMins as follows:
  - If less than 20 jobs have finished - or the gwmsmon query results in errors - do nothing!
  - If at least 20 jobs have finished, take the 95th percentile of the runtime for completed jobs; set estimated run time as min(95th percentile, user-provided runtime).
- This python script will provide a new configuration for the JobRouter running on the CRAB3 schedd. The route will update the ClassAds for idle jobs
  - JobRouter scales much better than a cronjob performing condor_qedit for CRAB3 jobs.
- In order to preserve a single autocluster per task, all jobs in a CRAB3 task will get the same value of EstimatedWallTimeMins.

As of April 19, Justas has done the work in gwmsmon [1]

Work to do is tracked in:

put here pointers to git issues and/or Jira issues as proper

QUESTIONS:

- how do we deal with jobs which run into time limit ? Do we resubmit in post-job with limit *= 1.5 until we hit 48h ? see https://hypernews.cern.ch/HyperNews/CMS/get/crabDevelopment/2617/1/1.html
- what happens to jobs killed by pilot reaching end of life before payload does ?

REFERENCES:

original mail thread: https://github.com/dmwm/CRABServer/files/938902/AutoTuningMails.pdf
[1] gwmsmon API to use: http://cms-gwmsmon.cern.ch/analysisview/json/historynew/percentileruntime720/<user>/<task> works from CERN LAN. Outside CERN need to use https and SSO
example:

https://cms-gwmsmon.cern.ch/analysisview/json/historynew/percentileruntime720/smitra/170411_132805:smitra_crab_DYJets

which returns a json file
{"hits": {"hits": [], "total": 263, "max_score": 0.0}, "_shards": {"successful": 31, "failed": 0, "total": 31}, "took": 40, "aggregations": {"2": {"values": {"5.0": 6.5436111111111126, "25.0": 11.444305555555555, "1.0": 3.5115222222222222, "95.0": 19.811305555555556, "75.0": 16.773194444444446, "99.0": 20.513038888888889, "50.0": 13.365277777777777}}}, "timed_out": false}

in hopefully fixed forever format so that the "values" can be extracted
and one would e.g. pick the 95.0 one (i.e. 19.8 hours)

gwmsmon : https://github.com/juztas/gwmsmon
how this is done in production: https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/go_condor.py#L168 Note: There, Unified pre-computes the final number - go_condor.py is simply converting the Unified data to a JobRouter configuration. Here, we would be doing the calculation in the script based on the raw data from gwmsmon.
job router used in production is here fill link !
job router for CRAB schedd's is here fill link !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AUTO TUNING of jobs time limit

Clone this wiki locally