Skip to content

AUTO TUNING of jobs time limit

Stefano Belforte edited this page Apr 20, 2017 · 41 revisions

Problem: most jobs have a very long time limit, because users either do not put anything (and so use default of 24h) or set a convervative limit that can make all jobs in the task succeed. But gWms uses the time limit both to kill jobs (so it is good to be conservative) and to schedule. Using long time for scheduling makes it impossible to fit jobs in the tail of existing pilots, leading to pilot churn and/or underutilization of partially used multicore pilots.

Solution (proposed by Brian) in two steps:

  1. Introduce two ClassAds attributes (see https://github.com/dmwm/CRABServer/pull/5463 for implementation):

    • EstimatedWallTimeMins: Used for matchmaking of jobs within HTCondor. This is initially set to the wall time requested by the user.
    • MaxWallTimeMins: If the job is idle (to be matched), evaluates to the value of EstimatedWallTimeMins. Otherwise, used by the condor_schedd for killing jobs that have gone over the runtime limit and set to the user-requested limit (in CRAB, this defaults to 20 hours).
  2. Introduce a mechanism (based on the existing work for WMAgent) to automatically tune EstimatedWallTimeMins based on the time it actually takes for jobs to run:

    • gwmsmon provides running time percentiles for a task.
    • a python script calculates the new EstimatedWallTimeMins as follows:
      • If less than 20 jobs have finished - or the gwmsmon query results in errors - do nothing!
      • If at least 20 jobs have finished, take the 95th percentile of the runtime for completed jobs; set estimated run time as min(95th percentile, user-provided runtime).
    • This python script will provide a new configuration for the JobRouter running on the CRAB3 schedd. The route will update the ClassAds for idle jobs
      • JobRouter scales much better than a cronjob performing condor_qedit for CRAB3 jobs.
    • In order to preserve a single autocluster per task, all jobs in a CRAB3 task will get the same value of EstimatedWallTimeMins.

As of April 19, Justas has done the work in gwmsmon [1]

Work to do is tracked in:

  • put here pointers to git issues and/or Jira issues as proper

QUESTIONS:

    • what happens to jobs killed by pilot reaching end of life before payload does ?

REFERENCES:

https://cms-gwmsmon.cern.ch/analysisview/json/historynew/percentileruntime720/smitra/170411_132805:smitra_crab_DYJets

which returns a json file
{"hits": {"hits": [], "total": 263, "max_score": 0.0}, "_shards": {"successful": 31, "failed": 0, "total": 31}, "took": 40, "aggregations": {"2": {"values": {"5.0": 6.5436111111111126, "25.0": 11.444305555555555, "1.0": 3.5115222222222222, "95.0": 19.811305555555556, "75.0": 16.773194444444446, "99.0": 20.513038888888889, "50.0": 13.365277777777777}}}, "timed_out": false}

in hopefully fixed forever format so that the "values" can be extracted
and one would e.g. pick the 95.0 one (i.e. 19.8 hours)