Long execution time for JobCreator #11335

germanfgv · 2022-10-11T08:25:57Z

Impact of the bug
T0 Agent and likely any WMAgent

Describe the bug
We noticed the JobCreator on our T0 production agent was taking increasingly long times to complete a loop, up to 4h20m handeling around 2k subscriptions. During that time we would receive more input files, increasing the amount of work to do in the next loop, Increasing the loop duration for as long as we keep taking data

After checking the logs and debugging, I noticed that most of this time was spent in the execution of the GetLocationsForJobs query. Each execution takes arouind 4.7seconds, which is a lot for a relatively simple query. The same query from sqlplus or sqlDeveloper takes around 30ms.

As the CMS duty cycle increases, this problem can render our agent unusable

How to reproduce it
Simply run an agent with with jobs to create

Expected behavior
The GetLocationsForJobs query should take just a fraction of a second.

Additional context and error message
Here you can see the delay from the polling of the jobSplitter to the creation of the jobGroup

2022-10-02 23:47:41,026:140111126890240:INFO:JobCreatorPoller:Completed iteration over subscription 89451
2022-10-02 23:47:41,099:140111126890240:INFO:JobCreatorPoller:Retrieved 1 jobGroups from jobSplitter
2022-10-02 23:47:45,731:140111126890240:INFO:JobCreatorPoller:JobCreator has finished creating jobGroup 166906.

2022-10-02 23:47:45,735:140111126890240:INFO:JobFactory:No additional files found; Ending.
2022-10-02 23:47:45,736:140111126890240:INFO:JobCreatorPoller:Retrieved 0 jobGroups from jobSplitter
2022-10-02 23:47:45,736:140111126890240:INFO:JobCreatorPoller:Found end in iteration over subscription 89453
2022-10-02 23:47:45,774:140111126890240:INFO:JobCreatorPoller:Retrieved 1 jobGroups from jobSplitter
2022-10-02 23:47:50,343:140111126890240:INFO:JobCreatorPoller:JobCreator has finished creating jobGroup 166907.

2022-10-02 23:47:50,346:140111126890240:INFO:JobCreatorPoller:Completed iteration over subscription 89454
2022-10-02 23:47:50,380:140111126890240:INFO:JobCreatorPoller:Retrieved 1 jobGroups from jobSplitter
2022-10-02 23:47:54,936:140111126890240:INFO:JobCreatorPoller:JobCreator has finished creating jobGroup 166908.

The text was updated successfully, but these errors were encountered:

germanfgv · 2022-10-11T12:29:46Z

Running the query from 'python3' using cProfile I got the following:

>>> action = daofactory(classname = "JobGroup.GetLocationsForJobs")
>>> cProfile.run('action.execute(id = 167196, transaction = False)')
         172 function calls in 11.111 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   11.111   11.111 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 DBCore.py:116(connection)
        1    0.000    0.000   11.111   11.111 DBCore.py:123(processData)
        1    0.000    0.000   11.110   11.110 DBCore.py:54(executebinds)
        1    0.000    0.000    0.000    0.000 GetLocationsForJobs.py:31(format)
        1    0.000    0.000   11.111   11.111 GetLocationsForJobs.py:45(execute)
        1    0.000    0.000    0.000    0.000 ResultSet.py:14(__init__)
        1    0.000    0.000    0.000    0.000 ResultSet.py:27(fetchall)
        1    0.000    0.000    0.000    0.000 ResultSet.py:30(add)
        2    0.000    0.000    0.000    0.000 WMObject.py:22(makelist)
        2    0.000    0.000    0.000    0.000 __init__.py:1677(isEnabledFor)
        1    0.000    0.000    0.000    0.000 __init__.py:710(listvalues)
        3    0.000    0.000    0.000    0.000 attr.py:267(__bool__)
        1    0.000    0.000   11.110   11.110 base.py:1138(_execute_text)
        1    0.000    0.000    0.000    0.000 base.py:116(_for_class)
        1    0.000    0.000   11.110   11.110 base.py:1163(_execute_context)
        1    0.000    0.000    0.000    0.000 base.py:119(_for_instance)
        1    0.000    0.000    0.000    0.000 base.py:1208(normalize_name)
        1    0.000    0.000    0.000    0.000 base.py:123(_join)
        1    0.000    0.000    0.000    0.000 base.py:1327(_safe_close_cursor)
        1    0.000    0.000    0.000    0.000 base.py:1674(__init__)
        1    0.000    0.000    0.000    0.000 base.py:1679(_parent)
        1    0.000    0.000    0.000    0.000 base.py:1711(commit)
        1    0.000    0.000    0.000    0.000 base.py:1737(__init__)
        1    0.000    0.000    0.000    0.000 base.py:1745(_do_commit)
        1    0.000    0.000    0.000    0.000 base.py:2179(connect)
        1    0.000    0.000    0.000    0.000 base.py:2259(_wrap_pool_connect)
        1    0.000    0.000    0.000    0.000 base.py:2271(raw_connection)
        1    0.000    0.000    0.000    0.000 base.py:261(__init__)
        1    0.000    0.000    0.000    0.000 base.py:292(unique_connection)
        1    0.000    0.000    0.000    0.000 base.py:295(__get__)
        6    0.000    0.000    0.000    0.000 base.py:370(connection)
        1    0.000    0.000    0.000    0.000 base.py:375(_return_conn)
        1    0.000    0.000    0.000    0.000 base.py:490(checkout)
        1    0.000    0.000    0.000    0.000 base.py:516(checkin)
        1    0.000    0.000    0.000    0.000 base.py:589(get_connection)
        1    0.000    0.000    0.000    0.000 base.py:596(begin)
        1    0.000    0.000    0.000    0.000 base.py:654(_finalize_fairy)
        1    0.000    0.000    0.000    0.000 base.py:69(__init__)
        1    0.000    0.000    0.000    0.000 base.py:713(_begin_impl)
        1    0.000    0.000    0.000    0.000 base.py:724(__init__)
        1    0.000    0.000    0.000    0.000 base.py:752(_commit_impl)
        1    0.000    0.000    0.000    0.000 base.py:757(_checkout)
        1    0.000    0.000    0.000    0.000 base.py:77(__init__)
        1    0.000    0.000    0.000    0.000 base.py:840(_checkin)
        1    0.000    0.000    0.000    0.000 base.py:854(_reset)
        1    0.000    0.000    0.000    0.000 base.py:869(close)
        1    0.000    0.000   11.110   11.110 base.py:922(execute)
        1    0.000    0.000    0.000    0.000 base.py:958(cursor)
        2    0.000    0.000    0.000    0.000 base.py:967(__getattr__)
        1    0.000    0.000    0.000    0.000 base.py:996(close)
        1    0.000    0.000    0.000    0.000 compiler.py:3487(_requires_quotes)
        1    0.000    0.000    0.000    0.000 cx_oracle.py:626(create_cursor)
        1    0.000    0.000    0.000    0.000 cx_oracle.py:633(get_result_proxy)
        1    0.000    0.000    0.000    0.000 cx_oracle.py:884(output_type_handler)
        1    0.000    0.000    0.000    0.000 default.py:1051(get_result_processor)
        1    0.000    0.000    0.001    0.001 default.py:1108(_setup_crud_result_proxy)
        1    0.000    0.000    0.000    0.000 default.py:498(do_begin)
        1    0.000    0.000    0.000    0.000 default.py:501(do_rollback)
        1    0.000    0.000    0.000    0.000 default.py:504(do_commit)
        1    0.000    0.000   11.109   11.109 default.py:551(do_execute)
        1    0.000    0.000    0.000    0.000 default.py:862(_init_statement)
        1    0.000    0.000    0.000    0.000 default.py:947(should_autocommit)
        1    0.000    0.000    0.000    0.000 default.py:999(should_autocommit_text)
        1    0.000    0.000    0.000    0.000 impl.py:103(_do_return_conn)
        1    0.000    0.000    0.000    0.000 impl.py:112(_do_get)
        1    0.000    0.000    0.000    0.000 langhelpers.py:852(__get__)
        1    0.000    0.000    0.000    0.000 log.py:56(_should_log_debug)
        1    0.000    0.000    0.000    0.000 log.py:59(_should_log_info)
        1    0.000    0.000    0.000    0.000 queue.py:135(get)
        1    0.000    0.000    0.000    0.000 queue.py:191(_empty)
        1    0.000    0.000    0.000    0.000 queue.py:195(_full)
        1    0.000    0.000    0.000    0.000 queue.py:199(_put)
        1    0.000    0.000    0.000    0.000 queue.py:203(_get)
        1    0.000    0.000    0.000    0.000 queue.py:92(put)
        2    0.000    0.000    0.000    0.000 result.py:1146(_fetchone_impl)
        1    0.000    0.000    0.000    0.000 result.py:1178(process_rows)
        1    0.000    0.000    0.000    0.000 result.py:1191(<listcomp>)
        2    0.000    0.000    0.000    0.000 result.py:1250(fetchone)
        1    0.000    0.000    0.000    0.000 result.py:182(keys)
        1    0.000    0.000    0.000    0.000 result.py:215(__init__)
        1    0.000    0.000    0.000    0.000 result.py:263(<listcomp>)
        1    0.000    0.000    0.000    0.000 result.py:266(<listcomp>)
        1    0.000    0.000    0.000    0.000 result.py:334(_merge_cursor_description)
        1    0.000    0.000    0.000    0.000 result.py:440(<listcomp>)
        2    0.000    0.000    0.000    0.000 result.py:461(_colnames_from_description)
        2    0.000    0.000    0.000    0.000 result.py:560(_merge_cols_by_none)
        1    0.000    0.000    0.000    0.000 result.py:714(__init__)
        1    0.000    0.000    0.000    0.000 result.py:740(_init_metadata)
        1    0.000    0.000    0.000    0.000 result.py:838(returns_rows)
        1    0.000    0.000    0.000    0.000 result.py:864(_cursor_description)
        2    0.000    0.000    0.000    0.000 result.py:869(_soft_close)
        1    0.000    0.000    0.000    0.000 result.py:901(close)
        2    0.000    0.000    0.000    0.000 result.py:942(__iter__)
        1    0.000    0.000    0.000    0.000 threading.py:1306(current_thread)
        2    0.000    0.000    0.000    0.000 threading.py:341(notify)
        1    0.000    0.000    0.000    0.000 type_api.py:483(_cached_result_processor)
        2    0.000    0.000    0.000    0.000 weakref.py:382(__getitem__)
        1    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x7f5c9a619960}
        1    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        1    0.000    0.000   11.111   11.111 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        7    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        9    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {built-in method sqlalchemy.cutils._distill_params}
        1    0.000    0.000    0.000    0.000 {built-in method time.time}
        2    0.000    0.000    0.000    0.000 {method '_is_owned' of '_thread.RLock' objects}
        2    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.RLock' objects}
        1    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}
        1    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
        4    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'close' of 'cx_Oracle.Cursor' objects}
        1    0.000    0.000    0.000    0.000 {method 'commit' of 'cx_Oracle.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'cursor' of 'cx_Oracle.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'discard' of 'set' objects}
        1   11.109   11.109   11.109   11.109 {method 'execute' of 'cx_Oracle.Cursor' objects}
        1    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
        2    0.000    0.000    0.000    0.000 {method 'fetchone' of 'cx_Oracle.Cursor' objects}
        1    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        3    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'match' of 're.Pattern' objects}
        1    0.000    0.000    0.000    0.000 {method 'popleft' of 'collections.deque' objects}
        2    0.000    0.000    0.000    0.000 {method 'release' of '_thread.RLock' objects}
        1    0.000    0.000    0.000    0.000 {method 'rollback' of 'cx_Oracle.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'update' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'values' of 'sqlalchemy.cresultproxy.BaseRowProxy' objects}
        1    0.000    0.000    0.000    0.000 {method 'var' of 'cx_Oracle.Cursor' objects}

todor-ivanov · 2022-10-11T13:03:45Z

Thanks @germanfgv this seems to include highly valuable information also for the more generic issue we are currently investigating: #11330

vkuznet · 2022-10-12T11:01:55Z

Germán, based on your profile I would say that it is issue with ORACLE since the
main call from your profile is this one:

1   11.109   11.109   11.109   11.109 {method 'execute' of 'cx_Oracle.Cursor' objects}

As such, the actual error comes from ORACLE execution. I suggest to dump the actual query and profile it with ORACLE DBA.

todor-ivanov · 2022-10-13T16:50:59Z

Hi @vkuznet,
As @germanfgv suggested, based on his previous observations, I did run one sql query from inside the agent in two modes:

Executing it through ExecuteDAO.py and using the properly defined DAO object from our stack:

bash-4.2$ python3 ExecuteDAO.py -c /data/tier0/srv/wmagent/current/config/tier0/config.py -p WMCore.WMBS -m Monitoring.JobTypeCountByState -- created 
2022-10-13 16:47:58,621:INFO:ExecuteDAO: Loading configFile: /data/tier0/srv/wmagent/current/config/tier0/config.py
2022-10-13 16:47:58,638:INFO:DBCore: Instantiating base WM DBInterface
2022-10-13 16:47:58,638:INFO:ExecuteDAO: DAO Factory initialised from package: WMCore.WMBS
2022-10-13 16:47:58,641:INFO:ExecuteDAO: DAO Module initialised as: Monitoring.JobTypeCountByState
2022-10-13 17:25:04,826:INFO:ExecuteDAO: DAO Results:
{'Cleanup': 5,
 'Express': 0,
 'Harvesting': 0,
 'LogCollect': 2,
 'Merge': 8,
 'Processing': 10134,
 'Production': 0,
 'Repack': 0,
 'Skim': 0}

-- looking at the log timestamps shows the actual time it took for this query is 38 min

Executing it as a pure SQL query directly from the agent's prompt, again using the same agent and same configuration but instead of the python code we provide the standard sqlplus was client sitting at the background:

$manage db-prompt
SQL*Plus: Release 19.0.0.0.0 - Production on Thu Oct 13 18:41:28 2022
Version 19.11.0.0.0

Copyright (c) 1982, 2020, Oracle.  All rights reserved.

Last Successful login time: Thu Oct 13 2022 18:41:24 +02:00

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.13.0.0.0

SQL> SELECT wmbs_sub_types.name AS job_type, COUNT(*) AS count FROM wmbs_job   INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id   INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id    INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id    WHERE wmbs_job.state = (SELECT id FROM wmbs_job_state WHERE name = 'created')   GROUP BY wmbs_subscription.subtype, wmbs_sub_types.name;

JOB_TYPE
--------------------------------------------------------------------------------
     COUNT
----------
Processing
      5966

Cleanup
	 4

Merge
	43

-- took less than a second.

I doubt this has anything to do with Oracle here.
I am more inclined to search the problem in the client we are using and the whole multilayered mechanisms of calls provided by sqlalchemy library that we use in the background. I'd also rule out any infrastructural issues, or database overload.

vkuznet · 2022-10-13T18:23:45Z

@todor-ivanov , I doubt that python cProfile lies, it shows exactly where the issue is. Said that, it does not mean that surrounding code, threads, conditions, etc. does not influence the ORACLE call. For instance, server may be stuck with lots of unclosed pool connections which caused latency of ORACLE call, or server was busy that context (thread) switching took lots of time which influence ORACLE latency call, etc. Of course, I have no doubt that pristine environment shows no sign of issues, but the same call within framework may. If you'll look closely to profile output you'll notice '_thread.RLock' which indicates that lock was placed on ORACLE query and this is your clue what caused a delay. Most likely another thread was busy with something. Since profile output doe snot show state of all threads you can only deduce from it that ORACLE call in this particular thread took a long time.

todor-ivanov · 2022-10-20T13:04:04Z

As we figured out with @germanfgv two days ago, as the agent started to get drained more and more .... the WMBS queries got back to some normal states:

bash-4.2$ time python3 ExecuteDAO.py -c /data/tier0/srv/wmagent/current/config/tier0/config.py -p WMCore.WMBS -m Monitoring.JobTypeCountByState -- created
2022-10-20 14:46:11,826:INFO:ExecuteDAO: Loading configFile: /data/tier0/srv/wmagent/current/config/tier0/config.py
2022-10-20 14:46:11,842:INFO:DBCore: Instantiating base WM DBInterface
2022-10-20 14:46:11,842:INFO:ExecuteDAO: DAO Factory initialised from package: WMCore.WMBS
2022-10-20 14:46:11,845:INFO:ExecuteDAO: DAO Module initialised as: Monitoring.JobTypeCountByState
2022-10-20 14:46:12,078:INFO:ExecuteDAO: DAO Results:
{'Cleanup': 0,
 'Express': 0,
 'Harvesting': 0,
 'LogCollect': 0,
 'Merge': 0,
 'Processing': 0,
 'Production': 0,
 'Repack': 0,
 'Skim': 0}

real	0m0.740s
user	0m0.443s
sys	0m0.106s

What we also noticed from the logs was that the huge delays for the component were also mixed with some pretty short/quick polling cycles taking just the expected several seconds rather than hours.

I'd say we put that issue in waiting until the problem reveals itself again.

germanfgv · 2022-10-20T13:23:52Z

Thanks @todor-ivanov. We will keep logs from the previous agent and keep an eye on the new one.

todor-ivanov · 2022-10-20T14:08:09Z

I am closing the issue now, since there is no problem to investigate at the moment. We will reopen it in case the problem shows up again.

FYI @amaltaro

germanfgv added the Tier0-systems label Oct 11, 2022

amaltaro added WMAgent BUG High Priority labels Oct 11, 2022

todor-ivanov self-assigned this Oct 17, 2022

todor-ivanov closed this as completed Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long execution time for JobCreator #11335

Long execution time for JobCreator #11335

germanfgv commented Oct 11, 2022

germanfgv commented Oct 11, 2022

todor-ivanov commented Oct 11, 2022

vkuznet commented Oct 12, 2022

todor-ivanov commented Oct 13, 2022 •

edited

Loading

vkuznet commented Oct 13, 2022

todor-ivanov commented Oct 20, 2022

germanfgv commented Oct 20, 2022

todor-ivanov commented Oct 20, 2022

Long execution time for JobCreator #11335

Long execution time for JobCreator #11335

Comments

germanfgv commented Oct 11, 2022

germanfgv commented Oct 11, 2022

todor-ivanov commented Oct 11, 2022

vkuznet commented Oct 12, 2022

todor-ivanov commented Oct 13, 2022 • edited Loading

vkuznet commented Oct 13, 2022

todor-ivanov commented Oct 20, 2022

germanfgv commented Oct 20, 2022

todor-ivanov commented Oct 20, 2022

todor-ivanov commented Oct 13, 2022 •

edited

Loading