Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long execution time for JobCreator #11335

Closed
germanfgv opened this issue Oct 11, 2022 · 8 comments
Closed

Long execution time for JobCreator #11335

germanfgv opened this issue Oct 11, 2022 · 8 comments

Comments

@germanfgv
Copy link
Contributor

Impact of the bug
T0 Agent and likely any WMAgent

Describe the bug
We noticed the JobCreator on our T0 production agent was taking increasingly long times to complete a loop, up to 4h20m handeling around 2k subscriptions. During that time we would receive more input files, increasing the amount of work to do in the next loop, Increasing the loop duration for as long as we keep taking data

After checking the logs and debugging, I noticed that most of this time was spent in the execution of the GetLocationsForJobs query. Each execution takes arouind 4.7seconds, which is a lot for a relatively simple query. The same query from sqlplus or sqlDeveloper takes around 30ms.

As the CMS duty cycle increases, this problem can render our agent unusable

How to reproduce it
Simply run an agent with with jobs to create

Expected behavior
The GetLocationsForJobs query should take just a fraction of a second.

Additional context and error message
Here you can see the delay from the polling of the jobSplitter to the creation of the jobGroup

2022-10-02 23:47:41,026:140111126890240:INFO:JobCreatorPoller:Completed iteration over subscription 89451
2022-10-02 23:47:41,099:140111126890240:INFO:JobCreatorPoller:Retrieved 1 jobGroups from jobSplitter
2022-10-02 23:47:45,731:140111126890240:INFO:JobCreatorPoller:JobCreator has finished creating jobGroup 166906.

2022-10-02 23:47:45,735:140111126890240:INFO:JobFactory:No additional files found; Ending.
2022-10-02 23:47:45,736:140111126890240:INFO:JobCreatorPoller:Retrieved 0 jobGroups from jobSplitter
2022-10-02 23:47:45,736:140111126890240:INFO:JobCreatorPoller:Found end in iteration over subscription 89453
2022-10-02 23:47:45,774:140111126890240:INFO:JobCreatorPoller:Retrieved 1 jobGroups from jobSplitter
2022-10-02 23:47:50,343:140111126890240:INFO:JobCreatorPoller:JobCreator has finished creating jobGroup 166907.

2022-10-02 23:47:50,346:140111126890240:INFO:JobCreatorPoller:Completed iteration over subscription 89454
2022-10-02 23:47:50,380:140111126890240:INFO:JobCreatorPoller:Retrieved 1 jobGroups from jobSplitter
2022-10-02 23:47:54,936:140111126890240:INFO:JobCreatorPoller:JobCreator has finished creating jobGroup 166908.
@germanfgv
Copy link
Contributor Author

Running the query from 'python3' using cProfile I got the following:

>>> action = daofactory(classname = "JobGroup.GetLocationsForJobs")
>>> cProfile.run('action.execute(id = 167196, transaction = False)')
         172 function calls in 11.111 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   11.111   11.111 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 DBCore.py:116(connection)
        1    0.000    0.000   11.111   11.111 DBCore.py:123(processData)
        1    0.000    0.000   11.110   11.110 DBCore.py:54(executebinds)
        1    0.000    0.000    0.000    0.000 GetLocationsForJobs.py:31(format)
        1    0.000    0.000   11.111   11.111 GetLocationsForJobs.py:45(execute)
        1    0.000    0.000    0.000    0.000 ResultSet.py:14(__init__)
        1    0.000    0.000    0.000    0.000 ResultSet.py:27(fetchall)
        1    0.000    0.000    0.000    0.000 ResultSet.py:30(add)
        2    0.000    0.000    0.000    0.000 WMObject.py:22(makelist)
        2    0.000    0.000    0.000    0.000 __init__.py:1677(isEnabledFor)
        1    0.000    0.000    0.000    0.000 __init__.py:710(listvalues)
        3    0.000    0.000    0.000    0.000 attr.py:267(__bool__)
        1    0.000    0.000   11.110   11.110 base.py:1138(_execute_text)
        1    0.000    0.000    0.000    0.000 base.py:116(_for_class)
        1    0.000    0.000   11.110   11.110 base.py:1163(_execute_context)
        1    0.000    0.000    0.000    0.000 base.py:119(_for_instance)
        1    0.000    0.000    0.000    0.000 base.py:1208(normalize_name)
        1    0.000    0.000    0.000    0.000 base.py:123(_join)
        1    0.000    0.000    0.000    0.000 base.py:1327(_safe_close_cursor)
        1    0.000    0.000    0.000    0.000 base.py:1674(__init__)
        1    0.000    0.000    0.000    0.000 base.py:1679(_parent)
        1    0.000    0.000    0.000    0.000 base.py:1711(commit)
        1    0.000    0.000    0.000    0.000 base.py:1737(__init__)
        1    0.000    0.000    0.000    0.000 base.py:1745(_do_commit)
        1    0.000    0.000    0.000    0.000 base.py:2179(connect)
        1    0.000    0.000    0.000    0.000 base.py:2259(_wrap_pool_connect)
        1    0.000    0.000    0.000    0.000 base.py:2271(raw_connection)
        1    0.000    0.000    0.000    0.000 base.py:261(__init__)
        1    0.000    0.000    0.000    0.000 base.py:292(unique_connection)
        1    0.000    0.000    0.000    0.000 base.py:295(__get__)
        6    0.000    0.000    0.000    0.000 base.py:370(connection)
        1    0.000    0.000    0.000    0.000 base.py:375(_return_conn)
        1    0.000    0.000    0.000    0.000 base.py:490(checkout)
        1    0.000    0.000    0.000    0.000 base.py:516(checkin)
        1    0.000    0.000    0.000    0.000 base.py:589(get_connection)
        1    0.000    0.000    0.000    0.000 base.py:596(begin)
        1    0.000    0.000    0.000    0.000 base.py:654(_finalize_fairy)
        1    0.000    0.000    0.000    0.000 base.py:69(__init__)
        1    0.000    0.000    0.000    0.000 base.py:713(_begin_impl)
        1    0.000    0.000    0.000    0.000 base.py:724(__init__)
        1    0.000    0.000    0.000    0.000 base.py:752(_commit_impl)
        1    0.000    0.000    0.000    0.000 base.py:757(_checkout)
        1    0.000    0.000    0.000    0.000 base.py:77(__init__)
        1    0.000    0.000    0.000    0.000 base.py:840(_checkin)
        1    0.000    0.000    0.000    0.000 base.py:854(_reset)
        1    0.000    0.000    0.000    0.000 base.py:869(close)
        1    0.000    0.000   11.110   11.110 base.py:922(execute)
        1    0.000    0.000    0.000    0.000 base.py:958(cursor)
        2    0.000    0.000    0.000    0.000 base.py:967(__getattr__)
        1    0.000    0.000    0.000    0.000 base.py:996(close)
        1    0.000    0.000    0.000    0.000 compiler.py:3487(_requires_quotes)
        1    0.000    0.000    0.000    0.000 cx_oracle.py:626(create_cursor)
        1    0.000    0.000    0.000    0.000 cx_oracle.py:633(get_result_proxy)
        1    0.000    0.000    0.000    0.000 cx_oracle.py:884(output_type_handler)
        1    0.000    0.000    0.000    0.000 default.py:1051(get_result_processor)
        1    0.000    0.000    0.001    0.001 default.py:1108(_setup_crud_result_proxy)
        1    0.000    0.000    0.000    0.000 default.py:498(do_begin)
        1    0.000    0.000    0.000    0.000 default.py:501(do_rollback)
        1    0.000    0.000    0.000    0.000 default.py:504(do_commit)
        1    0.000    0.000   11.109   11.109 default.py:551(do_execute)
        1    0.000    0.000    0.000    0.000 default.py:862(_init_statement)
        1    0.000    0.000    0.000    0.000 default.py:947(should_autocommit)
        1    0.000    0.000    0.000    0.000 default.py:999(should_autocommit_text)
        1    0.000    0.000    0.000    0.000 impl.py:103(_do_return_conn)
        1    0.000    0.000    0.000    0.000 impl.py:112(_do_get)
        1    0.000    0.000    0.000    0.000 langhelpers.py:852(__get__)
        1    0.000    0.000    0.000    0.000 log.py:56(_should_log_debug)
        1    0.000    0.000    0.000    0.000 log.py:59(_should_log_info)
        1    0.000    0.000    0.000    0.000 queue.py:135(get)
        1    0.000    0.000    0.000    0.000 queue.py:191(_empty)
        1    0.000    0.000    0.000    0.000 queue.py:195(_full)
        1    0.000    0.000    0.000    0.000 queue.py:199(_put)
        1    0.000    0.000    0.000    0.000 queue.py:203(_get)
        1    0.000    0.000    0.000    0.000 queue.py:92(put)
        2    0.000    0.000    0.000    0.000 result.py:1146(_fetchone_impl)
        1    0.000    0.000    0.000    0.000 result.py:1178(process_rows)
        1    0.000    0.000    0.000    0.000 result.py:1191(<listcomp>)
        2    0.000    0.000    0.000    0.000 result.py:1250(fetchone)
        1    0.000    0.000    0.000    0.000 result.py:182(keys)
        1    0.000    0.000    0.000    0.000 result.py:215(__init__)
        1    0.000    0.000    0.000    0.000 result.py:263(<listcomp>)
        1    0.000    0.000    0.000    0.000 result.py:266(<listcomp>)
        1    0.000    0.000    0.000    0.000 result.py:334(_merge_cursor_description)
        1    0.000    0.000    0.000    0.000 result.py:440(<listcomp>)
        2    0.000    0.000    0.000    0.000 result.py:461(_colnames_from_description)
        2    0.000    0.000    0.000    0.000 result.py:560(_merge_cols_by_none)
        1    0.000    0.000    0.000    0.000 result.py:714(__init__)
        1    0.000    0.000    0.000    0.000 result.py:740(_init_metadata)
        1    0.000    0.000    0.000    0.000 result.py:838(returns_rows)
        1    0.000    0.000    0.000    0.000 result.py:864(_cursor_description)
        2    0.000    0.000    0.000    0.000 result.py:869(_soft_close)
        1    0.000    0.000    0.000    0.000 result.py:901(close)
        2    0.000    0.000    0.000    0.000 result.py:942(__iter__)
        1    0.000    0.000    0.000    0.000 threading.py:1306(current_thread)
        2    0.000    0.000    0.000    0.000 threading.py:341(notify)
        1    0.000    0.000    0.000    0.000 type_api.py:483(_cached_result_processor)
        2    0.000    0.000    0.000    0.000 weakref.py:382(__getitem__)
        1    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x7f5c9a619960}
        1    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        1    0.000    0.000   11.111   11.111 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        7    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        9    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {built-in method sqlalchemy.cutils._distill_params}
        1    0.000    0.000    0.000    0.000 {built-in method time.time}
        2    0.000    0.000    0.000    0.000 {method '_is_owned' of '_thread.RLock' objects}
        2    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.RLock' objects}
        1    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}
        1    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
        4    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'close' of 'cx_Oracle.Cursor' objects}
        1    0.000    0.000    0.000    0.000 {method 'commit' of 'cx_Oracle.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'cursor' of 'cx_Oracle.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'discard' of 'set' objects}
        1   11.109   11.109   11.109   11.109 {method 'execute' of 'cx_Oracle.Cursor' objects}
        1    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
        2    0.000    0.000    0.000    0.000 {method 'fetchone' of 'cx_Oracle.Cursor' objects}
        1    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        3    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'match' of 're.Pattern' objects}
        1    0.000    0.000    0.000    0.000 {method 'popleft' of 'collections.deque' objects}
        2    0.000    0.000    0.000    0.000 {method 'release' of '_thread.RLock' objects}
        1    0.000    0.000    0.000    0.000 {method 'rollback' of 'cx_Oracle.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'update' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'upper' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'values' of 'sqlalchemy.cresultproxy.BaseRowProxy' objects}
        1    0.000    0.000    0.000    0.000 {method 'var' of 'cx_Oracle.Cursor' objects}

@todor-ivanov
Copy link
Contributor

Thanks @germanfgv this seems to include highly valuable information also for the more generic issue we are currently investigating: #11330

@vkuznet
Copy link
Contributor

vkuznet commented Oct 12, 2022

Germán, based on your profile I would say that it is issue with ORACLE since the
main call from your profile is this one:

1   11.109   11.109   11.109   11.109 {method 'execute' of 'cx_Oracle.Cursor' objects}

As such, the actual error comes from ORACLE execution. I suggest to dump the actual query and profile it with ORACLE DBA.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Oct 13, 2022

Hi @vkuznet,
As @germanfgv suggested, based on his previous observations, I did run one sql query from inside the agent in two modes:

  • Executing it through ExecuteDAO.py and using the properly defined DAO object from our stack:
bash-4.2$ python3 ExecuteDAO.py -c /data/tier0/srv/wmagent/current/config/tier0/config.py -p WMCore.WMBS -m Monitoring.JobTypeCountByState -- created 
2022-10-13 16:47:58,621:INFO:ExecuteDAO: Loading configFile: /data/tier0/srv/wmagent/current/config/tier0/config.py
2022-10-13 16:47:58,638:INFO:DBCore: Instantiating base WM DBInterface
2022-10-13 16:47:58,638:INFO:ExecuteDAO: DAO Factory initialised from package: WMCore.WMBS
2022-10-13 16:47:58,641:INFO:ExecuteDAO: DAO Module initialised as: Monitoring.JobTypeCountByState
2022-10-13 17:25:04,826:INFO:ExecuteDAO: DAO Results:
{'Cleanup': 5,
 'Express': 0,
 'Harvesting': 0,
 'LogCollect': 2,
 'Merge': 8,
 'Processing': 10134,
 'Production': 0,
 'Repack': 0,
 'Skim': 0}

-- looking at the log timestamps shows the actual time it took for this query is 38 min

  • Executing it as a pure SQL query directly from the agent's prompt, again using the same agent and same configuration but instead of the python code we provide the standard sqlplus was client sitting at the background:
$manage db-prompt
SQL*Plus: Release 19.0.0.0.0 - Production on Thu Oct 13 18:41:28 2022
Version 19.11.0.0.0

Copyright (c) 1982, 2020, Oracle.  All rights reserved.

Last Successful login time: Thu Oct 13 2022 18:41:24 +02:00

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.13.0.0.0

SQL> SELECT wmbs_sub_types.name AS job_type, COUNT(*) AS count FROM wmbs_job   INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id   INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id    INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id    WHERE wmbs_job.state = (SELECT id FROM wmbs_job_state WHERE name = 'created')   GROUP BY wmbs_subscription.subtype, wmbs_sub_types.name;

JOB_TYPE
--------------------------------------------------------------------------------
     COUNT
----------
Processing
      5966

Cleanup
	 4

Merge
	43

-- took less than a second.

I doubt this has anything to do with Oracle here.
I am more inclined to search the problem in the client we are using and the whole multilayered mechanisms of calls provided by sqlalchemy library that we use in the background. I'd also rule out any infrastructural issues, or database overload.

@vkuznet
Copy link
Contributor

vkuznet commented Oct 13, 2022

@todor-ivanov , I doubt that python cProfile lies, it shows exactly where the issue is. Said that, it does not mean that surrounding code, threads, conditions, etc. does not influence the ORACLE call. For instance, server may be stuck with lots of unclosed pool connections which caused latency of ORACLE call, or server was busy that context (thread) switching took lots of time which influence ORACLE latency call, etc. Of course, I have no doubt that pristine environment shows no sign of issues, but the same call within framework may. If you'll look closely to profile output you'll notice '_thread.RLock' which indicates that lock was placed on ORACLE query and this is your clue what caused a delay. Most likely another thread was busy with something. Since profile output doe snot show state of all threads you can only deduce from it that ORACLE call in this particular thread took a long time.

@todor-ivanov todor-ivanov self-assigned this Oct 17, 2022
@todor-ivanov
Copy link
Contributor

As we figured out with @germanfgv two days ago, as the agent started to get drained more and more .... the WMBS queries got back to some normal states:

bash-4.2$ time python3 ExecuteDAO.py -c /data/tier0/srv/wmagent/current/config/tier0/config.py -p WMCore.WMBS -m Monitoring.JobTypeCountByState -- created
2022-10-20 14:46:11,826:INFO:ExecuteDAO: Loading configFile: /data/tier0/srv/wmagent/current/config/tier0/config.py
2022-10-20 14:46:11,842:INFO:DBCore: Instantiating base WM DBInterface
2022-10-20 14:46:11,842:INFO:ExecuteDAO: DAO Factory initialised from package: WMCore.WMBS
2022-10-20 14:46:11,845:INFO:ExecuteDAO: DAO Module initialised as: Monitoring.JobTypeCountByState
2022-10-20 14:46:12,078:INFO:ExecuteDAO: DAO Results:
{'Cleanup': 0,
 'Express': 0,
 'Harvesting': 0,
 'LogCollect': 0,
 'Merge': 0,
 'Processing': 0,
 'Production': 0,
 'Repack': 0,
 'Skim': 0}

real	0m0.740s
user	0m0.443s
sys	0m0.106s

What we also noticed from the logs was that the huge delays for the component were also mixed with some pretty short/quick polling cycles taking just the expected several seconds rather than hours.

I'd say we put that issue in waiting until the problem reveals itself again.

@germanfgv
Copy link
Contributor Author

Thanks @todor-ivanov. We will keep logs from the previous agent and keep an eye on the new one.

@todor-ivanov
Copy link
Contributor

I am closing the issue now, since there is no problem to investigate at the moment. We will reopen it in case the problem shows up again.

FYI @amaltaro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants