Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMCore services reported being slow #11330

Closed
todor-ivanov opened this issue Oct 10, 2022 · 24 comments · May be fixed by #11351
Closed

WMCore services reported being slow #11330

todor-ivanov opened this issue Oct 10, 2022 · 24 comments · May be fixed by #11351
Assignees

Comments

@todor-ivanov
Copy link
Contributor

Impact of the bug
All WMCore services

Describe the bug
We were contacted by several groups during the last week or two, reporting slowness of the APIs from WMCore they are using. We did check few of them, and they seemed fine. So far the information we managed to collect is not enough to properly describe the issue. The only clue we have so far that it may be related mostly to POST http requests, rather than GET

One place we are 100% sure the APIs are slow is the workflow creation/injection. It was taking minutes in comparison of seconds from previous moths.

How to reproduce it
No clue so far

Expected behavior
No slow APIS are expected

Additional context and error message
None

@todor-ivanov
Copy link
Contributor Author

FYI: @amaltaro

@amaltaro
Copy link
Contributor

amaltaro commented Oct 25, 2022

I am able to reproduce this slowness by simply creating a workflow in my private VM services.

The bulk of the delay - and processing as well - comes from the validateSchema call when constructing a workload:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L893

Zooming in, it's actually validating the PSet (ConfigCacheID) configuration here:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L944

and performing the following http calls:

  1. GET https://cmsweb.cern.ch:8443/couchdb/_all_dbs (to verify that the db reqmgr_config_cache actually exists, took almost 2secs on the client side - couchdb reports 3ms)
  2. again the same call (!!!), but this time it took <1sec
  3. GET https://cmsweb.cern.ch:8443/couchdb/reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e (took <1sec, but I've seen it taking a few secs - couchdb reports 2ms)
  4. now it tries to update the PSet user/group, but first it fires another call as in 1)
  5. and yet another execution of the call above
  6. now it has to load the configFile attachment, thus it runs a GET https://cmsweb.cern.ch:8443/couchdb/reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e/configFile (took 3secs on the client side - couchdb reports 3ms)
  7. another GET https://alancc7-cloud1.cern.ch/couchdb/_all_dbs (to verify that the db reqmgr_workload_cache actually exists, took <1sec - couchdb reports 4ms)
  8. HEAD https://alancc7-cloud1.cern.ch/couchdb/reqmgr_workload_cache/amaltaro_DQMHarvest_RunWhitelist_Oct2022_Val_221025_210317_7603 (to check whether this workflow already exists in the system, took <1sec - couchdb reports 2ms)
  9. PUT https://alancc7-cloud1.cern.ch/couchdb/reqmgr_workload_cache/amaltaro_DQMHarvest_RunWhitelist_Oct2022_Val_221025_210317_7603 (finally, to persist the workflow information, took <1sec - couchdb reports 10ms)
  10. another PUT https://alancc7-cloud1.cern.ch/couchdb/reqmgr_workload_cache/amaltaro_DQMHarvest_RunWhitelist_Oct2022_Val_221025_210317_7603/spec?rev=1-3d6228fdc600c66f5c323147d21f67dd (I guess this one creates an attachment for the document above, with the workload object, took <1sec - couchdb reports 21ms)

and the client POST call to the endpoint /reqmgr2/data/request is completed, this time taking 7719986 us, but I've seen 10-14secs.

BTW, I am not able to reproduce any of these delays with vanilla curl command.

@amaltaro
Copy link
Contributor

Just another not-so-useful update on this investigation.

I created an ad-hoc script called usePycurl.py, which basically imports RequestHandler from the pycurl_manager module and does an HTTP call just like the one done by ReqMgr2. Changes are in this branch:
master...amaltaro:WMCore:fix-11330

I then made some modifications to the pycurl_manager module, such that I can time where exactly in the pycurl module time is spent. Here is a diff of my changes [1].

After running this test a few times, for this URI:
Uri: https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e/configFile

one can see that the total elapsed time varies from 0.07 to 3.x seconds. CouchDB responds with the following header:

 'X-Couch-Request-ID': 'fc180d9ad7',

such that we can search for this UID request identifier in the CouchDB logs, which I also paste it here:

[notice] 2022-10-26T20:57:46.875621Z [email protected] <0.18462.1070> d458bcf084 vocms0842.cern.ch:5984 188.185.89.194 undefined GET /reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e/configFile 200 ok 2
[notice] 2022-10-26T20:57:53.533188Z [email protected] <0.23077.1070> fc180d9ad7 vocms0842.cern.ch:5984 xxx.xxx.xxx.xxx undefined GET /reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e/configFile 200 ok 3

confirming that it takes only 2 or 3 miliseconds to be served by CouchDB itself.

With this information, the best I can assume is that something is delaying either:

  • the HTTP request to reach the backend; or
  • the HTTP response to come back to the client.

Comparing the HTTP response Date time with the cmsweb frontend logs, I can identify some inconsistency which is not very clear to me (CMS-Server-Time matches the elapsed time seen on the client side though). On the client side, this is what I see in the response headers:

# request 1
 'CMS-Server-Time': 'D=747111 t=1666817866128390',
 'Date': 'Wed, 26 Oct 2022 20:57:45 GMT',
 'X-Couch-Request-ID': 'd458bcf084',
# request 2
 'CMS-Server-Time': 'D=3089492 t=1666817870443434',
 'Date': 'Wed, 26 Oct 2022 20:57:52 GMT',
 'X-Couch-Request-ID': 'fc180d9ad7',

while this is the access log of the frontend (note the different timestamps?)

/data/s3/logs/cmsweb/frontend-logs/cmsweb/2022/10/26/cmsweb-wvdrd-startAt=22:45_0.txt:[26/Oct/2022:22:57:46 +0200] 24ee3f389641 xxx.xxx.xxx.xxx "GET /couchdb/reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e/configFile HTTP/1.1" 200 [data: 3252 in 24736 out 3948 body 747611 us ] [auth: blah ] [ref: "-" "WMCore/usePycurl" ] 8443 - Podname=frontend-hrflv
/data/s3/logs/cmsweb/frontend-logs/cmsweb/2022/10/26/cmsweb-wvdrd-startAt=22:45_0.txt:[26/Oct/2022:22:57:50 +0200] 24ee3f389641 xxx.xxx.xxx.xxx "GET /couchdb/reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e/configFile HTTP/1.1" 200 [data: 3252 in 24737 out 3948 body 3090094 us ] [auth: blah ] [ref: "-" "WMCore/usePycurl" ] 8443 - Podname=frontend-hrflv

[1]

@@ -321,13 +322,18 @@ class RequestHandler(object):
                 verbose=0, ckey=None, cert=None, capath=None,
                 doseq=True, encode=False, decode=False, cainfo=None, cookie=None):
         """Fetch data for given set of parameters"""
+        tStart = time.time()
         curl = pycurl.Curl()
+        print("AMR created curl object in: %s" % (time.time() - tStart))
         bbuf, hbuf = self.set_opts(curl, url, params, headers, ckey, cert, capath,
                                    verbose, verb, doseq, encode, cainfo, cookie)
+        print("AMR set options in: %s" % (time.time() - tStart))
         curl.perform()
+        print("AMR performed call in : %s" % (time.time() - tStart))
         if verbose:
             print(verb, url, params, headers)
         header = self.parse_header(hbuf.getvalue())
+        print("AMR parsed headers in : %s" % (time.time() - tStart))
         if header.status < 300:

@vkuznet
Copy link
Contributor

vkuznet commented Oct 26, 2022

Alan, I suggest that you simplify your measurements using individual pieces, e.g.

  • first, measure time spent in FE+nginx using simpler (then couchdb) service like httpgo, e.g. https://cmsweb.cern.ch:8443/httpgo will only print you results of http request, i.e. only give you time spent in FE+nginx. This is high throughput Go server which spent zero time on your request. If you find inconsistencies here then the issue is with latency in FE+nginx stack, otherwise you can move forward with testing CouchDB only
  • measure couchdb access directly without going to GE, i.e. bypass FE+nginx. Since FE has CouchDB redirect rules, you can put your script to a node (VM) and measure time using localhost URL this may give you idea if something wrong with CouchDB itself, e.g. it storage.
  • use hey tool to measure concurrent calls, instead of pycurl. It will send N requests and provide you stats with mean/low/max time that you can see. You can find documentation about it over here. We use it in all our stress tests.
  • you can put httpgo executable directly on your couchDB node and test the VM itself using this HTTP service. This will give you idea if any networking issue exist on VM itself. Since it is static executable you can grab it directly from release area and run it on your node. Then, you can use either pycurl script or hey to measure performance of this server to understand if VM itself may misbehave.

@amaltaro
Copy link
Contributor

amaltaro commented Oct 27, 2022

Valentin, those are good suggestions, thanks! I ran some of these tests and I think there are more results now supporting my suspicious of something wrong between FEs and VM backends communication

this is not going to exercise exactly the same components. When talking to CouchDB, Nginx is out of scope. Anyhow, I ran some tests and results are pretty constant, in general < 0.1 seconds.

  • measure couchdb access directly without going to GE, i.e. bypass FE+nginx.

this test exposes the problem, and it can be reproduced with plain curl. Running the same call (accept-encoding has no effect here) on the localhost, in this case vocms0842, I get constant results like (matching what gets reported in the CouchDB log):

time curl "localhost:5984/reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e/configFile" -v -H "Content-type: application/json" -H "Accept: application/json" -H "Accept-Encoding: deflate,identity" | head -n2
...
real	0m0.014s
user	0m0.003s
sys	0m0.006s

while running the same command from a remote host, I get something like (with large variations):

time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e/configFile" -v -H "Content-type: application/json" -H "Accept: application/json" -H "Accept-Encoding: deflate,identity" | head -n2
...
< CMS-Server-Time: D=1856123 t=1666832943150137
real	0m1.897s
user	0m0.018s
sys	0m0.020s

where D=1856123 is 1.856 seconds.

  • use hey tool to measure concurrent calls, ...

The hey test has been performed before putting those nodes into production, and we should definitely NOT run it against the production services.

  • you can put httpgo executable directly on your couchDB ...

for that to work as we need, we would have to change the Apache FE backend rules. Otherwise, traffic won't go outside of localhost.

Last but not least, I ran the same remote tests, but this time using a different FE (cmsweb-prod instead of cmsweb), and the results are pretty good and constant:

 time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb-prod.cern.ch/couchdb/reqmgr_config_cache/6c075881a8454070ce3c1e8921cdb45e/configFile" -v -H "Content-type: application/json" -H "Accept: application/json" -H "Accept-Encoding: deflate,identity" | head -n2
...
< CMS-Server-Time: D=65918 t=1666833329545629
real	0m0.103s
user	0m0.019s
sys	0m0.015s

To me it's clear that something is wrong with the cmsweb frontend and/or their communication to the VM backends.
@muhammadimranfarooqi would you have any clue of what might be wrong? I think it all started when we swapped the CouchDB VMs, running that procedure with Aroosha. If you read this post (no need to read the whole thread), there are instructions to easily reproduce this problem.

@muhammadimranfarooqi
Copy link

muhammadimranfarooqi commented Oct 28, 2022

Hi @amaltaro

I tried to some investigatiion and comparison b/w cmsweb and cmsweb-prod.

These are the rules and tags which are same in both FE clusters

[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ cat /data/srv/current/config/frontend/backends.txt  | grep couchdb
^/auth/complete/(?:couchdb/workqueue|workqueue)(?:/|$) vocms0841.cern.ch :affinity
^/auth/complete/(?:couchdb/workqueue_inbox|workqueue_inbox)(?:/|$) vocms0841.cern.ch :affinity
^/auth/complete/(?:couchdb/wmstats|wmstats)(?:/|$) vocms0843.cern.ch :affinity
^/auth/complete/(?:couchdb/workloadsummary|workloadsummary)(?:/|$) vocms0843.cern.ch :affinity
^/auth/complete/(?:couchdb/wmstats_logdb|wmstats_logdb)(?:/|$) vocms0843.cern.ch :affinity
^/auth/complete/(?:couchdb/tier0_wmstats|tier0_wmstats)(?:/|$) vocms0844.cern.ch :affinity
^/auth/complete/(?:couchdb/t0_workloadsummary|t0_workloadsummary)(?:/|$) vocms0844.cern.ch :affinity
^/auth/complete/(?:couchdb/t0_request|t0_request)(?:/|$) vocms0844.cern.ch :affinity
^/auth/complete/(?:couchdb/t0_logdb|t0_logdb)(?:/|$) vocms0844.cern.ch :affinity
^/auth/complete/(?:couchdb/wmdatamining|wmdatamining)(?:/|$) vocms0842.cern.ch :affinity
^/auth/complete/(?:couchdb/reqmgr_workload_cache|reqmgr_workload_cache)(?:/|$) vocms0842.cern.ch :affinity
^/auth/complete/(?:couchdb/reqmgr_config_cache|reqmgr_config_cache)(?:/|$) vocms0842.cern.ch :affinity
^/auth/complete/(?:couchdb/acdcserver|acdcserver)(?:/|$) vocms0842.cern.ch :affinity
^/auth/complete/(?:couchdb/reqmgr_auxiliary|couchdb)(?:/|$) vocms0842.cern.ch :affinity
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ ls -l /data/srv/
total 0
drwxr-xr-x. 1 _frontend _frontend 32 Oct 13 05:16 HG2211a
drwxr-xr-x. 2 _frontend _frontend  6 Oct 13 05:10 auth
lrwxrwxrwx. 1 _frontend _frontend  7 Oct 13 05:16 current -> HG2211a
drwxr-xr-x. 2 _frontend _frontend 22 Oct 13 05:16 enabled
drwxr-xr-x. 4 _frontend _frontend 35 Oct 13 05:10 logs
drwxr-xr-x. 1 _frontend _frontend 22 Oct 13 05:10 state

The firewall rules are also correct and implemented together for cmsweb and cmsweb-prod here

Regarding monitoring of nodes for both clusters these are dashboards for cmsweb and cmsweb-prod

And this dashboard is for frontend apache for both cmsweb and cmsweb-prod

Please have a look in dashboards, so far I did notice anything unusual b/w both clusters apart from TCP connections which are double in cmsweb than cmsweb-prod

@amaltaro
Copy link
Contributor

@muhammadimranfarooqi Hi Imran,
thank you for providing these pointers and double investigating for possible configuration differences.

Looking at the dashboards, I don't see anything wrong with them. There are a few differences here and there, like a much larger number of "Sockstat TCP" and larger "Network Traffic" as well, but I am not sure whether those are relevant or not.

I performed many queries to CouchDB, and monitored which cmsweb frontend IP was used, and it's not associated to a single FE, so we can sort of rule out a single misbehaving FE.

Could you please try to reproduce this problem with the following 2 different curl calls (e.g., from lxplus):

time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/fd79406f016da454e5ac53c393491320/configFile" -H "Content-type: application/json" -H "Accept: application/json" -H "Accept-Encoding: deflate,identity" | head -n2

and

time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb-prod.cern.ch/couchdb/reqmgr_config_cache/fd79406f016da454e5ac53c393491320/configFile" -H "Content-type: application/json" -H "Accept: application/json" -H "Accept-Encoding: deflate,identity" | head -n2

?

@arooshap
Copy link
Member

arooshap commented Oct 28, 2022

Hi @amaltaro, I tried both commands through lxplus8.

Using the first command, I get the following output:

[apervaiz@lxplus8s20 ~]$ time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/fd79406f016da454e5ac53c393491320/configFile" -H "Content-type: application/json" -H "Accept: application/json" -H "Accept-Encoding: deflate,identity" | head -n2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4341  100  4341    # Auto generated configuration file-:-- --:--:--     0
# using:
0     0   9396      0 --:--:-- --:--:-- --:--:--  9396
(23) Failed writing body

real	0m0.476s
user	0m0.018s
sys	0m0.014s

And using the second, I am getting the following output:

[apervaiz@lxplus8s20 ~]$ time curl --cert $X509_USER_CERT --key $X509_USER_KEY "https://cmsweb-prod.cern.ch/couchdb/reqmgr_config_cache/fd79406f016da454e5ac53c393491320/configFile" -H "Content-type: application/json" -H "Accept: application/json" -H "Accept-Encoding: deflate,identity" | head -n2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4341  # Auto generated configuration file--:--:-- --:--:-- --:--:--     0
1# using:
00  4341    0     0  44295      0 --:--:-- --:--:-- --:--:-- 44295
(23) Failed writing body

real	0m0.122s
user	0m0.020s
sys	0m0.017s

@vkuznet
Copy link
Contributor

vkuznet commented Oct 28, 2022

@arooshap , please use silent flag for curl to avoid extra output, and run series of curl calls to acquire statistics.

@muhammadimranfarooqi , you should try to login to minions of both clusters and use traceroute to couchdb host to see how many hops each minion cluster have. It can be that traffic routes are different and there extra delay in getting packets.

From tests with httpgo and couchdb URLs we may conclude that it is not issue per-se with FE, but rather communication of FE with couchdb. The httpgo access gives the same results which rules out issue with communicating FE and its BE service. While, access to couchdb certainly show big difference, e.g. the cmsweb-prod is in range of 0.8s while cmsweb gives 2-3s. And, you do not need access reqmgr2 couchdb DB since the results are consistent with simple access to couch:

time scurl -s "https://cmsweb.cern.ch/couchdb/"
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"121e5be7435f9d9f31b37171122b6645","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m3.350s
user    0m0.043s
sys     0m0.037s

time scurl -s "https://cmsweb-prod.cern.ch/couchdb/"
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"121e5be7435f9d9f31b37171122b6645","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.421s
user    0m0.045s
sys     0m0.037s

So, I would say that we need to inspect routes between k8s minions and couch on both clusters.

@amaltaro
Copy link
Contributor

@muhammadimranfarooqi @arooshap if you prefer, please give me access to both cmsweb and cmsweb-prod FE clusters (including configuration and everything that is needed to list/connect to those pods) and I can try to figure something out.

As Valentin correctly mentioned, you need to perform this same query many times (> 5 times) to have better statistics.

@muhammadimranfarooqi
Copy link

muhammadimranfarooqi commented Oct 28, 2022

Hi @amaltaro @vkuznet

After login to cmweb-prod FE pods, I run following commands:

[_frontend@cmsweb-prod-otvm4plyph5p-node-2 data]$ time curl vocms0841.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"62667967fbb70f5e51bcda5c36df28cb","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.032s
user    0m0.006s
sys     0m0.007s
[_frontend@cmsweb-prod-otvm4plyph5p-node-2 data]$ time curl vocms0842.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"121e5be7435f9d9f31b37171122b6645","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.022s
user    0m0.006s
sys     0m0.003s
[_frontend@cmsweb-prod-otvm4plyph5p-node-2 data]$ time curl vocms0843.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"b1d94ffc05edae0dc3e2b9f7dee93339","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.027s
user    0m0.006s
sys     0m0.007s
[_frontend@cmsweb-prod-otvm4plyph5p-node-2 data]$ time curl vocms0844.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"d1a3739528ee6a423e5f49b754d98744","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.022s
user    0m0.003s
sys     0m0.006s

However, the pattern is different from cmsweb cluster.

[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ time curl vocms0841.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"62667967fbb70f5e51bcda5c36df28cb","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m3.095s
user    0m0.004s
sys     0m0.007s
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ time curl vocms0842.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"121e5be7435f9d9f31b37171122b6645","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.256s
user    0m0.005s
sys     0m0.003s
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ time curl vocms0843.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"b1d94ffc05edae0dc3e2b9f7dee93339","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m3.033s
user    0m0.001s
sys     0m0.009s
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ time curl vocms0844.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"d1a3739528ee6a423e5f49b754d98744","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.014s
user    0m0.003s
sys     0m0.005s

Traceroute does not reflect any latency though. Some nodes are two hops and some are 4 hops.
Example, in cmsweb-prod

[_frontend@cmsweb-prod-otvm4plyph5p-node-0 data]$ traceroute vocms0841.cern.ch
traceroute to vocms0841.cern.ch (188.184.76.226), 30 hops max, 60 byte packets
 1  k513-v-rjuxl-v11-ip35.cern.ch (10.116.40.2)  1.032 ms k513-v-rjuxl-v12-ip35.cern.ch (10.116.40.7)  0.839 ms k513-v-rjuxl-v11-ip35.cern.ch (10.116.40.2)  1.204 ms
 2  k773-b-rjuxl-v1-cp41.cern.ch (185.249.58.81)  2.513 ms k513-b-rjuxl-v2-cb41.cern.ch (185.249.58.57)  0.823 ms k773-b-rjuxl-v1-cp42.cern.ch (185.249.58.85)  0.912 ms
 3  k513-a-rjuxl-v11-cb31.cern.ch (185.249.58.50)  0.883 ms k513-a-rjuxl-v11-ca31.cern.ch (185.249.58.18)  1.170 ms  8.301 ms
 4  vocms0841.cern.ch (188.184.76.226)  10.844 ms  18.192 ms  32.609 ms

and from cmsweb

[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ traceroute vocms0841.cern.ch
traceroute to vocms0841.cern.ch (188.184.76.226), 30 hops max, 60 byte packets
 1  k513-a-rjuxl-v12-ip560.cern.ch (10.116.129.3)  0.873 ms  0.809 ms k513-a-rjuxl-v11-ip560.cern.ch (10.116.129.2)  0.836 ms
 2  vocms0841.cern.ch (188.184.76.226)  11.865 ms  19.365 ms  26.886 ms

from the same pod I run time curl

@vkuznet
Copy link
Contributor

vkuznet commented Oct 28, 2022

ok, now we clearly see which nodes causing the issue, e.g.:

  • cmsweb-prod-otvm4plyph5p-node-0 is fine
  • while cmsweb-zone-b-mjgekmafkolv-node-0 is not when it talks to vocms0841, vocms0843

I suggest to talk to CERN IT (Spyros) to diagnose this. It may be faulty network card or something on a physics node. Another way would be to add a new node to cmsweb cluster and then discard cmsweb-zone-b-mjgekmafkolv-node-0 one (may be create a node in different cern zone).

@amaltaro
Copy link
Contributor

Thanks for running these tests, Imran.
Note that from my tests, I see the same problematic behavior coming from multiple IP addresses, so it's likely not 1 minion having problems.

@muhammadimranfarooqi
Copy link

Actually I don't see problem with the node. I deploy curl pod on the same node and here is result.

root@curl-87dd648d8-jlhxl:/go# time curl vocms0841.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"62667967fbb70f5e51bcda5c36df28cb","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.018s
user    0m0.004s
sys     0m0.010s
root@curl-87dd648d8-jlhxl:/go# time curl vocms0842.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"121e5be7435f9d9f31b37171122b6645","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.215s
user    0m0.007s
sys     0m0.005s
root@curl-87dd648d8-jlhxl:/go# time curl vocms0843.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"b1d94ffc05edae0dc3e2b9f7dee93339","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.014s
user    0m0.006s
sys     0m0.005s
root@curl-87dd648d8-jlhxl:/go# time curl vocms0844.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"d1a3739528ee6a423e5f49b754d98744","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.217s
user    0m0.007s
sys     0m0.005s

Which may suggest prob is in FE pods. May be due to high load I guess,

@muhammadimranfarooqi
Copy link

I am running one more test. I am adding one more node in the cmsweb cluster. I'll deploy FE but will not put this in the DNS. Then I'll test from the new FE pod. Lets see

@muhammadimranfarooqi
Copy link

muhammadimranfarooqi commented Oct 28, 2022

Here is the result of FE pod with the new node on the same zone-b in cmsweb cluster.

[_frontend@cmsweb-zone-b-mjgekmafkolv-node-1 data]$ time curl vocms0841.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"62667967fbb70f5e51bcda5c36df28cb","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.025s
user    0m0.002s
sys     0m0.003s
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-1 data]$ time curl vocms0842.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"121e5be7435f9d9f31b37171122b6645","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.010s
user    0m0.000s
sys     0m0.004s
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-1 data]$ time curl vocms0843.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"b1d94ffc05edae0dc3e2b9f7dee93339","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.020s
user    0m0.001s
sys     0m0.004s
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-1 data]$ time curl vocms0844.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"d1a3739528ee6a423e5f49b754d98744","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.012s
user    0m0.002s
sys     0m0.002s

As there is no load, results match with cmsweb-prod.

@muhammadimranfarooqi
Copy link

I have restrated previous node and now results are better.

[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ time curl vocms0844.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"d1a3739528ee6a423e5f49b754d98744","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.019s
user    0m0.003s
sys     0m0.007s
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ time curl vocms0841.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"62667967fbb70f5e51bcda5c36df28cb","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.012s
user    0m0.003s
sys     0m0.004s
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ time curl vocms0842.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"121e5be7435f9d9f31b37171122b6645","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.027s
user    0m0.001s
sys     0m0.007s
[_frontend@cmsweb-zone-b-mjgekmafkolv-node-0 data]$ time curl vocms0843.cern.ch:5984
{"couchdb":"Welcome","version":"3.2.2","git_sha":"d5b746b7c","uuid":"b1d94ffc05edae0dc3e2b9f7dee93339","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"]}

real    0m0.013s
user    0m0.002s
sys     0m0.005s

Let me restart all nodes one by one. But I guess situation might come to the same state after some time.

@muhammadimranfarooqi
Copy link

I have restarted all nodes. Now things look better. @amaltaro please monitor status for few days and lets see if the same situation arise again.

@vkuznet
Copy link
Contributor

vkuznet commented Oct 28, 2022

@muhammadimranfarooqi , if the issue in a load then we need to compare load on cmsweb vs cmsweb-prod as some ration factor, @mrceyhun can help to build proper plot using apache load metrics. And, if it is the case, then the solution will be do add more nodes to cmsweb and add more FE pods which will bring average load per pod down.

@amaltaro
Copy link
Contributor

I ran around 10 curl calls - using cmsweb frontends - and the response time seems to be back to normal, indeed.
Thank you for following this up, Imran and Valentin. I am going to update it once again on Monday.

@amaltaro
Copy link
Contributor

I injected all the DMWM and Integration set of test workflows into cmsweb-testbed, and it seems to be in a much better shape now. I am still not 100% sure whether we are back to the norm though. I will come back to this in a day or two.

@muhammadimranfarooqi
Copy link

Alan, should not workflows be injected into cmsweb instead of cmsweb-testbed as issue was related to cmsweb FEs?

@amaltaro
Copy link
Contributor

amaltaro commented Nov 1, 2022

Imran, to have a more complete test, yes. However, even when we inject workflows in testbed, we can point the workflow to use job configuration stored in the production CouchDB instance (accessing reqmgr_config_cache).

@amaltaro
Copy link
Contributor

amaltaro commented Jan 4, 2023

After the cmsweb frontend restarts, back in November, there hasn't been any more reports of such slowness.
There still seem to exist a minor delay when injecting workflows into testbed that rely on ConfigCacheID from the production CouchDB, but it's not really significant and the reason is likely not part of the WMCore stack. So I am closing this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants