-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WMCore services reported being slow #11330
Comments
FYI: @amaltaro |
I am able to reproduce this slowness by simply creating a workflow in my private VM services. The bulk of the delay - and processing as well - comes from the Zooming in, it's actually validating the PSet (ConfigCacheID) configuration here: and performing the following http calls:
and the client POST call to the endpoint BTW, I am not able to reproduce any of these delays with vanilla |
Just another not-so-useful update on this investigation. I created an ad-hoc script called usePycurl.py, which basically imports I then made some modifications to the After running this test a few times, for this URI: one can see that the total elapsed time varies from 0.07 to 3.x seconds. CouchDB responds with the following header:
such that we can search for this UID request identifier in the CouchDB logs, which I also paste it here:
confirming that it takes only 2 or 3 miliseconds to be served by CouchDB itself. With this information, the best I can assume is that something is delaying either:
Comparing the HTTP response
while this is the access log of the frontend (note the different timestamps?)
[1]
|
Alan, I suggest that you simplify your measurements using individual pieces, e.g.
|
Valentin, those are good suggestions, thanks! I ran some of these tests and I think there are more results now supporting my suspicious of something wrong between FEs and VM backends communication
this is not going to exercise exactly the same components. When talking to CouchDB, Nginx is out of scope. Anyhow, I ran some tests and results are pretty constant, in general < 0.1 seconds.
this test exposes the problem, and it can be reproduced with plain
while running the same command from a remote host, I get something like (with large variations):
where
The
for that to work as we need, we would have to change the Apache FE backend rules. Otherwise, traffic won't go outside of localhost. Last but not least, I ran the same remote tests, but this time using a different FE (
To me it's clear that something is wrong with the |
Hi @amaltaro I tried to some investigatiion and comparison b/w These are the rules and tags which are same in both FE clusters
The firewall rules are also correct and implemented together for cmsweb and cmsweb-prod here Regarding monitoring of nodes for both clusters these are dashboards for cmsweb and cmsweb-prod And this dashboard is for frontend apache for both cmsweb and cmsweb-prod Please have a look in dashboards, so far I did notice anything unusual b/w both clusters apart from TCP connections which are double in |
@muhammadimranfarooqi Hi Imran, Looking at the dashboards, I don't see anything wrong with them. There are a few differences here and there, like a much larger number of "Sockstat TCP" and larger "Network Traffic" as well, but I am not sure whether those are relevant or not. I performed many queries to CouchDB, and monitored which Could you please try to reproduce this problem with the following 2 different curl calls (e.g., from lxplus):
and
? |
Hi @amaltaro, I tried both commands through lxplus8. Using the first command, I get the following output:
And using the second, I am getting the following output:
|
@arooshap , please use silent flag for curl to avoid extra output, and run series of curl calls to acquire statistics. @muhammadimranfarooqi , you should try to login to minions of both clusters and use From tests with httpgo and couchdb URLs we may conclude that it is not issue per-se with FE, but rather communication of FE with couchdb. The httpgo access gives the same results which rules out issue with communicating FE and its BE service. While, access to couchdb certainly show big difference, e.g. the cmsweb-prod is in range of 0.8s while cmsweb gives 2-3s. And, you do not need access reqmgr2 couchdb DB since the results are consistent with simple access to couch:
So, I would say that we need to inspect routes between k8s minions and couch on both clusters. |
@muhammadimranfarooqi @arooshap if you prefer, please give me access to both As Valentin correctly mentioned, you need to perform this same query many times (> 5 times) to have better statistics. |
After login to
However, the pattern is different from
Traceroute does not reflect any latency though. Some nodes are two hops and some are 4 hops.
and from
from the same pod I run time curl |
ok, now we clearly see which nodes causing the issue, e.g.:
I suggest to talk to CERN IT (Spyros) to diagnose this. It may be faulty network card or something on a physics node. Another way would be to add a new node to cmsweb cluster and then discard cmsweb-zone-b-mjgekmafkolv-node-0 one (may be create a node in different cern zone). |
Thanks for running these tests, Imran. |
Actually I don't see problem with the node. I deploy
Which may suggest prob is in FE pods. May be due to high load I guess, |
I am running one more test. I am adding one more node in the |
Here is the result of FE pod with the new node on the same
As there is no load, results match with |
I have restrated previous node and now results are better.
Let me restart all nodes one by one. But I guess situation might come to the same state after some time. |
I have restarted all nodes. Now things look better. @amaltaro please monitor status for few days and lets see if the same situation arise again. |
@muhammadimranfarooqi , if the issue in a load then we need to compare load on cmsweb vs cmsweb-prod as some ration factor, @mrceyhun can help to build proper plot using apache load metrics. And, if it is the case, then the solution will be do add more nodes to cmsweb and add more FE pods which will bring average load per pod down. |
I ran around 10 curl calls - using |
I injected all the DMWM and Integration set of test workflows into cmsweb-testbed, and it seems to be in a much better shape now. I am still not 100% sure whether we are back to the norm though. I will come back to this in a day or two. |
Alan, should not workflows be injected into |
Imran, to have a more complete test, yes. However, even when we inject workflows in testbed, we can point the workflow to use job configuration stored in the production CouchDB instance (accessing |
After the cmsweb frontend restarts, back in November, there hasn't been any more reports of such slowness. |
Impact of the bug
All WMCore services
Describe the bug
We were contacted by several groups during the last week or two, reporting slowness of the APIs from WMCore they are using. We did check few of them, and they seemed fine. So far the information we managed to collect is not enough to properly describe the issue. The only clue we have so far that it may be related mostly to
POST
http requests, rather thanGET
One place we are 100% sure the APIs are slow is the workflow creation/injection. It was taking minutes in comparison of seconds from previous moths.
How to reproduce it
No clue so far
Expected behavior
No slow APIS are expected
Additional context and error message
None
The text was updated successfully, but these errors were encountered: