-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_machines is too slow and does not use the book keeping data #2101
Comments
Well there is an intrinsic complexity here: a list of all active tasks needs to be made. On the primary instance the db access could be eliminated by using the cache and the set of unfinished runs (which is dynamically kept up to date), but I believe get_machines runs on a secondary instance. What we could do (in addition perhaps) is to create a scheduled task to regenerate the machines list every minute and then serve this list. |
how is that? the goal is to get the active machines, the "added" complexity that you need to loop through the tasks is an implementation detail.. |
Actually you are right. The main instance has a data structure wtt_map_schema = {
short_worker_name: (runs_schema, task_id),
} So that could be used to get the machines list quickly on the main instance. |
Either the machine list api should be moved to a primary instance, or else an internal api should be created where a secondary instance asks the primary instance for the list (this may involve serialization/deserialization of json). An example of such an api is in #1993. |
And as I said, an additional enhancement could be to recreate the list every minute in a scheduled task (rather than creating it on demand). |
Note that the |
I think we moved this API to the secondary instance because of performance issues.. while I expected we have such a map in the flow that should provide us such information I'm not quite sure about the 1 min reschedule time build, do you mean that the numbers showing in the home page cards currently only renew every 1 min? This sounds too much TBH.. I'm not generally a fan of rescheduling api's.. neither does the nature of book-keeping approaches requires it.. so this caught me by surprise. |
Well I do not know where the bottleneck is. It could also be in mako. Currently the table is computed on demand but there is a delay (of the order of 30-60s) before the db is fully updated. I like to schedule expensive computing tasks at regular intervals rather than on demand (unless accuracy is critical). In that way one get predictable performance independently of load. Note that the new scheduler in Fishtest makes creating such tasks very easy. |
Wel my point is if we look at book-keeping as a technique it is neither on demand or scheduled.. it none of both.. so my question was .. suppose each time a worker gets active you add it to the dict.. and a worker gets non active you remove it from the list.. I'm not sure I understand why would such a map require scheduling? |
I do not know how much time it takes to generate the html from the data structure. If it takes 1s and 60 people want to view the machines lists then there is no time to do anything else... EDIT: Simplified of course... |
Sorry but I'm not following.. my question is related to the wttp_map, my question is why do we have it every 1 min.. I'm not understanding conceptually why such map doesn't exist always up to date just like how run results gets incrementally updated.. is it because we don't spot the worker while they leave? |
The wtt_map is always up to date (except that it contains some inactive tasks which are periodically cleaned). But you still need to generate the html from it. That's what I am talking about (during a visit of Vondele's fleet this will typically be a table with 10000 rows). Of course you could fiddle with paging but then sorting by username becomes unpleasant (I think). PS. The reason why there are some inactive tasks is that I did not want to clutter the |
Perhaps I should clarify that the For the machine list perhaps another data structure would be more appropriate: a dictionary with keys the run_id's of unfinished runs and with values the sets of corresponding active tasks. I.e. with schema {
run_id: {task_id}
} This would be very easy to keep up to date. |
I'm not sure where to look, perhaps it would be easier for you to code the new function please.
{ |
The thing is that I am not 100% sure that generating the data is the bottleneck. I wonder if it is not the generation of a table with > 15000 rows from that data (a table that also needs to be transferred to the client). If the latter would be the case then moving the machines api to the primary instance would be bad since there would less time to handle the worker apis. |
I guess we could try. But it would require a fleet visit to see the effect. |
Good thing the fleet is visiting right this very moment :) (the server is even holding fairly steady at 100+k cores for two hours now!) |
Well it takes more than 15min to implement this... |
Yes the server seems to be doing fine. Local |
Server is indeed doing fine under this load now. Current api locally:
a bit earlier the 3 pgn_upload pserves were 100% loaded, but not hanging (eventually we should just reduce the amount of STC pgns we upload, IMO, e.g. only for the first N tasks or so). |
Hmm the worker seems to log only successful api calls (which take a few ms), not those that time out (~30s I guess). |
Note that these are rare (17 out of 6810) for the table above. |
It would be nice to understand this though. I wonder if the As a safety measure I am thinking of only scavenging a limited number of dead tasks per turn. E.g. 500. Then clearing 17000 dead tasks would take 34 minutes. This is not entirely free though because it means runs may take longer to finish. But it seems like a worthy trade off. |
See #2109 |
pagination needed in both cases
|
Yes but one should still be able to sort the whole table (not only the current page). |
Actually vondele suggested this one which I forgot about:
|
Sounds good. Sorting by username is indeed the only thing I do. |
I actually do sort per core count, compiler version, python version etc... |
Yeah doable.. will work on it |
https://tests.stockfishchess.org/tests/machines
This route should use the latest optimizations of workers book-keeping if exists
It's also too slow under fleet load, ignoring the 10s cached http resposne
The text was updated successfully, but these errors were encountered: