-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workers with issues. #1360
Comments
This worker is probably running in a bug in latest clang. llvm/llvm-project#55377 |
Yes but Oakwen-5cores-05c1d913 also runs clang 15 on WSL and seems to be doing fine. |
for whatever reason the TLS might be handled differently or the code aligned properly by luck, depending on the OS. |
Oakwen-3cores-8708609b switched to g++ so the problem is solved. |
Another issue: The matches of Dantist-7cores-3e5ab901 always finish with "Finished match uncleanly": This has been going on since forever. I have no idea how it is possible, |
hard to guess, would need a more detailed error message. |
The matches with "Finished match uncleanly" have no games but also no crashes. This suggests that cutechess-cli failed to start the engine(s). I would be good to have access to the worker output of Dantist-7cores-3e5ab901 so that we can see what's going on. |
@Dantist can you provide such output? |
@noobpwnftw Your worker ChessDBCN-16cores-f3dad03d is now suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655459532.996&max_actions=33 . Note: The hexadecimal number f3dad03d is the first 8 characters of the UUID (which are constant). It can be found as a comment in the config file and also in uuid.txt. |
That actually looks correct: run.log I have to say that I have a somewhat unique setup.. I planned to deploy the worker to many servers, so I dockerized it with the In general, it worked great - the latest GCC, Python, cutechess-cli (compiled from source). Anyway, currently. I see that this docker image is no longer working, and rebuilding doesn't fix the issue, and sadly enough, I have no time to fix it. Unfortunately, I haven't looked after my workers for some time and have fallen out of life, because now I have to defend my country from putin's barbaric invaders, but if I get out alive, I'll definitely fix everything. I'll attach my docker setup below for your convenience (in troubleshooting), but you might want to amend it and add this run method as one of the option to the "Running the worker" wiki page, or even make your own official docker image and push it to hub.docker.com so people can run docker with a single CLI command without manually downloading anything. Tiny Alpine Docker image: fishtest-docker.zip I hope this can be of any help. |
best of luck, and stay healthy. |
I wish you way more than luck @Dantist |
@Dantist Thanks for the logs. They are very helpful. And good luck! |
@noobpwnftw The worker ChessDBCN-16cores-97544138 is also suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655554389.898&max_actions=100 . |
@noobpwnftw Now the worker ChessDBCN-16cores-b858eb82 is throttled: |
See https://stackoverflow.com/questions/71580631/how-can-i-get-code-coverage-with-clang-13-0-1-on-mac A MacOS worker is running fine with clang, perhaps has a x86_64 CPU, so at the moment skip the profiled build only for Apple silicon with #1370 |
I have removed worker |
@ppigazzini I have noticed this AssertionError once before. I did a code review then but could not find what might cause it. So it is a mystery. I suspect it is some kind of race condition.... |
@noobpwnftw The worker ChessDBCN-16cores-97544138 still suffers quite heavily from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1658567399.302&max_actions=100 . |
Worker technologov-28cores-r345 suffers quite badly from throttling |
technologov-56cores-r101 suffers from "Finished match uncleanly". It plays no games so this suggests that cutechess is unable to start the engines (the same issue Dantist had, but in this case it was fixed with the new cutechess binary). EDIT: However I checked that technologov-56cores-r101 does not always suffer from this. In many cases it can execute a task. |
Since it's not yet been discussed here (see discord), one technologov worker as well as most or every worker of linrock and sebastronomy have severe time loss problems. This is of course yet another symptom of known cutechess concurrency issues, however until the worker or cutechess is fixed, this is causing substantial pollution of fishtest data (timelosses causing higher-than-nominal pairwise-"draws", in the form of 1-0 1-0, thereby biasing test elos towards 0). See also #1393, for implementing a worker-side workaround of cutechess problems, and #1394 for server-side filtering of bad data. |
@dubslow This issue is specifically for documenting ill behaving workers. So it is best to refer to a worker by its full name (as has been done in the earlier comments). For documentation purposes it would be nice if there were a method in fishtest to link to a task (in a similar way that it is possible to link to an event). Currently we can only link to a run. |
@MinetaS @silversolver1 Thanks for reporting. Unfortunately there is currently not really a strategy for dealing with time losses (or other undesirable worker behavior). However since yesterday excessive time losses are recorded in the event log. This will make it easier to follow such workers. https://tests.stockfishchess.org/actions?action=crash_or_time&user=&text= |
An excessive number of time losses by Wencey-32cores https://tests.stockfishchess.org/actions?action=crash_or_time&user=&text=%22Wencey-32cores%22 |
I assume that the issue is that on some systems the communication between the engine and cutechess-cli steals time from the engine. There really should be an option in cutechess-cli that makes it trust the time reported by the engines. Perhaps someone can implement this? |
This task has more time losses than played games https://tests.stockfishchess.org/tests/view/63df697473223e7f52ad5d79?show_task=1097 Must be a bug... |
@noobpwnftw Currently there are many of your workers that have the same uuid prefix which is not really desirable. I assume they were all started with a config file with the same private section. The private section looks like this:
The private section is generated once and then saved in the config file. If it is deleted then it is regenerated. The hw_seed, which is simply a random number, is the last line of defense to distinguish workers that are otherwise completely identical (e.g. running from the same virtual OS image). |
I have written an email to sebastronomy. |
Hi, that's me. It seems to be the NUMA issue discussed a while ago on discord. I had to update kernel and OS of the machine and restarted the worker without numactl. I restarted with numactl. and will keep an eye on it. |
Fixed now. Faulty CPU cooling on one CPU. reduced workload on this CPU, locked the workers to certain CPU cores (numactl). No more crashes since then. Will have to call DELL to fix this. Fans all OK. looks like problem with cooler itself |
These workers seem to have serious issues with building SF |
That's an issue that the newest clang compiler doesn't recognize one of the options we used in older versions of SF. The newer Makefile works, but not for the regression test against SF15 |
technologov-56cores-r116 appears to be generating a large number of dead tasks https://tests.stockfishchess.org/actions?action=dead_task&user=&text=%22technologov-56cores-r116%22 Edit: the tasks don't have any games. After each reported dead task the worker restarts. |
technologov-56cores-r116 is still generating a large number of dead tasks. Perhaps someone can write an email? |
I wrote him on Discord, the user has already answered. |
There seem to be two workers named okrout-28cores-0a0cde5b. I.e. they have the same UUID prefix (but different UUID). This is harmless but not nice. It is also not easy to achieve accidentally. There should be a server side mechanism to avoid duplicate UUID prefixes but this is not so easy as one one has to take for example dead tasks into account which can linger for quite some time. |
During request_task, we check if there are active tasks for workers with the same uuid prefix, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360.
During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360.
During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360.
During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360.
During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360 (comment) ,
During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360 (comment) .
During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix #1360 (comment) .
This worker has a problem with his make installation https://tests.stockfishchess.org/actions?action=&user=&text=+%22maximmasiutin-2cores-f21151d3%22 |
I can block him if you give me approver rights. |
Done :) |
The strange case of okrout-28cores-ca072243. Despite having been blocked, and having been sent an email about the blocking 3 days ago, the worker keeps dutifully reconnecting every 15 minutes, trying to get a task (see https://tests.stockfishchess.org/workers/show). Presumable the worker runs unattended and the email address is stale. It is not a problem. Just strange. EDIT The worker has been taken offline now. EDIT2 The worker came briefly back online again and was then replaced by okrout-28cores-ba47b84b which suffers from the same problem (I blocked it again). It seems the owner does read email and also not the console messages. |
I suspect some users create multiple workers whose total number of cores is larger than the number of cores in the system, with predictably bad results. Perhaps this happens here https://tests.stockfishchess.org/actions?action=&user=tolkki963&text= ? It seems feasible to count the total number of cores in use by the workers using (named) shared memory (available in Python 3.8). For Python <= 3.12 this requires a monkey patch to work around some bugs in resource tracking. python/cpython#82300 (comment) I tested this code on python 3.10 and it works. As usual there are two annoying issues to deal with
|
I am thinking that is easier for the server to sort it out (if the workers send the necessary information). The server needs to know which workers run on the same machine but this can be done by using a random number in some fixed temporary file. |
I am creating this issue to report workers with issues. E.g. currently.
https://tests.stockfishchess.org/actions?action=failed_task&user=Oakwen&before=1655180683.077&max_actions=100
The text was updated successfully, but these errors were encountered: