Workers with issues. #1360

vdbergh · 2022-06-14T04:28:34Z

I am creating this issue to report workers with issues. E.g. currently.

https://tests.stockfishchess.org/actions?action=failed_task&user=Oakwen&before=1655180683.077&max_actions=100

vondele · 2022-06-14T05:14:10Z

This worker is probably running in a bug in latest clang. llvm/llvm-project#55377
I see his workers are based on clang 15.

vdbergh · 2022-06-14T09:47:17Z

Yes but Oakwen-5cores-05c1d913 also runs clang 15 on WSL and seems to be doing fine.

vondele · 2022-06-14T10:21:53Z

for whatever reason the TLS might be handled differently or the code aligned properly by luck, depending on the OS.

vdbergh · 2022-06-15T04:02:16Z

Oakwen-3cores-8708609b switched to g++ so the problem is solved.

vdbergh · 2022-06-15T04:07:10Z

Another issue: The matches of Dantist-7cores-3e5ab901 always finish with "Finished match uncleanly":

https://tests.stockfishchess.org/actions?action=failed_task&user=Dantist&before=1655265828.358&max_actions=100

This has been going on since forever. I have no idea how it is possible,

vondele · 2022-06-15T05:03:25Z

hard to guess, would need a more detailed error message.

vdbergh · 2022-06-16T06:48:34Z

The matches with "Finished match uncleanly" have no games but also no crashes. This suggests that cutechess-cli failed to start the engine(s). I would be good to have access to the worker output of Dantist-7cores-3e5ab901 so that we can see what's going on.

vondele · 2022-06-16T06:50:58Z

@Dantist can you provide such output?

vdbergh · 2022-06-17T08:52:57Z

@noobpwnftw Your worker ChessDBCN-16cores-f3dad03d is now suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655459532.996&max_actions=33 .

Note: The hexadecimal number f3dad03d is the first 8 characters of the UUID (which are constant). It can be found as a comment in the config file and also in uuid.txt.

Dantist · 2022-06-17T21:04:30Z

@vdbergh @vondele

This suggests that cutechess-cli failed to start the engine(s).

That actually looks correct: run.log

I have to say that I have a somewhat unique setup.. I planned to deploy the worker to many servers, so I dockerized it with the alpine:edge base image.

In general, it worked great - the latest GCC, Python, cutechess-cli (compiled from source).
Only the worker's auto-update feature is what often broke things down (sometimes it auto-updated normally, and sometimes it pulled the cutechess-cli binary, which did not work with musl). I just had to monitor the workers and rebuild the docker image so that cutechess-cli was again compiled from the source. Sadly, I often noticed this after a few days of workers' inactivity.
It would be cool if "the actual version of cutechess-cli was checked prior to updating" or "make this feature optional" or "verify that updated cutechess-cli binary can be executed prior to replacing the working one".
I was occasionally reading Discord and saw that someone noticed the issue with cutechess-cli on my setup and correctly identified that my workers were running under Alpine and musl.

Anyway, currently. I see that this docker image is no longer working, and rebuilding doesn't fix the issue, and sadly enough, I have no time to fix it. Unfortunately, I haven't looked after my workers for some time and have fallen out of life, because now I have to defend my country from putin's barbaric invaders, but if I get out alive, I'll definitely fix everything.

I'll attach my docker setup below for your convenience (in troubleshooting), but you might want to amend it and add this run method as one of the option to the "Running the worker" wiki page, or even make your own official docker image and push it to hub.docker.com so people can run docker with a single CLI command without manually downloading anything.
This should work on any OS/Arch where Docker is supported but there is a drawback - if everyone uses this method it will reduce the diversity of fishtest workers' setups.

Tiny Alpine Docker image: fishtest-docker.zip

I hope this can be of any help.
Stay safe, take care, and send armor to Ukraine my western friends. If russia stops fighting there will be no more war. If Ukrainians stop fighting there will be no more Ukraine.

vondele · 2022-06-17T21:08:14Z

best of luck, and stay healthy.

ppigazzini · 2022-06-17T21:34:03Z

I wish you way more than luck @Dantist

vdbergh · 2022-06-18T07:00:11Z

@Dantist Thanks for the logs. They are very helpful. And good luck!

See official-stockfish#1360

vdbergh · 2022-06-18T12:13:38Z

@noobpwnftw The worker ChessDBCN-16cores-97544138 is also suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655554389.898&max_actions=100 .

See official-stockfish#1360

vdbergh · 2022-06-20T05:52:26Z

@noobpwnftw Now the worker ChessDBCN-16cores-b858eb82 is throttled:

https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655704051.324&max_actions=100

ppigazzini · 2022-06-26T12:02:29Z

https://tests.stockfishchess.org/actions?actions=failed_task&user=technologov&max_actions=1&before=1656244248.6

See https://stackoverflow.com/questions/71580631/how-can-i-get-code-coverage-with-clang-13-0-1-on-mac

A MacOS worker is running fine with clang, perhaps has a x86_64 CPU, so at the moment skip the profiled build only for Apple silicon with #1370

noobpwnftw · 2022-06-26T12:56:45Z

I have removed worker f3dad03d. Now the others seemed less frequent.

ppigazzini · 2022-07-02T09:45:19Z

https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&max_actions=1&before=1656744889.552

https://github.com/glinscott/fishtest/blob/e4b7bb596c92b401f54ed0a963ff73e0cda00b2e/worker/games.py#L819-L823

vdbergh · 2022-07-05T18:19:50Z

@ppigazzini I have noticed this AssertionError once before. I did a code review then but could not find what might cause it. So it is a mystery. I suspect it is some kind of race condition....

vdbergh · 2022-07-23T09:12:40Z

@noobpwnftw The worker ChessDBCN-16cores-97544138 still suffers quite heavily from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1658567399.302&max_actions=100 .

vdbergh · 2022-07-27T08:10:12Z

Worker technologov-28cores-r345 suffers quite badly from throttling

https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&before=1658909315.519&max_actions=100

vdbergh · 2022-07-27T08:15:40Z

technologov-56cores-r101 suffers from "Finished match uncleanly". It plays no games so this suggests that cutechess is unable to start the engines (the same issue Dantist had, but in this case it was fixed with the new cutechess binary).

https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&max_actions=1&before=1658888465.102

EDIT: However I checked that technologov-56cores-r101 does not always suffer from this. In many cases it can execute a task.

ppigazzini · 2022-07-27T10:31:19Z

Here are some past analysis about the "Finished match uncleanly" problem:
#1110
#1116

dubslow · 2022-08-02T04:34:38Z

Since it's not yet been discussed here (see discord), one technologov worker as well as most or every worker of linrock and sebastronomy have severe time loss problems. This is of course yet another symptom of known cutechess concurrency issues, however until the worker or cutechess is fixed, this is causing substantial pollution of fishtest data (timelosses causing higher-than-nominal pairwise-"draws", in the form of 1-0 1-0, thereby biasing test elos towards 0).

See also #1393, for implementing a worker-side workaround of cutechess problems, and #1394 for server-side filtering of bad data.

vdbergh · 2022-08-02T09:45:17Z

@dubslow This issue is specifically for documenting ill behaving workers. So it is best to refer to a worker by its full name (as has been done in the earlier comments).

For documentation purposes it would be nice if there were a method in fishtest to link to a task (in a similar way that it is possible to link to an event). Currently we can only link to a run.

vdbergh · 2023-01-23T09:08:58Z

@MinetaS @silversolver1 Thanks for reporting. Unfortunately there is currently not really a strategy for dealing with time losses (or other undesirable worker behavior). However since yesterday excessive time losses are recorded in the event log. This will make it easier to follow such workers.

https://tests.stockfishchess.org/actions?action=crash_or_time&user=&text=

vdbergh · 2023-01-24T20:34:58Z

An excessive number of time losses by Wencey-32cores

https://tests.stockfishchess.org/actions?action=crash_or_time&user=&text=%22Wencey-32cores%22

vdbergh · 2023-01-24T21:12:12Z

I assume that the issue is that on some systems the communication between the engine and cutechess-cli steals time from the engine. There really should be an option in cutechess-cli that makes it trust the time reported by the engines. Perhaps someone can implement this?

vdbergh · 2023-02-08T09:33:29Z

This task has more time losses than played games https://tests.stockfishchess.org/tests/view/63df697473223e7f52ad5d79?show_task=1097

Must be a bug...

vdbergh · 2023-04-08T06:40:11Z

@noobpwnftw Currently there are many of your workers that have the same uuid prefix which is not really desirable. I assume they were all started with a config file with the same private section. The private section looks like this:

[private]
hw_seed = 2564186689

The private section is generated once and then saved in the config file. If it is deleted then it is regenerated. The hw_seed, which is simply a random number, is the last line of defense to distinguish workers that are otherwise completely identical (e.g. running from the same virtual OS image).

dav1312 · 2023-04-20T14:23:55Z

https://tests.stockfishchess.org/actions?action=crash_or_time&user=sebastronomy

vondele · 2023-04-20T14:29:18Z

I have written an email to sebastronomy.

mcbastian · 2023-04-20T15:56:43Z

Hi, that's me. It seems to be the NUMA issue discussed a while ago on discord. I had to update kernel and OS of the machine and restarted the worker without numactl. I restarted with numactl. and will keep an eye on it.

mcbastian · 2023-04-21T11:27:44Z

Fixed now. Faulty CPU cooling on one CPU. reduced workload on this CPU, locked the workers to certain CPU cores (numactl). No more crashes since then. Will have to call DELL to fix this. Fans all OK. looks like problem with cooler itself

vdbergh · 2023-05-08T14:49:22Z

These workers seem to have serious issues with building SF

https://tests.stockfishchess.org/actions?action=failed_task&user=maximmasiutin&text=%22Executing+make%22%2B%22unknown+argument%22

vondele · 2023-05-08T14:53:56Z

That's an issue that the newest clang compiler doesn't recognize one of the options we used in older versions of SF. The newer Makefile works, but not for the regression test against SF15

vdbergh · 2023-08-11T05:13:57Z

technologov-56cores-r116 appears to be generating a large number of dead tasks

https://tests.stockfishchess.org/actions?action=dead_task&user=&text=%22technologov-56cores-r116%22

Edit: the tasks don't have any games. After each reported dead task the worker restarts.

vdbergh · 2023-08-12T19:46:15Z

technologov-56cores-r116 is still generating a large number of dead tasks. Perhaps someone can write an email?

ppigazzini · 2023-08-12T19:51:24Z

technologov-56cores-r116 is still generating a large number of dead tasks. Perhaps someone can write an email?

I wrote him on Discord, the user has already answered.

vdbergh · 2023-08-23T07:31:41Z

There seem to be two workers named okrout-28cores-0a0cde5b. I.e. they have the same UUID prefix (but different UUID). This is harmless but not nice. It is also not easy to achieve accidentally.

There should be a server side mechanism to avoid duplicate UUID prefixes but this is not so easy as one one has to take for example dead tasks into account which can linger for quite some time.

During request_task, we check if there are active tasks for workers with the same uuid prefix, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360.

During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360.

During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360 (comment) ,

During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix official-stockfish#1360 (comment) .

During request_task, we check if there are active tasks for workers with the same name, which have recently been updated (i.e. they are not dead). If so then we return an error. Should fix #1360 (comment) .

vdbergh · 2023-09-07T06:59:45Z

This worker has a problem with his make installation https://tests.stockfishchess.org/actions?action=&user=&text=+%22maximmasiutin-2cores-f21151d3%22

vdbergh · 2023-09-07T07:37:09Z

I can block him if you give me approver rights.

ppigazzini · 2023-09-07T07:46:11Z

I can block him if you give me approver rights.

Done :)

vdbergh · 2023-10-09T06:01:43Z

The strange case of okrout-28cores-ca072243.

Despite having been blocked, and having been sent an email about the blocking 3 days ago, the worker keeps dutifully reconnecting every 15 minutes, trying to get a task (see https://tests.stockfishchess.org/workers/show).

Presumable the worker runs unattended and the email address is stale.

It is not a problem. Just strange.

EDIT The worker has been taken offline now.

EDIT2 The worker came briefly back online again and was then replaced by okrout-28cores-ba47b84b which suffers from the same problem (I blocked it again). It seems the owner does read email and also not the console messages.

vdbergh · 2024-01-11T09:45:50Z

I suspect some users create multiple workers whose total number of cores is larger than the number of cores in the system, with predictably bad results. Perhaps this happens here https://tests.stockfishchess.org/actions?action=&user=tolkki963&text= ?

It seems feasible to count the total number of cores in use by the workers using (named) shared memory (available in Python 3.8). For Python <= 3.12 this requires a monkey patch to work around some bugs in resource tracking.

python/cpython#82300 (comment)

I tested this code on python 3.10 and it works.

As usual there are two annoying issues to deal with

workers that are terminated using SIGKILL (they are unable to clean up properly)
synchronization

vdbergh · 2024-01-11T11:38:25Z

I am thinking that is easier for the server to sort it out (if the workers send the necessary information).

The server needs to know which workers run on the same machine but this can be done by using a random number in some fixed temporary file.

vdbergh added a commit to vdbergh/fishtest that referenced this issue Jun 18, 2022

Stop the task if the cutechess-cli cannot start an engine.

7374cfe

See official-stockfish#1360

vdbergh mentioned this issue Jun 18, 2022

Stop the task if cutechess-cli cannot start an engine. #1362

Closed

vdbergh added a commit to vdbergh/fishtest that referenced this issue Jun 18, 2022

Stop the task if cutechess-cli cannot start an engine.

34d2f68

See official-stockfish#1360

vdbergh mentioned this issue Jun 19, 2022

Call the engine binaries by absolute path. #1363

Closed

vdbergh mentioned this issue Jul 31, 2022

split worker for reduced concurrency #1393

Open

vdbergh mentioned this issue Aug 24, 2023

Try to avoid workers with the same name. #1759

Merged

ppigazzini closed this as completed in #1759 Aug 25, 2023

Workers with issues. #1360

Workers with issues. #1360

Comments

vdbergh commented Jun 14, 2022 • edited by ppigazzini Loading

vondele commented Jun 14, 2022

vdbergh commented Jun 14, 2022

vondele commented Jun 14, 2022

vdbergh commented Jun 15, 2022

vdbergh commented Jun 15, 2022 • edited by ppigazzini Loading

vondele commented Jun 15, 2022

vdbergh commented Jun 16, 2022

vondele commented Jun 16, 2022

vdbergh commented Jun 17, 2022 • edited by ppigazzini Loading

Dantist commented Jun 17, 2022

vondele commented Jun 17, 2022

ppigazzini commented Jun 17, 2022

vdbergh commented Jun 18, 2022

vdbergh commented Jun 18, 2022 • edited by ppigazzini Loading

vdbergh commented Jun 20, 2022 • edited by ppigazzini Loading

ppigazzini commented Jun 26, 2022 • edited Loading

noobpwnftw commented Jun 26, 2022

ppigazzini commented Jul 2, 2022 • edited Loading

vdbergh commented Jul 5, 2022

vdbergh commented Jul 23, 2022 • edited by ppigazzini Loading

vdbergh commented Jul 27, 2022 • edited by ppigazzini Loading

vdbergh commented Jul 27, 2022 • edited by ppigazzini Loading

ppigazzini commented Jul 27, 2022

dubslow commented Aug 2, 2022 • edited Loading

vdbergh commented Aug 2, 2022

vdbergh commented Jan 23, 2023

vdbergh commented Jan 24, 2023

vdbergh commented Jan 24, 2023

vdbergh commented Feb 8, 2023

vdbergh commented Apr 8, 2023

dav1312 commented Apr 20, 2023

vondele commented Apr 20, 2023

mcbastian commented Apr 20, 2023

mcbastian commented Apr 21, 2023

vdbergh commented May 8, 2023

vondele commented May 8, 2023

vdbergh commented Aug 11, 2023 • edited Loading

vdbergh commented Aug 12, 2023

ppigazzini commented Aug 12, 2023 • edited Loading

vdbergh commented Aug 23, 2023 • edited Loading

vdbergh commented Sep 7, 2023

vdbergh commented Sep 7, 2023

ppigazzini commented Sep 7, 2023

vdbergh commented Oct 9, 2023 • edited Loading

vdbergh commented Jan 11, 2024

vdbergh commented Jan 11, 2024

vdbergh commented Jun 14, 2022 •

edited by ppigazzini

Loading

vdbergh commented Jun 15, 2022 •

edited by ppigazzini

Loading

vdbergh commented Jun 17, 2022 •

edited by ppigazzini

Loading

vdbergh commented Jun 18, 2022 •

edited by ppigazzini

Loading

vdbergh commented Jun 20, 2022 •

edited by ppigazzini

Loading

ppigazzini commented Jun 26, 2022 •

edited

Loading

ppigazzini commented Jul 2, 2022 •

edited

Loading

vdbergh commented Jul 23, 2022 •

edited by ppigazzini

Loading

vdbergh commented Jul 27, 2022 •

edited by ppigazzini

Loading

vdbergh commented Jul 27, 2022 •

edited by ppigazzini

Loading

dubslow commented Aug 2, 2022 •

edited

Loading

vdbergh commented Aug 11, 2023 •

edited

Loading

ppigazzini commented Aug 12, 2023 •

edited

Loading

vdbergh commented Aug 23, 2023 •

edited

Loading

vdbergh commented Oct 9, 2023 •

edited

Loading