artifacts-tests are cordumping during SCT shutdown #9784

yaronkaikov · 2025-01-12T06:48:48Z

Although the test passed,

07:56:40  < t:2025-01-12 05:56:40,267 f:setup.py        l:89   c:sdcm.sct_events.setup p:INFO  > }
07:56:40  < t:2025-01-12 05:56:40,268 f:tester.py       l:3088 c:ArtifactsTest        p:INFO  > ================================= TEST RESULTS =================================
07:56:40  < t:2025-01-12 05:56:40,268 f:tester.py       l:3088 c:ArtifactsTest        p:INFO  > ================================================================================
07:56:40  < t:2025-01-12 05:56:40,268 f:tester.py       l:3088 c:ArtifactsTest        p:INFO  > SUCCESS :)
07:56:41  < t:2025-01-12 05:56:41,597 f:base.py         l:274  c:elasticsearch        p:INFO  > GET https://746f0ad652a3447d83b1572f657c67cb.us-east-1.aws.found.io/ [status:200 request:0.077s]
07:56:41  < t:2025-01-12 05:56:41,620 f:base.py         l:274  c:elasticsearch        p:INFO  > HEAD https://746f0ad652a3447d83b1572f657c67cb.us-east-1.aws.found.io/sct_test_runs [status:200 request:0.022s]
07:56:41  < t:2025-01-12 05:56:41,717 f:base.py         l:274  c:elasticsearch        p:INFO  > PUT https://746f0ad652a3447d83b1572f657c67cb.us-east-1.aws.found.io/sct_test_runs/sct_test_run_short_v1/f95ee4d6-1636-4033-9b2c-d6493cc90f59/_create

and then the test coredump during shut down

07:56:46  /bin/bash: line 1:     7 Segmentation fault      (core dumped) ./sct.py run-test artifacts_test --backend gce --logdir /tmp/workspace/enterprise-2024.2/artifacts-offline-install/artifacts-debian11-nonroot-test/scylla-cluster-tests

Logs

Seen in https://jenkins.scylladb.com/job/enterprise-2024.2/job/artifacts-offline-install/job/artifacts-debian11-nonroot-test/20/consoleFull

The text was updated successfully, but these errors were encountered:

fruch · 2025-01-12T09:13:09Z

The failure here is not related to Argus

The test seems to coredump at the very end (after passing lines you mentioned), and it's not clear why

Problem is we don't have logs from the Jenkins builders

fruch · 2025-01-16T09:16:45Z

one more occurence of it:
https://jenkins.scylladb.com/job/releng-testing/job/artifacts-offline-install/job/artifacts-ubuntu2204-test/551

Logs

this PR might help to extract a bit more ufault data out of the crash:
#9841

fruch · 2025-01-16T09:19:24Z

@k0machi I need a way to get those run into Argus (i.e. things are triggered inside releng-testing folder (and also under SCT CI), please open an issue for it on Argus, and let's discuss it there, what we can do with it.

fruch · 2025-01-16T09:22:53Z

looking at the logs when sct stops, those are the processes and threads left:

========= Thread argus-heartbeat daemonic=True from sdcm.tester =========
========= STACK TRACE =========
File: "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
  self._bootstrap_inner()
File: "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  self.run()
File: "/usr/local/lib/python3.10/threading.py", line 953, in run
  self._target(*self._args, **self._kwargs)
File: "/tmp/workspace/enterprise-2024.2/artifacts-offline-install/artifacts-debian11-nonroot-test/scylla-cluster-tests/sdcm/tester.py", line 459, in send_argus_heartbeat
  time.sleep(30.0)
========= END OF Thread argus-heartbeat from sdcm.tester =========
========= Thread QueueFeederThread daemonic=True from multiprocessing.queues =========
========= STACK TRACE =========
File: "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
  self._bootstrap_inner()
File: "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  self.run()
File: "/usr/local/lib/python3.10/threading.py", line 953, in run
  self._target(*self._args, **self._kwargs)
File: "/usr/local/lib/python3.10/multiprocessing/queues.py", line 231, in _feed
  nwait()
File: "/usr/local/lib/python3.10/threading.py", line 320, in wait
  waiter.acquire()
========= END OF Thread QueueFeederThread from multiprocessing.queues =========
========= Thread event_loop daemonic=True from cassandra.io.libevreactor =========
========= SOURCE =========
    def _run_loop(self):
        while True:
            self._loop.start()
            # there are still active watchers, no deadlock
            with self._lock:
                if not self._shutdown and self._live_conns:
                    log.debug("Restarting event loop")
                    continue
                else:
                    # all Connections have been closed, no active watchers
                    log.debug("All Connections currently closed, event loop ended")
                    self._started = False
                    break

========= STACK TRACE =========
File: "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
  self._bootstrap_inner()
File: "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  self.run()
File: "/usr/local/lib/python3.10/threading.py", line 953, in run
  self._target(*self._args, **self._kwargs)
File: "/usr/local/lib/python3.10/site-packages/cassandra/io/libevreactor.py", line 101, in _run_loop
  self._loop.start()
========= END OF Thread event_loop from cassandra.io.libevreactor =========
========= Process SyncManager-3 daemonic=False from multiprocessing.context =========

========= END OF Process SyncManager-3 from multiprocessing.context  =========

fruch · 2025-01-16T09:27:04Z

so it boils to 3 possible issue (not sure if related to the crash):

python-driver - cassandra.io.libevreactor isn't suppose to left open when we reach the end of the test - also we update to lastest version recently (few days ago)
multiprocessing.queues, no emptied queues might be cause it to stuck, we have the decode queue, that might nice to be cleared, before exiting.
argus heartbeat, should be waited, and not be left running

soyacz · 2025-01-16T10:15:42Z

isn't it just another fallout of mounting .local as tmpfs?

fruch · 2025-01-16T12:31:53Z

isn't it just another fallout of mounting .local as tmpfs?

It doesn't looks like that...

soyacz · 2025-01-16T13:36:55Z

isn't it just another fallout of mounting .local as tmpfs?

It doesn't looks like that...

yes, but we started to see it recently - maybe some fix of the above caused it?

fruch · 2025-01-16T14:29:52Z

isn't it just another fallout of mounting .local as tmpfs?

It doesn't looks like that...

yes, but we started to see it recently - maybe some fix of the above caused it?

As I listed we changed a bunch of other things as well, I don't know where it's originating from

yaronkaikov assigned k0machi Jan 12, 2025

github-actions bot assigned yaronkaikov Jan 12, 2025

yaronkaikov assigned fruch and unassigned yaronkaikov Jan 12, 2025

fruch changed the title ~~artifacts-debian11-nonroot-test failed to Failed to submit data to Argus~~ artifacts-tests are cordumping during SCT shutdown Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

artifacts-tests are cordumping during SCT shutdown #9784

artifacts-tests are cordumping during SCT shutdown #9784

yaronkaikov commented Jan 12, 2025 •

edited by fruch

Loading

fruch commented Jan 12, 2025

fruch commented Jan 16, 2025

fruch commented Jan 16, 2025

fruch commented Jan 16, 2025

fruch commented Jan 16, 2025

soyacz commented Jan 16, 2025

fruch commented Jan 16, 2025

soyacz commented Jan 16, 2025

fruch commented Jan 16, 2025

artifacts-tests are cordumping during SCT shutdown #9784

artifacts-tests are cordumping during SCT shutdown #9784

Comments

yaronkaikov commented Jan 12, 2025 • edited by fruch Loading

Logs

fruch commented Jan 12, 2025

fruch commented Jan 16, 2025

Logs

fruch commented Jan 16, 2025

fruch commented Jan 16, 2025

fruch commented Jan 16, 2025

soyacz commented Jan 16, 2025

fruch commented Jan 16, 2025

soyacz commented Jan 16, 2025

fruch commented Jan 16, 2025

yaronkaikov commented Jan 12, 2025 •

edited by fruch

Loading