Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

artifacts-tests are cordumping during SCT shutdown #9784

Open
yaronkaikov opened this issue Jan 12, 2025 · 9 comments
Open

artifacts-tests are cordumping during SCT shutdown #9784

yaronkaikov opened this issue Jan 12, 2025 · 9 comments
Assignees

Comments

@yaronkaikov
Copy link
Contributor

yaronkaikov commented Jan 12, 2025

Although the test passed,

07:56:40  < t:2025-01-12 05:56:40,267 f:setup.py        l:89   c:sdcm.sct_events.setup p:INFO  > }
07:56:40  < t:2025-01-12 05:56:40,268 f:tester.py       l:3088 c:ArtifactsTest        p:INFO  > ================================= TEST RESULTS =================================
07:56:40  < t:2025-01-12 05:56:40,268 f:tester.py       l:3088 c:ArtifactsTest        p:INFO  > ================================================================================
07:56:40  < t:2025-01-12 05:56:40,268 f:tester.py       l:3088 c:ArtifactsTest        p:INFO  > SUCCESS :)
07:56:41  < t:2025-01-12 05:56:41,597 f:base.py         l:274  c:elasticsearch        p:INFO  > GET https://746f0ad652a3447d83b1572f657c67cb.us-east-1.aws.found.io/ [status:200 request:0.077s]
07:56:41  < t:2025-01-12 05:56:41,620 f:base.py         l:274  c:elasticsearch        p:INFO  > HEAD https://746f0ad652a3447d83b1572f657c67cb.us-east-1.aws.found.io/sct_test_runs [status:200 request:0.022s]
07:56:41  < t:2025-01-12 05:56:41,717 f:base.py         l:274  c:elasticsearch        p:INFO  > PUT https://746f0ad652a3447d83b1572f657c67cb.us-east-1.aws.found.io/sct_test_runs/sct_test_run_short_v1/f95ee4d6-1636-4033-9b2c-d6493cc90f59/_create 

and then the test coredump during shut down

07:56:46  /bin/bash: line 1:     7 Segmentation fault      (core dumped) ./sct.py run-test artifacts_test --backend gce --logdir /tmp/workspace/enterprise-2024.2/artifacts-offline-install/artifacts-debian11-nonroot-test/scylla-cluster-tests

Logs

Seen in https://jenkins.scylladb.com/job/enterprise-2024.2/job/artifacts-offline-install/job/artifacts-debian11-nonroot-test/20/consoleFull

@fruch
Copy link
Contributor

fruch commented Jan 12, 2025

The failure here is not related to Argus

The test seems to coredump at the very end (after passing lines you mentioned), and it's not clear why

Problem is we don't have logs from the Jenkins builders

@fruch fruch changed the title artifacts-debian11-nonroot-test failed to Failed to submit data to Argus artifacts-tests are cordumping during SCT shutdown Jan 16, 2025
@fruch
Copy link
Contributor

fruch commented Jan 16, 2025

@k0machi I need a way to get those run into Argus (i.e. things are triggered inside releng-testing folder (and also under SCT CI), please open an issue for it on Argus, and let's discuss it there, what we can do with it.

@fruch
Copy link
Contributor

fruch commented Jan 16, 2025

looking at the logs when sct stops, those are the processes and threads left:

========= Thread argus-heartbeat daemonic=True from sdcm.tester =========
========= STACK TRACE =========
File: "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
  self._bootstrap_inner()
File: "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  self.run()
File: "/usr/local/lib/python3.10/threading.py", line 953, in run
  self._target(*self._args, **self._kwargs)
File: "/tmp/workspace/enterprise-2024.2/artifacts-offline-install/artifacts-debian11-nonroot-test/scylla-cluster-tests/sdcm/tester.py", line 459, in send_argus_heartbeat
  time.sleep(30.0)
========= END OF Thread argus-heartbeat from sdcm.tester =========
========= Thread QueueFeederThread daemonic=True from multiprocessing.queues =========
========= STACK TRACE =========
File: "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
  self._bootstrap_inner()
File: "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  self.run()
File: "/usr/local/lib/python3.10/threading.py", line 953, in run
  self._target(*self._args, **self._kwargs)
File: "/usr/local/lib/python3.10/multiprocessing/queues.py", line 231, in _feed
  nwait()
File: "/usr/local/lib/python3.10/threading.py", line 320, in wait
  waiter.acquire()
========= END OF Thread QueueFeederThread from multiprocessing.queues =========
========= Thread event_loop daemonic=True from cassandra.io.libevreactor =========
========= SOURCE =========
    def _run_loop(self):
        while True:
            self._loop.start()
            # there are still active watchers, no deadlock
            with self._lock:
                if not self._shutdown and self._live_conns:
                    log.debug("Restarting event loop")
                    continue
                else:
                    # all Connections have been closed, no active watchers
                    log.debug("All Connections currently closed, event loop ended")
                    self._started = False
                    break

========= STACK TRACE =========
File: "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
  self._bootstrap_inner()
File: "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  self.run()
File: "/usr/local/lib/python3.10/threading.py", line 953, in run
  self._target(*self._args, **self._kwargs)
File: "/usr/local/lib/python3.10/site-packages/cassandra/io/libevreactor.py", line 101, in _run_loop
  self._loop.start()
========= END OF Thread event_loop from cassandra.io.libevreactor =========
========= Process SyncManager-3 daemonic=False from multiprocessing.context =========

========= END OF Process SyncManager-3 from multiprocessing.context  =========

@fruch
Copy link
Contributor

fruch commented Jan 16, 2025

so it boils to 3 possible issue (not sure if related to the crash):

  1. python-driver - cassandra.io.libevreactor isn't suppose to left open when we reach the end of the test - also we update to lastest version recently (few days ago)
  2. multiprocessing.queues, no emptied queues might be cause it to stuck, we have the decode queue, that might nice to be cleared, before exiting.
  3. argus heartbeat, should be waited, and not be left running

@soyacz
Copy link
Contributor

soyacz commented Jan 16, 2025

isn't it just another fallout of mounting .local as tmpfs?

@fruch
Copy link
Contributor

fruch commented Jan 16, 2025

isn't it just another fallout of mounting .local as tmpfs?

It doesn't looks like that...

@soyacz
Copy link
Contributor

soyacz commented Jan 16, 2025

isn't it just another fallout of mounting .local as tmpfs?

It doesn't looks like that...

yes, but we started to see it recently - maybe some fix of the above caused it?

@fruch
Copy link
Contributor

fruch commented Jan 16, 2025

isn't it just another fallout of mounting .local as tmpfs?

It doesn't looks like that...

yes, but we started to see it recently - maybe some fix of the above caused it?

As I listed we changed a bunch of other things as well, I don't know where it's originating from

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants