The `destroy_data_then_repair` nemesis takes `~8 hours` to delete `51900` sstables. #9595

vponomaryov · 2024-12-19T18:09:11Z

Packages

Scylla version: 2024.2.0-20241118.614d56348f46 with build-id e67376d9ddfea081a3bab398f4581ecdde59911d

Kernel Version: 5.15.0-1072-aws

Issue description

In the test where we create and populate 5000 tables was triggered the destroy_data_then_repair nemesis.
In scope of this nemesis scylla service was stopped and 50% (51900) of sstables were deleted.
And the problem with it is that it took about 8 hours:

2024-12-10 04:11:25,484 f:remote_base.py  l:560  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.4.209>: Running command "sudo systemctl stop scylla-server.service"...
...
2024-12-10 05:43:20,464 f:nemesis.py      l:1175 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: SStables amount to destroy (50 percent of all SStables): 51900
...
2024-12-10 05:43:20,969 f:nemesis.py      l:1190 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Files /var/lib/scylla/data/feeds/table1040_field4_table1040_index-18614371b64911efa67f40a70c16fdee/me-3gly_08fl_1zsvk2s1uyyslpee6h-big-Data.db were destroyed
	...
2024-12-10 13:32:08,892 f:nemesis.py      l:1190 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Files /var/lib/scylla/data/feeds/table1291-1d809a90b64911efa67f40a70c16fdee/me-3gly_07c5_5v4b42s1uyyslpee6h-big-Data.db were destroyed
...
2024-12-10 13:32:09,010 f:remote_base.py  l:560  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.4.209>: Running command "sudo systemctl start scylla-server.service"...

Impact

Significant waste of time for actions which could be done much faster.

How frequently does it reproduce?

1/1

Installation details

Cluster size: 1 nodes (i4i.8xlarge)

Scylla Nodes used in this run:

longevity-5000-tables-dev-db-node-577988a0-6 (3.249.0.189 | 10.4.4.209) (shards: 30)
longevity-5000-tables-dev-db-node-577988a0-5 (54.75.190.27 | 10.4.5.109) (shards: 30)
longevity-5000-tables-dev-db-node-577988a0-4 (63.33.66.182 | 10.4.5.238) (shards: 30)
longevity-5000-tables-dev-db-node-577988a0-3 (34.243.59.47 | 10.4.7.22) (shards: 30)
longevity-5000-tables-dev-db-node-577988a0-2 (34.254.96.143 | 10.4.6.248) (shards: 30)
longevity-5000-tables-dev-db-node-577988a0-1 (54.77.124.97 | 10.4.4.86) (shards: 30)

OS / Image: ami-0698e16ac1b56a821 (aws: undefined_region)

Test: vp-scale-5000-tables-test
Test id: 577988a0-bc60-4abe-b176-dd4bea6b8666
Test name: scylla-staging/valerii/vp-scale-5000-tables-test
Test method: longevity_test.LongevityTest.test_user_batch_custom_time
Test config file(s):

longevity-5000-tables.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 577988a0-bc60-4abe-b176-dd4bea6b8666
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 577988a0-bc60-4abe-b176-dd4bea6b8666

Logs:

longevity-5000-tables-dev-db-node-577988a0-1 - https://cloudius-jenkins-test.s3.amazonaws.com/577988a0-bc60-4abe-b176-dd4bea6b8666/20241209_161117/longevity-5000-tables-dev-db-node-577988a0-1-577988a0.tar.gz
db-cluster-577988a0.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/577988a0-bc60-4abe-b176-dd4bea6b8666/20241211_111129/db-cluster-577988a0.tar.gz
sct-runner-events-577988a0.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/577988a0-bc60-4abe-b176-dd4bea6b8666/20241211_111129/sct-runner-events-577988a0.tar.gz
2024_12_09__16_11_18_784.sct-577988a0.log.gz - https://cloudius-jenkins-test.s3.amazonaws.com/577988a0-bc60-4abe-b176-dd4bea6b8666/20241211_111129/2024_12_09__16_11_18_784.sct-577988a0.log.gz
2024_12_10__02_52_47_341.sct-577988a0.log.gz - https://cloudius-jenkins-test.s3.amazonaws.com/577988a0-bc60-4abe-b176-dd4bea6b8666/20241211_111129/2024_12_10__02_52_47_341.sct-577988a0.log.gz
2024_12_10__12_47_10_484.sct-577988a0.log.gz - https://cloudius-jenkins-test.s3.amazonaws.com/577988a0-bc60-4abe-b176-dd4bea6b8666/20241211_111129/2024_12_10__12_47_10_484.sct-577988a0.log.gz
2024_12_10__20_49_36_127.sct-577988a0.log.gz - https://cloudius-jenkins-test.s3.amazonaws.com/577988a0-bc60-4abe-b176-dd4bea6b8666/20241211_111129/2024_12_10__20_49_36_127.sct-577988a0.log.gz
loader-set-577988a0.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/577988a0-bc60-4abe-b176-dd4bea6b8666/20241211_111129/loader-set-577988a0.tar.gz
monitor-set-577988a0.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/577988a0-bc60-4abe-b176-dd4bea6b8666/20241211_111129/monitor-set-577988a0.tar.gz
core.scylla-longevity-5000-tables-dev-db-node-577988a0-2-2024-12-11_11-30-40.gz - https://storage.cloud.google.com/upload.scylladb.com/core.scylla.112.f5e7a73c317c4845933d1d7afebb91d6.995.1733916173000000/core.scylla.112.f5e7a73c317c4845933d1d7afebb91d6.995.1733916173000000.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

github-actions bot assigned vponomaryov Dec 19, 2024

vponomaryov added the Bug Something isn't working right label Dec 19, 2024

fruch added master/triage and removed master/triage labels Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `destroy_data_then_repair` nemesis takes `~8 hours` to delete `51900` sstables. #9595

The `destroy_data_then_repair` nemesis takes `~8 hours` to delete `51900` sstables. #9595

vponomaryov commented Dec 19, 2024 •

edited

Loading

Logs:

The destroy_data_then_repair nemesis takes ~8 hours to delete 51900 sstables. #9595

The destroy_data_then_repair nemesis takes ~8 hours to delete 51900 sstables. #9595

Comments

vponomaryov commented Dec 19, 2024 • edited Loading

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

The `destroy_data_then_repair` nemesis takes `~8 hours` to delete `51900` sstables. #9595

The `destroy_data_then_repair` nemesis takes `~8 hours` to delete `51900` sstables. #9595

vponomaryov commented Dec 19, 2024 •

edited

Loading