disrupt_replace_service_level_using_detach_during_load validate_role_service_level_attributes_against_db fails with: IndexError: list index out of range #9671

yarongilor · 2025-01-07T12:50:34Z

Packages

Scylla version: 2024.3.0~dev-20241220.e8463f6b719f with build-id bb749b8a21072b70e631576259d767ecfe654583

Kernel Version: 6.8.0-1021-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

disrupt_replace_service_level_using_detach_during_load	elasticity-test-nemesis-master-db-node-2556bfba-2	Failed	2024-12-23 19:55:30	2024-12-23 20:31:03
Nemesis Information
Class: Sisyphus
Name: disrupt_replace_service_level_using_detach_during_load
Status: Failed
Failure reason
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5452, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4938, in disrupt_replace_service_level_using_detach_during_load
    self.format_error_for_sla_test_and_raise(error_events=error_events)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5031, in format_error_for_sla_test_and_raise
    raise NemesisSubTestFailure("\n".join(f"Step: {error.step}. Error:\n - {error}"
sdcm.exceptions.NemesisSubTestFailure: Step: Run stress command and validate io_queue_operations during load. Error:
 - (TestStepEvent Severity.ERROR) period_type=end event_id=afcdaccf-4177-4637-80ab-916770064553 during_nemesis=ReplaceServiceLevelUsingDetachDuringLoad duration=10m20s: step=Run stress command and validate io_queue_operations during load  errors=list index out of range
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/sla_tests.py", line 31, in run_stress_and_validate_scheduler_io_queue_operations_during_load
    self.validate_io_queue_operations(start_time=start_time,
  File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/libs/sla_utils.py", line 157, in validate_io_queue_operations
    user['role'].validate_role_service_level_attributes_against_db()
  File "/home/ubuntu/scylla-cluster-tests/test_lib/sla.py", line 386, in validate_role_service_level_attributes_against_db
    LOGGER.debug("Service level from LIST: %s", service_level[0].service_level)
IndexError: list index out of range

Step: Attach service level 'sl50_dda50134' with 50 shares to role500_dda50134. Validate io_queue_operations during load. Error:
 - (TestStepEvent Severity.ERROR) period_type=end event_id=b6287188-84d6-4586-b294-aed18dedd245 during_nemesis=ReplaceServiceLevelUsingDetachDuringLoad duration=0s: step=Attach service level 'sl50_dda50134' with 50 shares to role500_dda50134. Validate io_queue_operations during load  errors=list index out of range
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/sla_tests.py", line 121, in attach_sl_and_validate_io_queue_operations
    role_for_attach.validate_role_service_level_attributes_against_db()
  File "/home/ubuntu/scylla-cluster-tests/test_lib/sla.py", line 386, in validate_role_service_level_attributes_against_db
    LOGGER.debug("Service level from LIST: %s", service_level[0].service_level)
IndexError: list index out of range

Installation details

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

elasticity-test-nemesis-master-db-node-2556bfba-3 (52.210.179.65 | 10.4.14.61) (shards: 2)
elasticity-test-nemesis-master-db-node-2556bfba-2 (54.216.17.80 | 10.4.15.159) (shards: 2)
elasticity-test-nemesis-master-db-node-2556bfba-1 (54.220.142.116 | 10.4.15.152) (shards: 2)

OS / Image: ami-0af6c6b0a814f34a7 (aws: undefined_region)

Test: byo-longevity-test-yg2
Test id: 2556bfba-bff7-4ec5-833d-312330270ab4
Test name: scylla-staging/yarongilor/byo-longevity-test-yg2
Test method: longevity_sla_test.LongevitySlaTest.test_custom_time
Test config file(s):

perf-regression-latency-i4i_2xlarge-elasticity-90-percent-with-nemesis.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 2556bfba-bff7-4ec5-833d-312330270ab4
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 2556bfba-bff7-4ec5-833d-312330270ab4

Logs:

db-cluster-2556bfba.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2556bfba-bff7-4ec5-833d-312330270ab4/20241223_220910/db-cluster-2556bfba.tar.gz
sct-runner-events-2556bfba.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2556bfba-bff7-4ec5-833d-312330270ab4/20241223_220910/sct-runner-events-2556bfba.tar.gz
sct-2556bfba.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2556bfba-bff7-4ec5-833d-312330270ab4/20241223_220910/sct-2556bfba.log.tar.gz
loader-set-2556bfba.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2556bfba-bff7-4ec5-833d-312330270ab4/20241223_220910/loader-set-2556bfba.tar.gz
monitor-set-2556bfba.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2556bfba-bff7-4ec5-833d-312330270ab4/20241223_220910/monitor-set-2556bfba.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

fruch · 2025-01-13T08:49:19Z

@yarongilor

you are not running it with the needed configuration for those nemesis

< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > config_files:
< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - test-cases/performance/perf-regression-latency-i4i_2xlarge-elasticity-90-percent-with-nemesis.yaml
< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/disable_kms.yaml                                                               < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/network_config/two_interfaces.yaml
< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/ldap-authorization.yaml

'configurations/nemesis/additional_configs/sla_config.yaml' is missing

yarongilor · 2025-01-13T10:24:58Z

@yarongilor

you are not running it with the needed configuration for those nemesis

< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > config_files:                                                                                   < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - test-cases/performance/perf-regression-latency-i4i_2xlarge-elasticity-90-percent-with-nemesis.yaml                                                                                                                                                                                  < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/disable_kms.yaml                                                               < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/network_config/two_interfaces.yaml                                             < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/ldap-authorization.yaml

'configurations/nemesis/additional_configs/sla_config.yaml' is missing

@fruch , i wouldn't guess that from an error of "IndexError: list index out of range"...
so i think it is still an SCT issue.

pehala · 2025-01-13T11:12:14Z

@fruch , i wouldn't guess that from an error of "IndexError: list index out of range"...
so i think it is still an SCT issue.

In this case, it is either documentation issue or an enhancement for the existing nemesis, so I would create a new issue for this with more honed down description

fruch · 2025-01-13T11:21:17Z

@yarongilor
how did you know to add configurations/network_config/two_interfaces.yaml and configurations/ldap-authorization.yaml ?
it's documented exactly the same way

also seems you added the following boolean in your configuration :

        # Temporary solution. We do not want to run SLA nemeses during not-SLA test until the feature is stable
        dict(name="sla", env="SCT_SLA", type=boolean,
             help="run SLA nemeses if the test is SLA only"),

so what documentation do you except ? (if you expect it, please write it)

yarongilor · 2025-01-13T15:48:24Z

@yarongilor how did you know to add configurations/network_config/two_interfaces.yaml and configurations/ldap-authorization.yaml ? it's documented exactly the same way

also seems you added the following boolean in your configuration :
        # Temporary solution. We do not want to run SLA nemeses during not-SLA test until the feature is stable
        dict(name="sla", env="SCT_SLA", type=boolean,
             help="run SLA nemeses if the test is SLA only"),
so what documentation do you except ? (if you expect it, please write it)

@fruch , As can be seen above, this parameter help is not detailed enough to understand what's missing. i actually still have no idea what's missing..
AS for configurations/ldap-authorization.yaml - i know to add it since the Ldap nemesis specifically complains what's missing when skipping running it.
I have a vague feeling about the root cause of what's missing here - you mentioned configurations/nemesis/additional_configs/sla_config.yaml is missing , but what specific parameter is it? the nemesis didn't skip for missing parameters.. that is since test used configurations/ldap-authorization.yaml that has:

use_ldap: true
ldap_server_type: 'openldap'
use_ldap_authorization: true
authenticator: 'PasswordAuthenticator'
authenticator_user: cassandra
authenticator_password: cassandra
authorizer: 'CassandraAuthorizer'

Could it be that the combination of Ldap and SLA is the root of the problem somehow?
and if so - can we consider adding SLA nemeses something like:

        if self.cluster.params.get('use_ldap'):
            raise UnsupportedNemesis("SLA feature can't work with Ldap")

? or should we consider adjusting SLA nemesis code to work with Ldap somehow?
i'm not too familiar with SLA so perhaps @juliayakovlev could advise here.

juliayakovlev · 2025-01-13T17:21:55Z

By SCT log I see that Service level 'sl250_dda50134' has been created and attached to role role250_dda50134.

< t:2024-12-23 19:55:30,348 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-12-23 19:55:30.345: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=48401a81-6ddc-4e7b-a04d-082f6ec94a1b: nemesis_name=ReplaceServiceLevelUsingDetachDuringLoad target_node=Node elasticity-test-nemesis-master-db-node-2556bfba-2 [54.216.17.80 | 10.4.15.159]

< t:2024-12-23 19:55:31,641 f:sla.py          l:361  c:test_lib.sla         p:DEBUG > CREATE role query: CREATE ROLE IF NOT EXISTS role250_dda50134 WITH password = 'rolep250' AND login = True AND superuser = True
< t:2024-12-23 19:55:31,641 f:common.py       l:1317 c:utils                p:DEBUG > Executing CQL 'CREATE ROLE IF NOT EXISTS role250_dda50134 WITH password = 'rolep250' AND login = True AND superuser = True' ...
< t:2024-12-23 19:55:31,737 f:sla.py          l:150  c:test_lib.sla         p:DEBUG > Create service level query: CREATE SERVICE_LEVEL IF NOT EXISTS 'sl250_dda50134' WITH shares = 250
< t:2024-12-23 19:55:31,737 f:common.py       l:1317 c:utils                p:DEBUG > Executing CQL 'CREATE SERVICE_LEVEL IF NOT EXISTS 'sl250_dda50134' WITH shares = 250' ...
< t:2024-12-23 19:55:31,737 f:common.py       l:1325 c:utils                p:DEBUG > Executing CQL 'CREATE SERVICE_LEVEL IF NOT EXISTS 'sl250_dda50134' WITH shares = 250' ...
< t:2024-12-23 19:55:31,752 f:sla.py          l:152  c:test_lib.sla         p:DEBUG > Service level 'sl250_dda50134' has been created
< t:2024-12-23 19:55:31,752 f:sla.py          l:273  c:test_lib.sla         p:DEBUG > Attach service level query: ATTACH SERVICE_LEVEL 'sl250_dda50134' TO role250_dda50134
< t:2024-12-23 19:55:31,752 f:common.py       l:1317 c:utils                p:DEBUG > Executing CQL 'ATTACH SERVICE_LEVEL 'sl250_dda50134' TO role250_dda50134' ...
< t:2024-12-23 19:55:32,208 f:sla_utils.py    l:405  c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-1'
< t:2024-12-23 19:55:33,361 f:sla_utils.py    l:425  c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-1'
< t:2024-12-23 19:55:33,361 f:sla_utils.py    l:405  c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-2'
< t:2024-12-23 19:55:34,477 f:sla_utils.py    l:425  c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-2'
< t:2024-12-23 19:55:34,477 f:sla_utils.py    l:405  c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-3'
< t:2024-12-23 19:55:35,596 f:sla_utils.py    l:425  c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-3'

But 'LIST ATTACHED SERVICE_LEVEL OF role250_dda50134' returned empty list and this is very weird . Like there is no SL attached to the role (despite it should be). And this caused to IndexError:

< t:2024-12-23 20:06:00,243 f:sla.py          l:317  c:test_lib.sla         p:DEBUG > List attached service level(s) query: LIST ATTACHED SERVICE_LEVEL OF role250_dda50134
< t:2024-12-23 20:06:00,243 f:common.py       l:1325 c:utils                p:DEBUG > Executing CQL 'LIST ATTACHED SERVICE_LEVEL OF role250_dda50134' ...
< t:2024-12-23 20:06:00,246 f:sla.py          l:374  c:test_lib.sla         p:DEBUG > List of service levels for role role250_dda50134: []
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-12-23 20:06:00.247: (TestStepEvent Severity.ERROR) period_type=end event_id=afcdaccf-4177-4637-80ab-916770064553 during_nemesis=ReplaceServiceLevelUsingDetachDuringLoad duration=10m20s: step=Run stress command and validate io_queue_operations during load  errors=list index out of range
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Traceback (most recent call last):
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/sla_tests.py", line 31, in run_stress_and_validate_scheduler_io_queue_operations_during_load
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self.validate_io_queue_operations(start_time=start_time,
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/libs/sla_utils.py", line 157, in validate_io_queue_operations
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     user['role'].validate_role_service_level_attributes_against_db()
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/test_lib/sla.py", line 386, in validate_role_service_level_attributes_against_db
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     LOGGER.debug("Service level from LIST: %s", service_level[0].service_level)
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > IndexError: list index out of range

By the SCT code this [failed] line LOGGER.debug("Service level from LIST: %s", service_level[0].service_level) should not be achieved.
Expected to raise ValueError because of service_level list is empty or at least print "No attached Service level to the role %s" and exit from function:

    def validate_role_service_level_attributes_against_db(self):
        service_level = self.list_user_role_attached_service_levels()
        LOGGER.debug("List of service levels for role %s: %s", self.name, service_level)
        if not service_level and self.attached_service_level:
            ValueError(f"No Service Level attached to the role '{self.name}'. But it is expected that Service Level "
                       f"'{self._attached_service_level_name}' is attached to this role. Validate if it is test or "
                       "Scylla issue")
        elif not self.attached_service_level and service_level:
            ValueError(f"Found attached Service Level '{service_level[0].service_level}' to the role '{self.name}'. "
                       "But it is expected that no attached Service Level. Validate if it is test or Scylla issue")
        elif not service_level and not self.attached_service_level:
            LOGGER.debug("No attached Service level to the role %s", self.name)
            return

        LOGGER.debug("Service level from LIST: %s", service_level[0].service_level)

So there are 2 problems here:

'LIST ATTACHED SERVICE_LEVEL OF role250_dda50134' returned empty list unexpectedly
ValueError was not raised

About LDAP and SLA - I do not know how LDAP may broke the test

temichus · 2025-01-15T12:16:24Z

2 problem at least can be easily solved #9829

github-actions bot assigned yarongilor Jan 7, 2025

yarongilor assigned temichus and unassigned yarongilor Jan 7, 2025

yarongilor mentioned this issue Jan 7, 2025

Run existing nemesis with 90% storage utilization test function #9155

Open

fruch closed this as completed Jan 13, 2025

yarongilor reopened this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disrupt_replace_service_level_using_detach_during_load validate_role_service_level_attributes_against_db fails with: IndexError: list index out of range #9671

disrupt_replace_service_level_using_detach_during_load validate_role_service_level_attributes_against_db fails with: IndexError: list index out of range #9671

yarongilor commented Jan 7, 2025

Logs:

fruch commented Jan 13, 2025 •

edited

Loading

yarongilor commented Jan 13, 2025

pehala commented Jan 13, 2025

fruch commented Jan 13, 2025

yarongilor commented Jan 13, 2025

juliayakovlev commented Jan 13, 2025

temichus commented Jan 15, 2025

disrupt_replace_service_level_using_detach_during_load validate_role_service_level_attributes_against_db fails with: IndexError: list index out of range #9671

disrupt_replace_service_level_using_detach_during_load validate_role_service_level_attributes_against_db fails with: IndexError: list index out of range #9671

Comments

yarongilor commented Jan 7, 2025

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Jan 13, 2025 • edited Loading

yarongilor commented Jan 13, 2025

pehala commented Jan 13, 2025

fruch commented Jan 13, 2025

yarongilor commented Jan 13, 2025

juliayakovlev commented Jan 13, 2025

temichus commented Jan 15, 2025

fruch commented Jan 13, 2025 •

edited

Loading