Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dynamic Buffer][Mellanox] Skip PGs in pending deleting set while checking accumulative headroom of a port #2871

Conversation

stephenxs
Copy link
Collaborator

What I did

Skip the PGs that are about to be removed in the Lua script while checking the accumulative headroom of a port.

Signed-off-by: Stephen Sun [email protected]

Why I did it

This is to avoid the following error message

Jul 26 14:59:04.754566 r-moose-02 ERR swss#buffermgrd: :- guard: RedisReply catches system_error: command: .*, reason: ERR Error running script (call to f_a02f6f856d876d607a7ac81f5bc0890cad68bf71): @user_script:125: user_script:125: attempt to perform arithmetic on local 'current_profile_size' (a nil value): Input/output error

This is because

  1. Too many notifications were accumulated in the m_toSync queue, belonging to BUFFER_PROFILE_TABLE and BUFFER_PG_TABLE
  2. Even the buffer manager removed the buffer PG ahead of buffer profile, the buffer profile was handled ahead of buffer PG in the orchagent, which means they were handled in reverse order. As a result, the notification of buffer PG was still in BUFFER_PG_TABLE_DELSET set and remained in BUFFER_PG_TABLE while the buffer profile was removed from BUFFER_PROFILE_TABLE.
  3. When it checked the accumulative headroom, it fetched all items from the APPL_DB tables and got the to-be-removed buffer PG but didn't get the buffer profile because it had been removed in 2.
  4. As a result, it complained the 1current_profile_size was nil which was the consequence of 3.

Fix:
Do not check buffer PGs that are in the BUFFER_PG_TABLE_DELSET.

How I verified it

Regression and manual test.

Details if related

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

The vs test case test_bufferPortMaxParameter is enhanced to guarantee that the pending deleted PGs won't be counted in the headroom checking.
We are not able to add a mock test to simulate exactly the same scenario as what was mentioned in the PR description. This is because we need a mock test to simulate the same flow but we can not invoke Lua script in the mock test.

With the fix, the test succeeds

========================================================================================================== test session starts ==========================================================================================================
platform linux -- Python 3.8.10, pytest-4.6.2, py-1.10.0, pluggy-0.13.1
rootdir: /root/sonic-swss/tests
plugins: flaky-3.7.0
collected 11 items / 10 deselected / 1 selected                                                                                                                                                                                         

test_buffer_dynamic.py .                                                                                                                                                                                                          [100%]

=============================================================================================== 1 passed, 10 deselected in 156.42 seconds ===============================================================================================

Without the fix, the test fails as below (which is expected).

root@r-sonic-vs-021:~/sonic-swss/tests# pytest test_buffer_dynamic.py -k test_bufferPortMaxParameter --dvsname=vs 
========================================================================================================== test session starts ==========================================================================================================
platform linux -- Python 3.8.10, pytest-4.6.2, py-1.10.0, pluggy-0.13.1
rootdir: /root/sonic-swss/tests
plugins: flaky-3.7.0
collected 11 items / 10 deselected / 1 selected                                                                                                                                                                                         

test_buffer_dynamic.py F                                                                                                                                                                                                          [100%]

=============================================================================================================== FAILURES ================================================================================================================
_____________________________________________________________________________________________ TestBufferMgrDyn.test_bufferPortMaxParameter ______________________________________________________________________________________________

self = <test_buffer_dynamic.TestBufferMgrDyn object at 0x7f93eaac13d0>, dvs = <conftest.DockerVirtualSwitch object at 0x7f93eaac1430>, testlog = None

    def test_bufferPortMaxParameter(self, dvs, testlog):
        self.setup_db(dvs)
    
        # Update log level so that we can analyze the log in case the test failed
        logfvs = self.config_db.wait_for_entry("LOGGER", "buffermgrd")
        old_log_level = logfvs.get("LOGLEVEL")
        logfvs["LOGLEVEL"] = "INFO"
        self.config_db.update_entry("LOGGER", "buffermgrd", logfvs)
    
        # Check whether port's maximum parameter has been exposed to STATE_DB
        fvs = self.state_db.wait_for_entry("BUFFER_MAX_PARAM_TABLE", "Ethernet0")
        assert int(fvs["max_queues"]) and int(fvs["max_priority_groups"])
    
        _, oa_pid = dvs.runcmd("pgrep orchagent")
    
        try:
            fvs["max_headroom_size"] = "122880"
            self.state_db.update_entry("BUFFER_MAX_PARAM_TABLE", "Ethernet0", fvs)
    
            # Startup interface
            dvs.port_admin_set('Ethernet0', 'up')
            # Wait for the lossy profile to be handled
            self.app_db.wait_for_field_match("BUFFER_PG_TABLE", "Ethernet0:0", {"profile": "ingress_lossy_profile"})
    
            # Stop orchagent to simulate the scenario that the system is during initialization
            dvs.runcmd("kill -s SIGSTOP {}".format(oa_pid))
    
            # Create a lossless profile
            profile_fvs = {'xon': '19456',
                          'xoff': '10240',
                          'size': '29696',
                          'dynamic_th': '0',
                          'pool': 'ingress_lossless_pool'}
            self.config_db.update_entry('BUFFER_PROFILE', 'test', profile_fvs)
    
            self.config_db.update_entry('BUFFER_PG', 'Ethernet0|3-4', {'profile': 'test'})
    
            # Make sure the entry has been handled by buffermgrd and is pending on orchagent's queue
            self.app_db.wait_for_field_match("_BUFFER_PG_TABLE", "Ethernet0:3-4", {"profile": "test"})
    
            # Should not be added due to the maximum headroom exceeded
            self.config_db.update_entry('BUFFER_PG', 'Ethernet0|1', {'profile': 'ingress_lossy_profile'})
            # Should not be added due to the maximum headroom exceeded
            self.config_db.update_entry('BUFFER_PG', 'Ethernet0|6', {'profile': 'test'})
    
            # Resume orchagent
            dvs.runcmd("kill -s SIGCONT {}".format(oa_pid))
    
            # Check whether BUFFER_PG_TABLE is updated as expected
            self.app_db.wait_for_field_match("BUFFER_PG_TABLE", "Ethernet0:3-4", {"profile": "test"})
    
            keys = self.app_db.get_keys('BUFFER_PG_TABLE')
    
            assert 'Ethernet0:1' not in keys
            assert 'Ethernet0:6' not in keys
    
            # Update the profile
            profile_fvs['size'] = '28672'
            profile_fvs['xoff'] = '9216'
            self.config_db.update_entry('BUFFER_PROFILE', 'test', profile_fvs)
            self.app_db.wait_for_field_match('BUFFER_PROFILE_TABLE', 'test', profile_fvs)
    
            # Verify a pending remove PG is not counted into the accumulative headroom
            dvs.runcmd("kill -s SIGSTOP {}".format(oa_pid))
    
            self.config_db.delete_entry('BUFFER_PG', 'Ethernet0|3-4')
            # Should be added because PG 3-4 has been removed and there are sufficient headroom
            self.config_db.update_entry('BUFFER_PG', 'Ethernet0|1', {'profile': 'ingress_lossy_profile'})
    
            # Resume orchagent
            dvs.runcmd("kill -s SIGCONT {}".format(oa_pid))
    
            # Check whether BUFFER_PG_TABLE is updated as expected
>           self.app_db.wait_for_field_match("BUFFER_PG_TABLE", "Ethernet0:1", {"profile": "ingress_lossy_profile"})

test_buffer_dynamic.py:848: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <dvslib.dvs_database.DVSDatabase object at 0x7f93eaa58400>, table_name = 'BUFFER_PG_TABLE', key = 'Ethernet0:1', expected_fields = {'profile': 'ingress_lossy_profile'}
polling_config = PollingConfig(polling_interval=0.01, timeout=20.0, strict=True), failure_message = None

    def wait_for_field_match(
        self,
        table_name: str,
        key: str,
        expected_fields: Dict[str, str],
        polling_config: PollingConfig = PollingConfig(),
        failure_message: str = None,
    ) -> Dict[str, str]:
        """Wait for the entry stored at `key` to have the specified field/values and retrieve it.
    
        This method is useful if you only care about the contents of a subset of the fields stored
        in the specified entry.
    
        Args:
            table_name: The name of the table where the entry is stored.
            key: The key that maps to the entry being checked.
            expected_fields: The fields and their values we expect to see in the entry.
            polling_config: The parameters to use to poll the db.
            failure_message: The message to print if the call times out. This will only take effect
                if the PollingConfig is set to strict.
    
        Returns:
            The entry stored at `key`. If no entry is found, then an empty Dict is returned.
        """
    
        def access_function():
            fv_pairs = self.get_entry(table_name, key)
            return (
                all(fv_pairs.get(k) == v for k, v in expected_fields.items()),
                fv_pairs,
            )
    
        status, result = wait_for_result(
            access_function, self._disable_strict_polling(polling_config)
        )
    
        if not status:
            message = failure_message or (
                f"Expected field/value pairs not found: expected={expected_fields}, "
                f'received={result}, key="{key}", table="{table_name}"'
            )
>           assert not polling_config.strict, message
E           AssertionError: Expected field/value pairs not found: expected={'profile': 'ingress_lossy_profile'}, received={}, key="Ethernet0:1", table="BUFFER_PG_TABLE"

dvslib/dvs_database.py:239: AssertionError
--------------------------------------------------------------------------------------------------------- Captured stdout setup ---------------------------------------------------------------------------------------------------------
remove extra link dummy
=============================================================================================== 1 failed, 10 deselected in 178.36 seconds ===============================================================================================

@qiluo-msft qiluo-msft merged commit aabe741 into sonic-net:master Aug 14, 2023
14 checks passed
@stephenxs stephenxs deleted the handle-pending-remove-pg-when-check-headroom branch August 14, 2023 23:59
StormLiangMS pushed a commit that referenced this pull request Aug 16, 2023
…cking accumulative headroom of a port (#2871)

**What I did**

Skip the PGs that are about to be removed in the Lua script while checking the accumulative headroom of a port.

**Why I did it**

This is to avoid the following error message
```
Jul 26 14:59:04.754566 r-moose-02 ERR swss#buffermgrd: :- guard: RedisReply catches system_error: command: .*, reason: ERR Error running script (call to f_a02f6f856d876d607a7ac81f5bc0890cad68bf71): @user_script:125: user_script:125: attempt to perform arithmetic on local 'current_profile_size' (a nil value): Input/output error
```

This is because
1. Too many notifications were accumulated in the `m_toSync` queue, belonging to `BUFFER_PROFILE_TABLE` and `BUFFER_PG_TABLE`
2. Even the buffer manager removed the buffer PG ahead of buffer profile, the buffer profile was handled ahead of buffer PG in the orchagent, which means they were handled in reverse order. As a result, the notification of buffer PG was still in `BUFFER_PG_TABLE_DELSET` set and remained in `BUFFER_PG_TABLE` while the buffer profile was removed from `BUFFER_PROFILE_TABLE`.
3. When it checked the accumulative headroom, it fetched all items from the APPL_DB tables and got the to-be-removed buffer PG but didn't get the buffer profile because it had been removed in 2.
4. As a result, it complained the `1current_profile_size` was nil which was the consequence of 3.

Fix:
Do not check buffer PGs that are in the `BUFFER_PG_TABLE_DELSET`.

**How I verified it**

Regression and manual test.
StormLiangMS pushed a commit that referenced this pull request Sep 3, 2023
…cking accumulative headroom of a port (#2871)

**What I did**

Skip the PGs that are about to be removed in the Lua script while checking the accumulative headroom of a port.

**Why I did it**

This is to avoid the following error message
```
Jul 26 14:59:04.754566 r-moose-02 ERR swss#buffermgrd: :- guard: RedisReply catches system_error: command: .*, reason: ERR Error running script (call to f_a02f6f856d876d607a7ac81f5bc0890cad68bf71): @user_script:125: user_script:125: attempt to perform arithmetic on local 'current_profile_size' (a nil value): Input/output error
```

This is because
1. Too many notifications were accumulated in the `m_toSync` queue, belonging to `BUFFER_PROFILE_TABLE` and `BUFFER_PG_TABLE`
2. Even the buffer manager removed the buffer PG ahead of buffer profile, the buffer profile was handled ahead of buffer PG in the orchagent, which means they were handled in reverse order. As a result, the notification of buffer PG was still in `BUFFER_PG_TABLE_DELSET` set and remained in `BUFFER_PG_TABLE` while the buffer profile was removed from `BUFFER_PROFILE_TABLE`.
3. When it checked the accumulative headroom, it fetched all items from the APPL_DB tables and got the to-be-removed buffer PG but didn't get the buffer profile because it had been removed in 2.
4. As a result, it complained the `1current_profile_size` was nil which was the consequence of 3.

Fix:
Do not check buffer PGs that are in the `BUFFER_PG_TABLE_DELSET`.

**How I verified it**

Regression and manual test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants