Skip to content

Commit

Permalink
CLI support for SmartSwitch PMON (#3271)
Browse files Browse the repository at this point in the history
* CLI support for SmartSwitch PMON

* imad minor fixes

* Did some cleanup for backward compatibility

* removed the column wrapping

* Made it backward compatible and removed textwrap and added ut to PR

* 1. There was a duplication of part of a function and that has been
   addressed. 2. The DPU reboot-cause data is fetched directly fromn the
chassis_state_db now

* reboot_cause and system_health are obtained directly from chassisStateDB
now

* The expected and result are the same but the test is throwing an error,
temporarily bypassing the check

* Let us get the build going and then look into the test mockup

* Implemented as per the pmon hld, also made some improvements in the
implementation

* Fixed the key for CHASSIS_MODULE_INFO_TABLE entries

* Fixed "show reboot-cause all" and "show reboot-cause history all"

* Addressing review comments

* Checking if the test issue still exists

* Resolving SA errors triggered due to reboot_cause_test

* Resolved pre-commit issues

* Resolved pre-commit issues

* Improving coverage

* Fixed SA related warnings

* Did some cleanup

* Minor improvements and fixes

* Adding tests for system health

* Adding more system health related tests

* Fixed a minor issue

* Fixed long line SA issue

* Trying to please SA

* Trying to improve coverage

* import mock

* Fixed a typo

* mocking DB

* Fixed syntax issues

* DB mock fix

* removed unused import

* creating ut for dpu state

* Improving coverage

* Fixed a typo

* Adjusted the reboot-cause key as per the updated hld

* Added fix to gracefully handle sytem health DB keys not present case

* Addressed minor review comments

* Addressed review comments.  Commented out system-health support until
phase:2

* Resolved minor issues and SA failures

* Added role to PORT table in config_db.  Using role to differentiate
npu-dpu data plane connection in SmartSwitch with Dpc being the role.
Did a minor cleanup.

* Resolving pre-commit check error related to line > 120

* Trying to avoid pre-commit issues

* Testing SA and precommit checks

* Making it backward compatible

* Resolving column size and whitespace issue

* Working on SA issue

* Testing SA and UT

* Added 2 spaces before inline comment

* Enabling "show system-health dpu" cli alone.  The rest of the dpu health
is differed for now.

* Fixed SA issues

* Adde new line at EOF

* Enabling the UT for the CLI "show system-health dpu"

* Resolved SA issues

* Resolved a SA issue

* Added smartswitch specific "reboot-cause" and "reboot-cause history" CLI
extensions

* Removed the phase:2 related system-health cli extensions as a seperate
PR will be raised eventually for phase:2

* Using smartswitch qualifier for the clie extensions

* Fixed SA issues

* mocking device_info for test cases

* import patch in tests

* Debugging test failure

* Fixing SA issues

* fixing sa issues

* Debugging sa issues

* trying to resolve sa issues

* fixed indentation

* debugging

* debugging

* debugging

* debugging

* Debugging

* debugging

* debugging

* Debugging

* Debugging

* Debuggingg

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debuggingg

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Debugging

* Removing the test to build an image

* Removed mock import

* Improving coverage

* pleasing SA

* Fixing tests for design changes as per review comments

* Resolving test failure

* fixed indentation

* cleaned up the test case

* Addressed review comments in Command-Reference.md and trying to improve
coverage

* Improving coverage

* Fixed a test issue

* Addressed review comments

* Addressed review comment.  Reading DPUs list from config_db.json

* Improving coverage

* Resolved SA error

* Trying to improve coverage. Also, reading from platform.json

* adding json import in the test

* Fixed a test failure

* Fixed SA error

* Exercising the new function in test

* Removed a blank line

* fixing mock issue

* Trying a different approach

* working on coverage

* debugging

* debugging

* Debugging

* Increasing coverage

* improving coverage

* Adjusting the show cli implementation to align with the reboot-cause
changes such as 1. STATE_DB vs CHASSIS_STATE_DB and the key info

* Fixing a minor issue

* Removed ID column from the "show system-health dpu DPUx" cli as per the new requirement

* Addressed default dpu admin status for dark-mode and seamless migration
to lightup mode

* Resolving SA issue

* Resolved a typo

* Added checks to see if module_name is valid in the "config chassis
modules startup DPUx" cli aand also moved all the required utilities to
the common file

* Fixed white space issues

* Cleaned unwanted import

* Fixed build issues

* missedout the fixes in a couple of files

* With the recent code the app_db multi_asic.PORT_ROLE is Dpc for DPU
ports, earlier this was not the case. So removing the additional check.

* As the port role issue is no longer seen in smartswitch, cleaning up the
related chnages.

* Using the verbose define for TYPE_DPC in the CLI, if there is a specific
requirement to keep 'TYPE_DPC = Dpc", which is the role, then we will
revert it

* Reverting intfutil_test.py

* Using the common API to get_dpu_list

* Removed unused import json

* Addressed review comments

* Did some minor cleanp

* Fix: SA error

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Addressed review comments

* Added fix for issue:21372 - Device name column shows NPU instead of module name

* Added fix for issue:21372 - Fixing the device name colum in the cli output

* Added a few review comments
  • Loading branch information
rameshraghupathy authored Jan 27, 2025
1 parent 752c3d4 commit 97c20cc
Show file tree
Hide file tree
Showing 8 changed files with 616 additions and 46 deletions.
30 changes: 24 additions & 6 deletions config/chassis_modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import re
import subprocess
import utilities_common.cli as clicommon
from utilities_common.chassis import is_smartswitch, get_all_dpus

TIMEOUT_SECS = 10

Expand All @@ -27,7 +28,10 @@ def get_config_module_state(db, chassis_module_name):
config_db = db.cfgdb
fvs = config_db.get_entry('CHASSIS_MODULE', chassis_module_name)
if not fvs:
return 'up'
if is_smartswitch():
return 'down'
else:
return 'up'
else:
return fvs['admin_status']

Expand Down Expand Up @@ -102,16 +106,21 @@ def fabric_module_set_admin_status(db, chassis_module_name, state):
#
@modules.command('shutdown')
@clicommon.pass_db
@click.argument('chassis_module_name', metavar='<module_name>', required=True)
@click.argument('chassis_module_name',
metavar='<module_name>',
required=True,
type=click.Choice(get_all_dpus(), case_sensitive=False) if is_smartswitch() else str
)
def shutdown_chassis_module(db, chassis_module_name):
"""Chassis-module shutdown of module"""
config_db = db.cfgdb
ctx = click.get_current_context()

if not chassis_module_name.startswith("SUPERVISOR") and \
not chassis_module_name.startswith("LINE-CARD") and \
not chassis_module_name.startswith("FABRIC-CARD"):
ctx.fail("'module_name' has to begin with 'SUPERVISOR', 'LINE-CARD' or 'FABRIC-CARD'")
not chassis_module_name.startswith("FABRIC-CARD") and \
not chassis_module_name.startswith("DPU"):
ctx.fail("'module_name' has to begin with 'SUPERVISOR', 'LINE-CARD', 'FABRIC-CARD', 'DPU'")

# To avoid duplicate operation
if get_config_module_state(db, chassis_module_name) == 'down':
Expand All @@ -130,7 +139,11 @@ def shutdown_chassis_module(db, chassis_module_name):
#
@modules.command('startup')
@clicommon.pass_db
@click.argument('chassis_module_name', metavar='<module_name>', required=True)
@click.argument('chassis_module_name',
metavar='<module_name>',
required=True,
type=click.Choice(get_all_dpus(), case_sensitive=False) if is_smartswitch() else str
)
def startup_chassis_module(db, chassis_module_name):
"""Chassis-module startup of module"""
config_db = db.cfgdb
Expand All @@ -142,7 +155,12 @@ def startup_chassis_module(db, chassis_module_name):
return

click.echo("Starting up chassis module {}".format(chassis_module_name))
config_db.set_entry('CHASSIS_MODULE', chassis_module_name, None)
if is_smartswitch():
fvs = {'admin_status': 'up'}
config_db.set_entry('CHASSIS_MODULE', chassis_module_name, fvs)
else:
config_db.set_entry('CHASSIS_MODULE', chassis_module_name, None)

if chassis_module_name.startswith("FABRIC-CARD"):
if not check_config_module_state_with_timeout(ctx, db, chassis_module_name, 'up'):
fabric_module_set_admin_status(db, chassis_module_name, 'up')
142 changes: 136 additions & 6 deletions doc/Command-Reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -713,11 +713,52 @@ This command displays the cause of the previous reboot
```

- Example:
### Shown below is the output of the CLI when executed on the NPU
```
admin@sonic:~$ show reboot-cause
User issued reboot command [User: admin, Time: Mon Mar 25 01:02:03 UTC 2019]
```
### Shown below is the output of the CLI when executed on the DPU
```
admin@sonic:~$ show reboot-cause
reboot
```
```
Note: The CLI extensions shown in this block are applicable only to smartswitch platforms. When these extensions are used on a regular switch the extension will be ignored and the output will be the same irrespective of the options.
CLI Extensions Applicable to Smartswtich
- show reboot-cause all
- show reboot-cause history all
- show reboot-cause history DPUx
```
**show reboot-cause all**

This command displays the cause of the previous reboot for the Switch and the DPUs for which the midplane interfaces are up.

- Usage:
```
show reboot-cause all
```

- Example:
### Shown below is the output of the CLI when executed on the NPU
```
root@MtFuji:/home/cisco# show reboot-cause all
Device Name Cause Time User
-------- ------------------- ------------ ------------------------------- ------
NPU 2025_01_21_09_01_11 Power Loss N/A N/A
DPU1 2025_01_21_09_03_43 Non-Hardware Tue Jan 21 09:03:43 AM UTC 2025
DPU0 2025_01_21_09_03_37 Non-Hardware Tue Jan 21 09:03:37 AM UTC 2025
```
### Shown below is the output of the CLI when executed on the DPU
```
root@sonic:/home/admin# show reboot-cause all
Usage: show reboot-cause [OPTIONS] COMMAND [ARGS]...
Try "show reboot-cause -h" for help.
Error: No such command "all".
```
**show reboot-cause history**

This command displays the history of the previous reboots up to 10 entry
Expand All @@ -728,15 +769,74 @@ This command displays the history of the previous reboots up to 10 entry
```

- Example:
### Shown below is the output of the CLI when executed on the NPU
```
admin@sonic:~$ show reboot-cause history
Name Cause Time User Comment
------------------- ----------- ---------------------------- ------ ---------
2020_10_09_02_33_06 reboot Fri Oct 9 02:29:44 UTC 2020 admin
root@MtFuji:/home/cisco# show reboot-cause history
Name Cause Time User Comment
------------------- ---------- ------ ------ ----------------------------------------------------------------------------------
2020_10_09_02_40_11 Power Loss Fri Oct 9 02:40:11 UTC 2020 N/A Unknown (First boot of SONiC version azure_cisco_master.308-dirty-20250120.220704)
2020_10_09_01_56_59 reboot Fri Oct 9 01:53:49 UTC 2020 admin
2020_10_09_02_00_53 fast-reboot Fri Oct 9 01:58:04 UTC 2020 admin
2020_10_09_04_53_58 warm-reboot Fri Oct 9 04:51:47 UTC 2020 admin
```
### Shown below is the output of the CLI when executed on the DPU
```
root@sonic:/home/admin# show reboot-cause history
Name Cause Time User Comment
------------------- ------- ------------------------------- ------ ---------
2025_01_21_16_49_20 Unknown N/A N/A N/A
2025_01_17_11_25_58 reboot Fri Jan 17 11:23:24 AM UTC 2025 admin N/A
```
**show reboot-cause history all**

This command displays the history of the previous reboots up to 10 entry of the Switch and the DPUs for which the midplane interfaces are up.

- Usage:
```
show reboot-cause history all
```

- Example:
### Shown below is the output of the CLI when executed on the NPU
```
root@MtFuji:~# show reboot-cause history all
Device Name Cause Time User Comment
-------- ------------------- ----------------------------------------- ------------------------------- ------ -------
NPU 2024_07_23_23_06_57 Kernel Panic Tue Jul 23 11:02:27 PM UTC 2024 N/A N/A
NPU 2024_07_23_11_21_32 Power Loss N/A N/A Unknown
```
### Shown below is the output of the CLI when executed on the DPU
```
root@sonic:/home/admin# show reboot-cause history all
Usage: show reboot-cause history [OPTIONS]
Try "show reboot-cause history -h" for help.
Error: Got unexpected extra argument (all)
```
**show reboot-cause history DPU1**

This command displays the history of the previous reboots up to 10 entry of DPU1. If DPU1 is powered down then there won't be any data in the DB and the "show reboot-cause history DPU1" output will be blank.

- Usage:
```
show reboot-cause history DPU1
```

- Example:
### Shown below is the output of the CLI when executed on the NPU
```
root@MtFuji:~# show reboot-cause history DPU1
Device Name Cause Time User Comment
-------- ------ ----------------------------------------- ------ ------ ---------
DPU1 DPU1 Software causes (Hardware watchdog reset) N/A N/A N/A
```
### Shown below is the output of the CLI when executed on the DPU
```
root@sonic:/home/admin# show reboot-cause history DPU1
Usage: show reboot-cause history [OPTIONS]
Try "show reboot-cause history -h" for help.
Error: Got unexpected extra argument (DPU1)
```


**show uptime**

Expand Down Expand Up @@ -11348,6 +11448,36 @@ In addition, displays a list of all current 'Services' and 'Hardware' being moni
psu.voltage Ignored Device
```
**show system-health dpu <option>**
This is a smartswitch specific cli. This cli shows the midplane, control plane and data plane health of the DPU modules in the smartswitch.
This can take two forms of "<option>" 1. DPU module name (ex: DPU0) 2. all, which will list all the DPUs in the smartswitch
- Usage:
```
show system-health dpu DPU0
```
- Example:
```
root@MtFuji-dut:/home/cisco# show system-health dpu DPU0
Name Oper-Status State-Detail State-Value Time Reason
------ ------------- ----------------------- ------------- ------------------------------- ------------------------------------------------------------------------------------
DPU0 Online dpu_midplane_link_state up Mon Dec 23 05:12:17 PM UTC 2024
dpu_control_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 All containers are up and running, host-ethlink-status: Uplink1/1 is UP
dpu_data_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 DPU container named polaris is running, pdsagent running : OK, pciemgrd running : OK
root@MtFuji-dut:/home/cisco# show system-health dpu all
Name Oper-Status State-Detail State-Value Time Reason
------ ------------- ----------------------- ------------- ------------------------------- ------------------------------------------------------------------------------------
DPU0 Online dpu_midplane_link_state up Mon Dec 23 05:12:17 PM UTC 2024
dpu_control_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 All containers are up and running, host-ethlink-status: Uplink1/1 is UP
dpu_data_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 DPU container named polaris is running, pdsagent running : OK, pciemgrd running : OK
DPU1 Online dpu_midplane_link_state up Mon Dec 23 05:12:17 PM UTC 2024
dpu_control_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 All containers are up and running, host-ethlink-status: Uplink1/1 is UP
dpu_data_plane_state up Mon Dec 23 05:12:17 PM UTC 2024 DPU container named polaris is running, pdsagent running : OK, pciemgrd running : OK
Go Back To [Beginning of the document](#) or [Beginning of this section](#System-Health)
## VLAN & FDB
Expand Down
12 changes: 8 additions & 4 deletions show/chassis_modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from natsort import natsorted
from tabulate import tabulate
from swsscommon.swsscommon import SonicV2Connector
from utilities_common.chassis import is_smartswitch

import utilities_common.cli as clicommon
from sonic_py_common import multi_asic
Expand Down Expand Up @@ -40,11 +41,11 @@ def status(db, chassis_module_name):
state_db = SonicV2Connector(host="127.0.0.1")
state_db.connect(state_db.STATE_DB)

key_pattern = '*'
key_pattern = CHASSIS_MODULE_INFO_TABLE + '|*'
if chassis_module_name:
key_pattern = '|' + chassis_module_name
key_pattern = CHASSIS_MODULE_INFO_TABLE + '|' + chassis_module_name

keys = state_db.keys(state_db.STATE_DB, CHASSIS_MODULE_INFO_TABLE + key_pattern)
keys = state_db.keys(state_db.STATE_DB, key_pattern)
if not keys:
print('Key {} not found in {} table'.format(key_pattern, CHASSIS_MODULE_INFO_TABLE))
return
Expand All @@ -62,7 +63,10 @@ def status(db, chassis_module_name):
oper_status = data_dict[CHASSIS_MODULE_INFO_OPERSTATUS_FIELD]
serial = data_dict[CHASSIS_MODULE_INFO_SERIAL_FIELD]

admin_status = 'up'
if is_smartswitch():
admin_status = 'down'
else:
admin_status = 'up'
config_data = chassis_cfg_table.get(key_list[1])
if config_data is not None:
admin_status = config_data.get(CHASSIS_MODULE_INFO_ADMINSTATUS_FIELD)
Expand Down
Loading

0 comments on commit 97c20cc

Please sign in to comment.