- Introduction
- Scope
- Testbed and Version
- Topology
- Definitions and Abbreviations
- Objectives of CLI Test Cases
- CLI Test Cases
- 1.1 Check DPU Status
- 1.2 Check platform voltage
- 1.3 Check platform temperature
- 1.4 Check DPU console
- 1.5 Check midplane ip address between Switch and DPU
- 1.6 Check DPU shutdown and power up individually
- 1.7 Check pcie link status between Switch and DPU
- 1.8 Check the NTP date and timezone between DPU and Switch
- 1.9 Check the State of DPUs
- 1.10 Check the Health of DPUs
- 1.11 Check reboot cause history
- 1.12 Check the DPU state after Switch reboot
- 1.13 Check memory on DPU
- 1.14 Check DPU status and pcie Link after memory exhaustion on Switch
- 1.15 Check DPU status and pcie Link after memory exhaustion on DPU
- 1.16 Check DPU status and pcie Link after restart pmon on Switch
- 1.17 Check DPU status and pcie Link after reload of configuration on Switch
- 1.18 Check DPU status and pcie Link after kernel panic on Switch
- 1.19 Check DPU status and pcie Link after kernel panic on DPU
- 1.20 Check DPU status and pcie Link after switch power cycle
- 1.21 Check DPU status and pcie Link after SW trip by temperature trigger
- Objectives of API Test Cases
- API Test Cases
The purpose is to test the functionality of Smartswitch. Smartswitch is connected to DPUs via pcie links.
The test is targeting a running SONIC on Switch and SONIC-DASH system on each DPUs. Purpose of the test is to verify smartswich platform related functionalities/features for each DPUs and PMON APIs. For every test cases, all DPUs need to be powered on unless specified in any of the case. General convention of DPU0, DPU1, DPU2 and DPUX has been followed to represent DPU modules and the number of DPU modules can vary.
The test runs on the os versions 202411 and above. Add a check to confirm that the test environment uses version 202411 or later; if the version is earlier, skip the test. After the above check, it needs to check DPUs in the testbed are in dark mode or not. If it is in dark mode, then power up all the DPUs. Dark mode is one in which all the DPUs admin_status are down.
New topology called smartswitch-t1 has been added for running smartswitch cases. T1 cases also runs on the new topology.
Term | Meaning |
---|---|
DPU | Data Processing Unit |
NPU | Network Processing Unit |
NTP | Network Time Protocol |
SWITCH | Refers to NPU and the anything other than DPUs |
SS | SmartSwitch |
Test Case | Intention | Comments | |
---|---|---|---|
1.1 | Check DPU Status | To verify the DPU Status shown in the cli | |
1.2 | Check platform voltage | To verify the Voltage sensor values and and functionality of alarm by changing the threshold values | |
1.3 | Check platform temperature | To Verify the Temperature sensor values and functionality of alarm by changing the threshold values | |
1.4 | Check DPU console | To Verify console access for all DPUs | |
1.5 | Check midplane ip address between Switch and DPU | To Verify PCIe interface created between Switch and DPU according to bus number | |
1.6 | Check DPU shutdown and power up individually | To Verify DPU shutdown and DPUs power up | |
1.7 | Check pcie link status between Switch and DPU | To Verify the PCie hot plug functinality | |
1.8 | Check the NTP date and timezone between DPU and Switch | To Verify Switch and DPU are in sync with respect to timezone and logs timestamp | |
1.9 | Check the State of DPUs | To Verify DPU state details during online and offline | |
1.10 | Check the Health of DPUs | To Verify overall health (LED, process, docker, services and hw) of DPU | Phase:2 |
1.11 | Check reboot cause history | To Verify reboot cause history cli | |
1.12 | Check the DPU state after OS reboot | To Verify DPU state on host reboot | |
1.13 | Check memory on DPU | To verify Memory and its threshold on all the DPUs | |
1.14 | Check DPU status and pcie Link after memory exhaustion on Switch | To verify dpu status and connectivity after memory exhaustion on Switch | |
1.15 | Check DPU status and pcie Link after memory exhaustion on DPU | To verify dpu status and connectivity after memory exhaustion on DPU | |
1.16 | Check DPU status and pcie Link after restart pmon on Switch | To verify dpu status and connectivity after restart of pmon on NPU | |
1.17 | Check DPU status and pcie Link after reload of configuration on Switch | To verify dpu status and connectivity after reload of configuration on Switch | |
1.18 | Check DPU status and pcie Link after kernel panic on Switch | To verify dpu status and connectivity after Kernel Panic on Switch | |
1.19 | Check DPU status and pcie Link after kernel panic on DPU | To verify dpu status and connectivity after Kernel Panic on DPU | |
1.20 | Check DPU status and pcie Link after switch power cycle | To verify dpu status and connectivity after switch power cycle | |
1.21 | Check DPU status and pcie Link after SW trip by temperature trigger | To verify dpu status and connectivity after SW trip by temperature trigger |
- Use command on Switch:
show chassis modules status
to get DPU status - Get the number of DPU modules from ansible inventory file for the testbed.
- Switch
On Switch:
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- --------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
- Verify number of DPUs from inventory file for the testbed and number of DPUs shown in the cli output.
- Use command on Switch:
show platform voltage
to get platform voltage
- Switch
On Switch:
root@sonic:/home/cisco# show platform voltage
Sensor Voltage High TH Low TH Crit High TH Crit Low TH Warning Timestamp
------------------------ --------- --------- -------- -------------- ------------- --------- -----------------
A1V2_BB 1211 mV 1308 1092 1320 1080 False 20230619 11:31:08
A1V8_BB 1810 mV 1962 1638 1980 1620 False 20230619 11:31:07
A1V_BB 1008 mV 1090 910 1100 900 False 20230619 11:31:06
A1V_CPU 1001 mV 1090 910 1100 900 False 20230619 11:31:51
A1_2V_CPU 1209 mV 1308 1092 1320 1080 False 20230619 11:31:52
A1_8V_CPU 1803 mV 1962 1638 1980 1620 False 20230619 11:31:51
A2_5V_CPU 2517 mV 2725 2275 2750 2250 False 20230619 11:31:52
A3P3V_CPU 3284 mV 3597 3003 3603 2970 False 20230619 11:31:53
A3V3_BB 3298 mV 3597 3003 3630 2970 False 20230619 11:31:08
GB_CORE_VIN_L1_BB 12000 mV 13800 10200 14400 9600 False 20230619 11:31:06
GB_CORE_VOUT_L1_BB 824 mV N/A 614 918 608 False 20230619 11:31:50
GB_P1V8_PLLVDD_BB 1812 mV 1962 1638 1980 1620 False 20230619 11:31:11
GB_P1V8_VDDIO_BB 1815 mV 1962 1638 1980 1620 False 20230619 11:31:11
GB_PCIE_VDDACK_BB 755 mV 818 683 825 675 False 20230619 11:31:12
GB_PCIE_VDDH_BB 1208 mV 1308 1092 1320 1080 False 20230619 11:31:12
P3_3V_BB 3330 mV 3597 3003 3630 2970 False 20230619 11:31:12
P5V_BB 5069 mV 5450 4550 5500 4500 False 20230619 11:31:07
P12V_CPU 12103 mV 13080 10920 13200 10800 False 20230619 11:31:54
P12V_SLED1_VIN 12048 mV N/A N/A 12550 11560 False 20230619 11:31:13
P12V_SLED2_VIN 12048 mV N/A N/A 12550 11560 False 20230619 11:31:13
P12V_SLED3_VIN 12079 mV N/A N/A 12550 11560 False 20230619 11:31:13
P12V_SLED4_VIN 12043 mV N/A N/A 12550 11560 False 20230619 11:31:13
P12V_STBY_CPU 12103 mV 13080 10920 13200 10800 False 20230619 11:31:54
P12V_U1_VR3_CPU 11890 mV 13800 10200 14400 9600 False 20230619 11:31:54
P12V_U1_VR4_CPU 11890 mV N/A N/A N/A N/A False 20230619 11:31:54
P12V_U1_VR5_CPU 11890 mV 13800 10200 14400 9600 False 20230619 11:31:54
TI_3V3_L_VIN_BB 12015 mV N/A 10200 14400 9600 False 20230619 11:31:06
TI_3V3_L_VOUT_L1_BB 3340 mV N/A 2839 4008 2672 False 20230619 11:31:13
TI_3V3_R_VIN_BB 12078 mV N/A 10200 14400 9600 False 20230619 11:31:06
TI_3V3_R_VOUT_L1_BB 3340 mV N/A 2839 4008 2672 False 20230619 11:31:13
TI_GB_VDDA_VOUT_L2_BB 960 mV N/A 816 1152 768 False 20230619 11:31:13
TI_GB_VDDCK_VIN_BB 12031 mV N/A 10200 14400 9600 False 20230619 11:31:06
TI_GB_VDDCK_VOUT_L1_BB 1150 mV N/A 978 1380 920 False 20230619 11:31:13
TI_GB_VDDS_VIN_BB 12046 mV N/A 10200 14400 9600 False 20230619 11:31:50
TI_GB_VDDS_VOUT_L1_BB 750 mV N/A 638 900 600 False 20230619 11:31:12
VP0P6_DDR0_VTT_DPU0 599 mV 630 570 642 558 False 20230619 11:31:55
VP0P6_DDR0_VTT_DPU1 597 mV 630 570 642 558 False 20230619 11:31:56
VP0P6_DDR0_VTT_DPU2 600 mV 630 570 642 558 False 20230619 11:31:58
VP0P6_DDR0_VTT_DPUX 600 mV 630 570 642 558 False 20230619 11:31:59
VP0P6_DDR1_VTT_DPU0 600 mV 630 570 642 558 False 20230619 11:31:56
VP0P6_DDR1_VTT_DPU1 602 mV 630 570 642 558 False 20230619 11:31:57
VP0P6_DDR1_VTT_DPU2 601 mV 630 570 642 558 False 20230619 11:31:58
VP0P6_DDR1_VTT_DPUX 601 mV 630 570 642 558 False 20230619 11:32:00
VP0P6_VTT_DIMM_CPU 597 mV 654 546 660 540 False 20230619 11:31:51
VP0P8_AVDD_D6_DPU0 801 mV 840 760 856 744 False 20230619 11:31:16
VP0P8_AVDD_D6_DPU1_ADC 806 mV 840 760 856 744 False 20230619 11:31:20
VP0P8_AVDD_D6_DPU2 804 mV 840 760 856 744 False 20230619 11:31:25
VP0P8_AVDD_D6_DPU3_ADC 805 mV 840 760 856 744 False 20230619 11:31:29
VP0P8_AVDD_D6_DPU4 806 mV 840 760 856 744 False 20230619 11:31:34
VP0P8_AVDD_D6_DPU5_ADC 801 mV 840 760 856 744 False 20230619 11:31:39
VP0P8_AVDD_DX_DPUX 805 mV 840 760 856 744 False 20230619 11:31:44
VP0P8_AVDD_D6_DPU7_ADC 806 mV 840 760 856 744 False 20230619 11:31:48
VP0P8_NW_DPU0 803 mV 840 760 856 744 False 20230619 11:31:17
VP0P8_NW_DPU1 804 mV 840 760 856 744 False 20230619 11:31:21
VP0P8_NW_DPU2 803 mV 840 760 856 744 False 20230619 11:31:26
VP0P8_NW_DPUX 804 mV 840 760 856 744 False 20230619 11:31:31
VP0P8_PLL_AVDD_PCIE_DPU0 802 mV 840 760 856 744 False 20230619 11:31:56
VP0P8_PLL_AVDD_PCIE_DPU1 804 mV 840 760 856 744 False 20230619 11:31:57
VP0P8_PLL_AVDD_PCIE_DPU2 801 mV 840 760 856 744 False 20230619 11:31:59
VP0P8_PLL_AVDD_PCIE_DPUX 802 mV 840 760 856 744 False 20230619 11:32:00
VP0P9_AVDDH_D6_DPU0 906 mV 945 855 963 837 False 20230619 11:31:15
VP0P9_AVDDH_D6_DPU1 908 mV 945 855 963 837 False 20230619 11:31:19
VP0P9_AVDDH_D6_DPU2 907 mV 945 855 963 837 False 20230619 11:31:24
VP0P9_AVDDH_D6_DPUX 908 mV 945 855 963 837 False 20230619 11:31:29
VP0P9_AVDDH_PCIE_DPU0 901 mV 945 855 963 837 False 20230619 11:31:17
VP0P9_AVDDH_PCIE_DPU1 903 mV 945 855 963 837 False 20230619 11:31:22
VP0P9_AVDDH_PCIE_DPU2 901 mV 945 855 963 837 False 20230619 11:31:26
VP0P9_AVDDH_PCIE_DPUX 903 mV 945 855 963 837 False 20230619 11:31:31
VP0P75_PVDD_DPU0 752 mV 788 713 803 698 False 20230619 11:31:15
VP0P75_PVDD_DPU1 756 mV 788 713 802 698 False 20230619 11:31:20
VP0P75_PVDD_DPU2 756 mV 788 713 803 698 False 20230619 11:31:24
VP0P75_PVDD_DPUX 755 mV 788 713 802 698 False 20230619 11:31:29
VP0P75_RTVDD_DPU0 753 mV 788 713 803 698 False 20230619 11:31:14
VP0P75_RTVDD_DPU1 755 mV 788 713 802 698 False 20230619 11:31:19
VP0P75_RTVDD_DPU2 752 mV 788 713 803 698 False 20230619 11:31:24
VP0P75_RTVDD_DPUX 755 mV 788 713 802 698 False 20230619 11:31:28
VP0P82_AVDD_PCIE_DPU0 823 mV 861 779 877 763 False 20230619 11:31:18
VP0P82_AVDD_PCIE_DPU1 823 mV 861 779 877 763 False 20230619 11:31:22
VP0P82_AVDD_PCIE_DPU2 822 mV 861 779 877 763 False 20230619 11:31:27
VP0P82_AVDD_PCIE_DPUX 822 mV 861 779 877 763 False 20230619 11:31:31
root@sonic:/home/cisco#
- Verify warnings are all false.
- Verify changing the threshold values (high, low, critical high and critical low) and check alarm warning changing into True
- Verify changing back the threshold values to original one and check alarm warning changing into False
- Use command on Switch:
show platform temperature
to get platform temperature
- Switch
On Switch:
root@sonic:~#show platform temperature
Sensor Temperature High TH Low TH Crit High TH Crit Low TH Warning Timestamp
--------------- ------------- --------- -------- -------------- ------------- --------- -----------------
DPU_0_T 37.438 100.0 -5.0 105.0 -10.0 False 20230728 06:39:18
DPU_1_T 37.563 100.0 -5.0 105.0 -10.0 False 20230728 06:39:18
DPU_2_T 38.5 100.0 -5.0 105.0 -10.0 False 20230728 06:39:18
DPU_3_T 38.813 100.0 -5.0 105.0 -10.0 False 20230728 06:39:18
FAN_Sensor 23.201 100.0 -5.0 102.0 -10.0 False 20230728 06:39:18
MB_PORT_Sensor 21.813 97.0 -5.0 102.0 -10.0 False 20230728 06:39:18
MB_TMP421_Local 26.25 135.0 -5.0 140.0 -10.0 False 20230728 06:39:18
SSD_Temp 40.0 80.0 -5.0 83.0 -10.0 False 20230728 06:39:18
X86_CORE_0_T 37.0 100.0 -5.0 105.0 -10.0 False 20230728 06:39:18
X86_PKG_TEMP 41.0 100.0 -5.0 105.0 -10.0 False 20230728 06:39:18
- Verify warnings are all false
- Verify changing the threshold values (high, low, critical high and critical low) and check alarm warning changing into True
- Verify changing back the threshold values to original one and check alarm warning changing into False
- Use serial port utility to access console for given DPU.
- Get the mapping of serial port to DPU number from platform.json file.
- Get the number of DPU modules from ansible inventory file for the testbed. Test is to check for console access for all DPUs.
- Switch
On Switch: (shows connection to dpu-4 console and offset is 4 for this case)
root@sonic:/home/cisco# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:16 tx:286846 rx:6096 oe:1 RTS|DTR|DSR|CD|RI
1: uart:16550A port:00006000 irq:19 tx:0 rx:0 CTS|DSR|CD
2: uart:16550A port:000003E8 irq:18 tx:0 rx:0 DSR|CD|RI
3: uart:16550A port:00007000 irq:16 tx:0 rx:0 CTS|DSR|CD
**4: uart:16550 mmio:0x94040040 irq:94 tx:0 rx:0
5: uart:16550 mmio:0x94040060 irq:94 tx:20 rx:68
6: uart:16550 mmio:0x94040080 irq:94 tx:0 rx:0
7: uart:16550 mmio:0x940400A0 irq:94 tx:0 rx:0
8: uart:16550 mmio:0x940400C0 irq:94 tx:0 rx:0
9: uart:16550 mmio:0x940400E0 irq:94 tx:0 rx:0
10: uart:16550 mmio:0x94040100 irq:94 tx:0 rx:0
11: uart:16550 mmio:0x94040120 irq:94 tx:0 rx:0**
12: uart:16550 mmio:0x94040140 irq:94 tx:0 rx:0 CTS|DSR
13: uart:16550 mmio:0x94040160 irq:94 tx:0 rx:0 CTS|DSR
14: uart:16550 mmio:0x94040180 irq:94 tx:0 rx:0 CTS|DSR
15: uart:16550 mmio:0x940401A0 irq:94 tx:0 rx:0 CTS|DSR
root@sonic:/home/cisco# /usr/bin/picocom -b 115200 /dev/ttyS8
picocom v3.1
port is : /dev/ttyS8
flowcontrol : none
baudrate is : 115200
parity is : none
databits are : 8
stopbits are : 1
escape is : C-a
local echo is : no
noinit is : no
noreset is : no
hangup is : no
nolock is : no
send_cmd is : sz -vv
receive_cmd is : rz -vv -E
imap is :
omap is :
emap is : crcrlf,delbs,
logfile is : none
initstring : none
exit_after is : not set
exit is : no
Type [C-a] [C-h] to see available commands
Terminal ready
sonic login: admin
Password:
Linux sonic 6.1.0-11-2-arm64 #1 SMP Debian 6.1.38-4 (2023-08-08) aarch64
You are on
____ ___ _ _ _ ____
/ ___| / _ \| \ | (_)/ ___|
\___ \| | | | \| | | |
___) | |_| | |\ | | |___
|____/ \___/|_| \_|_|\____|
-- Software for Open Networking in the Cloud --
Unauthorized access and/or use are prohibited.
All access and/or use are subject to monitoring.
Help: https://sonic-net.github.io/SONiC/
Last login: Fri Jan 26 21:49:12 UTC 2024 from 169.254.143.2 on pts/1
admin@sonic:~$
admin@sonic:~$
Terminating...
Thanks for using picocom
root@sonic:/home/cisco#
- Verify Login access is displayed.
- cntrl+a and then cntrl+x to come out of the DPU console.
- Get the number of DPU modules from from ansible inventory file for the testbed.
- Get mid plane ip address for each DPU module from ansible inventory file for the testbed.
- Switch
On Switch:
root@sonic:/home/cisco# show ip interface
Interface Master IPv4 address/mask Admin/Oper BGP Neighbor Neighbor IP
------------ -------- ------------------- ------------ -------------- -------------
eth0 172.25.42.65/24 up/up N/A N/A
bridge-midplane 169.254.200.254/24 up/up N/A N/A
lo 127.0.0.1/16 up/up N/A N/A
root@sonic:/home/cisco#
- Verify Ping works to all the mid plane ip listed in the ansible inventory file for the testbed.
- Get the number of DPU modules from Ansible inventory file for the testbed.
- Use command on Switch:
config chassis modules shutdown <DPU_Number>
to shut down individual DPU - Use command on Switch:
show chassis modules status
to show DPU status - Use command on Switch:
config chassis modules startup <DPU_Number>
to power up individual DPU - Use command on Switch:
show chassis modules status
to show DPU status
- Switch
On Switch:
root@sonic:/home/cisco# config chassis modules shutdown DPU3
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ----------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPU3 Data Processing Unit N/A Offline down 154226463179184
root@sonic:/home/cisco# config chassis modules startup DPU3
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ ------------------- --------------- ------------- -------------- ----------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPU3 Data Processing Unit N/A Online up 154226463179184
- Verify DPU offline in show chassis modules status after DPU shut down
- Verify DPU is shown in show chassis modules status after DPU powered on
- Use command on Switch:
show platform pcieinfo -c
to run the pcie info test to check everything is passing - Use command on Switch:
config chassis modules shutdown DPU<DPU_NUM>
to bring down the dpu (This will bring down the pcie link between npu and dpu) - Use command on Switch:
show platform pcieinfo -c
to run the pcie info test to check pcie link has been removed - Use command on Switch:
config chassis modules startup DPU<DPU_NUM>
to bring up the dpu (This will rescan pcie links) - Use command on Switch:
show platform pcieinfo -c
to run the pcie info test to check everything is passing - This test is to check the PCie hot plug functinality since there is no OIR possible
- Switch
On Switch: Showing example of one DPU pcie link
root@sonic:/home/cisco# show platform pcieinfo -c
root@sonic:/home/cisco# config chassis modules shutdown DPU<DPU_NUM>
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis modules startup DPU<DPU_NUM>
root@sonic:/home/cisco#
root@sonic:/home/cisco# show platform pcieinfo -c
- Verify after removing pcie link, pcie info test fail only for that pcie link.
- Verify pcieinfo test pass for all after bringing back up the link
- Use command on Switch:
date
to get date and time zone on Switch - Use command on Switch:
ssh [email protected]
to enter into required dpu. - Use command on DPU:
date
to get date and time zone on DPU
- Switch and DPU
On Switch:
root@sonic:/home/cisco# date
Tue 23 Apr 2024 11:46:47 PM UTC
root@sonic:/home/cisco#
.
root@sonic:/home/cisco# ssh [email protected]
root@sonic:/home/cisco#
.
On DPU:
root@sonic:/home/admin# date
Tue 23 Apr 2024 11:46:54 PM UTC
root@sonic:/home/cisco#
- Verify both the date and time zone are same
- Verify the syslogs on both switch and DPU to be same
- Verify by changing time intentionally in DPU and restart the DPU. Verify again for time sync
- Use command on Switch:
show system-health DPU all
to get DPU health status.
- Switch and DPU
On Switch:
root@sonic:~#show system-health DPU all
Name ID Oper-Status State-Detail State-Value Time Reason
DPU0 1 Online dpu_midplane_link_state up Wed 20 Oct 2023 06:52:28 PM UTC
dpu_booted_state up Wed 20 Oct 2023 06:52:28 PM UTC
dpu_control_plane_state up Wed 20 Oct 2023 06:52:28 PM UTC
dpu_data_plane_state up Wed 20 Oct 2023 06:52:28 PM UTC
DPU1 2 Online dpu_midplane_link_state up Wed 20 Oct 2023 06:52:28 PM UTC
dpu_booted_state up Wed 20 Oct 2023 06:52:28 PM UTC
dpu_control_plane_state up Wed 20 Oct 2023 06:52:28 PM UTC
dpu_data_plane_state up Wed 20 Oct 2023 06:52:28 PM UTC
root@sonic:~#show system-health DPU 0
Name ID Oper-Status State-Detail State-Value Time Reason
DPU0 1 Offline dpu_midplane_link_state down Wed 20 Oct 2023 06:52:28 PM UTC
dpu_booted_state down Wed 20 Oct 2023 06:52:28 PM UTC
dpu_control_plane_state down Wed 20 Oct 2023 06:52:28 PM UTC
dpu_data_plane_state down Wed 20 Oct 2023 06:52:28 PM UTC
root@sonic:~#show system-health DPU 0
Name ID Oper-Status State-Detail State-Value Time Reason
DPU0 1 Partial Online dpu_midplane_link_state up Wed 20 Oct 2023 06:52:28 PM UTC
dpu_booted_state up Wed 20 Oct 2023 06:52:28 PM UTC
dpu_control_plane_state up Wed 20 Oct 2023 06:52:28 PM UTC
dpu_data_plane_state down Wed 20 Oct 2023 06:52:28 PM UTC Pipeline failure
- Verify the following criteria for Pass/Fail:
- Online : All states are up
- Offline: dpu_midplane_link_state or dpu_booted_state is down
- Partial Online: dpu_midplane_link_state is up and dpu_booted_state is up and dpu_control_plane_state is up and dpu_data_plane_state is down
- Verify powering down DPU and check for status and powering up again to check the status to show online.
- This Test case is to be covered in Phase 2
- Use command on Switch:
show system-health detail <DPU_SLOT_NUMBER>
to check the health of the DPU.
- Switch
On Switch:
root@sonic:/home/cisco# show system-health detail <DPU_SLOT_NUMBER>
Device: DPU0
System status summary
System status LED green
Services:
Status: Not OK
Not Running: container_checker, lldp
Hardware:
Status: OK
system services and devices monitor list
Name Status Type
------------------------- -------- ----------
mtvr-r740-04-bf3-sonic-01 OK System
rsyslog OK Process
- Verify System Status - Green, Service Status - OK, Hardware Status - OK
- Stop any docker in DPU and check for Service Status - Not OK and that docker as Not running
- Start the docker again and Verify System Status - Green, Service Status - OK, Hardware Status - OK
- Use command on Switch: "show reboot-cause" CLI to show the most recent rebooted device, time and the cause.
- Use command on Switch: "show reboot-cause history" CLI to show the history of the Switch and all DPUs
- Use command on Switch: "show reboot-cause history module-name" CLI to show the history of the specified module
- Use command on Switch:
config chassis modules shutdown <DPU_Number>
- Use command on Switch:
config chassis modules startup <DPU_Number>
- Wait for 5 minutes for Pmon to update the DPU states
- Use command on Switch:
show reboot-cause <DPU_Number>
to check the latest reboot is displayed
- Switch
On Switch:
root@sonic:~#show reboot-cause
Device Name Cause Time User Comment
switch 2023_10_20_18_52_28 Watchdog:1 expired; Wed 20 Oct 2023 06:52:28 PM UTC N/A N/A
DPU3 2023_10_03_18_23_46 Watchdog: stage 1 expired; Mon 03 Oct 2023 06:23:46 PM UTC N/A N/A
DPU2 2023_10_02_17_20_46 reboot Sun 02 Oct 2023 05:20:46 PM UTC admin User issued 'reboot'
root@sonic:~#show reboot-cause history
Device Name Cause Time User Comment
switch 2023_10_20_18_52_28 Watchdog:1 expired; Wed 20 Oct 2023 06:52:28 PM UTC N/A N/A
switch 2023_10_05_18_23_46 reboot Wed 05 Oct 2023 06:23:46 PM UTC user N/A
DPU3 2023_10_03_18_23_46 Watchdog: stage 1 expired; Mon 03 Oct 2023 06:23:46 PM UTC N/A N/A
DPU3 2023_10_02_18_23_46 Host Power-cycle Sun 02 Oct 2023 06:23:46 PM UTC N/A Host lost DPU
DPU3 2023_10_02_17_23_46 Host Reset DPU Sun 02 Oct 2023 05:23:46 PM UTC N/A N/A
DPU2 2023_10_02_17_20_46 reboot Sun 02 Oct 2023 05:20:46 PM UTC admin User issued 'reboot'
"show reboot-cause history module-name"
root@sonic:~#show reboot-cause history dpu3
Device Name Cause Time User Comment
DPU3 2023_10_03_18_23_46 Watchdog: stage 1 expired; Mon 03 Oct 2023 06:23:46 PM UTC N/A N/A
DPU3 2023_10_02_18_23_46 Host Power-cycle Sun 02 Oct 2023 06:23:46 PM UTC N/A Host lost DPU
DPU3 2023_10_02_17_23_46 Host Reset DPU Sun 02 Oct 2023 05:23:46 PM UTC N/A N/A
- Verify the output to check the latest reboot cause with the date time stamp at the start of reboot
- Verify all the reboot causes - Watchdog, reboot command, Host Reset
Existing Test case for Switch:
- Reboot using a particular command (sonic reboot, watchdog reboot, etc)
- All the timeout and poll timings are read from platform.json
- Wait for ssh to drop
- Wait for ssh to connect
- Database check
- Check for uptime – (NTP sync)
- Check for critical process
- Check for transceiver status
- Check for pmon status
- Check for reboot cause
- Reboot is successful
Reboot Test Case for DPU:
- Save the configurations of all DPU state before reboot
- Power on all the DPUs that were powered on before reboot using
config chassis modules startup <DPU_Number>
- Wait for DPUs to be up
- Use command on Switch:
show chassis modules status
to get DPU status - Get the number of DPU modules from ansible inventory file for the testbed.
- Switch
On Switch:
root@sonic:/home/cisco# reboot
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis modules startup <DPU_Number>
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
- Verify number of DPUs from inventory file for the testbed and number of DPUs shown in the cli output.
- Infrastructure support will be provided to choose from following clis or in combinations based on the vendor.
- Use command on DPU:
show system-memory
to get memory usage on each of those DPUs - Use command on Switch:
show system-health dpu <DPU_NUM>
to check memory check service status - Use command on DPU:
pdsctl show system --events
to check memory related events triggered. (This is vendor specific event monitoring cli)
- DPU
On DPU:
root@sonic:/home/admin# show system-memory
total used free shared buff/cache available
Mem: 6266 4198 1509 28 765 2067
Swap: 0 0 0
root@sonic:/home/admin#
root@sonic:/home/admin#
On Switch:
root@MtFuji:/home/cisco# show system-health dpu DPU0
is_smartswitch returning True
Name ID Oper-Status State-Detail State-Value Time Reason
------ ---- ------------- ----------------------- ------------- ----------------- --------------------------------------------------------------------------------------------------------------------------
DPU0 0 Online dpu_midplane_link_state UP 20241003 17:57:22 INTERNAL-MGMT : admin state - UP, oper_state - UP, status - OK, HOST-MGMT : admin state - UP, oper_state - UP, status - OK
dpu_control_plane_state UP 20241003 17:57:22 All containers are up and running, host-ethlink-status: Uplink1/1 is UP
dpu_data_plane_state UP 20241001 19:54:30 DPU container named polaris is running, pdsagent running : OK, pciemgrd running : OK
root@MtFuji:/home/cisco#
On DPU:
# Note: This command is run on cisco specific testbed.
# HwSKU: Cisco-8102-28FH-DPU-O-T1
root@sonic:/home/admin# pdsctl show system --events
----------------------------------------------------------------------------------------------------
Event Severity Timestamp
----------------------------------------------------------------------------------------------------
DSE_SERVICE_STARTED DEBUG 2024-09-20 21:48:11.515551 +0000 UTC
DSE_SERVICE_STARTED DEBUG 2024-09-20 21:48:11.668685 +0000 UTC
DSE_SERVICE_STARTED DEBUG 2024-09-20 21:48:12.379261 +0000 UTC
DSE_SERVICE_STARTED DEBUG 2024-09-20 21:48:19.379819 +0000 UTC
root@sonic:/home/admin#
- Verify that used memory should not cross the specified threshold value (90) of total memory.
- Threshold can be set different based on platform.
- Verify that dpu_control_plane_state is up under system-health dpu <DPU_NUM> cli.
- Verify no memory related events (MEM_FAILURE_EVENT) under pdsctl show system --events cli. This is vendor specific event montioring cli.
- Increase the memory to go beyond threshold (head -c <MEM_SIZE> /dev/zero | tail &) and verify it in pdsctl show system --events cli.
- Use command on Switch:
sudo swapoff -a
. Swapping is turned off so the OOM is triggered in a shorter time. - Use command on Switch: 'nohup bash -c "sleep 5 && tail /dev/zero" &' to run out of memory completely.
- It runs on the background and
nohup
is also necessary to protect the background process. - Added
sleep 5
to ensure ansible receive the result first. - Check the status and power on DPUs after switch goes for reboot and comes back
- Use command on Switch:
show chassis modules status
to check status of the DPUs. - Append to the existing test case: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/platform_tests/test_memory_exhaustion.py
- Switch
root@sonic:/home/cisco# sudo swapoff -a
root@sonic:/home/cisco#
root@sonic:/home/cisco# nohup bash -c "sleep 5 && tail /dev/zero" &
root@sonic:/home/cisco# nohup: ignoring input and appending output to 'nohup.out'
root@sonic:/home/cisco#
.
(Going for reboot)
.
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Offline down 154226463179136
DPU1 Data Processing Unit N/A Offline down 154226463179152
DPU2 Data Processing Unit N/A Offline down 154226463179168
DPUX Data Processing Unit N/A Offline down 154226463179184
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis startup DPU0
root@sonic:/home/cisco# config chassis startup DPU1
root@sonic:/home/cisco# config chassis startup DPU2
root@sonic:/home/cisco# config chassis startup DPUX
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
root@sonic:/home/cisco# ping 169.254.200.1
PING 169.254.200.1 (169.254.200.1) 56(84) bytes of data.
64 bytes from 169.254.200.1: icmp_seq=1 ttl=64 time=0.160 ms
^C
--- 169.254.28.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.160/0.160/0.160/0.000 ms
root@sonic:/home/cisco#
- Verify number of DPUs from inventory file for the testbed and number of DPUs shown in the cli output.
- Verify Ping works to all the mid plane ip listed in the ansible inventory file for the testbed.
- Use command on DPU:
sudo swapoff -a
. Swapping is turned off so the OOM is triggered in a shorter time. - Use command on DPU: 'nohup bash -c "sleep 5 && tail /dev/zero" &' to to run out of memory completely.
- It runs on the background and
nohup
is also necessary to protect the background process. - Added
sleep 5
to ensure ansible receive the result first. - Powercycling of DPU is to ensure that pcie link came up properly after the memory exhaustion test.
- Use command on Switch:
config chassis module shutdown <DPU_NUMBER>
to power off the DPUs. - Wait for 3 mins.
- Use command on Switch:
config chassis module startup <DPU_NUMBER>
to power on the DPUs.
- DPU and Switch
DPU:
root@sonic:/home/admin# sudo swapoff -a
root@sonic:/home/adminsco#
root@sonic:/home/admin# nohup bash -c "sleep 5 && tail /dev/zero" &
root@sonic:/home/admin# nohup: ignoring input and appending output to 'nohup.out'
root@sonic:/home/admin#
.
(Going for reboot)
.
Switch:
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Offline down 154226463179136
DPU1 Data Processing Unit N/A Offline down 154226463179152
DPU2 Data Processing Unit N/A Offline down 154226463179168
DPUX Data Processing Unit N/A Offline down 154226463179184
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis startup DPU0
root@sonic:/home/cisco# config chassis startup DPU1
root@sonic:/home/cisco# config chassis startup DPU2
root@sonic:/home/cisco# config chassis startup DPUX
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
root@sonic:/home/cisco# ping 169.254.200.1
PING 169.254.200.1 (169.254.200.1) 56(84) bytes of data.
64 bytes from 169.254.200.1: icmp_seq=1 ttl=64 time=0.160 ms
^C
--- 169.254.28.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.160/0.160/0.160/0.000 ms
root@sonic:/home/cisco#
- Verify number of DPUs from inventory file for the testbed and number of DPUs shown in the cli output.
- Verify Ping works to all the mid plane ip listed in the ansible inventory file for the testbed.
- Use command on Switch:
docker ps
to check the status of all the dockers. - Use command on Switch:
systemctl restart pmon
- Wait for 3 mins
- Use command on Switch:
docker ps
to check the status of all the dockers. - Use command on Switch:
show chassis modules status
to check status of the DPUs.
- Switch
root@MtFuji:/home/cisco# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a49fcf07beb8 docker-snmp:latest "/usr/local/bin/supe…" 3 days ago Up 3 days snmp
57ecc675292d docker-platform-monitor:latest "/usr/bin/docker_ini…" 3 days ago Up 3 days pmon
f1306072ba01 docker-sonic-mgmt-framework:latest "/usr/local/bin/supe…" 3 days ago Up 3 days mgmt-framework
571cc36585ae docker-lldp:latest "/usr/bin/docker-lld…" 3 days ago Up 3 days lldp
db4b1444e8a0 docker-sonic-gnmi:latest "/usr/local/bin/supe…" 3 days ago Up 3 days gnmi
a90702b9c541 d0a0fb621c53 "/usr/bin/docker_ini…" 3 days ago Up 3 days dhcp_server
4d2d79b77c66 2c214d2315a2 "/usr/bin/docker_ini…" 3 days ago Up 3 days dhcp_relay
90246d1e26d2 docker-fpm-frr:latest "/usr/bin/docker_ini…" 3 days ago Up 3 days bgp
42cf834770a8 docker-orchagent:latest "/usr/bin/docker-ini…" 3 days ago Up 3 days swss
7eb9da209385 docker-router-advertiser:latest "/usr/bin/docker-ini…" 3 days ago Up 3 days radv
66c4c8779e60 docker-syncd-cisco:latest "/usr/local/bin/supe…" 3 days ago Up 3 days syncd
5d542c98fb00 docker-teamd:latest "/usr/local/bin/supe…" 3 days ago Up 3 days teamd
a5225d08bcf4 docker-eventd:latest "/usr/local/bin/supe…" 3 days ago Up 3 days eventd
bd7555425d6d docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu5
42fd04767a03 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu4
a1633fc4a6ff docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu3
32b4e9506827 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu0
bb73239399e4 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu6
e5281aba74de docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu7
96032ebcb451 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu1
45418ff0d88f docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu2
39ddbddd3fb3 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days database
f1cac669cd08 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days database-chassis
root@MtFuji:/home/cisco#
root@MtFuji:/home/cisco#
root@MtFuji:/home/cisco# systemctl restart pmon
root@MtFuji:/home/cisco#
root@MtFuji:/home/cisco#
root@MtFuji:/home/cisco#
root@MtFuji:/home/cisco# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a49fcf07beb8 docker-snmp:latest "/usr/local/bin/supe…" 3 days ago Up 3 days snmp
57ecc675292d docker-platform-monitor:latest "/usr/bin/docker_ini…" 3 days ago Up 4 seconds pmon
f1306072ba01 docker-sonic-mgmt-framework:latest "/usr/local/bin/supe…" 3 days ago Up 3 days mgmt-framework
571cc36585ae docker-lldp:latest "/usr/bin/docker-lld…" 3 days ago Up 3 days lldp
db4b1444e8a0 docker-sonic-gnmi:latest "/usr/local/bin/supe…" 3 days ago Up 3 days gnmi
a90702b9c541 d0a0fb621c53 "/usr/bin/docker_ini…" 3 days ago Up 3 days dhcp_server
4d2d79b77c66 2c214d2315a2 "/usr/bin/docker_ini…" 3 days ago Up 3 days dhcp_relay
90246d1e26d2 docker-fpm-frr:latest "/usr/bin/docker_ini…" 3 days ago Up 3 days bgp
42cf834770a8 docker-orchagent:latest "/usr/bin/docker-ini…" 3 days ago Up 3 days swss
7eb9da209385 docker-router-advertiser:latest "/usr/bin/docker-ini…" 3 days ago Up 3 days radv
66c4c8779e60 docker-syncd-cisco:latest "/usr/local/bin/supe…" 3 days ago Up 3 days syncd
5d542c98fb00 docker-teamd:latest "/usr/local/bin/supe…" 3 days ago Up 3 days teamd
a5225d08bcf4 docker-eventd:latest "/usr/local/bin/supe…" 3 days ago Up 3 days eventd
bd7555425d6d docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu5
42fd04767a03 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu4
a1633fc4a6ff docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu3
32b4e9506827 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu0
bb73239399e4 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu6
e5281aba74de docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu7
96032ebcb451 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu1
45418ff0d88f docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days databasedpu2
39ddbddd3fb3 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days database
f1cac669cd08 docker-database:latest "/usr/local/bin/dock…" 3 days ago Up 3 days database-chassis
root@MtFuji:/home/cisco#
root@MtFuji:/home/cisco#
root@MtFuji:/home/cisco# systemctl status pmon
● pmon.service - Platform monitor container
Loaded: loaded (/lib/systemd/system/pmon.service; static)
Drop-In: /etc/systemd/system/pmon.service.d
└─auto_restart.conf
Active: active (running) since Sat 2024-10-05 00:22:29 UTC; 24s ago
Process: 3584922 ExecStartPre=/usr/bin/pmon.sh start (code=exited, status=0>
Main PID: 3584995 (pmon.sh)
Tasks: 2 (limit: 153342)
Memory: 27.6M
CGroup: /system.slice/pmon.service
├─3584995 /bin/bash /usr/bin/pmon.sh wait
└─3585000 python3 /usr/local/bin/container wait pmon
Oct 05 00:22:28 MtFuji container[3584943]: container_start: pmon: set_owner:loc>
Oct 05 00:22:29 MtFuji container[3584943]: docker cmd: start for pmon
Oct 05 00:22:29 MtFuji container[3584943]: container_start: END
Oct 05 00:22:29 MtFuji systemd[1]: Started pmon.service - Platform monitor cont>
Oct 05 00:22:29 MtFuji container[3585000]: container_wait: BEGIN
Oct 05 00:22:29 MtFuji container[3585000]: read_data: config:True feature:pmon >
Oct 05 00:22:29 MtFuji container[3585000]: read_data: config:False feature:pmon>
Oct 05 00:22:29 MtFuji container[3585000]: docker get image version for pmon
Oct 05 00:22:29 MtFuji container[3585000]: container_wait: pmon: set_owner:loca>
Oct 05 00:22:29 MtFuji container[3585000]: container_wait: END -- transitioning>
root@MtFuji:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
root@sonic:/home/cisco# ping 169.254.200.1
PING 169.254.200.1 (169.254.200.1) 56(84) bytes of data.
64 bytes from 169.254.200.1: icmp_seq=1 ttl=64 time=0.160 ms
^C
--- 169.254.28.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.160/0.160/0.160/0.000 ms
root@sonic:/home/cisco#
- Verify pmon and all the associated critical process is up are up.
- Verify number of DPUs from inventory file for the testbed and number of DPUs shown in the cli output.
- Verify Ping works to all the mid plane ip listed in the ansible inventory file for the testbed.
- Use command on Switch:
config reload -y
to reload the configurations in the switch. - Wait for 3 mins.
- Use command on Switch:
show chassis modules status
to check status of the DPUs. - Use command on Switch:
config chassis module startup <DPU_NUMBER>
to power on the DPUs. - Use command on Switch:
show chassis modules status
to check status of the DPUs.
- Switch
root@sonic:/home/cisco# config reload -y
Acquired lock on /etc/sonic/reload.lock
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment
Restarting SONiC target ...
Enabling container monitoring ...
Reloading Monit configuration ...
Reinitializing monit daemon
Released lock on /etc/sonic/reload.lock
root@MtFuji:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Offline down 154226463179136
DPU1 Data Processing Unit N/A Offline down 154226463179152
DPU2 Data Processing Unit N/A Offline down 154226463179168
DPUX Data Processing Unit N/A Offline down 154226463179184
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis startup DPU0
root@sonic:/home/cisco# config chassis startup DPU1
root@sonic:/home/cisco# config chassis startup DPU2
root@sonic:/home/cisco# config chassis startup DPUX
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
root@sonic:/home/cisco# ping 169.254.200.1
PING 169.254.200.1 (169.254.200.1) 56(84) bytes of data.
64 bytes from 169.254.200.1: icmp_seq=1 ttl=64 time=0.160 ms
^C
--- 169.254.28.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.160/0.160/0.160/0.000 ms
root@sonic:/home/cisco#
- Verify number of DPUs from inventory file for the testbed and number of DPUs shown in the cli output.
- Verify Ping works to all the mid plane ip listed in the ansible inventory file for the testbed.
- Use command on Switch:
nohup bash -c "sleep 5 && echo c > /proc/sysrq-trigger" &
- Use command on Switch:
show chassis modules status
to check status of the DPUs. - Use command on Switch:
config chassis module startup <DPU_NUMBER>
to power on the DPUs. - Use command on Switch:
show chassis modules status
to check status of the DPUs. - Append to the existing test case: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/platform_tests/test_kdump.py
- Switch
root@sonic:/home/cisco# nohup bash -c "sleep 5 && echo c > /proc/sysrq-trigger" &
.
(Going for reboot)
.
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Offline down 154226463179136
DPU1 Data Processing Unit N/A Offline down 154226463179152
DPU2 Data Processing Unit N/A Offline down 154226463179168
DPUX Data Processing Unit N/A Offline down 154226463179184
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis startup DPU0
root@sonic:/home/cisco# config chassis startup DPU1
root@sonic:/home/cisco# config chassis startup DPU2
root@sonic:/home/cisco# config chassis startup DPUX
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
root@sonic:/home/cisco# ping 169.254.200.1
PING 169.254.200.1 (169.254.200.1) 56(84) bytes of data.
64 bytes from 169.254.200.1: icmp_seq=1 ttl=64 time=0.160 ms
^C
--- 169.254.28.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.160/0.160/0.160/0.000 ms
root@sonic:/home/cisco# show reboot-cause
<cli implementation is in progress for getting the reasons in output for DPUs>
- Verify number of DPUs from inventory file for the testbed and number of DPUs shown in the cli output.
- Verify Ping works to all the mid plane ip listed in the ansible inventory file for the testbed.
- Verify
show reboot-cause <DPU_Number>
to check the reboot is caused by kernel panic.
- Use command on DPU:
nohup bash -c "sleep 5 && echo c > /proc/sysrq-trigger" &
. - Powercycling of DPU is to ensure that pcie link came up properly after the memory exhaustion test.
- Use command on Switch:
config chassis module shutdown <DPU_NUMBER>
to power off the DPUs. - Wait for 3 mins.
- Use command on Switch:
config chassis module startup <DPU_NUMBER>
to power on the DPUs.
- DPU and Switch
DPU:
root@sonic:/home/admin# nohup bash -c "sleep 5 && echo c > /proc/sysrq-trigger" &
.
(Going for reboot)
.
Switch:
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Offline down 154226463179136
DPU1 Data Processing Unit N/A Offline down 154226463179152
DPU2 Data Processing Unit N/A Offline down 154226463179168
DPUX Data Processing Unit N/A Offline down 154226463179184
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis startup DPU0
root@sonic:/home/cisco# config chassis startup DPU1
root@sonic:/home/cisco# config chassis startup DPU2
root@sonic:/home/cisco# config chassis startup DPUX
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
root@sonic:/home/cisco# ping 169.254.200.1
PING 169.254.200.1 (169.254.200.1) 56(84) bytes of data.
64 bytes from 169.254.200.1: icmp_seq=1 ttl=64 time=0.160 ms
^C
--- 169.254.28.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.160/0.160/0.160/0.000 ms
root@sonic:/home/cisco#
root@sonic:/home/cisco# show reboot-cause
<cli implementation is in progress for getting the reasons in output for DPUs>
- Verify number of DPUs from inventory file for the testbed and number of DPUs shown in the cli output.
- Verify Ping works to all the mid plane ip listed in the ansible inventory file for the testbed.
- Verify
show reboot-cause <DPU_Number>
to check the reboot is caused by kernel panic.
- Power cycle the testbed using PDU controller.
- Use command on Switch:
config chassis module startup <DPU_NUMBER>
to power on the DPUs. - Use command on Switch:
show chassis modules status
to check status of the DPUs. - Append to the existing test case: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/platform_tests/test_power_off_reboot.py
- Switch
After power cycle of the testbed,
root@sonic:/home/cisco#
root@sonic:/home/cisco#
.
(Going for power off reboot using pdu controller)
.
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Offline down 154226463179136
DPU1 Data Processing Unit N/A Offline down 154226463179152
DPU2 Data Processing Unit N/A Offline down 154226463179168
DPUX Data Processing Unit N/A Offline down 154226463179184
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis startup DPU0
root@sonic:/home/cisco# config chassis startup DPU1
root@sonic:/home/cisco# config chassis startup DPU2
root@sonic:/home/cisco# config chassis startup DPUX
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis modules startup <DPU_NUMBER>
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
root@sonic:/home/cisco# ping 169.254.200.1
PING 169.254.200.1 (169.254.200.1) 56(84) bytes of data.
64 bytes from 169.254.200.1: icmp_seq=1 ttl=64 time=0.160 ms
^C
--- 169.254.28.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.160/0.160/0.160/0.000 ms
root@sonic:/home/cisco#
- Verify number of DPUs from inventory file for the testbed and number of DPUs shown in the cli output.
- Verify Ping works to all the mid plane ip listed in the ansible inventory file for the testbed.
- Infrastructure will be provided to run the scripts that triggers the temperature trip based on vendor.
- The following is the example sequence to trigger temperature trip on the DPU
- Note: If Cisco setup, the following steps work.
- In DPU, Execute:
docker exec -it polaris /bin/bash
- Create /tmp/temp_sim.json file with dictionary { "hbmtemp": 65, "dietemp": 85}
- Increase dietemp to 125 to trigger the trip.
- DPU
DPU:
# Note: This command is run on cisco specific testbed.
# HwSKU: Cisco-8102-28FH-DPU-O-T1
root@sonic:/home/admin#
root@sonic:/home/admin# docker exec -it polaris /bin/bash
bash-4.4#
bash-4.4# cat /tmp/temp_sim.json
{ "hbmtemp": 65,
"dietemp": 85 }
bash-4.4#
bash-4.4#
bash-4.4#
bash-4.4# cat /tmp/temp_sim.json
{ "hbmtemp": 65,
"dietemp": 125 }
bash-4.4# exit
exit
root@sonic:/home/admin# e** RESETTING: SW TRIP dietemBoot0 v19, Id 0x82
Boot fwid 0 @ 0x74200000... OK
.
(Going for reboot)
.
Switch:
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Offline down 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# config chassis shutdown DPU0
root@sonic:/home/cisco# config chassis startup DPU0
root@sonic:/home/cisco#
root@sonic:/home/cisco#
root@sonic:/home/cisco# show chassis modules status
Name Description Physical-Slot Oper-Status Admin-Status Serial
------ -------------------- --------------- ------------- -------------- ---------------
DPU0 Data Processing Unit N/A Online up 154226463179136
DPU1 Data Processing Unit N/A Online up 154226463179152
DPU2 Data Processing Unit N/A Online up 154226463179168
DPUX Data Processing Unit N/A Online up 154226463179184
root@sonic:/home/cisco# ping 169.254.200.1
PING 169.254.200.1 (169.254.200.1) 56(84) bytes of data.
64 bytes from 169.254.200.1: icmp_seq=1 ttl=64 time=0.160 ms
^C
--- 169.254.28.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.160/0.160/0.160/0.000 ms
root@sonic:/home/cisco#
root@sonic:/home/cisco# show reboot-cause history dpu0
<reset reason is work in progress>
- Verify Ping works to DPU mid plane ip listed in the ansible inventory file for the testbed.
- Verify the command on Swithc: show reboot cause history to check the cause as cattrip.
Test Case | Intention | Comments | |
---|---|---|---|
1.1 | Check SmartSwitch specific ChassisClass APIs | To verify the newly implemented SmartSwitch specific ChassisClass APIs | |
1.2 | Check modified ChassisClass APIs for SmartSwitch | To verify the existing ChassisClass APIs that undergo minor changes with the addition of SmartSwitch | |
1.3 | Check DpuModule APIs for SmartSwitch | To verify the newly implemented DpuModule APIs for SmartSwitch | |
1.4 | Check modified ModuleClass APIs for SmartSwitch | To verify the existing ModuleClass APIs that undergo minor changes with the addition of SmartSwitch |
- Execute the following APIs on SmartSwitch
- get_dpu_id(self, name):
- Provide name (Example: DPU0 - Get it from platform.json file)
- This API should return an integer from 0-<max_num_of_dpus> (check it against platform.json)
- is_smartswitch(self):
- This API should return True
- get_module_dpu_data_port(self, index):
- It will return a dict as shown below.
- Switch
On Switch:
get_dpu_id(self, DPU3)
Output: 4
is_smartswitch(self):
Output: True
get_module_dpu_data_port(self, DPU0):
Output: {
"interface": {"Ethernet224": "Ethernet0"}
}
- The test result is a pass if the return value matches the expected value as shown in the "steps" and "Sample Output".
- is_modular_chassis(self):
- Should return False
- get_num_modules(self):
- Should return number of DPUs
- get_module(self, index):
- Make sure for each index this API returns an object and has some content and not None
- Check that the object's class is inherited from the ModuleBase class
- get_all_modules(self):
- This should return a list of items
- get_module_index(self, module_name):
- Given the module name say “DPU0” should return the index of it “1”
- Switch
On Switch:
is_modular_chassis(self):
Output: False
get_num_modules(self):
Output: number of DPUs
get_module(self, DPU0):
Output: DPU0 object
get_all_modules(self):
Output: list of objects (one per DPU + 1 switch object)
get_module_index(self, DPU0):
Output: could be any value from 0 to modules count -1
- The test result is a pass if the return value matches the expected value as shown in the "steps" and "Sample Output"
- get_dpu_id(self):
- Should return ID of the DpuModule Ex: 1 on DPU0
- get_reboot_cause(self):
- Reboot the module and then execute the "show reboot-cause ..." CLIs
- Verify the output string shows the correct Time and Cause
- Limit the testing to software reboot
- get_state_info(self):
- This should return an object
- Stop Syncd container on this DPU
- Execute the CLI and check the dpu-control-plane value should be down
- Check also dpu-data-plane is also down
- This will be enhanced in Phase - 2 to test multiple failure scenarios and transistions.
- Switch
On Switch:
get_dpu_id(self):
Output: When on module DPUx should return x+1
get_reboot_cause(self):
Output: {"Device": "DPU0", "Name": 2024_05_31_00_33_30, "Cause": "reboot", "Time": "Fri 31 May 2024 12:29:34 AM UTC", "User": "NA", "Comment": "NA"}
get_state_info(self):
Output: get_state_info() {
'dpu_control_plane_reason': 'All containers are up and running, host-ethlink-status: Uplink1/1 is UP',
'dpu_control_plane_state': 'UP',
'dpu_control_plane_time': '20240626 21:13:25',
'dpu_data_plane_reason': 'DPU container named polaris is running, pdsagent running : OK, pciemgrd running : OK',
'dpu_data_plane_state': 'UP',
'dpu_data_plane_time': '20240626 21:10:07',
'dpu_midplane_link_reason': 'INTERNAL-MGMT : admin state - UP, oper_state - UP, status - OK, HOST-MGMT : admin state - UP, oper_state - UP, status - OK',
'dpu_midplane_link_state': 'UP',
'dpu_midplane_link_time': '20240626 21:13:25',
'id': '0'
}
- Verify that all the APIs mentioned return the expected output
- get_base_mac(self):
- Should return the base mac address of this DPU
- Read all DPUs mac and verify if they are unique and not None
- get_system_eeprom_info(self):
- Verify the returned dictionary key:value
- get_name(self):
- Verify if this API returns “DPUx" on each of them
- get_description(self):
- Should return a string
- get_type(self):
- Should return “DPU” which is “MODULE_TYPE_DPU”
- get_oper_status(self):
- Should return the operational status of the DPU
- Stop one ore more containers
- Execute the CLI and see if it is down
- Power down the dpu and check if the operational status is down.
- reboot(self, reboot_type):
- Issue this CLI with input “
- verify if the module reboots
- The reboot type should be updated based on SmartSwitch reboot HLD sonic-net/SONiC#1699
- get_midplane_ip(self):
- should return the midplane IP
- Switch
On Switch:
get_base_mac(self):
Output: BA:CE:AD:D0:D0:01
get_system_eeprom_info(self):
Output: eeprom info object
get_name(self):
Output: DPU2
get_description(self):
Output "Pensando DSC"
get_type(self):
Output: DPU
get_oper_status(self):
Output: Online
reboot(self, reboot_type):
Result: the DPU should reboot
get_midplane_ip(self):
Output: 169.254.200.1
- The test result is a pass if the return value matches the expected value as shown in the "steps" and "Sample Output".