Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202205] FDB learning caused orchagent exited when reboot #1267

Open
shuaishang opened this issue Jul 26, 2023 · 7 comments
Open

[202205] FDB learning caused orchagent exited when reboot #1267

shuaishang opened this issue Jul 26, 2023 · 7 comments
Assignees

Comments

@shuaishang
Copy link
Collaborator

SONiC 202205, the orchagent (portsorch) crashed when init occasionally.
The reason is that FDB learned when reboot and it add reference count for default port bridge id.
Then portsorch tried to removeDefaultBridgePorts but failed because of reference count.

8783 2023 Jul 25 19:16:59.229376 NOTICE swss#orchagent: removeDefaultVlanMembers:801: Remove 34 VLAN members from default VLAN
8784 2023 Jul 25 19:16:59.254058 ERR swss#orchagent: meta_generic_validation_remove:2978: object 0x3a000000000082 reference count is 1, can't remove
8785 2023 Jul 25 19:16:59.254058 ERR swss#orchagent: removeDefaultBridgePorts:851: Failed to remove bridge port, rv:-17
8786 2023 Jul 25 19:16:59.255135 INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
8787 2023 Jul 25 19:16:59.255135 INFO swss#supervisord: orchagent   what():  PortsOrch initialization failure

SONiC 201911 fixed this issue before:
#572

But this fix was removed in 202205 and master.

@shuaishang
Copy link
Collaborator Author

@stephenxs Do you have any idea about this issue? Why 202205 and master branch delete the fix...?

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 26, 2023

fdb learning is disabled before reboot, so there should be no learning message, unless this was race condition, do you have syslog and sairedis.rec from that timestamp ?

@shuaishang
Copy link
Collaborator Author

When system boot up, the default behavior of fdb learning depends on vendor SAI/SDK.
There is no chance for orchagent to disable it, before "PortsOrch::PortsOrch" called "removeDefaultVlanMembers".
For our system, we do saw a FDB event after create switch immediately:

2023-07-25.19:16:53.783625|A|SAI_STATUS_SUCCESS 2023-07-25.19:16:53.786185|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x55c0a7627db0|SAI_SWITCH_ATTR_FDB_UNICAST_MISS_PACKET_ACTION=SAI_PACKET_ACTION_DROP|SAI_SWITCH_ATTR_FDB_BROADCAST_MISS_PACKET_ACTION=SAI_PACKET_ACTION_DROP|SAI_SWITCH_ATTR_FDB_MULTICAST_MISS_PACKET_ACTION=SAI_PACKET_ACTION_DROP|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x55c0a7627dc0|SAI_SWITCH_ATTR_BFD_SESSION_STATE_CHANGE_NOTIFY=0x55c0a7627f30|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x55c0a7627de0|SAI_SWITCH_ATTR_QUEUE_PFC_DEADLOCK_NOTIFY=0x55c0a7627e50|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=00:A0:C9:12:34:56|SAI_SWITCH_ATTR_CAPABILITY_EXTENSION=1:3204448703 2023-07-25.19:16:53.786632|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0 2023-07-25.19:16:59.154007|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x3000000000024 2023-07-25.19:16:59.154041|n|port_state_change|[{"port_id":"oid:0x1000000000012","port_state":"SAI_PORT_OPER_STATUS_UP"}]| 2023-07-25.19:16:59.154623|n|port_state_change|[{"port_id":"oid:0x1000000000004","port_state":"SAI_PORT_OPER_STATUS_UP"}]| 2023-07-25.19:16:59.154660|c|SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x6000000000649|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x3000000000024|SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_LOOPBACK|SAI_ROUTER_INTERFACE_ATTR_MTU=9100 2023-07-25.19:16:59.154724|n|port_state_change|[{"port_id":"oid:0x1000000000023","port_state":"SAI_PORT_OPER_STATUS_UP"}]| 2023-07-25.19:16:59.154755|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x26000000000031\",\"mac\":\"52:54:00:A1:C3:B0\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_LEARNED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000082"}]}]| 2023-07-25.19:16:59.154837|n|port_state_change|[{"port_id":"oid:0x1000000000012","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]| 2023-07-25.19:16:59.154851|n|port_state_change|[{"port_id":"oid:0x1000000000023","port_state":"SAI_PORT_OPER_STATUS_DOWN"}]| 2023-07-25.19:16:59.156049|q|attribute_capability|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|OBJECT_TYPE=SAI_OBJECT_TYPE_QUEUE|ATTR_ID=SAI_QUEUE_ATTR_PFC_DLR_INIT 2023-07-25.19:16:59.157095|Q|attribute_capability|SAI_STATUS_SUCCESS|OBJECT_TYPE=SAI_OBJECT_TYPE_QUEUE|ATTR_ID=SAI_QUEUE_ATTR_PFC_DLR_INIT|CREATE_IMP=false|SET_IMP=true|GET_IMP=false 2023-07-25.19:16:59.157154|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_MAX_NUMBER_OF_TEMP_SENSORS=0 2023-07-25.19:16:59.157355|G|SAI_STATUS_NOT_SUPPORTED| 2023-07-25.19:16:59.157422|q|attribute_capability|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|OBJECT_TYPE=SAI_OBJECT_TYPE_PORT|ATTR_ID=SAI_PORT_ATTR_TPID 2023-07-25.19:16:59.157612|Q|attribute_capability|SAI_STATUS_SUCCESS|OBJECT_TYPE=SAI_OBJECT_TYPE_PORT|ATTR_ID=SAI_PORT_ATTR_TPID|CREATE_IMP=true|SET_IMP=true|GET_IMP=true 2023-07-25.19:16:59.157644|q|attribute_capability|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|OBJECT_TYPE=SAI_OBJECT_TYPE_LAG|ATTR_ID=SAI_LAG_ATTR_TPID 2023-07-25.19:16:59.157836|Q|attribute_capability|SAI_STATUS_SUCCESS|OBJECT_TYPE=SAI_OBJECT_TYPE_LAG|ATTR_ID=SAI_LAG_ATTR_TPID|CREATE_IMP=false|SET_IMP=false|GET_IMP=false 2023-07-25.19:16:59.163658|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_CPU_PORT=oid:0x0 2023-07-25.19:16:59.164116|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_CPU_PORT=oid:0x1000000000034

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 26, 2023

OA explicitly turns off FDB learning before reboot so this situation not happen, maybe this is some other scenario rather than reboot ? maybe this is unexpected reboot ?

please attach full syslog and sairedis log from that day/event take a look from your paste:

19:16:53.786185|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000
19:16:59.154041|n|port_state_change|[{"port_id":"oid:0x1000000000012","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
19:16:59.154623|n|port_state_change|[{"port_id":"oid:0x1000000000004","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
19:16:59.154724|n|port_state_change|[{"port_id":"oid:0x1000000000023","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
19:16:59.154755|n|fdb_event|[{"fdb_entry":"{\"bvid\":\"oid:0x26000000000031\",\"mac\":\"52:54:00:A1:C3:B0\",\"
  1. swithch is created
  2. some ports get up
  3. fdb event is learned - this means that fdb was not disabled by OA in the first place, this would suggest that switch was shutdown not in a good way or OA crashed, that's why we need syslog to confirm that

@prsunny maybe we need a special case scenario here for this kind of behavior in swss

@shuaishang
Copy link
Collaborator Author

Hi @kcudnik ,

Appreciated for your comments.
Whatever the OA configured the learning mode, in a cold reboot, the vendor SAI/SDK will not care the previous setting.
SDK will init the switch from scratch when OA create switch.

Thanks

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 27, 2023

if it's cold boot, then all ports should be down by default, and from sairedis recordings seems like you get port up notification, so ports were administrative UP, which should not be the case in cold boot scenario.

Again, please attach syslog aroutd this boot +/- extra few minutes so we could analyze what happened

@prsunny
Copy link
Contributor

prsunny commented Sep 7, 2023

if it's cold boot, then all ports should be down by default, and from sairedis recordings seems like you get port up notification, so ports were administrative UP, which should not be the case in cold boot scenario.

Again, please attach syslog aroutd this boot +/- extra few minutes so we could analyze what happened

Agree with Kamil. Also orchagent removes all port from Bridge and default Vlan member association. MAC learning is not expected to happen in normal cold boot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants