-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLD for diagnostic monitoring of CMIS based transceivers #1828
base: master
Are you sure you want to change the base?
HLD for diagnostic monitoring of CMIS based transceivers #1828
Conversation
Signed-off-by: Mihir Patel <[email protected]>
@qinchuanares - It will be great if you can help in reviewing this PR. |
@Junchao-Mellanox @keboliu - It will be great if you can help in reviewing this PR. |
|
||
### 4.2 Dynamic Diagnostic Information | ||
|
||
The `DomInfoUpdateTask` thread is responsible for updating the dynamic diagnostic information for all the transceivers in the system. The `DomInfoUpdateTask` thread is triggered by a timer (`DOM_INFO_UPDATE_PERIOD_SECS`), which is set to 60 seconds by default. The `DomInfoUpdateTask` thread reads the diagnostic information from the transceiver and updates the relevant tables in `redis-db`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if there is any plan for making this alarm logging interrupt based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Can you please help in clarifying this?
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihirpat1 I don't see how the timestamp, count for each alarm will be updated so that these times are as closer to the actual reporting time by the module
3. Read the transceiver firmware information from the module and update the `TRANSCEIVER_FIRMWARE_INFO` table. | ||
4. Read the transceiver DOM sensor data from the module and update the `TRANSCEIVER_DOM_SENSOR` table. | ||
5. Read the transceiver DOM flag data from the module, record the timestamp, and update the `TRANSCEIVER_DOM_FLAG` table. | ||
6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also record the flag set/clear event to syslog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Junchao-Mellanox Sure, will plan to record the set/clear event to syslog as a debug. Making it as a notice will make the syslog chatty.
6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables. | ||
7. Read the transceiver status data from the module and update the `TRANSCEIVER_STATUS` table. | ||
8. If the transceiver supports VDM monitoring, perform the following steps: | ||
1. Freeze the statistics by calling the CMIS API and wait for `FreezeDone`. Once the statistics are frozen, record the timestamp and copy the VDM and PM statistics from the transceiver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please specify which API will be used to "freeze" the statistic? I guess it will write something to module EEPROM, and it is only applicable when module is under software control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Junchao-Mellanox I have now updated the API name now. Can you please elaborate on what do you mean by "it is only applicable when module is under software control"?
----------- --------------- -------- -------- -------- -------- | ||
Example: | ||
admin@sonic#show interfaces transceiver vdm flag Ethernet1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to add a filter to get abnormal status only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor do we want to add filter as part of current enhancement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Junchao-Mellanox We are now displaying the output after applying the filter. A --detail
option has been added to allow user to dump data for all lanes irrespective of the flag being set.
**Tables Used for Flag Analysis:** | ||
|
||
- `TRANSCEIVER_VDM_FLAG`: This table stores flags indicating the status of various VDM parameters. | ||
- `TRANSCEIVER_VDM_FLAG_CHANGE_COUNT`: This table keeps a count of how many times each VDM flag has changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If xcvrd stopped, will the change count be reset to 0? The same question to time set and time clear. Also, should we record the change count to syslog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Junchao-Mellanox Yes - the change count and the set/clear time in redis-db tables will be removed when xcvrd
is stopped. We can log the change count to syslog before deleting the table.
@prgeor I have now added |
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
3. Read the transceiver firmware information from the module and update the `TRANSCEIVER_FIRMWARE_INFO` table. | ||
4. Read the transceiver DOM sensor data from the module and update the `TRANSCEIVER_DOM_SENSOR` table. | ||
5. Read the transceiver DOM flag data from the module, record the timestamp, and update the `TRANSCEIVER_DOM_FLAG` table. | ||
6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Junchao-Mellanox Sure, will plan to record the set/clear event to syslog as a debug. Making it as a notice will make the syslog chatty.
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
----------- --------------- -------- -------- -------- -------- | ||
Example: | ||
admin@sonic#show interfaces transceiver vdm flag Ethernet1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor do we want to add filter as part of current enhancement?
6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables. | ||
7. Read the transceiver status data from the module and update the `TRANSCEIVER_STATUS` table. | ||
8. If the transceiver supports VDM monitoring, perform the following steps: | ||
1. Freeze the statistics by calling the CMIS API and wait for `FreezeDone`. Once the statistics are frozen, record the timestamp and copy the VDM and PM statistics from the transceiver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Junchao-Mellanox I have now updated the API name now. Can you please elaborate on what do you mean by "it is only applicable when module is under software control"?
**Tables Used for Flag Analysis:** | ||
|
||
- `TRANSCEIVER_VDM_FLAG`: This table stores flags indicating the status of various VDM parameters. | ||
- `TRANSCEIVER_VDM_FLAG_CHANGE_COUNT`: This table keeps a count of how many times each VDM flag has changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Junchao-Mellanox Yes - the change count and the set/clear time in redis-db tables will be removed when xcvrd
is stopped. We can log the change count to syslog before deleting the table.
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
lasertemplowwarning_chg_cnt = INTEGER ; laser temperature low warning change count | ||
``` | ||
|
||
#### 2.1.5 Transceiver DOM flag time set data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if all the data type is in string, is it easy to convert it to datetime data type? In some cases we may need to parse this data and do time diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always show the current time from any CLI output would help us determining the time difference too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qinchuanares The timestamp is recorded in the format Day Mon DD HH:MM:SS YYYY
. Yes, I have now updated the CLI to show the current time as well.
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
; Defines Transceiver PM information for a port | ||
key = TRANSCEIVER_PM|ifname ; information of PM on port | ||
; field = value | ||
prefec_ber_avg = FLOAT ; prefec ber avg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could expand this to media and host prefec BER and uncorr frames, even if this is C-CMIS only.
Media BER and UC frames info comes from page 34h, host BER and UC frames come from page 3Ah
…ription of using EWMA to update diagnostics_update_interval and added info using first subport only for breakout port config
/azp run |
No pipelines are associated with this pull request. |
/azp run |
No pipelines are associated with this pull request. |
1. **Periodic update of the diagnostic information:** | ||
- The `DomInfoUpdateTask` thread periodically updates the diagnostic information for all the ports. | ||
- The update period interval can be retrieved by reading the `dom_info_update_periodic_secs` field from the `/usr/share/sonic/device/{platform}/{hwsku}/pmon_daemon_control.json` file. | ||
- If this field or the file is absent, the default timer value is 0 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihirpat1 lets check this 0 secs does not caust high cpu in existing slower CPU platforms
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
|
||
- **TRANSCEIVER_DOM_FLAG_CHANGE_COUNT:** | ||
- Each time a flag in the `TRANSCEIVER_DOM_FLAG` table changes (either set or cleared), the corresponding count in this table is incremented. | ||
- **TRANSCEIVER_DOM_FLAG_SET_TIME:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this is the
FIRST
time dom finds the flag is set either as part of link down event or normal periodic? - What if the flag remains set, do we update the time next polling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This table represents the time at which the
DomInfoUpdateTask
thread finds the corresponding diagnostic flag to be set for the first time since it was cleared. This can happen either during link change event or as part of the periodic polling. - If the flag remains set during the next polling,
DomInfoUpdateTask
will skip updating the set time.
- Number of ports in the breakout port group | ||
- Number of link change events between successive polls | ||
|
||
#### Calculation of `diagnostics_update_interval` Using EWMA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihirpat1 please give some rationale behind what problem we are solving with EWMA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor I have addressed this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihirpat1 can we add the following in xcvr status table
- VDM supported
- VDM fine interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page 2F, byte 128, 129, 130
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor I have addressed this now. BTW, I am assuming that for VDM supported, you meant page 01h, byte 142, bit 6. Please correct me if I am wrong.
|
||
1. **Periodic update of the diagnostic information:** | ||
- The `DomInfoUpdateTask` thread periodically updates the diagnostic information for all the ports. | ||
- The update period interval can be retrieved by reading the `dom_info_update_periodic_secs` field from the `/usr/share/sonic/device/{platform}/{hwsku}/pmon_daemon_control.json` file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihirpat1 is it ok to have .json per platform? Reason being Xcvrd will do DOM polling per physical port count which is same across all HWSKUs (so far) and unlike polling of all logical ports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Agreed - I have removed the hwsku from the path now.
1. The `DomInfoUpdateTask` thread is created by the `xcvrd` process. | ||
2. The `dom_info_update_periodic_secs` value is retrieved from the `pmon_daemon_control.json` file to determine the interval for updating the diagnostic information for all the ports. If the `dom_info_update_periodic_secs` field is absent or set to 0, the diagnostic information will be updated continuously without any delay between updates. | ||
3. The `DomInfoUpdateTask` thread starts polling for the diagnostic information of the ports in a sequential manner: | ||
- It first checks if the `dom_info_update_periodic_secs` value is set. If not, it defaults to 0 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihirpat1 can you add a section of how Xcvrd detects link event happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor I have added a section for this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihirpat1 can you add a requirement section detailing what are the goals we need to meet and what are the NON-goals or future work
read_diagnostic_info(current_pport) | ||
|
||
update_all_diagnostic_info_in_db = False | ||
expired_time = time.time() + dom_info_update_periodic_secs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't expired_time be computed by accounting for lost time in reading the dom values especially where lost time < dom_info_update_periodic_secs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Yes, accounting for lost time in reading values can be useful if the lost time is < dom_info_update_periodic_secs
. However, I have suggested the current implementation to ensure that CPU usage does not spike if the lost time > dom_info_update_periodic_secs
. This approach is also more in line with the current implementation in DomInfoUpdateTask
. Let me know if you would like me to address the case where lost time is less than dom_info_update_periodic_secs
separately.
/azp run |
No pipelines are associated with this pull request. |
/azp run |
No pipelines are associated with this pull request. |
@mihirpat1 Capture the FreezeDone and UnfreezeDone time value defined by SONiC |
/azp run |
No pipelines are associated with this pull request. |
This HLD while provide an overview of how SONiC reads and stores the various diagnostic parameters read from a CMIS based transceiver.
Proposed changes for existing tables in SONiC
TRANSCEIVER_STATUS
toTRANSCEIVER_DOM_FLAG
table.