Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FC] process FC after apply view #3326

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

stepanblyschak
Copy link
Contributor

What I did

Simplify approach to delaying counters on warm boot and fast boot. Removed FLEX_COUNTER_DELAY_STATUS_FIELD and instead postpone all FC processing to happen after apply view to not delay data plane configuration.

The CONFIG_DB should not be updated in runtime anymore for counters to be delayed.

Why I did it

To address sonic-net/sonic-buildimage#20302.

How I verified it

Run warm-boot - make sure FC orch runs only after APPLY_VIEW.

Details if related

@stepanblyschak
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak
Copy link
Contributor Author

Putting back delay for 60 sec as we found some cases where oper state update handling is delayed due to FC configuration after APPLY_VIEW

m_bufferQueueConfigTable(db, CFG_BUFFER_QUEUE_TABLE_NAME),
m_bufferPgConfigTable(db, CFG_BUFFER_PG_TABLE_NAME),
m_deviceMetadataConfigTable(db, CFG_DEVICE_METADATA_TABLE_NAME)
{
SWSS_LOG_ENTER();
m_delayTimer = new SelectableTimer(timespec{.tv_sec = FLEX_COUNTER_DELAY_SEC, .tv_nsec = 0});
if (WarmStart::isWarmStart())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm this will also handle fast-reboot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

SWSS_LOG_ENTER();

SWSS_LOG_NOTICE("Processing counters");
m_delayTimer->stop();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about following?

if (!m_delayTimerExpired)
{
    m_delayTimer->stop();
    m_delayTimerExpired = true;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -254,6 +258,15 @@ void FlexCounterOrch::doTask(Consumer &consumer)
}
}

void FlexCounterOrch::doTask(SelectableTimer &timer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we delete the timer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

orchagent/flexcounterorch.cpp Show resolved Hide resolved
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

m_delayTimer = new SelectableTimer(timespec{.tv_sec = FLEX_COUNTER_DELAY_SEC, .tv_nsec = 0});
if (WarmStart::isWarmStart())
{
auto executor = new ExecutableTimer(m_delayTimer, this, "FLEX_COUNTER_DELAY");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced with this approach. I can use some to understand the approach.

Previously we had hardcoded delay in FC.
The hardcoded delay became a problem and we needed more deterministic delayed approach. So we added fast-reboot new infra which used warm-reboot kind of reconciliation and used fast-reboot done status to enable FC.

But now we again go back to hardcoded delay. I am afraid that to solve the new problem we (in a way) drift towards the day1 problem. Some scenarios that I can image that will now become problematic can be when 60s not being enough and overlaps w/ apply_view. Or similar race conditions that led to more deterministic delay.

Also, I think the concerns that Jingwen raised about new entries showing up in the table will still be present, right?
The config will still look different than what init config would want.

@wen587 , what do you think about this? Does this change really solve the concern that you have raised?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd This is the problem that is being solved sonic-net/sonic-buildimage#20302
The hardcoded delay is not a reported problem and this PR does not attempt to address that.
Not sure about race conditions, could you provide an example? The timer can't overlap with apply view as it is triggered in a different flow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vaibhavhd , i think so. We want the config just stay as it is in init_cfg.json. We do not want any derived changes after service restart. Because all config db are static in Golden Config. We need Golden Config to be ground of truth.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Stepan Blyschak <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

VladimirKuk pushed a commit to Marvell-switching/sonic-buildimage that referenced this pull request Jan 21, 2025
Related to sonic-net/sonic-swss#3326.

Why I did it
Simplify approach to delaying counters on warm boot and fast boot. Removed FLEX_COUNTER_DELAY_STATUS_FIELD and instead postpone all FC processing to happen after apply view to not delay data plane configuration.

The CONFIG_DB should not be updated in runtime anymore for counters to be delayed.

To address sonic-net#20302.

Work item tracking
Microsoft ADO (number only):
How I did it
Removed FLEX_COUNTER_DELAY_STATUS_FIELD set in enable_counters.py.

How to verify it
Run warm-boot - make sure FC orch runs only after APPLY_VIEW.
stepanblyschak pushed a commit to stepanblyschak/sonic-swss that referenced this pull request Jan 27, 2025
stepanblyschak pushed a commit to stepanblyschak/sonic-swss that referenced this pull request Jan 27, 2025
stepanblyschak pushed a commit to stepanblyschak/sonic-swss that referenced this pull request Jan 27, 2025
stepanblyschak pushed a commit to stepanblyschak/sonic-swss that referenced this pull request Jan 27, 2025
stepanblyschak pushed a commit to stepanblyschak/sonic-swss that referenced this pull request Jan 27, 2025
@stepanblyschak
Copy link
Contributor Author

Infra issue with tests: failure to install dotnet package:

+ sudo apt-get install -y dotnet-sdk-8.0
Reading package lists...
Building dependency tree...
Reading state information...
E: The package keysinuse needs to be reinstalled, but I can't find an archive for it.

Restarting

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak
Copy link
Contributor Author

@qiluo-msft @bingwang-ms @vaibhavhd Can be merged?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants