Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsfs - monitor only nsrs that are mounted. DFBUGS-153 #8561

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 6 additions & 10 deletions src/endpoint/endpoint.js
Original file line number Diff line number Diff line change
Expand Up @@ -207,16 +207,12 @@ async function main(options = {}) {
// there for namespace monitor won't be registered
if (internal_rpc_client && config.NAMESPACE_MONITOR_ENABLED) {
endpoint_stats_collector.instance().set_rpc_client(internal_rpc_client);

//wait with monitoring until pod has started
setTimeout(() => {
// Register a bg monitor on the endpoint
background_scheduler.register_bg_worker(new NamespaceMonitor({
name: 'namespace_fs_monitor',
client: internal_rpc_client,
should_monitor: nsr => Boolean(nsr.nsfs_config),
}));
}, 1000 * 60);
// Register a bg monitor on the endpoint
background_scheduler.register_bg_worker(new NamespaceMonitor({
name: 'namespace_fs_monitor',
client: internal_rpc_client,
should_monitor: nsr => Boolean(nsr.nsfs_config && process.env['NSFS_NSR_' + nsr.name]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions -

  1. If the endpoint got up before the namespace resource was mounted, when is the next time we will get to this flow for start monitoring? why not add a retry after 60 seconds -
    nsfs | wait for endpoint startup before namespace monitor registration #8474 (comment)
  2. Why avoid start monitoring instead of externalizing that the value of process.env['NSFS_NSR_' + nsr.name] is undefined which means that the PV was not mounted yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. An endpoint that was started before the mount will be deleted after a new endpoint is created with the new mount.
    The retry will not help as the endpoint that opened the report is removed when the new nsfs nsr mount is added (after nsfs nsr was created in kubernetes cluster).
  2. There could be other nsfs nsrs that should be monitored (correct me if I'm wrong).

Maybe I'll make the scenario more concrete-

  1. Operator install a system in a cluster.
  2. There is endpoint A. It does NOT have any nsfs nsr mounts.
  3. At some point, an nsfs nsr is created in the cluster.
  4. In reconcile:
  5. a. operator adds a mount for the nsfs nsr to endpoints' container.
  6. b. operator creates an nsr object in system store.
  7. A new endpoint B with the new mount is created by kubernetes.
  8. While B is being created, A updates its system store, reads the nsfs nsr. The new nsfs nsr is NOT mounted in A. A reports NOENT on the nsfs nsr. Note since default interval for nsfs nsr monitoring is less than creating a new endpoint, this doesn't necessarily happen. Reducing config.NAMESPACE_MONITOR_DELAY will ensure bug reproduction.
  9. Endpoint B is ready. Endpoint A is deleted. Nsr status is stuck in rejected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned on Slack, I think that the correct path is not to avoid monitoring a namespace resource that is still not mounted but add this check to the monitoring process.

Copy link
Contributor Author

@alphaprinz alphaprinz Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nsr will be monitored by the new endpoint.
The old endpoint is about to be deleted.
The only difference this commit makes is that old endpoints won't mistakenly report nsr as rejected.

If you think that the about-to-be-deleted endpoint should do something about the mount it will never have (or anything else, for that matter) please specify it explicitly. The current "add this to monitoring process" is too vague. Also specify explicitly if this is an enhancement or part of the bug fix.

Copy link
Contributor

@romayalon romayalon Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"About-to-be-deleted" is the happy path :)
There is also the sad path where there is an issue with the mounting and it takes a while/never happens - that's exactly why I think it's important and avoiding monitoring it if it's not mounted is a partial solution from my prespective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not trying to solve monitoring, but rather to fix a bug in monitoring.
I'm not removing any feature that we currently have.

Again, I would like a more specific way to proceed.
If you think a different fix or an enhancement to the monitoring is needed, please specify it explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alphaprinz
I already explained it in the above comment, but I'll be happy to summarize my comments -

My specific idea for solving it -
Instead of not monitoring unmounted namespace resources, I think you should move the new condition you added inside the monitoring check, and externalize that this is the current issue that the namespace resource has.
Comment 1, bullet 2
Comment 3

Why I think that my suggestion is a better behavior / user experience -
It will behave better in cases where the re-start of the endpoint takes time/won't happen at all.
Comment 5

How to proceed -
The above summary of comments is my opinion/suggestion/how I would fix it.
IMO, You shall proceed from here as how you see it, fix it, open an issue and call it enhancement, document this gap or anything else you feel appropriate.

}));
}

if (config.ENABLE_SEMAPHORE_MONITOR) {
Expand Down
Loading