Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse Rebuild IO handles #1755

Merged
merged 3 commits into from
Oct 17, 2024
Merged

Reuse Rebuild IO handles #1755

merged 3 commits into from
Oct 17, 2024

Conversation

tiagolobocastro
Copy link
Contributor

fix(rebuild): reuse rebuild IO handles

Reuses the rebuild IO handles, rather than attempting to allocate
them per rebuild task.
The main issue with handle allocation on the fly is that the target
may have not cleaned up a previous IO qpair connection, and so the
connect may fail. We started seeing this more on CI because we forgot
to cherry-pick a commit increasing the retry delay.
However, after inspecting a bunch of user support bundles I see that
we still have occasional connect errors. Rather than increasing the
timeout, we attempt here to reuse the handles, thus avoid the
problem almost entirely.

Signed-off-by: Tiago Castro <[email protected]>

refactor(rebuild): rebuild completion is not an error

When the rebuild has been complete, if we wait for it this fails because
the channels are not longer available.
Instead, simply return the rebuild state, since this is what we want anyway.

Signed-off-by: Tiago Castro <[email protected]>

When the rebuild has been complete, if we wait for it this fails because
the channels are not longer available.
Instead, simply return the rebuild state, since this is what we want anyway.

Signed-off-by: Tiago Castro <[email protected]>
Copy link
Contributor

@dsharma-dc dsharma-dc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.
As I understand we'll now make all 16 rebuild tasks use the same connection for reading/writing the rebuild segments. Is the rebuild descriptor destroyed today as soon as rebuild finishes and the nexus channels are reconfigured to add rebuilt bdev's handle? We wouldn't want a lingering handle in rebuild descriptor while the rebuilt child bdev handle has been added to main nexus channels too.

@tiagolobocastro
Copy link
Contributor Author

lgtm. As I understand we'll now make all 16 rebuild tasks use the same connection for reading/writing the rebuild segments.

Yes, each task has the immutable RebuildDescriptor and reuse the handles from there.

Is the rebuild descriptor destroyed today as soon as rebuild finishes and the nexus channels are reconfigured to add rebuilt bdev's handle? We wouldn't want a lingering handle in rebuild descriptor while the rebuilt child bdev handle has been added to main nexus channels too.

Yes that's right, the rebuild backend "runs away" when the rebuild is started. When the rebuild completes the backend terminates, and on the drop, the handles are also dropped.
Eventually we can also improve things here by moving the handle to the tasks themselves, allowing us to drop the handles if the rebuild is paused for example. Also that could be the first step to allowing the rebuild to run on multi-cores too.

For now simply moving the handle to descriptor is good enough for solving the connection issues.

Reuses the rebuild IO handles, rather than attempting to allocate
them per rebuild task.
The main issue with handle allocation on the fly is that the target
may have not cleaned up a previous IO qpair connection, and so the
connect may fail. We started seeing this more on CI because we forgot
to cherry-pick a commit increasing the retry delay.
However, after inspecting a bunch of user support bundles I see that
we still have occasional connect errors. Rather than increasing the
timeout, we attempt here to reuse the handles, thus avoid the
problem almost entirely.

Signed-off-by: Tiago Castro <[email protected]>
Brings in latest fixes and improvements.

Signed-off-by: Tiago Castro <[email protected]>
@tiagolobocastro
Copy link
Contributor Author

bors merge

@bors-openebs-mayastor
Copy link

Build succeeded:

@bors-openebs-mayastor bors-openebs-mayastor bot merged commit c7ec2b1 into develop Oct 17, 2024
4 checks passed
@bors-openebs-mayastor bors-openebs-mayastor bot deleted the rebuild-poll branch October 17, 2024 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants