Add watch request timeout to prevent watch request hang #5732

xigang · 2024-10-23T11:51:35Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
When the federate-apiserver's watch request to the member cluster gets stuck, it will cause the watch request from the federated client to get stuck as well.

Which issue(s) this PR fixes:
Fixes #5672

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

karmada-bot · 2024-10-23T11:51:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ikaven1024 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/search/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2024-10-23T12:04:18Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 48.14815% with 14 lines in your changes missing coverage. Please review.

Project coverage is 41.58%. Comparing base (331145f) to head (b47f1d6).
Report is 35 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/search/proxy/store/multi_cluster_cache.go	48.14%	13 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5732      +/-   ##
==========================================
+ Coverage   40.90%   41.58%   +0.67%     
==========================================
  Files         650      655       +5     
  Lines       55182    55773     +591     
==========================================
+ Hits        22573    23191     +618     
+ Misses      31171    31076      -95     
- Partials     1438     1506      +68

Flag	Coverage Δ
unittests	`41.58% <48.14%> (+0.67%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

xigang · 2024-10-23T14:29:11Z

cc @XiShanYongYe-Chang @RainbowMango @ikaven1024 PTAL.

zhzhuang-zju · 2024-10-24T01:44:12Z

pkg/search/proxy/store/multi_cluster_cache.go

 			return nil, err
+		case <-time.After(30 * time.Second):
+			// If the watch request times out, return an error, and the client will retry.
+			return nil, fmt.Errorf("timeout waiting for watch for resource %v in cluster %q", gvr.String(), cluster)


@xigang Hi, if a watch request is hanging and causes a timeout, will the hanging watch request continue to exist in the subprocess?

@zhzhuang-zju Yes, there is this issue. When a watch request times out, the goroutine needs to be terminated.

Good point! Then that case we have to cancel the context passed to cache.Watch().

So this patch intends to terminate the hanging by raising an error after a period of time. Is this the idea?

Another question:
Before starting the Watch, we tried to get the cache of that cluster, I'm curious why this cache still exists even after the cluster is gone. Do we have a chance to clean the cache?

karmada/pkg/search/proxy/store/multi_cluster_cache.go

Lines 333 to 336 in e7b6513

cache := c.cacheForClusterResource(cluster, gvr)

if cache == nil {

continue

}

Another question: Before starting the Watch, we tried to get the cache of that cluster, I'm curious why this cache still exists even after the cluster is gone. Do we have a chance to clean the cache?

karmada/pkg/search/proxy/store/multi_cluster_cache.go

Lines 333 to 336 in e7b6513

cache := c.cacheForClusterResource(cluster, gvr)

if cache == nil {

continue

}

@RainbowMango When the member cluster goes offline but the Cluster resources in the control plane are not deleted, it can prevent the offline clusters in the ResourceRegistry from being removed, resulting in the resource cache being retained for a short time.

@xigang Hi, if a watch request is hanging and causes a timeout, will the hanging watch request continue to exist in the subprocess?

@RainbowMango @zhzhuang-zju Fixed, please take a look.

xigang · 2024-10-28T02:18:22Z

/retest

ikaven1024 · 2024-10-28T03:25:14Z

pkg/search/proxy/store/multi_cluster_cache.go

 			return nil, err
+		case <-time.After(30 * time.Second):


It seems wait 30s for each cluster. Should we wait all clusters paralleled?

It seems wait 30s for each cluster. Should we wait all clusters paralleled?

@ikaven1024 There’s no issue here; as long as a single cache.Watch times out, the Watch request will return with an error and end. There’s no problem with that here.😄

It seems wait 30s for each cluster. Should we wait all clusters paralleled?

@ikaven1024 There’s no issue here; as long as a single cache.Watch times out, the Watch request will return with an error and end. There’s no problem with that here.😄

While if every cluster create watching takes 20s, not timeout, the total time spends 20s * N.

ikaven1024 · 2024-10-28T03:28:48Z

pkg/search/proxy/store/multi_cluster_cache.go

+		errChan := make(chan error, 1)
+
+		go func(cluster string) {
+			w, err := cache.Watch(ctx, options)


If this watcher is created after 30s, then it seems no way to stop it, is it leak?

If the watcher times out after 30 seconds during creation, it will trigger a time.After timeout, return an error, and call cancel to stop the watcher goroutine.

karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go

Line 263 in e65e993

defer func() { cancel() }()

…hang Signed-off-by: xigang <[email protected]>

karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 23, 2024

karmada-bot requested review from ikaven1024 and XiShanYongYe-Chang October 23, 2024 11:51

karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 23, 2024

xigang changed the title ~~Add timeout for watch requests to member clusters to prevent request …~~ Add watch request timeout to prevent watch request hang Oct 23, 2024

zhzhuang-zju reviewed Oct 24, 2024

View reviewed changes

xigang force-pushed the bugfix/watch branch 2 times, most recently from 21c10d3 to 5a55ca2 Compare October 27, 2024 02:50

ikaven1024 reviewed Oct 28, 2024

View reviewed changes

xigang force-pushed the bugfix/watch branch from 5a55ca2 to a9e3aa1 Compare October 29, 2024 07:30

Add timeout for watch requests to member clusters to prevent request …

b47f1d6

…hang Signed-off-by: xigang <[email protected]>

xigang force-pushed the bugfix/watch branch from a9e3aa1 to b47f1d6 Compare October 29, 2024 07:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add watch request timeout to prevent watch request hang #5732

Add watch request timeout to prevent watch request hang #5732

xigang commented Oct 23, 2024 •

edited by RainbowMango

Loading

karmada-bot commented Oct 23, 2024

codecov-commenter commented Oct 23, 2024 •

edited

Loading

xigang commented Oct 23, 2024

zhzhuang-zju Oct 24, 2024

xigang Oct 24, 2024

RainbowMango Oct 24, 2024

RainbowMango Oct 24, 2024

RainbowMango Oct 24, 2024

xigang Oct 27, 2024

xigang Oct 27, 2024 •

edited

Loading

xigang commented Oct 28, 2024

ikaven1024 Oct 28, 2024

xigang Oct 29, 2024 •

edited

Loading

ikaven1024 Oct 30, 2024

ikaven1024 Oct 28, 2024

xigang Oct 28, 2024 •

edited

Loading

ikaven1024 Oct 28, 2024

	cache := c.cacheForClusterResource(cluster, gvr)
	if cache == nil {
	continue
	}

Add watch request timeout to prevent watch request hang #5732

Are you sure you want to change the base?

Add watch request timeout to prevent watch request hang #5732

Conversation

xigang commented Oct 23, 2024 • edited by RainbowMango Loading

karmada-bot commented Oct 23, 2024

codecov-commenter commented Oct 23, 2024 • edited Loading

Codecov Report

xigang commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xigang Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

xigang commented Oct 28, 2024

Choose a reason for hiding this comment

xigang Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xigang Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xigang commented Oct 23, 2024 •

edited by RainbowMango

Loading

codecov-commenter commented Oct 23, 2024 •

edited

Loading

xigang Oct 27, 2024 •

edited

Loading

xigang Oct 29, 2024 •

edited

Loading

xigang Oct 28, 2024 •

edited

Loading