Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconciliation process with many images is slow #177

Open
mbaynton opened this issue Jun 14, 2022 · 2 comments
Open

Reconciliation process with many images is slow #177

mbaynton opened this issue Jun 14, 2022 · 2 comments
Assignees
Labels
done Code pushed to develop branch enhancement Enhancement to existing feature
Milestone

Comments

@mbaynton
Copy link

Our use case involves having nearly two thousand distinct images on each kubelet, and different images on different kubelets. We are evaluating kube-fledged as a component of how we can manage our image collections at this scale.

A finding we’ve discovered that is not already covered by other issues is that when we edit an existing ImageCache CRD of this size, it takes a few minutes to perform the reconciliation between the desired images and the images actually present, even if the actual change only added or removed one image.

It looks like this is likely attributable to this block, which adds all images in the modified CRD to a rate-limited work queue. The identification of whether the image is already present occurs later, inside the queue consumer. Computing a diff between the image list in the updated CRD and the image list in the node status upfront once, before pushing to the work queue, might improve responsiveness.

We could be open to working on this issue so that kube-fledged better meets our particular use case, but we wanted to file this issue as a first step to see if there is interest in supporting ImageCaches of this size in principle, and if you foresee any difficulties with the proposal to reconcile the CRD with the node status data upfront before pushing to the work queue.

@omar-rs
Copy link

omar-rs commented Jun 14, 2022

Here are some additional notes related to the issue above.

Setup:

  • 1676 images in a single imagecache instance on a a single node
  • all images are already pulled on the node (from an ECR registry)
  • all images are already part of the ImageCache instance definition

Test 1: Remove images from an imagecache

  • edit the imagecache instance to remove 10 images from the node
  • 1666 Job not created (image-already-present) output in the controller log
  • first delete job did not start until about 2min 37sec from the start of the image cache edit - all this time is consumed by the checking of existing images (Job not created)
  • all delete jobs completed within 13sec of start
  • overall, took about 2min 51sec to complete the imagecache sync and status update

Test 2: Append images to the end of the imagecache

  • edit the imagecache instance to append 10 images to the node (at the end of the list)
  • 1676 Job not created (image-already-present) output in the controller log
  • first pull job did not start until about 2min 37sec from the start of the image cache edit - all this time is consumed by the checking of existing images (Job not created)
  • all pull jobs completed within 21sec of start
  • overall, took about 3min 1sec to complete the imagecache sync and status update

Test 3: Add images at the top of the imagecache list

  • edit the imagecache instance to add 10 images to the node at the top of the list
  • 10 pull jobs were created in < 1sec
  • Job not created (image-already-present)` output started to appear in the controller log
  • took about 20 sec to complete the last pull job
  • overall, took about 2min 40sec to complete the imagecache sync and status update

@senthilrch senthilrch self-assigned this Oct 21, 2022
@senthilrch
Copy link
Owner

@mbaynton @omar-rs : Thanks for reporting this issue and the in-depth analyses you performed with kube-fledged.

I am keen on improving the performance of kube-fledged to meet your particular use-case. The scenario of modifying an existing imagecache is not fully optimised for performance i.e. it is treated as reconciling a new imagecache so you see ALL the image pulls (and deletes) getting queued to the image manager routing.

It makes perfect sense to queue only the image pulls (and deletes) that are required. I'll come up with a proposal for this.

@senthilrch senthilrch added the enhancement Enhancement to existing feature label Oct 26, 2022
@senthilrch senthilrch added this to the v0.11.0 milestone Mar 10, 2023
@senthilrch senthilrch added the done Code pushed to develop branch label Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done Code pushed to develop branch enhancement Enhancement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants