Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up after unexpectedly terminated build #25102

Merged
merged 1 commit into from
Jan 27, 2025

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented Jan 23, 2025

The podman system prune command can remove build containers that were created during the build but were not removed because the build terminated unexpectedly.

By default, build containers are not removed to prevent interference with builds in progress. Use the --build flag when running the command to remove build containers as well.

Reproducer:

  • Containerfile:
FROM ubi8/ubi
RUN truncate -s 10G out
RUN echo "Hi"
RUN sleep infinity
  • Test script run.sh:
#!/usr/bin/env bash

podman build -f Containerfile -t podmanleaker &
sleep 60 && kill -9 $! 
  • measure the size of current images, containers, etc... Before build
podman unshare du -sh ~/.local/share/containers/
  • Test script (Note: requires disk space of about 32 GB)
./run.sh
  • measure the size of current images, containers, etc... After termination of build
podman unshare du -sh ~/.local/share/containers/
  • Clean up leftovers after build
podman system prune --build -f
  • measure the size of current images, containers, etc... After cleanup build
podman unshare du -sh ~/.local/share/containers/

The size should be the same as the first measurement but could be different if a base image is present in the system.

Fixes: #14523
Fixes: #23683
Fixes: https://issues.redhat.com/browse/RHEL-62009

Does this PR introduce a user-facing change?

The `podman system prune` command now supports removing build containers with the new `--build` option. 

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels Jan 23, 2025
@Honny1 Honny1 force-pushed the prune branch 2 times, most recently from 596c6bb to c607887 Compare January 23, 2025 10:34
@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Jan 23, 2025
if err != nil {
return stageContainersPruneReports, err
}
if _, err := os.Stat(filepath.Join(path, "buildah.json")); errors.Is(err, fs.ErrNotExist) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fileutils.Exists()

@nalind Is there another (better?) way to check if a storage container is from buildah?

size, err := r.store.ContainerSize(container.ID)
if err != nil {
report.Err = err
logrus.Warnf("Failed to get size of build stage container %s: %v", container.ID, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not report the error to the caller and also print a warning.
Just reporting the errors back to the caller is good enough. The caller can then log once.

Comment on lines 1299 to 1300
report.Err = err
logrus.Warnf("Failed to remove build stage container %s: %v", container.ID, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here


reclaimedSpace := (uint64)(0)
found := true
for found {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a loop? Running once should be enough.

Comment on lines 79 to 103
stageContainersPruneReports, err := ic.Libpod.PruneStageContainers()
if err != nil {
return nil, err
}
if len(stageContainersPruneReports) > 0 {
found = true
}
reclaimedSpace += reports.PruneReportsSize(stageContainersPruneReports)
systemPruneReport.ContainerPruneReports = append(systemPruneReport.ContainerPruneReports, stageContainersPruneReports...)

// Prune Images
imagePruneOptions := entities.ImagePruneOptions{
External: true,
BuildCache: true,
}
imageEngine := ImageEngine{Libpod: ic.Libpod}
imagePruneReports, err := imageEngine.Prune(ctx, imagePruneOptions)
if err != nil {
return nil, err
}
if len(imagePruneReports) > 0 {
found = true
}
reclaimedSpace += reports.PruneReportsSize(imagePruneReports)
systemPruneReport.ImagePruneReports = append(systemPruneReport.ImagePruneReports, imagePruneReports...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inconsistent with the documentation, from the docs you wrote it sounds like it only prune the build containers but this prunes everything other containers and images as well.
I don't think this is desirable.
If we want that behavior then the code should not cause a conflict with --build option and only do this in addition to the existing code in SystemPrune() instead of duplicating the image logic here.

Comment on lines 621 to 624
hasNone, result := none.GrepString("<none>")
Expect(result).To(HaveLen(1))
Expect(hasNone).To(BeTrue())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads to horrible Expected false to be true errors.

Please use the proper matchers, something like Expect(none.OutputToString()).To(ContanSubstring("none"))

Comment on lines 626 to 628
dirents, err := os.ReadDir(containerStorageDir)
Expect(err).ToNot(HaveOccurred())
Expect(dirents).To(HaveLen(6))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this matter? This peaks at rather internal storage details which can change requiring the test to be updated often. Would it not be better to run buildah containers instead? Then in the end ensure it is removed there?

after := podmanTest.Podman([]string{"images", "-a"})
after.WaitWithDefaultTimeout()
Expect(after).Should(ExitCleanly())
Expect(len(after.OutputToStringArray())).To(BeNumerically(">", 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this should check, AFAIK in the test/e2e setup we will always have extra images from the additional store shown.

Comment on lines 645 to 649
hasNoneAfter, result := after.GrepString("<none>")
Expect(result).To(BeEmpty())
Expect(hasNoneAfter).To(BeFalse())
hasNotLeakerImager, _ := after.GrepString("notleaker")
Expect(hasNotLeakerImager).To(BeTrue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs to use proper matchers

Comment on lines 651 to 654
// still have: volatile-containers.json, containers.json, containers.lock and container dir
dirents, err = os.ReadDir(containerStorageDir)
Expect(err).ToNot(HaveOccurred())
Expect(dirents).To(HaveLen(4))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment I would use buildah containers to check

@Luap99
Copy link
Member

Luap99 commented Jan 23, 2025

Also what this doesn't answer what happens if the build process is still running? And another thing that this does not cleanup is the networking, not sure if there is a good way for that but if you kill the build and run with bridge networking the interfaces/firewall rules and ipam db allocations are still leaked. They should be cleaned after reboot so not that bad like the storage leak and certainly not a blocker for this here. I just mention it as there are other leaks to consider too.

@Honny1
Copy link
Member Author

Honny1 commented Jan 23, 2025

@Luap99 I have incorporated your review. I've changed the approach and the --build flag allows the removal of build/stage containers for the podman system prune command, so it works the same as --volumes. The networking should also be cleaned.

@Honny1 Honny1 marked this pull request as ready for review January 23, 2025 15:21
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2025
@Honny1 Honny1 requested a review from Luap99 January 23, 2025 15:46
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, code wise this seems good to me. just a minor comment on the man page and we need to make sure the test does not run forever


Removes any build containers that were created during the build, but were not removed because the build was unexpectedly terminated.

> **This is not safe operation and should be executed only when no builds are in progress. It can interfere with builds in progress.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general style in the man pages was to write Note: ... or something like that. I don't think we have used > elsewhere. I don't mind it for the web view but I have not yet looked at how it looks in the rendered man page.

Comment on lines 606 to 612
if build.LineInOutputContains("Please use signal 9") {
break
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great but maybe add a small sleep for like 10ms or something to not busy poll which also eats a lot of resources.

Second this needs a timeout, if the build process fails and never prints the output this would loop forever causing hard to debug issues in CI.

Comment on lines 607 to 618
done := make(chan struct{})
go func() {
for {
if build.LineInOutputContains("Please use signal 9") {
break
}
time.Sleep(100 * time.Millisecond)
}
build.Signal(syscall.SIGKILL)
close(done)
}()
Eventually(done, defaultWaitTimeout).Should(BeClosed())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still hang the goroutine which means you leak the entire goroutine. This really needs a timeout in the loop so the loop actually ends.

matchedOutput := false

for range 900 {
	if build.LineInOutputContains("Please use signal 9") {
		matchedOutput = true
		break
	}
	time.Sleep(100 * time.Millisecond)
}
if !matchedOutput {
	Fail("Did not match special string in podman build")
}

Also when using goroutines in a ginkgo test one should use defer GinkgoRecover() to ensure error reporting there works. In this case it may not strictly be needed as there is no actual matching done but in case of a panic it should still be recovered by ginkgo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I didn't realize that. Fixed.

The `podman system prune` command is able to remove build containers that were created during the build, but were not removed because the build terminated unexpectedly.

By default, build containers are not removed to prevent interference with builds in progress. Use the **--build** flag when running the command to remove build containers as well.

Fixes: https://issues.redhat.com/browse/RHEL-62009

Signed-off-by: Jan Rodák <[email protected]>
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2025
Copy link
Contributor

openshift-ci bot commented Jan 27, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Honny1, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2025
@Luap99 Luap99 removed the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2025
@Luap99
Copy link
Member

Luap99 commented Jan 27, 2025

Removed the lgtm label again (forgot there wasn't a second review)

@rhatdan
Copy link
Member

rhatdan commented Jan 27, 2025

I am hitting this all the time when playing with AI builds

Thanks @Honny1
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 8d65d1e into containers:main Jan 27, 2025
80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/api-change Change to remote API; merits scrutiny lgtm Indicates that a PR is ready to be merged. release-note
Projects
None yet
3 participants