-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Squashed all layers #3138
base: main
Are you sure you want to change the base?
Squashed all layers #3138
Conversation
Signed-off-by: tomersein <[email protected]>
@tomersein - I know very little about the Syft internals, and I'm trying to understand this PR. From the code and comments I understand that the new option will catalog packages from all layers, but then only include packages that are visible in the squashed file-system. How is that different from the regular squashed scope (or, I could probably rephrase this to: what is the difference between 'cataloging' and 'including')? My main concern is whether this would (eventually) help to fix issue #1818 Many thanks! |
hi @dbrugman , |
Got it, thanks @tomersein |
Hi @tomersein -- thanks for the contribution. I don't think we would want to merge this as-is, though. I wonder if there are any other things we may be able to do in order for you to accomplish what you're hoping to achieve. So I understand correctly: the use case is to be able to find the layer which introduced a package, right? |
yes correct @kzantow , let me know what are the gaps so I can push some fixes \ improvements.
@kzantow - please see my notes after the meeting yesterday |
any update? :) @wagoodman |
Signed-off-by: tomersein <[email protected]>
Signed-off-by: tomersein <[email protected]>
Signed-off-by: tomersein <[email protected]>
Signed-off-by: tomersein <[email protected]>
did some static analysis corrections and all checks are now passed |
@tomersein thank you for submitting a candidate solution to solve the problem of tracking the layer-of-first-attribution problem. Let me first summarize how this PR is achieving attribution. The first change involves adding a new file Resolver, which makes use of the squashed resolver and all-layer resolver based on the use case. The second change is adding Take for example a (rather silly) Dockerfile: FROM ubuntu:latest
RUN apt update -y
RUN apt install -y jq
RUN apt install -y vim
RUN apt install -y wget curl And after build:
[
"sha256:c26ee3582dcbad8dc56066358653080b606b054382320eb0b869a2cb4ff1b98b",
"sha256:5ba46f5cab5074e141556c87b924bc3944507b12f3cd0f71c5b0aa3982fb3cd4",
"sha256:1fde57bfea7ecd80e6acc2c12d90890d32b7977fec17495446678eb16604d8c7",
"sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
"sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756"
] Here we'll have multiple copies of the DPKG status file, which means classically we'll use the last layer for all evidence locations for packages (at least when it comes to the primary evidence location for the status file). Let's take a look at just
[
{
"path": "/usr/share/doc/vim/copyright",
"layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
"accessPath": "/usr/share/doc/vim/copyright",
"annotations": {
"evidence": "supporting"
}
},
{
"path": "/var/lib/dpkg/info/vim.md5sums",
"layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
"accessPath": "/var/lib/dpkg/info/vim.md5sums",
"annotations": {
"evidence": "supporting"
}
},
{
"path": "/var/lib/dpkg/status",
"layerID": "sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756",
"accessPath": "/var/lib/dpkg/status",
"annotations": {
"evidence": "primary"
}
},
{
"path": "/var/lib/dpkg/status",
"layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
"accessPath": "/var/lib/dpkg/status",
"annotations": {
"evidence": "primary"
}
}
] Note that we see the original layer the package was added ( Here's what I see when running a before and after:
It looks like when cataloging ~138 packages was found then before finalizing the number dropped to ~132, so that's good. But I noticed these runs took different times -- 8 seconds vs 11 seconds, not a big difference, but given that this is a small and simple image it is worth looking at. I believe this is because we're essentially doing both a squashed scan + an all-layers scan implicitly, since the resolver will return all references from both resolvers (not deduplicating Also note that there are several more executables and files cataloged! This is concerning since this should be behaving no different than the squashed cataloger from a count perspective. It's not immediately apparent what is causing this but it is a large blocker for this change (at first glance I think it's because catalogers are creating duplicate packages and relationships, but only the packages are getting deduplicated, but not the relationships... this should be confirmed though). After reviewing the PR there are a few problems that seem fundamental:
What's the path forward from here? I think there is a chance of modifying this PR to get it to a mergable state, but it would require looking into the following things:
The following changes would additionally be needed:
@tomersein shout out if you want to sync on this explicitly, I'd be glad to help. A good default time to chat with us is during our community office hours. Our next one is going to be this thursday at noon ET . If that doesn't work we can always chat through discourse group topics or DMs to setup a separate zoom call. |
Hi @wagoodman Please let me know if its ok! |
Hi @wagoodman ,
so I might need some more details or a direction in the code how to do so. moreover, feel free to put it under "waiting for discussion". I will not be able to attend the meeting, but i do hear the summary in youtube. I will have time to develop this feature which in my opinion can be useful :) |
Signed-off-by: tomersein <[email protected]>
Signed-off-by: tomersein <[email protected]>
hi @wagoodman |
This PR tries to solve the squash-with-all-layer resolver issue, aligned to the newest version of syft.
Please let me know how to proceed further, I guess the solution here is not perfect, but it does knows how to handle deleted packages.
part of - #15