Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: introduce new internal representation for Opossum files #192

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

abraemer
Copy link
Contributor

@abraemer abraemer commented Jan 20, 2025

Summary of changes

  • introduced new internal opossum model
  • switch ScanCode pipeline to the new model

This PR does not contain tests for the new model. These are done separately see #189

Context and reason for change

See: #190

Closes: #195
Part1 of: #196 (tests follow in a follow-up)

@abraemer abraemer force-pushed the refactor-consistent-model-pipeline branch 2 times, most recently from 1e1d053 to 057727c Compare January 21, 2025 11:23
@abraemer abraemer marked this pull request as ready for review January 21, 2025 11:24
@abraemer abraemer force-pushed the refactor-consistent-model-pipeline branch 2 times, most recently from 60b1e47 to 59f5dc7 Compare January 21, 2025 11:44
@Hellgartner Hellgartner self-requested a review January 21, 2025 11:54
attribution_to_id: dict[OpossumPackage, str] = field(
default_factory=default_attribution_id_mapper
)
output_file: OpossumOutputFile | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really optimal. We are mixing here two things in a model which are different from business point of view

  • What is currently stored in the opossum input file at the end but from business point of view is the results of the scan
  • What is currently stored in the opossum output file but actually is the result of the human interaction in the opossumUI tool

Proposal: Structure:

Opossum --> ScanResults (What is now opossum but without the output_file
|-----> ReviewResults (what is now the opossum output file)

Note: I would still use the opossum output file here as object ATM as we do not yet know the best format here

src/opossum_lib/opossum_model.py Show resolved Hide resolved
attribution_breakpoints=self.attribution_breakpoints,
external_attribution_sources=external_attribution_sources,
frequent_licenses=frequent_licenses,
files_with_children=self.files_with_children,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to deep-copy here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to be defensive and do a deepcopy just in case

] = {}

def process_node(node: Resource) -> None:
path = str(node.path).replace("\\", "/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As replace("\\", "/") is

  • used at two different places
  • is not totally obvious what it is doing

I would propose to extract to a function with an explanatory name

self.get_attribution_key(a): a.to_opossum_file_format()
for a in attributions
}
external_attributions.update(new_attributions_with_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could principally do a merge. After all associated discussions, is that by purpose?



class BaseUrlsForSources(BaseModel):
model_config = ConfigDict(frozen=True, extra="allow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it by purpose that some of these models are frozen while others are not?

default_text: str

def to_opossum_file_format(self) -> opossum_file.FrequentLicense:
return opossum_file.FrequentLicense(**self.model_dump())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the type safe way of converting

package_version: str | None = None
package_namespace: str | None = None
package_type: str | None = None
package_p_u_r_l_appendix: str | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need to use this ugly name here as we do no longer need to match the structure of the file

is_relevant_for_preferred: bool | None = None

def to_opossum_file_format(self) -> opossum_file.ExternalAttributionSource:
return opossum_file.ExternalAttributionSource(**self.model_dump())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again think about a type safe conversion

return opossum_file.Metadata(**self.model_dump())


class ResourceType(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move that closer to the Resource class?

* this model encapsulates also the semantic relationships of resources, resourcesToAttribuions
 and externalAttributions. These are not enforced by the file structure alone.
* This will be used as a target for the other file format frontends and simplify their logic.
* It also allows for easier testing since it allows to check for semantic/structural
 equivalence among opossum files (e.g. the IDs of the attribution carry no semantic semantic
 information themselves i.e. are arbitrary labels)
* Opossum class should be able to carry all information that could be present in an
 .opossum file
* resources can now by added to the resource structure without knowledge about internals
* for this reason:
 - resources can be created with just a path (i.e. without type)
 - resources can now be merged together if the types are compatible.
   types are compatible if at least one is not set or types are identical
 - when converting to opossum file format, unset type defaults to folder.
   Maybe this should raise an error instead, but being more permissible probably just
   makes things more ergonomic without hurting correctness.
* default to treating a Resource as file if the type is undefined and no children present
* provide slightly more information in an error message
@abraemer abraemer force-pushed the refactor-consistent-model-pipeline branch from 54f5c61 to bd3735f Compare January 22, 2025 11:59
@abraemer
Copy link
Contributor Author

Note: rebased onto current main

)
output_file: OpossumOutputFile | None = None

def to_opossum_file_format(self) -> opossum_file_content.OpossumFileContent:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be nicer to read if at the top of the file you don't use namespace imports but import exactly the classes you need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some duplication of names between opossum_file and opossum_model. That's why I wanted to explicit where which name comes from.
This duplication comes from the fact that opossum_model is quite similar to opossum_file generally but we chose to separate the implementations which meant copying some models verbatim.
The specific instance you marked could be removed though as there are only clashes between opossum_file and opossum_model.

attribution_breakpoints=[],
external_attribution_sources={},
frequent_licenses=None,
files_with_children=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the two None values? These are default values on the Opossum Models.

segments = path_segments(file.path)
temp_root.get_path(segments).file = file
resource = opossum_model.Resource(
path=PurePath(file.path.replace("\\", "/")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again the replacement.

process_node(child)

return external_attributions, resources_to_attributions
def convert_resource_type(val: FileType) -> opossum_model.ResourceType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename val -> file_type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce base model
3 participants