refactor: introduce new internal representation for Opossum files #192

abraemer · 2025-01-20T16:43:21Z

Summary of changes

introduced new internal opossum model
switch ScanCode pipeline to the new model

This PR does not contain tests for the new model. These are done separately see #189

Context and reason for change

See: #190

Closes: #195
Part1 of: #196 (tests follow in a follow-up)

Hellgartner · 2025-01-22T07:49:50Z

src/opossum_lib/opossum_model.py

+    attribution_to_id: dict[OpossumPackage, str] = field(
+        default_factory=default_attribution_id_mapper
+    )
+    output_file: OpossumOutputFile | None = None


Is this really optimal. We are mixing here two things in a model which are different from business point of view

What is currently stored in the opossum input file at the end but from business point of view is the results of the scan

What is currently stored in the opossum output file but actually is the result of the human interaction in the opossumUI tool

Proposal: Structure:

Opossum --> ScanResults (What is now opossum but without the output_file
|-----> ReviewResults (what is now the opossum output file)

Note: I would still use the opossum output file here as object ATM as we do not yet know the best format here

src/opossum_lib/opossum_model.py

Hellgartner · 2025-01-22T07:51:03Z

src/opossum_lib/opossum_model.py

+                attribution_breakpoints=self.attribution_breakpoints,
+                external_attribution_sources=external_attribution_sources,
+                frequent_licenses=frequent_licenses,
+                files_with_children=self.files_with_children,


do we need to deep-copy here?

Probably better to be defensive and do a deepcopy just in case

Hellgartner · 2025-01-22T07:52:36Z

src/opossum_lib/opossum_model.py

+        ] = {}
+
+        def process_node(node: Resource) -> None:
+            path = str(node.path).replace("\\", "/")


As replace("\\", "/") is

used at two different places

is not totally obvious what it is doing

I would propose to extract to a function with an explanatory name

Hellgartner · 2025-01-22T07:53:32Z

src/opossum_lib/opossum_model.py

+                self.get_attribution_key(a): a.to_opossum_file_format()
+                for a in attributions
+            }
+            external_attributions.update(new_attributions_with_id)


This could principally do a merge. After all associated discussions, is that by purpose?

Hellgartner · 2025-01-22T07:55:38Z

src/opossum_lib/opossum_model.py

+
+
+class BaseUrlsForSources(BaseModel):
+    model_config = ConfigDict(frozen=True, extra="allow")


Is it by purpose that some of these models are frozen while others are not?

Hellgartner · 2025-01-22T07:56:06Z

src/opossum_lib/opossum_model.py

+    default_text: str
+
+    def to_opossum_file_format(self) -> opossum_file.FrequentLicense:
+        return opossum_file.FrequentLicense(**self.model_dump())


Please use the type safe way of converting

Hellgartner · 2025-01-22T07:56:41Z

src/opossum_lib/opossum_model.py

+    package_version: str | None = None
+    package_namespace: str | None = None
+    package_type: str | None = None
+    package_p_u_r_l_appendix: str | None = None


we do not need to use this ugly name here as we do no longer need to match the structure of the file

Hellgartner · 2025-01-22T07:57:14Z

src/opossum_lib/opossum_model.py

+    is_relevant_for_preferred: bool | None = None
+
+    def to_opossum_file_format(self) -> opossum_file.ExternalAttributionSource:
+        return opossum_file.ExternalAttributionSource(**self.model_dump())


Again think about a type safe conversion

Hellgartner · 2025-01-22T07:57:31Z

src/opossum_lib/opossum_model.py

+        return opossum_file.Metadata(**self.model_dump())
+
+
+class ResourceType(Enum):


Could you move that closer to the Resource class?

* this model encapsulates also the semantic relationships of resources, resourcesToAttribuions and externalAttributions. These are not enforced by the file structure alone. * This will be used as a target for the other file format frontends and simplify their logic. * It also allows for easier testing since it allows to check for semantic/structural equivalence among opossum files (e.g. the IDs of the attribution carry no semantic semantic information themselves i.e. are arbitrary labels)

* Opossum class should be able to carry all information that could be present in an .opossum file

* resources can now by added to the resource structure without knowledge about internals * for this reason: - resources can be created with just a path (i.e. without type) - resources can now be merged together if the types are compatible. types are compatible if at least one is not set or types are identical - when converting to opossum file format, unset type defaults to folder. Maybe this should raise an error instead, but being more permissible probably just makes things more ergonomic without hurting correctness.

* default to treating a Resource as file if the type is undefined and no children present * provide slightly more information in an error message

abraemer · 2025-01-22T12:00:34Z

Note: rebased onto current main

mstykow · 2025-01-22T12:54:06Z

src/opossum_lib/opossum_model.py

+    )
+    output_file: OpossumOutputFile | None = None
+
+    def to_opossum_file_format(self) -> opossum_file_content.OpossumFileContent:


i think it would be nicer to read if at the top of the file you don't use namespace imports but import exactly the classes you need.

There is some duplication of names between opossum_file and opossum_model. That's why I wanted to explicit where which name comes from.
This duplication comes from the fact that opossum_model is quite similar to opossum_file generally but we chose to separate the implementations which meant copying some models verbatim.
The specific instance you marked could be removed though as there are only clashes between opossum_file and opossum_model.

Hellgartner · 2025-01-22T14:00:27Z

src/opossum_lib/scancode/convert_scancode_to_opossum.py

+        attribution_breakpoints=[],
+        external_attribution_sources={},
+        frequent_licenses=None,
+        files_with_children=None,


Do we need the two None values? These are default values on the Opossum Models.

Hellgartner · 2025-01-22T14:00:57Z

src/opossum_lib/scancode/resource_tree.py

-        segments = path_segments(file.path)
-        temp_root.get_path(segments).file = file
+        resource = opossum_model.Resource(
+            path=PurePath(file.path.replace("\\", "/")),


again the replacement.

Hellgartner · 2025-01-22T14:01:19Z

src/opossum_lib/scancode/resource_tree.py

-        process_node(child)
-
-    return external_attributions, resources_to_attributions
+def convert_resource_type(val: FileType) -> opossum_model.ResourceType:


rename val -> file_type

abraemer force-pushed the refactor-consistent-model-pipeline branch 2 times, most recently from 1e1d053 to 057727c Compare January 21, 2025 11:23

abraemer marked this pull request as ready for review January 21, 2025 11:24

abraemer force-pushed the refactor-consistent-model-pipeline branch 2 times, most recently from 60b1e47 to 59f5dc7 Compare January 21, 2025 11:44

Hellgartner self-requested a review January 21, 2025 11:54

Hellgartner reviewed Jan 22, 2025

View reviewed changes

abraemer added 9 commits January 22, 2025 12:59

refactor: Switch ScanCode frontend to new Opossum model

15067fd

feat: extend Opossum model

e14483d

* Opossum class should be able to carry all information that could be present in an .opossum file

refactor: simplify scancode conversion using new Resource functions

4317789

fix: use PurePaths instead to make it work on windows

818c8c5

fix: paths on windows by replacing \ with /

ceda49d

fix: Forgot to convert external_attribution_sources

41625aa

fix: minor changes to opossum_model.py

bd3735f

* default to treating a Resource as file if the type is undefined and no children present * provide slightly more information in an error message

abraemer force-pushed the refactor-consistent-model-pipeline branch from 54f5c61 to bd3735f Compare January 22, 2025 11:59

mstykow reviewed Jan 22, 2025

View reviewed changes

Hellgartner requested changes Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: introduce new internal representation for Opossum files #192

refactor: introduce new internal representation for Opossum files #192

abraemer commented Jan 20, 2025 •

edited

Loading

Hellgartner Jan 22, 2025

Hellgartner Jan 22, 2025

abraemer Jan 22, 2025

Hellgartner Jan 22, 2025

Hellgartner Jan 22, 2025

Hellgartner Jan 22, 2025

Hellgartner Jan 22, 2025

Hellgartner Jan 22, 2025

Hellgartner Jan 22, 2025

Hellgartner Jan 22, 2025

abraemer commented Jan 22, 2025

mstykow Jan 22, 2025

abraemer Jan 22, 2025

Hellgartner Jan 22, 2025

Hellgartner Jan 22, 2025

Hellgartner Jan 22, 2025



		class BaseUrlsForSources(BaseModel):
		model_config = ConfigDict(frozen=True, extra="allow")

		return opossum_file.Metadata(**self.model_dump())


		class ResourceType(Enum):

refactor: introduce new internal representation for Opossum files #192

Are you sure you want to change the base?

refactor: introduce new internal representation for Opossum files #192

Conversation

abraemer commented Jan 20, 2025 • edited Loading

Summary of changes

Context and reason for change

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abraemer commented Jan 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abraemer commented Jan 20, 2025 •

edited

Loading