Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syft extract the full description of the license in python #3088

Open
tomersein opened this issue Aug 1, 2024 · 9 comments · May be fixed by #3450
Open

syft extract the full description of the license in python #3088

tomersein opened this issue Aug 1, 2024 · 9 comments · May be fixed by #3450
Assignees
Labels
bug Something isn't working

Comments

@tomersein
Copy link
Contributor

What happened:
I scanned an image of python, and one of the packages has a full description of the license.
here is an example:

        {
          "path": "/usr/local/lib/python3.9/site-packages/numpy-1.26.4.dist-info/RECORD",
          "layerID": "sha256:6baba55ee976c6469c4b3b5a3c4585320bc0da3e3c9638e2169195f7243b8d03",
          "accessPath": "/usr/local/lib/python3.9/site-packages/numpy-1.26.4.dist-info/RECORD",
          "annotations": {
            "evidence": "supporting"
          }
        }
      ],
      "licenses": [
        {
          "value": "Copyright (c) 2005-2023, NumPy Developers.\n All rights reserved.\n \n Redistribution and use in source and binary forms, with or without\n modification, are permitted provided that the following conditions are\n met:\n \n * Redistributions of source code must retain the above copyright\n notice, this list of conditions and the following disclaimer.\n \n * Redistributions in binary form must reproduce the above\n copyright notice, this list of conditions and the following\n disclaimer in the documentation and/or other materials provided\n with the distribution.\n \n * Neither the name of the NumPy Developers nor the names of any\n contributors may be used to endorse or promote products derived\n from this software without specific prior written permission.\n \n THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS\n \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT\n LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR\n A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT\n OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,\n SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT\n LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,\n DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY\n THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\n (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN

What you expected to happen:
In case license has \n inside of it, I think syft should trim it so it will not make the SBOM look weird.
Steps to reproduce the issue:
Here is the Dockerfile to build a sample of an image:

# Use the official Python 3.9 image from Docker Hub
FROM python:3.9

# Set the working directory in the container
WORKDIR /app

# Install the specific version of NumPy
RUN pip install numpy==1.26.4

# Specify the command to run on container start
CMD ["python"]

Anything else we need to know?:

Environment:

  • Output of syft version: 1.9.0
  • OS (e.g: cat /etc/os-release or similar): mac
@tomersein tomersein added the bug Something isn't working label Aug 1, 2024
@willmurphyscode
Copy link
Contributor

Hi @tomersein, thanks for the report! I was able to reproduce this issue. I appreciate the inclusion of a Dockerfile - it makes reproducing the issue a lot easier.

  1. Make a Dockerfile exactly as above (thanks!)
  2. Run docker build -t syft3088 .
  3. Run syft -o json syft3088 > syft-sbom.json
  4. Run cat syft-sbom.json | jq '.artifacts[] | select(.name=="numpy") | { name: .name, licenses: .licenses }'

You can see that the license set has a value that's the whole text of the license and the SPDX expression is blank.

Interestingly, on GitHub, the license is also shown as just "license" rather than having some automatic identification: https://github.com/numpy/numpy?tab=License-1-ov-file

@tomersein
Copy link
Contributor Author

my suggestion is to use split \n and take the 1st part of the value in cases like that.
what do you think?

@spiffcs
Copy link
Contributor

spiffcs commented Aug 6, 2024

👋 thanks for the license issue @tomersein!

It looks like syft is pulling this value from the following path in the container:
/usr/local/lib/python3.9/site-packages/numpy-1.26.4.dist-info/METADATA

This happens when the Python cataloger runs and we construct package details from the Egg/Wheel metadata:

func assembleEggOrWheelMetadata(resolver file.Resolver, metadataLocation file.Location) (*parsedData, []file.Location, error) {
var sources = []file.Location{
metadataLocation.WithAnnotation(pkg.EvidenceAnnotationKey, pkg.PrimaryEvidenceAnnotation),
}
metadataContents, err := resolver.FileContentsByLocation(metadataLocation)
if err != nil {

Here is the field definition from we key off of from the python Metadata specification:
https://packaging.python.org/en/latest/specifications/core-metadata/#license

In this case the package distributor opted to put the full text of the license in this field.

This is a valid value for when we go and decode the map structure of this file here:

if err := mapstructure.Decode(fields, &pd); err != nil {
return pd, fmt.Errorf("unable to translate python wheel/egg metadata: %w", err)
}
// add additional metadata not stored in the egg/wheel metadata file
path := locationReader.Path()
pd.SitePackagesRootPath = determineSitePackagesRootPath(path)
if pd.Licenses != "" || pd.LicenseExpression != "" {
pd.LicenseLocation = file.NewLocation(path)
} else if pd.LicenseFile != "" {
pd.LicenseLocation = file.NewLocation(filepath.Join(filepath.Dir(path), pd.LicenseFile))
}

I appreciate the suggestion of using a split \n to truncate longer values in this case, but I am afraid that might be inadequate in the general sense.

If you look at the contents of this license block the Author's take a good amount of effort to address different distributions they consume as a part of NumPy:
First the general clause for redistribution of the software

        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
        "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
        LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
        A PARTICULAR PURPOSE ARE DISCLAIMED.

More information (not in their github license.txt) about what they distribute

        The NumPy repository and source distributions bundle several libraries that are
        compatibly licensed.  We list these here.

        Name: lapack-lite
        Files: numpy/linalg/lapack_lite/*
        License: BSD-3-Clause
          For details, see numpy/linalg/lapack_lite/LICENSE.txt

        Name: tempita
        Files: tools/npy_tempita/*
        License: MIT
          For details, see tools/npy_tempita/license.txt

        Name: dragon4
        Files: numpy/core/src/multiarray/dragon4.c
        License: MIT
          For license text, see numpy/core/src/multiarray/dragon4.c

        Name: libdivide
        Files: numpy/core/include/numpy/libdivide/*
        License: Zlib
          For license text, see numpy/core/include/numpy/libdivide/LICENSE.txt


        Note that the following files are vendored in the repository and sdist but not
        installed in built numpy packages:

        Name: Meson
        Files: vendored-meson/meson/*
        License: Apache 2.0
          For license text, see vendored-meson/meson/COPYING

        Name: spin
        Files: .spin/cmds.py
        License: BSD-3
          For license text, see .spin/LICENSE
        ----

There are also considerations for the following (there are more truncated for brevity):

        Name: OpenBLAS
        Files: numpy.libs/libopenblas*.so
        Description: bundled as a dynamically linked library
        Availability: https://github.com/OpenMathLib/OpenBLAS/
        License: BSD-3-Clause
          Copyright (c) 2011-2014, The OpenBLAS Project
          All rights reserved.

        Name: LAPACK
        Files: numpy.libs/libopenblas*.so
        Description: bundled in OpenBLAS
        Availability: https://github.com/OpenMathLib/OpenBLAS/
        License: BSD-3-Clause-Attribution

        Name: libquadmath
        Files: numpy.libs/libquadmath*.so
        Description: dynamically linked to files compiled with gcc
        Availability: https://gcc.gnu.org/git/?p=gcc.git;a=tree;f=libquadmath
        License: LGPL-2.1-or-later

Cutting the license off at the new line here would remove this information from the SBOM.

I know long field values are ugly and bloat the document, but in this case the value of the license file is as accurate as we can make it without resorting to some specialized parsing for this unique case.

@anchore/tools should we NOT be including this when we discover it associated with the package?

@wagoodman
Copy link
Contributor

Maybe we should add a new fullText field on the existing package License structure, so that we can separate when we do detect the different use cases here into a separate field. This can be leveraged directly in SPDX too, which is nice.

@tomersein
Copy link
Contributor Author

hi @wagoodman , what is the decision?
may I have a way to help with the solution?

@spiffcs
Copy link
Contributor

spiffcs commented Aug 16, 2024

Hey @tomersein!

We talked about this on our livestream the other day. We're moving forward with FullText field being added to the license struct. What do you think is the best way forward detecting this? Just doing a simple len(arbitraryNumber) to see if it's the full text?

https://www.youtube.com/@Anchore/streams

@tomersein
Copy link
Contributor Author

i suggest \n character. I don't think saving twice the long license value is a good practice.

@spiffcs spiffcs moved this to In Progress in OSS Aug 19, 2024
@westonsteimel
Copy link
Contributor

@spiffcs , https://peps.python.org/pep-0639/ has been provisionally accepted which will hopefully make this metadata field more standardised for future python package releases at least

@kzantow
Copy link
Contributor

kzantow commented Nov 7, 2024

Regarding:

What do you think is the best way forward detecting this? Just doing a simple len(arbitraryNumber) to see if it's the full text?

My suggestion would be to do this:
a) fuzzy match the license to an SPDX ID - don't include "full text" if we're just given an SPDX ID
b) for all other cases: include the license text we used as a fullText field, as was suggested above, and additionally try to classify the license using the library and if it returns a license that we can match to SPDX ID include that ID as well as the original license text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

6 participants