From 581bd2a87b494d4e14b8dd5384cafb2f8ee681b7 Mon Sep 17 00:00:00 2001 From: Hero2323 Date: Thu, 11 Jul 2024 16:29:35 +0200 Subject: [PATCH] chore(report): AI Powered License Detection Reports for Weeks 6 and 7 --- .../license-detection/updates/2024-06-06.md | 2 +- .../license-detection/updates/2024-06-13.md | 2 +- .../license-detection/updates/2024-06-20.md | 2 +- .../license-detection/updates/2024-06-27.md | 2 +- .../license-detection/updates/2024-07-04.md | 125 +++++++++++++ .../license-detection/updates/2024-07-11.md | 165 ++++++++++++++++++ 6 files changed, 294 insertions(+), 4 deletions(-) create mode 100644 docs/2024/license-detection/updates/2024-07-04.md create mode 100644 docs/2024/license-detection/updates/2024-07-11.md diff --git a/docs/2024/license-detection/updates/2024-06-06.md b/docs/2024/license-detection/updates/2024-06-06.md index 24548ed6c..8248e6e36 100644 --- a/docs/2024/license-detection/updates/2024-06-06.md +++ b/docs/2024/license-detection/updates/2024-06-06.md @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal # Meeting 2 -*(June 6,2023)* +*(June 6,2024)* ## Attendees: - [Kaushlendra Pratap](https://github.com/Kaushl2208) diff --git a/docs/2024/license-detection/updates/2024-06-13.md b/docs/2024/license-detection/updates/2024-06-13.md index 2b1126dea..7be28aeee 100644 --- a/docs/2024/license-detection/updates/2024-06-13.md +++ b/docs/2024/license-detection/updates/2024-06-13.md @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal # Meeting 3 -*(June 13,2023)* +*(June 13,2024)* ## Attendees: - [Kaushlendra Pratap](https://github.com/Kaushl2208) diff --git a/docs/2024/license-detection/updates/2024-06-20.md b/docs/2024/license-detection/updates/2024-06-20.md index 9aa40eb06..36e314e5c 100644 --- a/docs/2024/license-detection/updates/2024-06-20.md +++ b/docs/2024/license-detection/updates/2024-06-20.md @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal # Meeting 4 -*(June 20,2023)* +*(June 20,2024)* ## Attendees: - [Kaushlendra Pratap](https://github.com/Kaushl2208) diff --git a/docs/2024/license-detection/updates/2024-06-27.md b/docs/2024/license-detection/updates/2024-06-27.md index f8da22507..116e537b2 100644 --- a/docs/2024/license-detection/updates/2024-06-27.md +++ b/docs/2024/license-detection/updates/2024-06-27.md @@ -10,7 +10,7 @@ SPDX-FileCopyrightText: 2024 Abdelrahman Jamal # Meeting 5 -*(June 27,2023)* +*(June 27,2024)* ## Attendees: - [Kaushlendra Pratap](https://github.com/Kaushl2208) diff --git a/docs/2024/license-detection/updates/2024-07-04.md b/docs/2024/license-detection/updates/2024-07-04.md new file mode 100644 index 000000000..ac0889f85 --- /dev/null +++ b/docs/2024/license-detection/updates/2024-07-04.md @@ -0,0 +1,125 @@ +--- +title: Week 6 +author: Abdelrahman Jamal +--- + + +# Meeting 6 + +*(July 4,2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) +- [Ayush Bhardwaj](https://github.com/hastagAB) +- [Avinal Kumar](https://github.com/avinal) + +## Discussion: + +### Integration of Semantic Search with LLMs + +- Initial Attempt + 1. Prompt: The initial prompt focused on providing text and metadata to the LLM for license identification. + 2. Issues: The LLM attempted to match all provided lines to a license, even when many lines were clearly irrelevant to licensing. + +- Initial Prompt +``` +[Task] +You are provided with text extracted from a file, along with potential license matches identified by a semantic search tool. +Your task is to carefully analyze the provided text and metadata to determine the actual software license(s) present in the original file. +Out of the 10 provided lines, not all matches will be correct or relevant, so focus on the most relevant lines in your analysis. + +[Metadata Explanation] +The metadata provided for each line is a tuple containing four elements: + * **Line:** The actual line of text extracted from the file. + * **Potential License Match:** The name of a license that the semantic search tool believes the line might belong to. + * **License ID:** The SPDX identifier of the potential license match. + * **Matched License Text:** The specific text within the potential license that the line was matched to. + +[Guidelines] +1. **License Identification:** If a license is found, clearly state its name and its corresponding SPDX identifier (e.g., MIT License, SPDX-License-Identifier: MIT). If multiple licenses are found, list them all. +2. **Evidence and Reasoning (Focus on Relevance and Clarity):** + * For each identified license, extract the specific text snippet(s) from the provided text that confirm its presence. Include surrounding context if it helps clarify the license's applicability. Prioritize the most relevant lines of text. + * Explain why the identified license is the most likely match, taking into account the potential license matches and the matched license text provided in the metadata. + * Only consider matches that are clear and obviously correct. The semantic search tool will always attempt to match lines to licenses, but these matches are not always accurate. +3. **Override Semantic Search:** If the semantic search tool's suggested match seems incorrect, feel free to disregard it and rely on your own knowledge and analysis to determine the correct license. Provide a clear explanation of why you chose a different license. +4. **Exclude Irrelevant Information:** + * Disregard copyright notices and statements and lines of code as they do not indicate the software license. + * Focus only on text that is found in licenses or clearly identifies licenses. +5. **No License Scenario:** If no license is detected in the text, explicitly state "No software license found." +6. **Ambiguity:** If the license cannot be confidently determined due to ambiguity or conflicting information, clearly state this and provide an explanation. +7. **Response Format:** Provide the results in the following format: + * **Licenses = [list of identified licenses]** + * **SPDX-IDs = [list of corresponding SPDX identifiers]** + + If no licenses are found, both lists should be empty: + * **Licenses = []** + * **SPDX-IDs = []** + +[Text and Metadata] +``` +- Outcome: The LLM tried too hard to relate irrelevant lines to licenses, resulting in many false positives. + +### Revised Approach + +- Second Attempt + - Prompt: Changed the task to identify relevant lines before determining licenses. + - Issues: Reduced the number of irrelevant lines identified, but the problem of false positives persisted. + +- Second Prompt +``` +[Task] +From the following tuples, select those that are relevant to software licensing and ignore the rest. +A relevant tuple is a tuple that contains a line of text that is relevant and can be used to identify a license. + +[Tuples] +Each tuple consists of three elements: + 1. **Line:** The actual line of text extracted from the file. This is the element you need to evaluate for relevance to software licensing. + 2. **Potential License Match:** The name of a license that the semantic search tool suggests the line might belong to (provided for reference). + 3. **License ID:** The SPDX identifier of the potential license match (provided for reference). + +[Guidelines] +1. **Select License-Specific Lines:** Choose only lines that: + * Explicitly mention license terms + * Directly quote from known license texts + * Include specific license references or titles. + +2. **Ignore Irrelevant Lines:** + * Disregard lines that do not explicitly mention license terms. + * Ignore copyright notices, code snippets, comments, and general documentation. + * Ignore code documentation lines that seem to be documenting code or just general instructions or information. + * Do not select lines that are general descriptions, code, or comments unrelated to license terms. + +3. **No License:** If no license is found, state "No software license found." +4. **Ambiguity:** If uncertain, explain the ambiguity. +5. **Response Format:** + * **Relevant Lines = [list of relevant lines]** + * **Licenses = [list of identified licenses from relevant lines]** + * **SPDX-IDs = [list of corresponding SPDX identifiers from relevant lines]** + +[Text and Metadata] +``` +- Outcome: The LLM still included irrelevant lines in its output, indicating a persistent issue with following the prompt guidelines. + +### Key Findings +- Performance Issues: Despite detailed prompts, the LLMs struggled to correctly identify relevant lines and accurately match licenses. + +- RAG Exploration: Suggested by Kaushl, Retrieval-Augmented Generation (RAG) may provide a more robust solution to improve accuracy in license identification. + +## Conclusions and Next Steps +- Improve Semantic Search: Continue refining the semantic search approach for better initial filtering of potential license lines. + +- RAG Implementation: Investigate and implement RAG to enhance the LLM's ability to accurately identify relevant lines and match licenses. + +- Further Prompt Engineering: Experiment with additional prompt variations to improve LLM performance. + + +- Performance Metrics: Establish metrics to evaluate the effectiveness of the integrated approach and analyze the results for further improvements. +**** + + + diff --git a/docs/2024/license-detection/updates/2024-07-11.md b/docs/2024/license-detection/updates/2024-07-11.md new file mode 100644 index 000000000..c5bb3561c --- /dev/null +++ b/docs/2024/license-detection/updates/2024-07-11.md @@ -0,0 +1,165 @@ +--- +title: Week 7 +author: Abdelrahman Jamal +--- + + +# Meeting 7 + +*(July 4,2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) +- [Ayush Bhardwaj](https://github.com/hastagAB) +- [Avinal Kumar](https://github.com/avinal) +- [Anupam Ghosh](https://github.com/ag4ums) + +## Discussion: + +### Improved Semantic Search Algorithm + +- Presentation of Improved Algorithm + + 1. Approach: Implemented chunking by dividing code comments and paragraphs into multiline strings, starting new chunks at empty lines. + + 2. Challenges: Original method of extracting comments resulted in long, merged lines, necessitating a rework to handle multiline comments effectively. + +- Chunking Example + +``` +/* +This is a multiline comment +that should be considered as one +single chunk for better accuracy. + +This is a separate chunk. +*/ + +// This is a single line comment + +// This is still a single line comment +// This is also still a single line comment and not a chunk with the previous comment +``` +### License Matching Performance + +- Initial Results + + 1. Accuracy: The chunking approach significantly improved license matching accuracy but struggled with extremely similar licenses (e.g., 0BSD and ISC). + + 2. Problem: Minor differences between very similar licenses led to occasional misidentifications. + +- License Text Examples + +``` +0BSD License Text: +Copyright (C) YEAR by AUTHOR EMAIL + +Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted. + +THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + +ISC License Text: +Copyright (c) 2004-2010 by Internet Systems Consortium, Inc. ("ISC") +Copyright (c) 1995-2003 by Internet Software Consortium + +Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. + +THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + +``` + +- Enhancements + +1. Chunk Merging: Attempted merging potential chunks to improve license identification by providing more comprehensive text for comparison. + +2. Combined Line and Chunk Matching: Implemented both line and chunk matching to enhance accuracy, though it increased processing time due to the greater number of combinations. + +- Results + + 1. Metrics: + Predicted License Accuracy: 93.33% + Predicted Licenses Covered: 84.0% + + 2. Notes: Approximately 5% of the remaining 7.6% were files referring to a different license file, not containing the license text directly. + +- Matching Output Example: + +``` +[(100.0, + ' THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.', + 'BSD Zero Clause License', + '0BSD', + 'THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.', + [('BSD Zero Clause License', 100.0), + ('BSD Zero Clause License', 100.0), + ('ISC License', 97.0), + ('ISC License', 97.0), + ('Mackerras 3-Clause License', 93.0)]), + (100.0, + ' Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.', + 'curl License', + 'curl', + 'Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.', + [('curl License', 100.0), + ('curl License', 100.0), + ('pkgconf License', 99.0), + ('ISC License', 99.0), + ('ISC License', 99.0)]), + (99.0, + ' Permission to use, copy, modify, and distribute this software for any', + 'OAR License', + 'OAR', + 'Permission to use, copy, modify, and distribute this software for any', + [('Historical Permission Notice and Disclaimer - Fenneberg-Livingston variant', + 99.0), + ('David M. Gay dtoa License', 99.0), + ('OAR License', 99.0), + ('pkgconf License', 97.0), + ('SGI OpenGL License', 96.0)]), + (99.0, + ' copyright notice and this permission notice appear in all copies.', + 'pkgconf License', + 'pkgconf', + 'copyright notice and this permission notice appear in all copies.', + [('pkgconf License', 99.0), + ('Historical Permission Notice and Disclaimer - documentation variant', + 99.0), + ('CMU Mach - no notices-in-documentation variant', 96.0), + ('ISC Veillard variant', 87.0), + ('Historical Permission Notice and Disclaimer - Pbmplus variant', 86.0)]), +.... (many more) + +``` + +### Key Findings + +- Enhanced Accuracy: The combination of chunking and line matching improved overall accuracy and coverage. + +- Increased Processing Time: The dual approach led to longer search times due to the increased number of combinations. + +## Conclusions and Next Steps + +- Evaluate on Nomos Test Dataset: + + 1. Dataset Links: + - [LastGoodNomosTestfilesScan](https://github.com/fossology/fossology/blob/master/src/nomos/agent_tests/testdata/LastGoodNomosTestfilesScan) + - [NomosTestfiles](https://github.com/fossology/fossology/tree/master/src/nomos/agent_tests/testdata/NomosTestfiles) + + 2. Objective: Assess the performance of the semantic search and LLM integration on the provided test datasets. + +- Limit Line/Chunk Matching: To address the issue of excessive matches, limit the line/chunk matching to optimize search time and accuracy. + +- Additional Tasks + + 1. Acknowledgement from Notice Files: Begin work on identifying and acknowledging licenses from notice files. + + 2. Obligations: Convert identified licenses to obligations, detailing the requirements and conditions of each license. + + + +