Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of three new predefined recognizers, improved regex for IN_PAN #1323

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

devopam
Copy link
Contributor

@devopam devopam commented Mar 4, 2024

Change Description

Added three new recognizers

  1. IN_GSTIN
  2. ISIN_CODE
  3. CFI_CODE

Improved IN_PAN regex to not get spoofed by 0000 pattern

Describe your changes
Added new recognizers to improve detection of PII elements , one India specific and two global.

Issue reference

This PR fixes issue #XX

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

devopam and others added 30 commits June 26, 2023 16:39
Added India PAN (Permanent Account Number) recognizer
refined the regex for better recognition and enhanced the test cases accordingly
Fixed lint error that was missed earlier.
Added test cases , verification and context data
Added negative test cases per review comments.
update pattern recognizer value per suggestion in review
added PresidioAnalyzerUtils class with generic functions. removed usage of stdnum
added test cases for analyzer_utils.py in prescribed format
added to the count of predefined recognizers
Added India specific predefined pattern recognizer for vehicle registration number
reinstated python 3.9 compatibility, reorganized code
Logic reverted from analyzer_utils to recognizer classfile
added min size check to avoid failures per review comment
Two english language predefine recognizers added viz. ISIN , CFI
1. Improved IN_PAN regex
2.  Utility function for LUHN ModN validation
3. New recognizers : IN_GSTIN, CFI_CODE, ISIN_CODE
@devopam
Copy link
Contributor Author

devopam commented Mar 4, 2024

hi @omri374 ,
Apologies if I made it bit heavy by adding three recognizers in one go.
One additional library used : pycountry (https://pypi.org/project/pycountry/) , please let me know if their licenses are okay to be included within presidio.

regards
Devopam

@omri374
Copy link
Contributor

omri374 commented Mar 5, 2024

Hi @devopam, thanks for this! unfortunately the pycountry package is LGPL and we cannot have a dependency on it.

@omri374
Copy link
Contributor

omri374 commented Mar 7, 2024

Is there a simple alternative to it?

@devopam
Copy link
Contributor Author

devopam commented Mar 7, 2024

hi @omri374 , yes I will add it neatly in analyzer_utils so that others can use it as well. We will need to maintain all these metadata whenever world events change as such. I am down with cold & fever :( - that's the delay cause.

@omri374
Copy link
Contributor

omri374 commented Mar 7, 2024

Absolutely no rush! Hope you feel better soon!

Removed pycountry per feedback on it's license. Built the utility in analyzer_utils.py & removed all references.
@devopam
Copy link
Contributor Author

devopam commented Mar 9, 2024

hi @omri374 ,
I have removed pycountry as desired and added the metadata to country_master.csv file. Added util functions in analyzer_utils to load the required information as well (added one function for future use also). However, I am not convinced of the location of country_master.csv myself - shall we create a new sub-folder at the top level for such data ?
Please have a look when you can and let me know your feedback on the overall changes.

@omri374
Copy link
Contributor

omri374 commented Mar 10, 2024

Thanks! I will take a look soon.

Copy link
Contributor

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! Left some comments, would be happy to discuss.

presidio-analyzer/tests/test_recognizer_registry.py Outdated Show resolved Hide resolved
presidio-analyzer/presidio_analyzer/country_master.csv Outdated Show resolved Hide resolved

gstin_country_codes_iso3a = ""
utils = Utils()
countries = utils.get_country_codes(iso_code="ISO3166-1-Alpha-3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think of way to make this more efficient. Creating the PresidioAnalyzerUtils object for every recognizer that needs it is inefficient, and would require holding the countries data in memory multiple times.

On one hand, there's the need for this data for some recognizers, on the other hand most recognizers (and users) might not need it. Therefore, we can think of two options to handle this in my view.

  1. Instantiate the utils in the AnalyzerEngine class and pass it to each recognizer. This would allow it to only be instantiated once.
  2. Make some of these recognizers optional (which aren't loaded in load_predefined_recognizers) and therefore reduce the memory footprint.

Long term, I think we should think of a mechanism to specify the countries you expect PII to come from. If someone doesn't expect any Australian license plates, it doesn't make sense to load that recognizer for them.

Happy to get your thoughts on this.

Copy link
Contributor Author

@devopam devopam Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @omri374
I am confused with the best approach here frankly. #2 is not a good idea as it will defeat the purpose of 'built-in' recognizers readily available as such from a consumer's perspective imho.
We will need more of such metadata to improve detection beyond pattern matching as we add more industry specific recognizers to give comprehensive solution offering.
So, how best to design this is bit complex !
approach 1 above seems to be a feasible approach to me but I will need fair bit of help here to implement since I don't want to end up breaking things due to my limited knowledge of the core.
Shall we discuss offline ? Please suggest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @devopam. I agree that (1) is a better approach, also since in parallel we're thinking of introducing a country flag to be able to filter out recognizers that aren't needed.

Adding a new input to the __init__ of EntityRecognizer would not break the API. I would suggest to do that, and instantiate the PresidioAnalyzerUtils in the recognizer registry, so that it could be passed to all recognizers during instantiation.

How about taking this constructor:

and adding a new param to it:

    def __init__(
        self,
        supported_entities: List[str],
        name: str = None,
        supported_language: str = "en",
        version: str = "0.0.1",
        context: Optional[List[str]] = None,
        analyzer_utils: Optional[PresidioAnalyzerUtils] = None
    ):
        self.analyzer_utils = analyzer_utils

We could inject the analyzer_utils during recognizers instantiation:

WDYT?
Happy to discuss offline too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devopam how about we have a quick chat on this? If you're interested, let's chat on LinkedIn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @omri374 ,
Please have a look at the current implementation and advise. Utils is instantiated only in the recognizers using it now viz. in_gstin, isin

@omri374
Copy link
Contributor

omri374 commented Mar 13, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @devopam, added a few additional comments. This looks much better!
Thanks

@@ -36,13 +52,33 @@ def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
text = text.replace(search_string, replacement_string)
return text

@staticmethod
def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ") -> bool:

return (
sum(luhn_input[::2]) + sum(sum(divmod(i * 2, n)) for i in luhn_input[1::2])
) % n == 0

@staticmethod
def is_verhoeff_number(input_number: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_verhoeff_number(input_number: int):
def is_verhoeff_number(input_number: int) -> bool:

# return full country list for given code
return self.__get_country_master_full_data__(iso_code=iso_code)

def get_full_country_information(self, lookup_key: str, lookup_index: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please define return type (List[str]?)

ISO3166-1-Alpha-2,ISO3166-1-Alpha-3, ISO3166-1-Numeric,
International_licence_plate_country_code, Country_code_top_level_domain,
Currency_Name, ISO4217-Alpha-3, ISO4217-Numeric, Capital_City, Dialing_Code
:return: Dictionary object with additional information enriched from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says it returns a dictionary, but it looks like the code returns a list


self.load()
logger.info("Loaded recognizer: %s", self.name)
self.is_loaded = True

if analyzer_utils is not None:
self.analyzer_utils = analyzer_utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class fields should not be optional. I would suggest to remove the if analyzer_utils is not None and put the value anyway (even if it's None). Otherwise the analyzer_utils field would not always be part of the class.

supported_language: str = "en",
supported_entity: str = "ISIN_CODE",
replacement_pairs: Optional[List[Tuple[str, str]]] = None,
analyzer_utils=PresidioAnalyzerUtils(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on other recognizer

:param patterns: List of patterns to be used by this recognizer
:param context: List of context words to increase confidence in detection
:param supported_language: Language this recognizer supports
:param supported_entity: The entity this recognizer can detect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add analyzer_utils to docstring

:param supported_language: Language this recognizer supports
:param supported_entity: The entity this recognizer can detect
:param replacement_pairs: List of tuples with potential replacement values
for different strings to be used during pattern matching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add analyzer_utils to docstring

from presidio_analyzer.analyzer_utils import PresidioAnalyzerUtils


# from presidio_analyzer.analyzer_utils import PresidioAnalyzerUtils as Utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove

luhn_mod_n_test_set = [
["27AAACM6094R1ZP", True],
["36AAICA3369H1ZJ", True],
["36AAHAA2262Q1ZF", True],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add negative examples?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants