Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent changes broke types #1396

Open
tonyhammainen opened this issue Jun 5, 2024 · 6 comments
Open

Recent changes broke types #1396

tonyhammainen opened this issue Jun 5, 2024 · 6 comments

Comments

@tonyhammainen
Copy link

Describe the bug
As of 2.2.33 I was not getting any type-related errors, but upgrading to the latest 2.2.354 resulted in getting them.

error: Call to untyped function "AnonymizerEngine" in typed context [no-untyped-call]

^ Initializing the AnonymizerEngine throws an error. I believe it is because the class has no typed init function

error: Argument "ad_hoc_recognizers" to "analyze" of "AnalyzerEngine" has incompatible type "list[PatternRecognizer]"; expected "list[EntityRecognizer] | None" [arg-type]
^ I do not understand why mypy is not picking the fact that PatternRecognizer is a child class of EntityRecognizer

Argument "analyzer_results" to "anonymize" of "AnonymizerEngine" has incompatible type "list[presidio_analyzer.recognizer_result.RecognizerResult]"; expected "list[presidio_anonymizer.entities.engine.recognizer_result.RecognizerResult]" [arg-type]
^ I believe the above is a result of the RecognizerResult class in presidio-anonymizer being a copy of the class in presidio-analyzer, and mypy doesn't understand their equivalency

To Reproduce
Steps to reproduce the behavior:

  1. pip install analyzer, anonymizer == 2.2.354
  2. In project, include:
  • instantiation of a AnonymizerEngine()
  • call AnonymizerEngine.anonymize(ad_hoc_recognizers = <list[PatternRecognizer]>)
  • call AnonymizerEngine.anonymize(analyzer_results= <list[presidio_analyzer.recognizer_result.RecognizerResult]>)
  1. Run mypy on the project
  2. See error

Expected behavior
No mypy errors when using the library as expected

@omri374
Copy link
Contributor

omri374 commented Jun 6, 2024

Hi @tonyhammainen, we really do have a duplicate RecognizerResult in both packages, as the two packages are currently independent. The solution would be to have a base package which both packages use, but it's not on our roadmap to add this, as there is no functional benefit. Having said that, I don't see why this would be a mypy issue, as those packages are independent and each accepts its own type of recognizer result.

on the ad_hoc_recognizers object, I'm not sure, because as you said, PatternRecognizer is a child of EntityRecognizer

@tonyhammainen
Copy link
Author

tonyhammainen commented Jun 18, 2024

Hi @omri374, thanks for the reply!

The intended pattern is to use both libraries sequentially, first getting recognized PII out of analyzer, and inputting that for the anonymizer to anonymize. Or have I misunderstood?

If so, this means that we are inputting RecognizerResult of analyzer package to the anonymization function of anonymizer package, necessarily leading to mypy errors in the end-user app, if anonymizer package is only typed for its own version of RecognizerResults, which de-facto is something that nobody ever uses?

Tbh, I don't see why you are not including analyzer as a dependency of anonymizer, given the main usage pattern for anonymizer relies on analyzer?

If I have misunderstood how anonymizer package should be used do enlighten me!

@eicca
Copy link
Contributor

eicca commented Jul 1, 2024

I can confirm the issue with pyright (pylance) by using the first example from the quickstart: https://microsoft.github.io/presidio/getting_started/

If the libraries MUST be independent, then the conversion between the two classes should be done by the user. For example this way:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine, RecognizerResult

text = "My phone number is 212-555-5555"

# Set up the engine, loads the NLP module (spaCy model by default)
# and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text=text, entities=["PHONE_NUMBER"], language="en")
print(results)

# Convert recognizer results to RecognizerResult objects from presidio_anonymizer
results = [
    RecognizerResult(
        entity_type=result.entity_type,
        start=result.start,
        end=result.end,
        score=result.score,
    )
    for result in results
]

# Analyzer results are passed to the AnonymizerEngine for anonymization

anonymizer = AnonymizerEngine()

anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)

print(anonymized_text)

Over time, the difference between RecognizerResult and RecognizerResult will grow, so the above solution provides a somewhat typed-checked way to keep our sanity. In fact, they are already different: the analyzer's RecognizerResult has extra information like recognition_metadata and analysis_explanation.

Another approach might be to refactor the anonymizer so that it depends on the protocol of the PIIEntity (right now it's an abstract class) instead of the concrete RecognizerResult implementation. This will keep the analyzer and anonymizer independent, but provide a bridge between two RecognizerResult classes. If it makes sense to you @omri374 and you see that it's doable - I could give it a try.

Additionally, I'd suggest deprecating one of the RecognizerResult (probably the one on the anonymizer side) and creating a different class name, just to avoid user confusion.

@omri374
Copy link
Contributor

omri374 commented Jul 1, 2024

Hi @eicca, thanks for your suggestions. We should also think of backward compatibilty here. In my view, the optimal solution is to have a shared library. One option is to have the anonymizer as a dependency of the analyzer. The second is to create a presidio-core library which is used by both.

@eicca
Copy link
Contributor

eicca commented Jul 2, 2024

Hi @eicca, thanks for your suggestions. We should also think of backward compatibilty here. In my view, the optimal solution is to have a shared library. The second is to create a presidio-core library which is used by both.

Thanks for your reply @omri374!

One option is to have the anonymizer as a dependency of the analyzer.

This totally makes sense to me, especially since it won't bring a lot of unnecessary dependencies for people who want to use only analyzer (in fact, just pycryptodome, which is most probably already required by other dependencies). I can try to prepare a PR if you're open to it.

The second is to create a presidio-core library which is used by both.

This also makes sense. This can also be done at a later time after the previous step is done.

@omri374
Copy link
Contributor

omri374 commented Jul 3, 2024

Thanks! A PR adding the anonymizer to the analyzer sounds great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants