Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Execution Time When Scanning Large Files #1461

Open
vinay-cldscle opened this issue Oct 4, 2024 · 3 comments
Open

Slow Execution Time When Scanning Large Files #1461

vinay-cldscle opened this issue Oct 4, 2024 · 3 comments

Comments

@vinay-cldscle
Copy link

vinay-cldscle commented Oct 4, 2024

Hey team,
When I tried to scan a file that is 7 MB and contains more than 700,000 lines, I passed the data in chunks(chunks size is 100000). It takes about 7 to 10 minutes to complete execution. Is this normal behavior? Can we reduce the execution time? Does batch analysis support TXT files? I would like to complete the execution within 1 minute. Is that possible?

@omri374
Copy link
Contributor

omri374 commented Oct 8, 2024

Hi @vinay-cldscle, have you lookied into the BatchAnalyzerEngine option?

@vinay-cldscle
Copy link
Author

vinay-cldscle commented Oct 15, 2024

Hi @omri374 Yes, i tried using the BatchAnalyzerEngine for txt files but it not working.
analyzer_engine = AnalyzerEngine()
analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer_engine)

error:
results = analyzer.analyze(texts=text_chunks, language="en", return_decision_process=True)
^^^^^^^^^^^^^^^^
AttributeError: 'BatchAnalyzerEngine' object has no attribute 'analyze'

Batch analyzer works only for list and dict?

@omri374
Copy link
Contributor

omri374 commented Oct 15, 2024

Please see the python API reference here: https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.BatchAnalyzerEngine.analyze_iterator

your text_chunks should be iterable (such as List[str]) and then you could call batch_analyzer.analyze_iter(text_cunks,...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants