-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big inputs result in MemoryError #7
Comments
Major changes: * Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7
Major changes: * Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7
Major changes: * Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7
* Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7
* Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7
* Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7
* Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7
Thanks for the report, @konstin, and for posting your workaround! Truly fixing this will take some work, as simply breaking up the input text into separate chunks has (as you noted) the risk of splitting one part into several parts and then missing geo references that span more than one of these parts. One approach would be to break up the input into chunks but join broken parts afterwards. Another approach would be to use a completely different approach that works in a streaming fashion -- however, that would probably be a custom implementation and probably not as fast as the NumPy/SciPy routines were using now. |
* Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7
* Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7
geoextract fails to parse big inputs such as https://ratsinfo.leipzig.de/bi/oparl/1.0/download.asp?dtyp=130&id=1008643 (~4000 pages / ~7,000,000 characters) with a MemoryError.
I've seen it fail at different positions, with all being part of
self.splitter.split(text)
. On my machine it seems to fail for inputs bigger than ~5,000,000 characters.This issue is easily worked around my use case (I just split the text in batches of ~1,000,000 and ignore the word that's possibly lost); I'm just opening this issue for documentation purposes.
The text was updated successfully, but these errors were encountered: