Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big inputs result in MemoryError #7

Open
konstin opened this issue Jan 15, 2019 · 1 comment
Open

Big inputs result in MemoryError #7

konstin opened this issue Jan 15, 2019 · 1 comment

Comments

@konstin
Copy link
Contributor

konstin commented Jan 15, 2019

geoextract fails to parse big inputs such as https://ratsinfo.leipzig.de/bi/oparl/1.0/download.asp?dtyp=130&id=1008643 (~4000 pages / ~7,000,000 characters) with a MemoryError.

I've seen it fail at different positions, with all being part of self.splitter.split(text). On my machine it seems to fail for inputs bigger than ~5,000,000 characters.

    found_locations = pipeline.extract(text)
  File "/home/konsti/meine-stadt-transparent/.venv/lib/python3.6/site-packages/geoextract/__init__.py", line 459, in extract
    parts = map(self._normalize, self._split(text))
  File "/home/konsti/meine-stadt-transparent/.venv/lib/python3.6/site-packages/geoextract/__init__.py", line 452, in _split
    return self.splitter.split(text)

This issue is easily worked around my use case (I just split the text in batches of ~1,000,000 and ignore the word that's possibly lost); I'm just opening this issue for documentation purposes.

konstin added a commit to meine-stadt-transparent/meine-stadt-transparent that referenced this issue Jan 15, 2019
Major changes:
 * Real incremental update (finally)
 * Split up in phases: Body and metadata import, loading the lists, list to database, files
 * With the phases, there's a change to restart the import with finished phases intact after errors
 * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown
 * Progress bars
 * Good test coverage
 * Fixed some lurking data model problems
 * no mainapp -> importer dependency
 * Faster file analysis (which is required for the import for bigger cities not to take days)
 * Adds python-dateutil, which was already a transitive dependency through three paths
 * Added a workaround for stadt-karlsruhe/geoextract#7
konstin added a commit to meine-stadt-transparent/meine-stadt-transparent that referenced this issue Jan 15, 2019
Major changes:
 * Real incremental update (finally)
 * Split up in phases: Body and metadata import, loading the lists, list to database, files
 * With the phases, there's a change to restart the import with finished phases intact after errors
 * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown
 * Progress bars
 * Good test coverage
 * Fixed some lurking data model problems
 * no mainapp -> importer dependency
 * Faster file analysis (which is required for the import for bigger cities not to take days)
 * Adds python-dateutil, which was already a transitive dependency through three paths
 * Added a workaround for stadt-karlsruhe/geoextract#7
konstin added a commit to meine-stadt-transparent/meine-stadt-transparent that referenced this issue Jan 15, 2019
Major changes:
 * Real incremental update (finally)
 * Split up in phases: Body and metadata import, loading the lists, list to database, files
 * With the phases, there's a change to restart the import with finished phases intact after errors
 * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown
 * Progress bars
 * Good test coverage
 * Fixed some lurking data model problems
 * no mainapp -> importer dependency
 * Faster file analysis (which is required for the import for bigger cities not to take days)
 * Adds python-dateutil, which was already a transitive dependency through three paths
 * Added a workaround for stadt-karlsruhe/geoextract#7
konstin added a commit to meine-stadt-transparent/meine-stadt-transparent that referenced this issue Jan 16, 2019
 * Real incremental update (finally)
 * Split up in phases: Body and metadata import, loading the lists, list to database, files
 * With the phases, there's a change to restart the import with finished phases intact after errors
 * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown
 * Progress bars
 * Good test coverage
 * Fixed some lurking data model problems
 * no mainapp -> importer dependency
 * Faster file analysis (which is required for the import for bigger cities not to take days)
 * Adds python-dateutil, which was already a transitive dependency through three paths
 * Added a workaround for stadt-karlsruhe/geoextract#7
konstin added a commit to meine-stadt-transparent/meine-stadt-transparent that referenced this issue Jan 16, 2019
 * Real incremental update (finally)
 * Split up in phases: Body and metadata import, loading the lists, list to database, files
 * With the phases, there's a change to restart the import with finished phases intact after errors
 * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown
 * Progress bars
 * Good test coverage
 * Fixed some lurking data model problems
 * no mainapp -> importer dependency
 * Faster file analysis (which is required for the import for bigger cities not to take days)
 * Adds python-dateutil, which was already a transitive dependency through three paths
 * Added a workaround for stadt-karlsruhe/geoextract#7
konstin added a commit to meine-stadt-transparent/meine-stadt-transparent that referenced this issue Jan 16, 2019
 * Real incremental update (finally)
 * Split up in phases: Body and metadata import, loading the lists, list to database, files
 * With the phases, there's a change to restart the import with finished phases intact after errors
 * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown
 * Progress bars
 * Good test coverage
 * Fixed some lurking data model problems
 * no mainapp -> importer dependency
 * Faster file analysis (which is required for the import for bigger cities not to take days)
 * Adds python-dateutil, which was already a transitive dependency through three paths
 * Added a workaround for stadt-karlsruhe/geoextract#7
konstin added a commit to meine-stadt-transparent/meine-stadt-transparent that referenced this issue Jan 16, 2019
 * Real incremental update (finally)
 * Split up in phases: Body and metadata import, loading the lists, list to database, files
 * With the phases, there's a change to restart the import with finished phases intact after errors
 * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown
 * Progress bars
 * Good test coverage
 * Fixed some lurking data model problems
 * no mainapp -> importer dependency
 * Faster file analysis (which is required for the import for bigger cities not to take days)
 * Adds python-dateutil, which was already a transitive dependency through three paths
 * Added a workaround for stadt-karlsruhe/geoextract#7
@torfsen
Copy link
Contributor

torfsen commented Jan 16, 2019

Thanks for the report, @konstin, and for posting your workaround!

Truly fixing this will take some work, as simply breaking up the input text into separate chunks has (as you noted) the risk of splitting one part into several parts and then missing geo references that span more than one of these parts.

One approach would be to break up the input into chunks but join broken parts afterwards. Another approach would be to use a completely different approach that works in a streaming fashion -- however, that would probably be a custom implementation and probably not as fast as the NumPy/SciPy routines were using now.

konstin added a commit to meine-stadt-transparent/meine-stadt-transparent that referenced this issue Jan 16, 2019
 * Real incremental update (finally)
 * Split up in phases: Body and metadata import, loading the lists, list to database, files
 * With the phases, there's a change to restart the import with finished phases intact after errors
 * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown
 * Progress bars
 * Good test coverage
 * Fixed some lurking data model problems
 * no mainapp -> importer dependency
 * Faster file analysis (which is required for the import for bigger cities not to take days)
 * Adds python-dateutil, which was already a transitive dependency through three paths
 * Added a workaround for stadt-karlsruhe/geoextract#7
konstin added a commit to meine-stadt-transparent/meine-stadt-transparent that referenced this issue Jan 17, 2019
 * Real incremental update (finally)
 * Split up in phases: Body and metadata import, loading the lists, list to database, files
 * With the phases, there's a change to restart the import with finished phases intact after errors
 * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown
 * Progress bars
 * Good test coverage
 * Fixed some lurking data model problems
 * no mainapp -> importer dependency
 * Faster file analysis (which is required for the import for bigger cities not to take days)
 * Adds python-dateutil, which was already a transitive dependency through three paths
 * Added a workaround for stadt-karlsruhe/geoextract#7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants