Big inputs result in MemoryError #7

konstin · 2019-01-15T22:03:47Z

geoextract fails to parse big inputs such as https://ratsinfo.leipzig.de/bi/oparl/1.0/download.asp?dtyp=130&id=1008643 (~4000 pages / ~7,000,000 characters) with a MemoryError.

I've seen it fail at different positions, with all being part of self.splitter.split(text). On my machine it seems to fail for inputs bigger than ~5,000,000 characters.

    found_locations = pipeline.extract(text)
  File "/home/konsti/meine-stadt-transparent/.venv/lib/python3.6/site-packages/geoextract/__init__.py", line 459, in extract
    parts = map(self._normalize, self._split(text))
  File "/home/konsti/meine-stadt-transparent/.venv/lib/python3.6/site-packages/geoextract/__init__.py", line 452, in _split
    return self.splitter.split(text)

This issue is easily worked around my use case (I just split the text in batches of ~1,000,000 and ignore the word that's possibly lost); I'm just opening this issue for documentation purposes.

The text was updated successfully, but these errors were encountered:

Major changes: * Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7

* Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7

torfsen · 2019-01-16T14:08:56Z

Thanks for the report, @konstin, and for posting your workaround!

Truly fixing this will take some work, as simply breaking up the input text into separate chunks has (as you noted) the risk of splitting one part into several parts and then missing geo references that span more than one of these parts.

One approach would be to break up the input into chunks but join broken parts afterwards. Another approach would be to use a completely different approach that works in a streaming fashion -- however, that would probably be a custom implementation and probably not as fast as the NumPy/SciPy routines were using now.

* Real incremental update (finally) * Split up in phases: Body and metadata import, loading the lists, list to database, files * With the phases, there's a change to restart the import with finished phases intact after errors * Careful parallelism: Lists are fetch in parallel with a thread pool (because the multi-second wait times of the endpoints were the bottleneck) and files loaded and analysed by a process pool, since analysing is cpu intensive and indepedent. Uncaught exceptions are rethrown * Progress bars * Good test coverage * Fixed some lurking data model problems * no mainapp -> importer dependency * Faster file analysis (which is required for the import for bigger cities not to take days) * Adds python-dateutil, which was already a transitive dependency through three paths * Added a workaround for stadt-karlsruhe/geoextract#7

konstin mentioned this issue Jan 16, 2019

Rewrite the importer without liboparl meine-stadt-transparent/meine-stadt-transparent#182

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big inputs result in MemoryError #7

Big inputs result in MemoryError #7

konstin commented Jan 15, 2019

torfsen commented Jan 16, 2019

Big inputs result in MemoryError #7

Big inputs result in MemoryError #7

Comments

konstin commented Jan 15, 2019

torfsen commented Jan 16, 2019