Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid xml data in planet osm #23

Open
gartenkralle opened this issue Jun 22, 2021 · 15 comments
Open

Invalid xml data in planet osm #23

gartenkralle opened this issue Jun 22, 2021 · 15 comments

Comments

@gartenkralle
Copy link

gartenkralle commented Jun 22, 2021

Don't know if I am right here. But I found the following data in the planet-210524.osm file. Opening tag (way) doesn't match to the closing tag (relation). Also "chaer type" seems not valid.

 <way id="933805767" timestamp="2021-04-22T09:46:48Z" version="1" chaer type="way" ref="182400916" role="inner"/>
  <member type="way" ref="182400928" role="inner"/>
  <member type="way" ref="182400935" role="inner"/>
  <member type="way" ref="182400910" role="inner"/>
  <member type="way" ref="182400991" role="inner"/>
  <member type="way" ref="182401067" role="inner"/>
  <member type="way" ref="182400927" role="inner"/>
  <member type="way" ref="182400934" role="inner"/>
  <member type="way" ref="182400921" role="inner"/>
  <member type="way" ref="182400925" role="inner"/>
  <member type="way" ref="182400985" role="inner"/>
  <member type="way" ref="182400907" role="inner"/>
  <member type="way" ref="182401094" role="inner"/>
  <member type="way" ref="182400940" role="inner"/>
  <member type="way" ref="182401005" role="inner"/>
  <member type="way" ref="182401080" role="inner"/>
  <member type="way" ref="182401092" role="inner"/>
  <member type="way" ref="182400972" role="inner"/>
  <member type="way" ref="182400983" role="inner"/>
  <member type="way" ref="182401068" role="inner"/>
  <member type="way" ref="182400942" role="inner"/>
  <member type="way" ref="182401019" role="inner"/>
  <member type="way" ref="182400989" role="inner"/>
  <member type="way" ref="182401004" role="inner"/>
  <member type="way" ref="182401022" role="inner"/>
  <member type="way" ref="182401075" role="inner"/>
  <member type="way" ref="182401077" role="inner"/>
  <member type="way" ref="182401069" role="inner"/>
  <member type="way" ref="182401041" role="inner"/>
  <member type="way" ref="182400978" role="inner"/>
  <member type="way" ref="182401090" role="inner"/>
  <member type="way" ref="182401029" role="inner"/>
  <member type="way" ref="182401031" role="inner"/>
  <member type="way" ref="182401017" role="inner"/>
  <member type="way" ref="182400995" role="inner"/>
  <member type="way" ref="182401061" role="inner"/>
  <member type="way" ref="182400986" role="inner"/>
  <member type="way" ref="182401056" role="inner"/>
  <member type="way" ref="182400959" role="inner"/>
  <member type="way" ref="182401057" role="inner"/>
  <member type="way" ref="182401058" role="inner"/>
  <member type="way" ref="182401078" role="inner"/>
  <member type="way" ref="182401086" role="inner"/>
  <tag k="natural" v="grassland"/>
  <tag k="type" v="multipolygon"/>
 </relation>

This is not the only entry where opening and closing tag doesn't match.

@joto
Copy link

joto commented Jun 22, 2021

Did you check whether the MD5 matches (see planet-210524.osm.bz2.md5)?

@zerebubuth
Copy link
Owner

I figured that if the file was corrupt, it would be very unlikely for bzip2 to output anything other than garbage. But playing around with it now, it does seem as if a corrupt bz2 file can decompress into something that isn't completely noise.

Unhelpfully, it seems that bzcat doesn't stop output when it senses a CRC error, but just outputs a warning to stderr and exits with a non-zero code after processing the rest of the file. So if you're not checking stderr or the exit code, it would be easy to think it had succeeded.

I started testing the original file on the planet server, but it is taking a very, very long time. I'll update here when it's finished.

@gartenkralle
Copy link
Author

Did you check whether the MD5 matches (see planet-210524.osm.bz2.md5)?

Yes, did match.

@zerebubuth
Copy link
Owner

This is a bit weird - the planet file on the server looks completely fine. I grepped it for the way ID you mention, and the result is:

<way id="933805767" timestamp="2021-04-22T09:46:48Z" version="1" changeset="103400299" user="lipsigal" uid="438670">
  <nd ref="8654953875"/>
  <nd ref="8654953876"/>
  <nd ref="8654953877"/>
  ...

with no chaer type= or skipping into the relations section.

So if the file on the server is OK, and the MD5sum matches, and it matches your downloaded file too, does that mean that whatever problem is occurring must be during or after decompression? How are you decompressing? Using bzcat on the fly, or bunzip2, or something else?

@gartenkralle
Copy link
Author

I have used 7-zip file manager version 19 under windows 10 x64.

I will try another decompressor. Thanks for investigating so far.

@gartenkralle
Copy link
Author

This time I tried to uncompress with another tool (https://github.com/philr/bzip2-windows/releases) but same result.

Any more guesses?

@joto
Copy link

joto commented Jun 24, 2021

Looks to me like you (@gartenkralle) might have a problem with your hardware, faulty memory or so. I suggest running a memory tester.

@zerebubuth
Copy link
Owner

I think it's unlikely that a hardware fault would affect the decompression in exactly the same way with two different programs (with different memory layouts, etc...).

@gartenkralle are you decompressing the whole file? (In other words, you have a file called planet-210524.osm which is not compressed? Please could you tell me how big it is, and what the MD5sum is of the decompressed file?

@gartenkralle
Copy link
Author

Did a 2 cycle memory check. No faulty memory found.

Yes I decompressed the whole file. Decompressing again and then run MD5sum on it. Results I will report in some days...

@gartenkralle
Copy link
Author

Size: 1.542.302.591.588 Bytes

MD5 now running...

@gartenkralle
Copy link
Author

MD5 checksum: dfdff2778d0dfad6569ecc2b3613fbb4

@zerebubuth
Copy link
Owner

Here's what I got, for the same input file (our MD5s match for the .osm.bz2) - I guess the computer I was using was much slower!

MD5: 2cf5fcca63685b13440902f0f1fa24e6
Size: 1,542,302,591,588

We get the same size, but different MD5s. I think something might be going wrong because it's a 1.4TiB file, and that might be pushing the limits of what the decompression software has been tested with (perhaps some subtle bugs when the file length / offset exceeds 40 bits?)

It might be worth trying some other software. I'm using bzip2, a block-sorting file compressor. Version 1.0.8, 13-Jul-2019 on Linux, so it might be worth trying to replicate that (either a virtual machine, or Windows Subsystem for Linux).

Alternatively, is it possible to do what you wanted without decompressing the whole file? If whatever is parsing the OSM file is capable of streaming (e.g: SAX or event parser) then you could bzcat planet.osm.bz2 | whatever and not need to uncompress the whole thing.

Finally, if all those things won't work, then it might be worth rewriting your parser to use the PBF binary file. The data inside is exactly the same, but the PBF is about half the size of the XML and 10 or more times quicker to parse. @joto's excellent https://github.com/osmcode/libosmium is a well-tested and fast library for parsing PBFs, and there's a suite of utilities (https://github.com/osmcode/osmium-tool) for common tasks such as making geographic extracts and filtering by tags. (I think it builds on Windows, but I don't know enough about Windows to say for sure.)

@gartenkralle
Copy link
Author

Thanks for all your tips. Even with bzip2 under cygwin I got wrong MD5 checksum. Maybe a very low level bug or file system bug. Now I try doing on linux and transfering file to windows. Otherwise I will go with the PBF.

@mmd-osm
Copy link

mmd-osm commented Jul 29, 2021

@gartenkralle : do you have any updates on this? Can this issue be closed now?

@gartenkralle
Copy link
Author

Yes, issue can be closed.

The tool which calculated the checksum after decompression was wrong. I did a mistake in my parsing method. In the xml file there are relations which has no members. I have not considered that case. Additionally I did not consider that utf-8 has variable sized chars. After fixing it worked fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants