Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German postcodes via geo search #110

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

skurfuerst
Copy link

Hey everybody,

I've tried to include german Postcodes in the database, which means we cannot read the post_code attribute, but instead have to read an OSM boundary relation: https://wiki.openstreetmap.org/wiki/DE:Relation:boundary - so I've implemented a geo overlap search to determine the postcodes.

So far the code seems to work, but is quite ugly (I am just learning rust).

I want to polish this PR for a merge-able state, but I'd need your feedback on the following things:

  • is the direction generally useful?
  • any rust errors I should fix / improve upon?
  • should we maybe make this configurable?

Thanks for your work,
Sebastian

Copy link
Member

@amatissart amatissart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion. It looks like a very promising direction, and would definitely be useful. (For example, in order to improve the search by german postcodes in Qwant Maps!)

I made a few comments, mostly about the integration in the existing implementation. Depending on which kind of zones should be affected by this change, the postcodes lookup may need to be moved further in the generation steps, after the zone_type has been computed. But overall your approach looks good to me.

I would agree to make this an optional feature, possibly enabled by default. In many countries little data will be affected, but it's not clear for me yet if this additional step could cause a significant performance hit for large input files.

cosmogony/src/postcode.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/zone_ext.rs Outdated
.split(';')
.filter(|s| !s.is_empty())
.map(|s| s.to_string())
.sorted()
.collect();
if let Some(boundary) = boundary.as_ref() {
if let Some(bbox) = bbox {
if (zip_codes.is_empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all zones get their postcodes completed here ?

For large zones (countries, states, and possibly any zone with type larger than ZoneType::City) this lookup could be expensive and of little use. The problem is the zone_type is still unknown at this point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amatissart :-)

So far, I thought that if an object which already has zip_codes (the 3rd if condition), we will not add additional ones.

The checks above are only to make the Rust compiler happy 👍 - I'd be glad to do this differently :D

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. now I get what you mean I guess. I'll think through this some more.

src/zone_ext.rs Outdated
Comment on lines 156 to 168
.filter(|postcode_bbox| {
//info!(" - Candidate Postcode: {:?}", postcode_bbox.get_postcode().zipcode);

let overlap_between_postcode_and_area = BooleanOp::intersection(boundary, postcode_bbox.get_postcode().get_boundary());

// anteil überlappender Bereiches / Postcode: "Wieviel % des Postcodes sind von dieser Fläche befüllt"
let overlap_percentage_relative_to_postcode = overlap_between_postcode_and_area.unsigned_area() / postcode_bbox.get_postcode().unsigned_area();

//info!(" CHOSEN {} {:?}", overlap_percentage_relative_to_postcode, overlap_percentage_relative_to_postcode > 0.05);
// at least 5% des Postcodes müssen in der genannten Fläche liegen
overlap_percentage_relative_to_postcode > 0.05

})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the justification for this check ? Should we expect that some boundary=postal_code relations do not represent their exact coverage because of approximate boundaries ? Or do you chose to focus on the main postcodes for each zone, and deliberately ignore the ones that cover only a small part of the area ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey :)

  • only checking the bounding boxes (as is done by the RTree) is not enough to determine whether postcodes match or not.
  • Thus, I first did a "contains" check between postcode exact coordinates and the underlying geometry.
  • However, this check had the problem that sometimes, even if the postcode and the shape shared only a single boundary line the contains check returned true.
  • Thus, I came up with this "heuristic"; which seems to improve real-world results quite a bit.

Maybe we make the heuristic opt-in, or we make the percentage configurable?

All the best,
Sebastian

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of contains check do you refer to exactly ?

We had similar issues when implementing the hierarchy lookup among administrative areas. Unfortunately, the geo crate does not provide extensive algorithms to implement such checks natively yet. That's why Zone implements a .contains() method that relies on the .covers() method provided by the GEOS library. It's supposed to handle such edge cases nicely even when geometries share boundaries. So if that's not what you've already tried, implementing a similar .contains() method for Postcode could be worth it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coool, I'll try this :) I've checked the contains method of the simple geo Crate types.

src/lib.rs Outdated Show resolved Hide resolved
@skurfuerst
Copy link
Author

Hey @amatissart ,

thank you so much for your detailed review ❤️

Depending on which kind of zones should be affected by this change, the postcodes lookup may need to be moved further in the generation steps, after the zone_type has been computed. But overall your approach looks good to me.

Could you point me to the place where zone_type is computed? I'm afraid I don't understand the overall structure well enough to move that code block around :-)

Thanks and all the best,
Sebastian
PS: I'll try to work on this in this week; but it might be that I need some more days to get a next iteration up and running for your review (depending on other workload).

@skurfuerst
Copy link
Author

Hey @amatissart ,

PS: I just figured out that my changes greatly slow down the system on big instances :-( I guess I somehow need to parallelize the work.

I've seen that the hierarchy building in Cosmogony is using all 32 cores which I have. Can you give me a pointer how you exploit this parallelism? (A short link to docs or source code would be awesome).

Thanks so much ❤️ ,
Sebastian

@amatissart
Copy link
Member

Could you point me to the place where zone_type is computed? I'm afraid I don't understand the overall structure well enough to move that code block around :-)

zone_type is populated by type_zones() after all boundaries have been read from the .pbf file and the inclusions (i.e which zones are covered by each zone) have been computed.

I must say that this structure of this file lib.rs is a bit convoluted, and each step could be described in more detail. (Added on my todo list.)

I've seen that the hierarchy building in Cosmogony is using all 32 cores which I have. Can you give me a pointer how you exploit this parallelism? (A short link to docs or source code would be awesome).

To parallelize the process, Cosmogony is using rayon, that notably provides a helpful .par_iter() method to consume iterators in parallel. This is a simple approach, that comes with some constraints about the ownership and mutability of the data structure though. See type_zones() for an example on how it's used in here.

@skurfuerst
Copy link
Author

@amatissart Hey,

Thank you so much 👍 👍 Your pointers are greatly appreciated - I'll try to implement this as soon as possible (but might take 2-3 weeks).

BTW - if I can somewhen return the favor of your guidance, let me know ;-) I am a Neos CMS core dev, so if you happen to have a problem in the Open Source Content Management area anytime soon, let me know ;-)

All the best,
Sebastian

@skurfuerst
Copy link
Author

Hey @amatissart ,

A quick status update: I fixed a few of the issues; the main issue with performance still being a problem; and I still need to move postcode generation around to a later stage. I'll do that in the upcoming days.

All the best,
Sebastian

@remi-dupre
Copy link
Contributor

remi-dupre commented Dec 2, 2020

Hi, and thanks for your contribution, it looks indeed very promising!

About the performance issues I suppose updating to a simpler "contains" mechanism could be helpful? I am probably wrong but I was wondering if we could get less candidates from the Rtree if we queried for zones that contain an arbitrary point inside of the envelope instead of zones that intersect with its bbox?

As you are new to Rust I'll slip a few comments about the ecosystem that could be helpful:

  • When it comes to code formatting, the community standard is to rely on rustfmt: you can reformat the code for the whole project by running cargo fmt. It can be quite frustrating in a first place if you are not already used to code formatters, but it becomes become quite addictive with time! At least it helps with keeping sound diffs and avoiding endless negotiations about ideal number of columns & cie...

  • You can run extra lints by running cargo clippy. You can usually consider that clippy recommendations will be somewhat idiomatic, they are particularly useful for discovering Rust features and could eventually lead to small performance improvement. It is not expected that all clippy lints are fixed, but I'd definitely recommend to have a quick look to what it says, especially for someone who is learning the language :)

@skurfuerst
Copy link
Author

Hey @remi-dupre ,

ah, is the rtree capable of doing this; or would I need to iterate over all points of my feature here?

IMHO the problem might be that if you have the following situation, we would not find the postcode correctly:

/**

              ┌───────────────────────────────────┐
              │city boundary                      │
              │                                   │
              │    ┌───────────────────────────┐  │
              │    │                           │  │
              │    │                           │  │
              │    │                           │  │
              │    │                           │  │
              │    │     postcode boundary     │  │
              │    │                           │  │
              │    │                           │  │
              │    │                           │  │
              │    │                           │  │
              │    └───────────────────────────┘  │
              │                                   │
              └───────────────────────────────────┘

*/

Or am I missing anything? 😄

Thank you so much for your comments, keep them coming - they are extremely helpful for me 👍 👍 I'll try that out ASAP!

All the best,
Sebastian

@remi-dupre
Copy link
Contributor

IMHO the problem might be that if you have the following situation, we would not find the postcode correctly:

Oh your right, that's where I was wrong: I thought the inclusion was checked in the other direction :/

Sorry about that !

@skurfuerst
Copy link
Author

IMHO the problem might be that if you have the following situation, we would not find the postcode correctly:

Oh your right, that's where I was wrong: I thought the inclusion was checked in the other direction :/

Sorry about that !

No need to be sorry ❤️ Thanks for your work and your comments!

All the best,
Sebastian

@skurfuerst
Copy link
Author

@remi-dupre @amatissart btw - which IDE are you using? I'm currently trying out IntelliJ IDEA with Rust support, but I am not yet fully sure on it...

@skurfuerst
Copy link
Author

Current status: I think we are getting closer to a possible release.

Still missing are the CLI flag to enable/disable the algorithm; and maybe the other "contains" algorithm to speed up the process some more. :)

@remi-dupre
Copy link
Contributor

@remi-dupre @amatissart btw - which IDE are you using? I'm currently trying out IntelliJ IDEA with Rust support, but I am not yet fully sure on it...

I personally use Vim with the Coc plugin. I think people typically use visual code (rust-analyzer seems to be optimized for this use case).

Current status: I think we are getting closer to a possible release.

Still missing are the CLI flag to enable/disable the algorithm; and maybe the other "contains" algorithm to speed up the process some more. :)

That's very cool, thanks again for your energy!

@skurfuerst
Copy link
Author

Hey,
a happy new year to everyone!
Just a small heads-up - I did not forget about these changes, but due to the Lockdown in Germany I spent way less time in front of the computer in the last month... So I can't promise a timeline right now, but I'll promise to get this to a mergeable state somewhen ;-)

All the best,
Sebastian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants