German postcodes via geo search #110

skurfuerst · 2020-11-27T07:30:29Z

Hey everybody,

I've tried to include german Postcodes in the database, which means we cannot read the post_code attribute, but instead have to read an OSM boundary relation: https://wiki.openstreetmap.org/wiki/DE:Relation:boundary - so I've implemented a geo overlap search to determine the postcodes.

So far the code seems to work, but is quite ugly (I am just learning rust).

I want to polish this PR for a merge-able state, but I'd need your feedback on the following things:

is the direction generally useful?
any rust errors I should fix / improve upon?
should we maybe make this configurable?

Thanks for your work,
Sebastian

amatissart

Thanks for this suggestion. It looks like a very promising direction, and would definitely be useful. (For example, in order to improve the search by german postcodes in Qwant Maps!)

I made a few comments, mostly about the integration in the existing implementation. Depending on which kind of zones should be affected by this change, the postcodes lookup may need to be moved further in the generation steps, after the zone_type has been computed. But overall your approach looks good to me.

I would agree to make this an optional feature, possibly enabled by default. In many countries little data will be affected, but it's not clear for me yet if this additional step could cause a significant performance hit for large input files.

cosmogony/src/postcode.rs

src/lib.rs

amatissart · 2020-11-30T10:35:18Z

src/zone_ext.rs

            .split(';')
            .filter(|s| !s.is_empty())
            .map(|s| s.to_string())
            .sorted()
            .collect();
+        if let Some(boundary) = boundary.as_ref() {
+            if let Some(bbox) = bbox {
+                if (zip_codes.is_empty()) {


Should all zones get their postcodes completed here ?

For large zones (countries, states, and possibly any zone with type larger than ZoneType::City) this lookup could be expensive and of little use. The problem is the zone_type is still unknown at this point.

Hey @amatissart :-)

So far, I thought that if an object which already has zip_codes (the 3rd if condition), we will not add additional ones.

The checks above are only to make the Rust compiler happy 👍 - I'd be glad to do this differently :D

ah. now I get what you mean I guess. I'll think through this some more.

amatissart · 2020-11-30T10:42:46Z

src/zone_ext.rs

+                        .filter(|postcode_bbox| {
+                            //info!(" - Candidate Postcode: {:?}", postcode_bbox.get_postcode().zipcode);
+
+                            let overlap_between_postcode_and_area = BooleanOp::intersection(boundary, postcode_bbox.get_postcode().get_boundary());
+
+                            // anteil überlappender Bereiches / Postcode: "Wieviel % des Postcodes sind von dieser Fläche befüllt"
+                            let overlap_percentage_relative_to_postcode = overlap_between_postcode_and_area.unsigned_area() / postcode_bbox.get_postcode().unsigned_area();
+
+                            //info!("   CHOSEN {} {:?}", overlap_percentage_relative_to_postcode, overlap_percentage_relative_to_postcode > 0.05);
+                            // at least 5% des Postcodes müssen in der genannten Fläche liegen
+                            overlap_percentage_relative_to_postcode > 0.05
+
+                        })


What exactly is the justification for this check ? Should we expect that some boundary=postal_code relations do not represent their exact coverage because of approximate boundaries ? Or do you chose to focus on the main postcodes for each zone, and deliberately ignore the ones that cover only a small part of the area ?

Hey :)

only checking the bounding boxes (as is done by the RTree) is not enough to determine whether postcodes match or not.

Thus, I first did a "contains" check between postcode exact coordinates and the underlying geometry.

However, this check had the problem that sometimes, even if the postcode and the shape shared only a single boundary line the contains check returned true.

Thus, I came up with this "heuristic"; which seems to improve real-world results quite a bit.

Maybe we make the heuristic opt-in, or we make the percentage configurable?

All the best,
Sebastian

What kind of contains check do you refer to exactly ?

We had similar issues when implementing the hierarchy lookup among administrative areas. Unfortunately, the geo crate does not provide extensive algorithms to implement such checks natively yet. That's why Zone implements a .contains() method that relies on the .covers() method provided by the GEOS library. It's supposed to handle such edge cases nicely even when geometries share boundaries. So if that's not what you've already tried, implementing a similar .contains() method for Postcode could be worth it.

Coool, I'll try this :) I've checked the contains method of the simple geo Crate types.

src/lib.rs

skurfuerst · 2020-11-30T11:34:36Z

Hey @amatissart ,

thank you so much for your detailed review ❤️

Depending on which kind of zones should be affected by this change, the postcodes lookup may need to be moved further in the generation steps, after the zone_type has been computed. But overall your approach looks good to me.

Could you point me to the place where zone_type is computed? I'm afraid I don't understand the overall structure well enough to move that code block around :-)

Thanks and all the best,
Sebastian
PS: I'll try to work on this in this week; but it might be that I need some more days to get a next iteration up and running for your review (depending on other workload).

skurfuerst · 2020-11-30T12:56:41Z

Hey @amatissart ,

PS: I just figured out that my changes greatly slow down the system on big instances :-( I guess I somehow need to parallelize the work.

I've seen that the hierarchy building in Cosmogony is using all 32 cores which I have. Can you give me a pointer how you exploit this parallelism? (A short link to docs or source code would be awesome).

Thanks so much ❤️ ,
Sebastian

amatissart · 2020-11-30T14:59:13Z

Could you point me to the place where zone_type is computed? I'm afraid I don't understand the overall structure well enough to move that code block around :-)

zone_type is populated by type_zones() after all boundaries have been read from the .pbf file and the inclusions (i.e which zones are covered by each zone) have been computed.

I must say that this structure of this file lib.rs is a bit convoluted, and each step could be described in more detail. (Added on my todo list.)

I've seen that the hierarchy building in Cosmogony is using all 32 cores which I have. Can you give me a pointer how you exploit this parallelism? (A short link to docs or source code would be awesome).

To parallelize the process, Cosmogony is using rayon, that notably provides a helpful .par_iter() method to consume iterators in parallel. This is a simple approach, that comes with some constraints about the ownership and mutability of the data structure though. See type_zones() for an example on how it's used in here.

skurfuerst · 2020-11-30T19:23:23Z

@amatissart Hey,

Thank you so much 👍 👍 Your pointers are greatly appreciated - I'll try to implement this as soon as possible (but might take 2-3 weeks).

BTW - if I can somewhen return the favor of your guidance, let me know ;-) I am a Neos CMS core dev, so if you happen to have a problem in the Open Source Content Management area anytime soon, let me know ;-)

All the best,
Sebastian

skurfuerst · 2020-12-01T21:19:25Z

Hey @amatissart ,

A quick status update: I fixed a few of the issues; the main issue with performance still being a problem; and I still need to move postcode generation around to a later stage. I'll do that in the upcoming days.

All the best,
Sebastian

remi-dupre · 2020-12-02T07:29:46Z

Hi, and thanks for your contribution, it looks indeed very promising!

About the performance issues I suppose updating to a simpler "contains" mechanism could be helpful? I am probably wrong but I was wondering if we could get less candidates from the Rtree if we queried for zones that contain an arbitrary point inside of the envelope instead of zones that intersect with its bbox?

As you are new to Rust I'll slip a few comments about the ecosystem that could be helpful:

When it comes to code formatting, the community standard is to rely on rustfmt: you can reformat the code for the whole project by running cargo fmt. It can be quite frustrating in a first place if you are not already used to code formatters, but it becomes become quite addictive with time! At least it helps with keeping sound diffs and avoiding endless negotiations about ideal number of columns & cie...
You can run extra lints by running cargo clippy. You can usually consider that clippy recommendations will be somewhat idiomatic, they are particularly useful for discovering Rust features and could eventually lead to small performance improvement. It is not expected that all clippy lints are fixed, but I'd definitely recommend to have a quick look to what it says, especially for someone who is learning the language :)

skurfuerst · 2020-12-02T09:59:29Z

Hey @remi-dupre ,

ah, is the rtree capable of doing this; or would I need to iterate over all points of my feature here?

IMHO the problem might be that if you have the following situation, we would not find the postcode correctly:

/**

              ┌───────────────────────────────────┐
              │city boundary                      │
              │                                   │
              │    ┌───────────────────────────┐  │
              │    │                           │  │
              │    │                           │  │
              │    │                           │  │
              │    │                           │  │
              │    │     postcode boundary     │  │
              │    │                           │  │
              │    │                           │  │
              │    │                           │  │
              │    │                           │  │
              │    └───────────────────────────┘  │
              │                                   │
              └───────────────────────────────────┘

*/

Or am I missing anything? 😄

Thank you so much for your comments, keep them coming - they are extremely helpful for me 👍 👍 I'll try that out ASAP!

All the best,
Sebastian

remi-dupre · 2020-12-02T10:58:34Z

IMHO the problem might be that if you have the following situation, we would not find the postcode correctly:

Oh your right, that's where I was wrong: I thought the inclusion was checked in the other direction :/

Sorry about that !

skurfuerst · 2020-12-02T11:49:28Z

IMHO the problem might be that if you have the following situation, we would not find the postcode correctly:

Oh your right, that's where I was wrong: I thought the inclusion was checked in the other direction :/

Sorry about that !

No need to be sorry ❤️ Thanks for your work and your comments!

All the best,
Sebastian

skurfuerst · 2020-12-02T19:08:31Z

@remi-dupre @amatissart btw - which IDE are you using? I'm currently trying out IntelliJ IDEA with Rust support, but I am not yet fully sure on it...

skurfuerst · 2020-12-02T20:13:05Z

Current status: I think we are getting closer to a possible release.

Still missing are the CLI flag to enable/disable the algorithm; and maybe the other "contains" algorithm to speed up the process some more. :)

remi-dupre · 2020-12-03T07:30:23Z

@remi-dupre @amatissart btw - which IDE are you using? I'm currently trying out IntelliJ IDEA with Rust support, but I am not yet fully sure on it...

I personally use Vim with the Coc plugin. I think people typically use visual code (rust-analyzer seems to be optimized for this use case).

Current status: I think we are getting closer to a possible release.

Still missing are the CLI flag to enable/disable the algorithm; and maybe the other "contains" algorithm to speed up the process some more. :)

That's very cool, thanks again for your energy!

skurfuerst · 2021-01-03T13:05:53Z

Hey,
a happy new year to everyone!
Just a small heads-up - I did not forget about these changes, but due to the Lockdown in Germany I spent way less time in front of the computer in the last month... So I can't promise a timeline right now, but I'll promise to get this to a mergeable state somewhen ;-)

All the best,
Sebastian

skurfuerst added 4 commits November 26, 2020 18:13

WIP - initially working

8081518

FEATURE: more fine grained intersection matching

34c11c4

FEATURE: intersection matching with % inclusion

2f7c673

TASK: code cleanups

59d6358

amatissart reviewed Nov 30, 2020

View reviewed changes

skurfuerst commented Nov 30, 2020

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

skurfuerst added 4 commits December 1, 2020 21:25

TASK: move Postcode model around

d2e6baa

BUGFIX: fix compiler warnings

26a360a

BUGFIX: do not read OSM file twice for getting postcodes

0c73aab

TASK: use rayon for reading postcodes

0e16fb1

skurfuerst added 4 commits December 2, 2020 20:32

TASK: restructure and parallelize postcode assignment

20edf9e

run cargo fmt

9e5af94

fix clippy issues

b9a4b4c

TASK: minor refactorings for code clarity

75521ed

FEATURE: Sort Zip Codes

5cfc987

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German postcodes via geo search #110

German postcodes via geo search #110

skurfuerst commented Nov 27, 2020

amatissart left a comment

amatissart Nov 30, 2020

skurfuerst Nov 30, 2020

skurfuerst Dec 1, 2020

amatissart Nov 30, 2020

skurfuerst Nov 30, 2020

amatissart Nov 30, 2020

skurfuerst Nov 30, 2020

skurfuerst commented Nov 30, 2020

skurfuerst commented Nov 30, 2020

amatissart commented Nov 30, 2020

skurfuerst commented Nov 30, 2020

skurfuerst commented Dec 1, 2020

remi-dupre commented Dec 2, 2020 •

edited

Loading

skurfuerst commented Dec 2, 2020

remi-dupre commented Dec 2, 2020

skurfuerst commented Dec 2, 2020

skurfuerst commented Dec 2, 2020

skurfuerst commented Dec 2, 2020

remi-dupre commented Dec 3, 2020

skurfuerst commented Jan 3, 2021

German postcodes via geo search #110

Are you sure you want to change the base?

German postcodes via geo search #110

Conversation

skurfuerst commented Nov 27, 2020

amatissart left a comment

Choose a reason for hiding this comment

amatissart Nov 30, 2020

Choose a reason for hiding this comment

skurfuerst Nov 30, 2020

Choose a reason for hiding this comment

skurfuerst Dec 1, 2020

Choose a reason for hiding this comment

amatissart Nov 30, 2020

Choose a reason for hiding this comment

skurfuerst Nov 30, 2020

Choose a reason for hiding this comment

amatissart Nov 30, 2020

Choose a reason for hiding this comment

skurfuerst Nov 30, 2020

Choose a reason for hiding this comment

skurfuerst commented Nov 30, 2020

skurfuerst commented Nov 30, 2020

amatissart commented Nov 30, 2020

skurfuerst commented Nov 30, 2020

skurfuerst commented Dec 1, 2020

remi-dupre commented Dec 2, 2020 • edited Loading

skurfuerst commented Dec 2, 2020

remi-dupre commented Dec 2, 2020

skurfuerst commented Dec 2, 2020

skurfuerst commented Dec 2, 2020

skurfuerst commented Dec 2, 2020

remi-dupre commented Dec 3, 2020

skurfuerst commented Jan 3, 2021

remi-dupre commented Dec 2, 2020 •

edited

Loading