Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from flatgeobuffer to geomedea for GTFS #8

Merged
merged 4 commits into from
Jul 19, 2024
Merged

Conversation

dabreegster
Copy link
Contributor

@dabreegster dabreegster commented Jul 18, 2024

CC @michaelkirk, I'm trying out geomedea for the use case I described in Discord!

Metric flatgeobuffer geomedea
File size 99MB 53MB
Bristol 3.6MB in 23 requests 5MB in 20 requests
Elephant & Castle 6.4MB in 935 requests, 1.76 minutes 9.4MB in 24 requests, 8.3s

Bristol doesn't have many GTFS trip shapes intersecting the area, while E&C in London has loads.

Unless I'm measuring something wrong, the current approach with geomedea incurs more bandwidth, but through way less requests and latency.

.collect(),
));
let mut props = Properties::empty();
// TODO bincode or something else?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size difference is probably coming from here, I need to rethink how to encode this. Most of the size comes from a bunch of chrono::NaiveTimes right now, which get encoded in a pretty naive way

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is variant some pre-encoding of all your properties into a single byte stream?

Internally geomedea is using bincode for encoding property and geometry, and then each page is zstd compressed.

Varints probably make sense for property data, but probably not currently for geometry data, until/unless I also implement delta encoding for geometries. I'd like to implement delta encoding but it'll require reworking some API internals.

Copy link
Contributor Author

@dabreegster dabreegster Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using serde_json::to_vec on

    // Per stop, (original ID and name)
    pub stop_info: Vec<(orig_ids::StopID, String)>,

    // Each one has an arrival time per stop
    pub trips: Vec<Vec<NaiveTime>>,

    // Metadata
    pub route: Route,
}

The space is dominated by the times. Dropping some precision and some delta encoding would make tons of sense there.

I'll also play with using PropertyValue::Vec of some integers manually, instead of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.rs/chrono/latest/src/chrono/naive/time/serde.rs.html#5

OK, NaiveTime gets serde-ified as a string right now, that's amazing. Switching to something more appropriate...


[target.'cfg(not(target_arch = "wasm32"))'.dependencies]
geomedea = { git = "https://github.com/michaelkirk/geomedea" }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw this should compile in wasm now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In WASM, I want to disable the writer feature, but otherwise enable it. I fiddled around with specifying features based on architecture and landed here as something that works, but I'll take another look if there's a way to be more clear here...

@michaelkirk
Copy link

michaelkirk commented Jul 18, 2024

Unless I'm measuring something wrong, the current approach with geomedea incurs more bandwidth, but through way less requests and latency.

It's not entirely surprising that geomedea might request more data.

In FGB, there is a single buffer of uncompressed features. In FGB, since there is no compression, the index tells us exactly where each feature is in the file. Using this I implemented smart feature batching, so feature requests will only merge adjacent features into a single request if they are "close enough".

To take advantage of compression, geomedea groups features into pages, so you have to download an entire page even if you only need one feature in the page. Because geomedea's features are in compressed pages, request batching would be a little different. It can still be done, but I guess it'd be "page batches" rather than "feature batches". I haven't implemented this yet, but it should be doable in a non-breaking way.

@michaelkirk
Copy link

Could you do me a favor?

RUST_LOG=debug

And give me the lines matching: Finished using an HTTP client. used_bytes

e.g.:

Finished using an HTTP client. used_bytes=839712, wasted_bytes=293690, req_count=4
Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1

wasted_bytes should correspond to the bytes that could be gained by having more clever page-batching.

@michaelkirk
Copy link

wasted_bytes should correspond to the bytes that could be gained by having more clever page-batching.

I had a go at "more clever page-batching" here:
michaelkirk/geomedea#12

@michaelkirk
Copy link

michaelkirk commented Jul 19, 2024

I was looking at the network traffic for your existing FGB integration - and I feel like there must be a bug in the FGB client. It makes no sense for all those small nearby requests (4 bytes?!).

Screenshot 2024-07-18 at 17 01 39

I'm looking into that now.

@dabreegster
Copy link
Contributor Author

dabreegster commented Jul 19, 2024

After updating to the latest 417d4f43cd35aa98aea19a0b17632c8309b50466:

  • Bristol reads 2.6MB over 25 requests, total 3.6s (time measured from a perfectly fast localhost -- I could also try yocalhost or on a real wifi connection to cloudflare or something)
    • I see 3 logs: Finished using an HTTP client. used_bytes=156716, wasted_bytes=0, req_count=5
    • Finished using an HTTP client. used_bytes=3717536, wasted_bytes=1570173, req_count=19
    • Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1
  • Elephant & Castle reads 6.5MB also over exactly 25 requests
    • Finished using an HTTP client. used_bytes=352492, wasted_bytes=0, req_count=10
    • Finished using an HTTP client. used_bytes=5062757, wasted_bytes=2614402, req_count=37
    • Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1

These two cases are now competitive with fgb, so I'm almost definitely going to switch to this. :)

@dabreegster
Copy link
Contributor Author

With the new property encoding...

Elephant reads 6.3MB over 23 requests.
Finished using an HTTP client. used_bytes=156716, wasted_bytes=0, req_count=5
Finished using an HTTP client. used_bytes=3776001, wasted_bytes=1490472, req_count=17
Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1

Bristol reads 2.5MB over 25 requests
Finished using an HTTP client. used_bytes=144956, wasted_bytes=1344, req_count=9
Finished using an HTTP client. used_bytes=1078601, wasted_bytes=473493, req_count=15
Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1

So the new encoding is not giving that huge of an advantage, but still opens the way to doing something nicer later with delta encoding.

I'm going to merge this in now and continue to play with encoding / perf later on. It's a huge improvement with low work, so thanks so much for the new format, adding WASM support, and these page batching fixes!

@michaelkirk
Copy link

michaelkirk commented Jul 23, 2024

Here's Elephant & Castle with flatgeobuf/flatgeobuf#376
Screenshot 2024-07-23 at 15 07 52

tldr; there was a bad bug in the http fetch implementation, triggered by those 1.05MB requests. It hadn't came up in the shape of my own data and requests, so thanks for helping to uncover it.

With the bug fix, the two formats seem to be in the same ballpark of network transfer for your queries.

edit for completeness, here's the same with geomedea (one more request, 15% less bytes transferred):
Screenshot 2024-07-23 at 15 17 16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants