-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from flatgeobuffer to geomedea for GTFS #8
Conversation
backend/src/gtfs/gmd.rs
Outdated
.collect(), | ||
)); | ||
let mut props = Properties::empty(); | ||
// TODO bincode or something else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size difference is probably coming from here, I need to rethink how to encode this. Most of the size comes from a bunch of chrono::NaiveTime
s right now, which get encoded in a pretty naive way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is variant
some pre-encoding of all your properties into a single byte stream?
Internally geomedea is using bincode for encoding property and geometry, and then each page is zstd compressed.
Varints probably make sense for property data, but probably not currently for geometry data, until/unless I also implement delta encoding for geometries. I'd like to implement delta encoding but it'll require reworking some API internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using serde_json::to_vec
on
// Per stop, (original ID and name)
pub stop_info: Vec<(orig_ids::StopID, String)>,
// Each one has an arrival time per stop
pub trips: Vec<Vec<NaiveTime>>,
// Metadata
pub route: Route,
}
The space is dominated by the times. Dropping some precision and some delta encoding would make tons of sense there.
I'll also play with using PropertyValue::Vec
of some integers manually, instead of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.rs/chrono/latest/src/chrono/naive/time/serde.rs.html#5
OK, NaiveTime gets serde-ified as a string right now, that's amazing. Switching to something more appropriate...
|
||
[target.'cfg(not(target_arch = "wasm32"))'.dependencies] | ||
geomedea = { git = "https://github.com/michaelkirk/geomedea" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw this should compile in wasm now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In WASM, I want to disable the writer
feature, but otherwise enable it. I fiddled around with specifying features based on architecture and landed here as something that works, but I'll take another look if there's a way to be more clear here...
It's not entirely surprising that geomedea might request more data. In FGB, there is a single buffer of uncompressed features. In FGB, since there is no compression, the index tells us exactly where each feature is in the file. Using this I implemented smart feature batching, so feature requests will only merge adjacent features into a single request if they are "close enough". To take advantage of compression, geomedea groups features into pages, so you have to download an entire page even if you only need one feature in the page. Because geomedea's features are in compressed pages, request batching would be a little different. It can still be done, but I guess it'd be "page batches" rather than "feature batches". I haven't implemented this yet, but it should be doable in a non-breaking way. |
Could you do me a favor? RUST_LOG=debug And give me the lines matching: e.g.:
wasted_bytes should correspond to the bytes that could be gained by having more clever page-batching. |
I had a go at "more clever page-batching" here: |
After updating to the latest 417d4f43cd35aa98aea19a0b17632c8309b50466:
These two cases are now competitive with fgb, so I'm almost definitely going to switch to this. :) |
With the new property encoding... Elephant reads 6.3MB over 23 requests. Bristol reads 2.5MB over 25 requests So the new encoding is not giving that huge of an advantage, but still opens the way to doing something nicer later with delta encoding. I'm going to merge this in now and continue to play with encoding / perf later on. It's a huge improvement with low work, so thanks so much for the new format, adding WASM support, and these page batching fixes! |
Here's Elephant & Castle with flatgeobuf/flatgeobuf#376 tldr; there was a bad bug in the http fetch implementation, triggered by those 1.05MB requests. It hadn't came up in the shape of my own data and requests, so thanks for helping to uncover it. With the bug fix, the two formats seem to be in the same ballpark of network transfer for your queries. edit for completeness, here's the same with geomedea (one more request, 15% less bytes transferred): |
CC @michaelkirk, I'm trying out geomedea for the use case I described in Discord!
Bristol doesn't have many GTFS trip shapes intersecting the area, while E&C in London has loads.
Unless I'm measuring something wrong, the current approach with geomedea incurs more bandwidth, but through way less requests and latency.