Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex Properties Overhaul #121

Closed
wants to merge 12 commits into from
Closed

Conversation

flippmoke
Copy link
Member

This is different then the more simple approach to adding lists and maps in #117. This is a complete overhaul of the way properties are encoded. The thought process behind this change is to allow for higher levels of compression of values by:

  • Allowing integer values to be inlined rather then point to indexes
  • Indexed positions of properties are now to packed types
  • Keys and values share the same string storage system

This also allows for null values.

Solves #75 and #62

@flippmoke flippmoke changed the base branch from master to v3.0-development July 20, 2018 19:40
@flippmoke
Copy link
Member Author

Another slight changed proposed by @kkaefer I should mention (but not reflected currently)

using the special index "0" to mean "specified inline" for any of the index types

This could allow all types to be inlined.

repeated double double_values = 8 [ packed = true ];
repeated float float_values = 9 [ packed = true ];
repeated int64 int64_values = 10 [ packed = true ];
repeated uint64 uint64_values = 11 [ packed = true ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use sint64 instead of int64 since this would only be preferred over uint64 when the value is negative.

// | | (if 4th bit is 1 is map)
// | | remaining bits are the number of key_index and
// | | complex_value pairs to follow (same as properties)
repeated uint64 properties = 5 [ packed = true ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try an experimental implementation of this so we can see if it actually makes the tiles significantly smaller.

//
// Type | Id | Parameter
// ---------------------------------
// inline int | 0 | value of integer ( values between -2^60+1 to 2^60-1 )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use sint instead of int since this would only be preferred over uint when values are negative.

@e-n-f
Copy link
Contributor

e-n-f commented Jul 20, 2018

An experimental implementation of this scheme is in mapbox/tippecanoe#611

In a first, superficial test, it reduces the Natural Earth countries (to z5) to 99.55% of their usual size. There may be other data sets that show meaningful improvement though.

@flippmoke
Copy link
Member Author

I have experimented a little more with your branch and put in a few different sets of data:

North American Road Data

For one set of data, I used some north american road data, here are the results of file sizes:

-rw-r--r-- 1 mthompson 410M Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 430M Jul 23 11:42 na_roads_master.mbtiles

or

-rw-r--r-- 1 mthompson 428904448 Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 450768896 Jul 23 11:42 na_roads_master.mbtiles

This is about ~5% reduction in size.

Sample data format properties:

"properties": { "prefix": null, "number": "331", "class": "State", "type": "Other Paved", "divided": null, "country": "United States", "state": "Alabama", "note": null, "scalerank": 11, "uident": 142, "length": 4.559190, "rank": 0, "continent": "North America" }

My guess would be that we are seeing a reduction due to inline integers mostly in this case.

OSM Based Point Data

File sizes:

-rw-r--r-- 1 mthompson 832K Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 888K Jul 23 11:43 kathmandu_master.mbtiles

or

-rw-r--r-- 1 mthompson 851968 Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 909312 Jul 23 11:43 kathmandu_master.mbtiles

This is about a 7% reduction.

Example properties:

"properties": { "osm_id": 3483919843.0, "access": null, "aerialway": null, "aeroway": null, "amenity": null, "area": null, "barrier": null, "bicycle": null, "brand": null, "bridge": null, "boundary": null, "building": null, "capital": null, "covered": null, "culvert": null, "cutting": null, "disused": null, "ele": null, "embankment": null, "foot": null, "harbour": null, "highway": null, "historic": null, "horse": null, "junction": null, "landuse": null, "layer": null, "leisure": null, "lock": null, "man_made": null, "military": null, "motorcar": null, "name": null, "natural": null, "oneway": null, "operator": null, "poi": null, "population": null, "power": null, "place": null, "railway": null, "ref": null, "religion": null, "route": null, "service": null, "shop": null, "sport": null, "surface": null, "toll": null, "tourism": null, "tower:type": null, "tunnel": null, "water": null, "waterway": null, "wetland": null, "width": null, "wood": null, "z_order": null, "tags": "\"ford\"=>\"yes\"" }

@e-n-f
Copy link
Contributor

e-n-f commented Jul 23, 2018

Interesting. Thanks for the additional research. Can you add links to the files you are testing with?

@e-n-f
Copy link
Contributor

e-n-f commented Jul 23, 2018

New experiment with sorting the values but retaining the v2 encoding:

➤ ./tippecanoe -Voriginal -zg -f -o kathmandu-original.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ./tippecanoe -Vreordered -zg -f -o kathmandu-reordered.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ./tippecanoe -Vblake -zg -f -o kathmandu-blake.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ls -l kathmandu-*mbtiles
-rw-r--r-- 1 enf staff 577536 Jul 23 14:58 kathmandu-blake.mbtiles
-rw-r--r-- 1 enf staff 655360 Jul 23 14:58 kathmandu-original.mbtiles
-rw-r--r-- 1 enf staff 602112 Jul 23 14:58 kathmandu-reordered.mbtiles
  • Blake format: 88% of previous size
  • Reordered format: 92% of original size
➤ ./tippecanoe -Voriginal --no-tile-size-limit -zg -f -o neroads-original.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/41/96
➤ ./tippecanoe -Vreordered --no-tile-size-limit -zg -f -o neroads-reordered.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/41/96
➤ ./tippecanoe -Vblake --no-tile-size-limit -zg -f -o neroads-blake.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/57/96
➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18317312 Jul 23 15:04 neroads-reordered.mbtiles
  • Blake format: 92% of original size
  • Reordered format: 97% of original size

At least now I know it's worth spending a little extra time to sort the values before writing out the tile, even if there is some additional advantage to either inlining values or using repeated messages.

@e-n-f
Copy link
Contributor

e-n-f commented Jul 23, 2018

The Natural Earth roads are improved slightly by also sorting the keys:

➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18292736 Jul 23 15:17 neroads-reordered.mbtiles

@e-n-f
Copy link
Contributor

e-n-f commented Jul 23, 2018

Blake's format, but without inline ints:

  • Kathmandu: 94% instead of 88%
  • Natural Earth roads: 97% instead of 92%

So I think inlining is helping more than repeated messages are.

@e-n-f
Copy link
Contributor

e-n-f commented Jul 24, 2018

Inlining floats does help a little, but the difference is in the noise (91.97% vs 92.16%):

-rw-r--r-- 1 enf staff 17403904 Jul 24 10:21 neroads-blake-float.mbtiles
-rw-r--r-- 1 enf staff 17440768 Jul 24 10:20 neroads-blake-regular.mbtiles

This also highlights that we need more than 3 bits for types. In fact this PR already actually requires 4, because it specifies "list / map" as type 8, which won't fit in 3 bits. I'll add a 4th type bit to the test implementation and recalculate.

@e-n-f
Copy link
Contributor

e-n-f commented Jul 24, 2018

Adding the 4th type bit raises the roads from using 92.16% to 92.49% of the original tileset size.

-rw-r--r-- 1 enf staff 17502208 Jul 24 10:31 neroads-blake-regular.mbtiles

Copy link
Member

@mourner mourner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I really like the flat approach. While we introduce 4 bits per value, this should be more than offset by not wrapping each value as a separate tagged message, and nested properties fit here naturally.

// list / map | 8 | (if 4th bit is 0 is list)
// | | remaining bits are length of the list where
// | | each item in the list is a complex value
// | | (if 4th bit is 1 is map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing — the id 8 is 0b1000, but if the 4th bit is 1 (so that it becomes 0b1001), the id equals 9. Then why not just indicate 8 for list and 9 for map instead of mentioning the fourth bit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was attempting to get away with just using 3 bits so that we could represent higher int values with out having to using the int index system. I am not against 4 bits.

//
// The properties field is much like the tags value in the it is two integers
// pairs that reference key and value pairs however, it is broken out into a
// "key_index" and an "complex_value".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: had to read many times to understand the sentence. the -> that? also, ; before "however" would help

@mourner
Copy link
Member

mourner commented Jul 27, 2018

Also question — how does an encoder decide whether to inline a value or put it in the packed array? Should we always inline int/sint vlues? And if we remove "index to int/sint" types, this makes the types fit into 3 bytes again (including list/map without an additional bit).

@flippmoke
Copy link
Member Author

@mourner We can not currently remove the indexed integer types because there is a limit to the size that the inlined values can represent currently. Here is how @ericfischer currently did the implementation for when to use inline vs indexed. I think this makes sense overall, the only time there might be a bit savings by using index over inline would be values that are larger then could be represented by 24 bit integers that are highly repeated (probably more then 3 or 4 times) and would have a low index value. This could be calculated on the fly if required, but I don't know that we need to have such a complex implementation. I think overall the inline seems to save space on average so we may not need such complex code.

@joto
Copy link

joto commented Aug 9, 2018

Here are some random thoughts:

  • Large integers that can't be inlined (due to us using those 4 bits for the type) could still be inlined by having a special type that says: next integer is the actual value.
  • We might not need both the uint64/sint64_values tables, because internally they are both varints and we know the type from the type field
  • We could put all tables into one large buffer using offsets instead of indexes into those tables. the value type we know already. From an encoding point of view all those tables in a layer are a problem, because we need to keep them around in memory until the layer is finished, so this could simplify things. (A variant of this would be to store the data inline the first time and after that use offsets.)
  • The proposal removes the distinction between keys and values tables, only has the string_values field. This simplifies things slightly (and saves a few bytes for the second table header) but makes the often used keys values larger probably (unless you take care to first put all keys into the string table). It also makes reuse of the keys table not possible in shaving or similar use case. Also creating this table is more expensive in the first place, because the key space is usually small which makes it easier to find the index of a given key. On the other hand with nested maps having a single table is a bit easier, because there is no question where all the strings are.
  • Keys are always strings, so we don't need to store the type for the keys. Saves the 4 bits.
  • How common is the case where float/double values are used that are actually multiple times in the layer so the lookup table makes sense? Maybe it is better to convert them to ints somehow and store them inline?

@e-n-f
Copy link
Contributor

e-n-f commented Aug 9, 2018

  • I am OK with inlining large integers as internal varints if we're also doing that with lists and maps. The value I see to not making any types variable-length is that it is nice to be able to know how many attributes there are just by dividing the length of the list by 2. But I'm not sure how much that really matters.
  • We either need separate signed and unsigned integer types, or we need to zigzag unsigned integers as well, since non-zigzag negative numbers take so much extra space.
  • Is there a way to represent the one-large-buffer approach in standard protobuf syntax, or does that make the format protozero-only?
  • No objection from me to a separate keys table.
  • I tried inlining floats and it didn't make much difference in size, and makes the format harder to describe and implement. Inlining doubles would require a larger-than-64-bit integer type to pack them into. It might be worth trying encoding the mantissa and exponent as a pair of varints and see how that works out, though.

@e-n-f
Copy link
Contributor

e-n-f commented Aug 9, 2018

The mantissas of floating point numbers seem to be fairly uniformly distributed across the [.5…1) interval, so there's probably not much potential for giving more common mantissas shorter representations. Low exponents are more common than high ones, though, so we might be able to squeeze a little bit out there.

@joto
Copy link

joto commented Aug 16, 2018

Regarding special encoding of floating point numbers: I don't think it is worth it to come up with complex schemes here. I had thought about just using raw bytes stored in a string field or something. But while that might be easy to use in C++, it will be more difficult in JS or so.

Regarding the keys/string_values tables: With the encoding proposed here it doesn't cost us anything to split these up, because each string is encoded by itself anyway. But as mentioned it will lead to smaller index numbers which, especially for the keys case is probably worth it. Here is another idea though: Currently all keys/string_values are directly in the layer object, if we push this down one level and have an intermediate string_table object, it could be more efficient. It would allow us to jump over the whole table or copy the whole table in one go. The cost is one more byte for the type and one varint for the length of the whole table. Double that if we have separate keys/string_values tables.

Regarding integer value encodings: If we put large integers that can't be inlined as separate varint in the properties array, we will hit a bad case for varints. Because they are always large, chances are they will get even larger as varints (max 10 bytes compared to 8 bytes for the int itself). So there is some inefficiency there. On the other hand, if we want to put them in an index table, we can use a fixed size type instead of a varint, which would avoid this and also make access more efficient, because we can directly address values in those tables without having to decode them first. So I think if we keep the tables, they should be of type (s)fixed32/64 instead of (s/u)int32/64. It would still be an indirect access which likely is slower than inlining though.

We either need separate signed and unsigned integer types, or we need to zigzag unsigned integers as well, since non-zigzag negative numbers take so much extra space.

This is one of those cases where we are hitting the limits of protobuf encoding again. We know the type, so we could do the zigzag encoding ourselfs for sints and not for uints. For the C++ code this doesn't matter, because we do the zigzag encoding ourselves anyway, but for anybody using the protobuf encodings, we either need two tables, or they have to do the zigzag encoding outside the protobuf lib.

@e-n-f
Copy link
Contributor

e-n-f commented Aug 16, 2018

  • Glad to hear that inlining floats sounds like it is off the table.

  • I'm fine with putting keys and strings in separate tables, and within a container object if that is considered useful. I'll try changing my prototype to do that. Should we put all the attributes inside that message, or is there a case where readers just want to skip/copy the strings?

  • Good point that the integers-by-reference should be fixed-size instead of varint, since they will always be large if they didn't get inlined. I'll change my prototype and this .proto file to do that.

  • If we talk about inlining signed integers at all, we are inherently doing bit-packing, so we have to explicitly talk about either zigzagging or sign extension, and zigzagging is probably the better choice of the two. All clients, no matter what language they are written in, will have to be able to unpack inlined integers.

On a different topic:

  • If we inline lists and hashes, meaning that we mix single-word and multi-word values in the attribute list, we need to be clear about which sets of types occupy only a single slot and which refer to suffixes, so that soon-to-exist clients can skip over types that future versions of the standard may define. I think we should be explicit that types 9 through 15 only use a single slot and are either single-word inline types or reference types, not multi-word inline types.

// uses the properties field instead. This would only be used if version
// for a layer is 3 or greater and tags should not be used at that point
// Additional tags (or all the tags) of this feature may be
// encoded as repeated pairs of 32-bit integers, to take
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we want to get away from the "tags" name and use "properties" instead? Also the properties field has 64bit uints, not 32 bit ints. And this is not necessarily "pairs" when we deal with lists and maps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will reword. @flippmoke and I want to consistently call it "attributes" to match what OGC does.

@@ -69,6 +112,12 @@ message Tile {
// See https://github.com/mapbox/vector-tile-spec/issues/47
optional uint32 extent = 5 [ default = 4096 ];

repeated string string_values = 7;
repeated double double_values = 8 [ packed = true ];
repeated float float_values = 9 [ packed = true ];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to keep the types ordered consistently throughout the .proto file, ie some places have float first, then double, others in different order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I'll make that edit.

repeated float float_values = 9 [ packed = true ];
repeated sfixed64 sfixed64_values = 10 [ packed = true ];
repeated fixed64 fixed64_values = 11 [ packed = true ];

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest these should get "logical" names like signed_integer_values or so instead of ones based on the encoding sfixed....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fine with me.

// | | each item in the list is a complex value
// | | (if 4th bit is 1 is map)
// | | remaining bits are the number of key_index and
// | | complex_value pairs to follow (same as properties)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply make these list -> 8, map > 9? The extra bit is confusing and doesn't buy us anything, because we already have 9 values (0-8) for the Id anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, since we have enough type fields to spare. I think they were combined only because it looked like the types would fit in 3 bits.

// an index position into a value storage of the layer.
//
// uint64t type = complex_value & 0x0F; // First 4 Bits
// uint64t parameter = complex_value >> 4;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64_t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will fix

// bool/null | 7 | value of 0 = false, 1 = true, 2 = null
// list | 8 | value is the number of sub-attributes to follow:
// | | each item in the list is a complex value
// map | 9 | value is the number of sub-attributes to follow:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question on intent of wording here, is the number of sub attributes to follow based on number of key value pairs or the number of keys and values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it to be the number of pairs, not the total number of words to follow. Thanks. I'll reword.

// int | 3 | index into layer.attribute_pool.signed_integer_values
// uint | 4 | index into layer.attribute_pool.unsigned_integer_values
// inline uint | 5 | value of unsigned integer (values between 0 to 2^60-1)
// inline sint | 6 | value of zigzag-encoded integer (values between -2^59 to 2^59-1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should change this to be 2^56 for uint and 2^55 for signed due to the way varints are encoded.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to differentiate between what encodings are possible and what encodings are recommended. The spec may well say that this or that encoding is recommended becaus it is usually better, but still require readers to understand a different encoding.

@joto joto mentioned this pull request Sep 22, 2018
7 tasks
@flippmoke
Copy link
Member Author

Closing in favor of #123

@flippmoke flippmoke closed this Sep 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants