-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add encoder/decoder for Avro object container files #81
base: master
Are you sure you want to change the base?
Conversation
Only adding the raising functions for now The implementation is designed so that people can pass different implementations of the codecs as required. Some people may want to use pure beam functions, some may want to use NIFs.
Thank you for the PR! I will provide some direct feedback this week. One immediate piece of feedback is to move all codecs to separate files |
Improve how block headers are encoded to use an avro record This will help during decoding, since longs are variable length
lib/avro_ex/object_container.ex
Outdated
@bh_schema AvroEx.decode_schema!(~S({ | ||
"type":"record","name":"block_header", | ||
"fields":[ | ||
{"name":"num_objects","type":"long"}, | ||
{"name":"num_bytes","type":"long"} | ||
] | ||
})) | ||
@fh_schema AvroEx.decode_schema!(~S({ | ||
"type": "record", "name": "org.apache.avro.file.Header", | ||
"fields" : [ | ||
{"name": "magic", "type": {"type": "fixed", "name": "Magic", "size": 4}}, | ||
{"name": "meta", "type": {"type": "map", "values": "bytes"}}, | ||
{"name": "sync", "type": {"type": "fixed", "name": "Sync", "size": 16}} | ||
] | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- decode_schema! supports directly passing elixir terms, so this can pass an elixir map directly rather than a string
- Calling
decode_schema!/1
here causing a compile time dependency between this module andAvroEx
. Since this is a premature optimization, I suggest we move out of module attributes
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- decode_schema! supports directly passing elixir terms, so this can pass an elixir map directly rather than a string
Didn't notice that, thanks.
A good reason for using json may be to "match" with the Avro spec:
vs
%{
"type" => "record",
"name" => "org.apache.avro.file.Header",
"fields" => [
%{"name" => "magic", "type" => %{"type" => "fixed", "name" => "Magic", "size" => 4}},
%{"name" => "meta", "type" => %{"type" => "map", "values" => "bytes"}},
%{"name" => "sync", "type" => %{"type" => "fixed", "name" => "Sync", "size" => 16}}
]
})
the spec is unlikely to change, so no one will ever need to update it
I don't have a strong opinion about it 🤷
let me know which you prefer with the options side by side
2. Calling `decode_schema!/1` here causing a compile time dependency between this module and `AvroEx`. Since this is a premature optimization, I suggest we move out of module attributes
WDYT?
I'm not aware of any downsides. I work with a lot of embedded devices though, so doing stuff at compile time feels natural to me.
Is there a reason to avoid this optimization?
I can stick it in the file_header_schema/1
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it shouldn't change much over time, I think its fine to represent in elixir and get the benefits of formatting. It is also easier to edit as elixir code
Regarding compile-time, it shouldn't make a massive difference here, but accrued work at compile time does effect the time it takes to compile the library.
lib/avro_ex/object_container.ex
Outdated
def magic(), do: @magic | ||
def block_header_schema(), do: @bh_schema | ||
def file_header_schema(), do: @fh_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a need to expose these directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic is unnecessary, but the raw schemas are useful to have.
I'm using them in tests right now. It feels icky to repeat these schemas in the encoding and decoding test files.
Is there a better way other than exposing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should drop the need for them at tests, since we can just validate against expected inputs/outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true. The format isn't expected to change either, so the tests won't need to be updated.
I'll try it out and see how it looks, it may make the tests 99% encoded data blobs though 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best way to test this is that you can round trip the data through the encoder, so that should be completely reasonable
Took a quick look at this, here is some general feedback:
|
Matches the rest of the library
@behaviour AvroEx.ObjectContainer.Codec | ||
@impl AvroEx.ObjectContainer.Codec | ||
def encode!(data) do | ||
{:ok, compressed} = :snappyer.compress(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to add these as optional dependencies, no? We should also probably not compile this module if the user does not have snappyer
installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn over how to handle it.
The issue is that the snappy codec is implemented differently from all the other codecs.
I haven't looked at how to do it yet, but it would be nice to only compile it in if :snappyer is in the list of compiled/included apps or maybe some kind of build flag?
The implementation could just live in the docs, but it seems to be a popular codec
What do you think?
My plan was to worry about it after finishing the rest of the object container
will do 👍
Thanks, that is a useful document. |
Handle case with missing codec Add tests for parts of the file header decoding required by the spec
hello @veedo, just wanted to check in on this PR, are you waiting on any review from me, or still a WIP? |
I've just been swamped this month unfortunately. |
No rush! Just wanted to make sure you weren't waiting on me |
Implement all the functions that go into decoding parts of the object container
Hello @veedo! Checking back in here, is there anything I can do to help with this PR? If you're not going to come back to it, we can consider other options |
The swamping has continued unfortunately, and will probably continue for the next 2 months at least. Currently the encoding part works correctly and consistently. I'll whip myself to finish the encoding tests this weekend. |
Implement the function that decodes an object container
The implementation of the snappy codec in avro is quite unique, so providing an implementation is valuable. Uses the optional dependency method similar to ecto.
@davydog187 Disregard my last comment. I had forgotten how much was left to do. |
@veedo wanted to remind you about this PR! Its come so far I'd love to still be able to get this merged at some point |
For some of my projects, I need full control of the encoding/decoding process and AvroEx provides a good basis for that.
The only things missing from AvroEx that I use a lot are object containers and cloud wire formats.
This pull request is my attempt to add object containers in a flexible way that gives a user maximum control over the process.
Theory of operation:
Please provide feedback on the PR as I go in case there's something untenable