-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Table.save() store hash for parquet files #446
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this very relevant. Would maybe a little more detail come in handy?
- when it is triggered (on publication). You write implicitly that it is used when "using parquet as the storage format". I think this could be a little bit more explicit.
- what is the purpose of having a hash (to determine table identity)
- why the md5sums can differ in parquet (mainly compression I believe)
I am approving this anyway.
Your discussion points are valid, maybe you should not directly approve, but first let us discuss them ;)
Publication is not part of
I added the following statement to the docs:
So far I was not able to find documentation on that. There seems to be several factors that might influence the md5 sum:
At least the last point could be avoided, but I found it nice that storing the hash inside the file also provides the advantage that the hash needs to be calculated only ones. So I didn't bother trying to fix it. I'm also pretty sure that otherwise we will have the problem that Any suggestion, what and how I should add this to the docstring? |
I still do not know how to give approval after I had not done so in the first place. I promise I will figure this out :-)
My understanding is that you are asking for the reasoning why parquet files can differ. Probably using a kind of vagueness would help? One could talk of "reasons include" to keep it as something that is not necessarily comprehensive. And one could drop "likely" in order to avoid that it is hard to get such a comprehensive picture, maybe like this:? "md5sums of parquet files containing identical information often differ. Reasons include various factors like the library that wrote the parquet file and the version of this library, the chosen compression codec and metadata written by the library" I have so far not added the first point "the binary data always stored deterministically ". I though parquet will not do much more than a (deterministic) run length encoding, and the non-deterninism comes from the compression. But I am not an export on that at all. |
That's indeed a hidden feature in Github. I recently figured out how to do it. Let's assume you did your first review by just adding comments to certain parts of the code, and you didn't selected "Approve". |
Thanks, I shortened your suggestion slightly and included it as: |
I think the description is nice now. As I had prematurely given approval this is not necessary a second time. Or is it? |
Thanks, is fine, I will merge now. |
Document, that we store a hash value in the metadata of the schema of the PARQUET file, when saving to the file.