-
-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Modular Encryption Support When Reading Parquet Files #480
base: master
Are you sure you want to change the base?
Conversation
I was able to tidy up the PR. However there is a bug that happens when running
|
Okay, some findings. If any test runs after my new file decryption test in the same xunit collection it crashes the CLR. I moved my test to its own test collection and disabled parallelization which essentially means xunit will run my test in isolation. see: 9e0bbbe This way my test sometimes works; It randomly fails with similar memory mismanagement issues. So it's flaky at the moment. This is just a band-aid to get the PR green. I'm sure i'm doing something stupid somewhere that's causing this issue but I haven't been able to find it so far. |
Not a solution but some of my findings investigating this and may help debugging the issue: I've found a reliable way to reproduce the exception that crashes the test host in a single test, the issue happens when allocating memory. Add these two lines at the end of the "Z_DecryptFile_UTF8_AesGcmV1_192bit" test:
and it crashes every time on the second allocation with a System.ExecutionEngineException. The precise minimum size of the allocation is unknown to me, but a larger single allocation doesn't trigger the issue, but more smaller allocations will. e,g. this does not provoke the crash: but this does:
That said, it only crashes if I've read the first data column; the datetime column that has invalid data requiring the catch statement in the AsUnixMillisecondsInDateTime extension method so there's probably something in that. (not sure the int and float column data is valid either for that matter, but reading them doesn't seem to provoke this issue.) @mukunku do you know if the file is actually valid and what the unencrypted contents should be? (I couldn't figure out how to specify an encryption key with parquet-tools). It would be useful I'd say to have both the unencrypted and encrypted files side by side for verifying the validity of the decrypted contents in the test too. |
Spent a little more time looking into this and I think I've found the problem. The code is decrypting the dictionary page header but not the dictionary page itself so it is actually attempting to decompress encrypted data. As for the cause of the crash, this part is mostly educated speculation but because the "CompressedPageSize" is actually the encrypted payload (with tag, length and nonce), it is too large and that may be causing problems in the decompressor with unsafe pointers causing memory corruption. In the case of the first column, the CompressedPageSize= 50 and the UncompressedPageSize=16, but after actually decrypting the data first, the CompressedPageSize=18 and is correctly decompressed into correct data. This actually addresses the problem that required the "catch" in the "AsUnixMillisecondsInDateTime" method, a valid DateTime of "9/10/2023 2:26:07 PM" is decoded now. (well two DateTimes that are 2001 Microseconds apart actually) |
Summary
I made significant progress on getting Footer Decryption working with parquet files (#191).
I'm opening this work-in-progress pull request with hopes that some other folks can help get this across the finish line.
AES_GCM_V1
Thanks to a test file file @pzatschl shared with me I was able to implement the Aes Gcm V1 encryption algorithm.
link to code
AES_GCM_CTR_V1
I also implemented the Aes Gcm Ctr V1 encryption algorithm, however I don't have any test files to confirm it's working 🙃
link to code
How to test
Checkout the unit test I added that tests the sample file I mentioned above:
link to code
However, even though I can decrypt the test file successfully, the data itself doesn't seem to be valid. So I had to add this try-catch as a temporary workaround.
link to code
We should remove this once we have a proper test file. (Unfortunately I don't have any other test files )