-
Notifications
You must be signed in to change notification settings - Fork 909
Add Key Management Tools API for Parquet encryption #7387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This should come from the lower level API once it is available there, rather than being defined in the KMT API
I had implemented some integration tests with PyArrow but have removed them from this PR as it is already very big (0fb5372). I can add those in a follow up if people think they're worthwhile, although they add a bit of complexity to the CI for this one feature. Maybe there's a better way to implement them. |
@ggershinsky could you please take a look at this? |
hi @adamreeve , certainly. |
I'll be adding comments as I go thru the code. The review might take a week or so, but I'll inform when I'm finished with this round. |
this PR might need a comprehensive unitest for different encryption options, similar to https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java . Also, it would be good to test an interop with Spark or PyArrow, where files are written by these frameworks, and read by the Rust code (probably manually, as we don't have these files in the parquet-testing repo - but still, an initial/partial check would be helpful). |
Ok, I'm finished with the review round, adding some comments above. Overall, looks very good. |
Thanks for the review @ggershinsky, I will try to address all your comments soon.
Yes, I did write some tests that wrote files with Python and read them in Rust, and also wrote with Rust and read in Python, but I left them out of this PR to keep the size down as they were a bit complicated. I will make a follow up PR to add these later. |
I'm afraid I've not really been following this effort closely, and so I may be missing something but I would have thought this would need to be async to accommodate external stores. Taking a step back though I wonder if this makes sense to include in the parquet crate proper, or if it could be some third-party crate. Is there some way we could add the necessary hooks to parquet-rs, if they don't already exist, and have this be an external project. I suspect arrow-cpp bundles all this largely for packaging reasons that don't apply to arrow-rs. I say this for a few reasons
Basically I'm a little concerned that the complexity and risks involved in this outstrip the arrow-rs project's ability to effectively review it... |
In my opinion, features can not be implemented outside the crate (such as support for actually encrypting/decrypting parquet during encoder/decode ) clearly belong in the crate Things that can be implemented entirely outside the crate (such as what this PR appears to be proposing) we should be much more careful about accepting (as @tustvold says) because we don't have infinite capacity to maintain features. Keeping the crates focused on arrow/parquet is part of how we'll be able to keep maintaining it. The fact that other language implementations seem to have key management libraries is interesting. I am not much of an expert in how keys are normally managed |
BTW the recent assistance in updating the Rust implementation of parquet to get closer to parity with the Java and C/C++ implementaions is uch appreciated |
This is a good point but I don't think that's really feasible, it would require changing all code paths where encryption is used to be async which I expect would be a very large and breaking change. Providing a non-async API doesn't prevent you from communicating with external stores, you can always use
Yes this could easily be a third party crate. My concern with that is that it would be less discoverable, and users would assume that if they want Parquet encryption they should use the existing lower level API where encryption keys are specified directly. But I believe users should be directed to using this KMS based API if possible to push them towards better security practices.
I don't think any KMS specific clients should be included in the Parquet crate, those should definitely be third party crates. Arrow C++ and PyArrow don't include any KMS client implementations either, besides an example Vault client that is clearly documented as just being an example and not for production use. |
I can't help feeling this is something we are going to need eventually, and we should probably work out how it would work... If it means the sync APIs only support encryption with static keys, then maybe that is fine...
I mean even better would be for them to be using the envelope encryption facilities the cloud/hosting providers themselves provide... My understanding of this PR is the user's still have to manage and store the KEKs themselves, handle rotation, etc... Is there an argument that whilst perhaps better than the low-level interface it still requires quite a sophisticated user to use it securely? IMO parquet-rs should provide the minimal hooks to allow people to securely support modular encryption in their environment with the primitives available to them, be they a cloud-based KMS or HSM solution or something else. I understand the thinking of providing an out of the box key management toolkit, especially as a way to dog-food these interfaces, but I worry about being able to maintain what is a piece of complex and security critical code.
At least AWS PME appears to integrate with AWS KMS, i.e. the approach I am alluding to above. |
I don't think this understanding is quite right. Users would be expected to use the encryption facilities provided by their cloud environment or organisation's security team. This module just provides a way to integrate with those facilities while being compatible with other Parquet implementations. The management and rotation of master keys is the responsibility of the KMS, and ideally the user would only need to implement a very thin wrapper over the KMS API to integrate with this crate.
That is what this module does. I think it's not actually doing as much as you think it is? It really just provides a way to generate random data encryption keys that can then be encrypted in whatever way makes sense in the user's environment, and implements a standardised JSON metadata format for the key material to allow later decrypting those keys and providing compatibility with other Parquet implementations. |
I'll try to find time to take a more detailed look at the weekend |
Which issue does this PR close?
Closes #7256
What changes are included in this PR?
Adds a new
key_management
module that allows generating Parquet encryption and decryption properties that integrate with a Key Management Server (KMS).This module is enabled by a new
key_management
feature.The implementation is fairly closely based on the C++ Arrow library behaviour (see https://github.com/apache/arrow/blob/main/cpp/src/parquet/encryption/crypto_factory.h as a starting point for reference).
There's also a design document on key management tools which is another useful reference.
One feature not included is the ability to use external key material to handle rotation of master encryption keys in the KMS. That could be added later.
Are there any user-facing changes?
Yes, this adds new user-facing functionality (within an experimental module).