Skip to content

Add Key Management Tools API for Parquet encryption #7387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Apr 5, 2025

Which issue does this PR close?

Closes #7256

What changes are included in this PR?

Adds a new key_management module that allows generating Parquet encryption and decryption properties that integrate with a Key Management Server (KMS).

This module is enabled by a new key_management feature.

The implementation is fairly closely based on the C++ Arrow library behaviour (see https://github.com/apache/arrow/blob/main/cpp/src/parquet/encryption/crypto_factory.h as a starting point for reference).

There's also a design document on key management tools which is another useful reference.

One feature not included is the ability to use external key material to handle rotation of master encryption keys in the KMS. That could be added later.

Are there any user-facing changes?

Yes, this adds new user-facing functionality (within an experimental module).

adamreeve added 30 commits April 5, 2025 11:19
This should come from the lower level API once it is available there,
rather than being defined in the KMT API
@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 5, 2025
@adamreeve
Copy link
Contributor Author

I had implemented some integration tests with PyArrow but have removed them from this PR as it is already very big (0fb5372). I can add those in a follow up if people think they're worthwhile, although they add a bit of complexity to the CI for this one feature. Maybe there's a better way to implement them.

@adamreeve
Copy link
Contributor Author

@ggershinsky could you please take a look at this?

@ggershinsky
Copy link
Contributor

hi @adamreeve , certainly.

@ggershinsky
Copy link
Contributor

I'll be adding comments as I go thru the code. The review might take a week or so, but I'll inform when I'm finished with this round.

@ggershinsky
Copy link
Contributor

this PR might need a comprehensive unitest for different encryption options, similar to https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java .

Also, it would be good to test an interop with Spark or PyArrow, where files are written by these frameworks, and read by the Rust code (probably manually, as we don't have these files in the parquet-testing repo - but still, an initial/partial check would be helpful).

@ggershinsky
Copy link
Contributor

Ok, I'm finished with the review round, adding some comments above. Overall, looks very good.

@adamreeve
Copy link
Contributor Author

Thanks for the review @ggershinsky, I will try to address all your comments soon.

Also, it would be good to test an interop with Spark or PyArrow, where files are written by these frameworks, and read by the Rust code (probably manually, as we don't have these files in the parquet-testing repo - but still, an initial/partial check would be helpful).

Yes, I did write some tests that wrote files with Python and read them in Rust, and also wrote with Rust and read in Python, but I left them out of this PR to keep the size down as they were a bit complicated. I will make a follow up PR to add these later.

@tustvold
Copy link
Contributor

tustvold commented Apr 9, 2025

I'm afraid I've not really been following this effort closely, and so I may be missing something but I would have thought this would need to be async to accommodate external stores.

Taking a step back though I wonder if this makes sense to include in the parquet crate proper, or if it could be some third-party crate. Is there some way we could add the necessary hooks to parquet-rs, if they don't already exist, and have this be an external project. I suspect arrow-cpp bundles all this largely for packaging reasons that don't apply to arrow-rs.

I say this for a few reasons

  • Supporting external KMS providers is a monumental undertaking
  • There is very limited arrow-rs review budget
  • Key management is hard and needs to be done and reviewed with care

Basically I'm a little concerned that the complexity and risks involved in this outstrip the arrow-rs project's ability to effectively review it...

@alamb
Copy link
Contributor

alamb commented Apr 9, 2025

In my opinion, features can not be implemented outside the crate (such as support for actually encrypting/decrypting parquet during encoder/decode ) clearly belong in the crate

Things that can be implemented entirely outside the crate (such as what this PR appears to be proposing) we should be much more careful about accepting (as @tustvold says) because we don't have infinite capacity to maintain features. Keeping the crates focused on arrow/parquet is part of how we'll be able to keep maintaining it.

The fact that other language implementations seem to have key management libraries is interesting. I am not much of an expert in how keys are normally managed

@alamb
Copy link
Contributor

alamb commented Apr 9, 2025

BTW the recent assistance in updating the Rust implementation of parquet to get closer to parity with the Java and C/C++ implementaions is uch appreciated

@adamreeve
Copy link
Contributor Author

adamreeve commented Apr 9, 2025

I'm afraid I've not really been following this effort closely, and so I may be missing something but I would have thought this would need to be async to accommodate external stores.

This is a good point but I don't think that's really feasible, it would require changing all code paths where encryption is used to be async which I expect would be a very large and breaking change. Providing a non-async API doesn't prevent you from communicating with external stores, you can always use block_on, but I can see how some users might prefer to avoid that.

I wonder if this makes sense to include in the parquet crate proper, or if it could be some third-party crate.

Yes this could easily be a third party crate. My concern with that is that it would be less discoverable, and users would assume that if they want Parquet encryption they should use the existing lower level API where encryption keys are specified directly. But I believe users should be directed to using this KMS based API if possible to push them towards better security practices.

Supporting external KMS providers is a monumental undertaking

I don't think any KMS specific clients should be included in the Parquet crate, those should definitely be third party crates. Arrow C++ and PyArrow don't include any KMS client implementations either, besides an example Vault client that is clearly documented as just being an example and not for production use.

@ggershinsky
Copy link
Contributor

a side comment - adding this layer will help arrow-rs readers to open files written by AWS PME, Spark PME or PyArrow/ArrowC++. They all package a similar mechanism.
also, the key management libraries are indeed not included here; instead, an interface is provided for working with basically any KMS.

@tustvold
Copy link
Contributor

tustvold commented Apr 9, 2025

This is a good point but I don't think that's really feasible, it would require changing all code paths where encryption is used to be async which I expect would be a very large and breaking change.

I can't help feeling this is something we are going to need eventually, and we should probably work out how it would work... If it means the sync APIs only support encryption with static keys, then maybe that is fine...

But I believe users should be directed to using this KMS based API if possible to push them towards better security practices

I mean even better would be for them to be using the envelope encryption facilities the cloud/hosting providers themselves provide... My understanding of this PR is the user's still have to manage and store the KEKs themselves, handle rotation, etc... Is there an argument that whilst perhaps better than the low-level interface it still requires quite a sophisticated user to use it securely?

IMO parquet-rs should provide the minimal hooks to allow people to securely support modular encryption in their environment with the primitives available to them, be they a cloud-based KMS or HSM solution or something else. I understand the thinking of providing an out of the box key management toolkit, especially as a way to dog-food these interfaces, but I worry about being able to maintain what is a piece of complex and security critical code.

Adding this layer will help arrow-rs readers to open files written by AWS PME, Spark PME or PyArrow/ArrowC++. They all package a similar mechanism.

At least AWS PME appears to integrate with AWS KMS, i.e. the approach I am alluding to above.

@adamreeve
Copy link
Contributor Author

I mean even better would be for them to be using the envelope encryption facilities the cloud/hosting providers themselves provide... My understanding of this PR is the user's still have to manage and store the KEKs themselves, handle rotation, etc... Is there an argument that whilst perhaps better than the low-level interface it still requires quite a sophisticated user to use it securely?

I don't think this understanding is quite right. Users would be expected to use the encryption facilities provided by their cloud environment or organisation's security team. This module just provides a way to integrate with those facilities while being compatible with other Parquet implementations.

The management and rotation of master keys is the responsibility of the KMS, and ideally the user would only need to implement a very thin wrapper over the KMS API to integrate with this crate.

IMO parquet-rs should provide the minimal hooks to allow people to securely support modular encryption in their environment with the primitives available to them, be they a cloud-based KMS or HSM solution or something else

That is what this module does. I think it's not actually doing as much as you think it is? It really just provides a way to generate random data encryption keys that can then be encrypted in whatever way makes sense in the user's environment, and implements a standardised JSON metadata format for the key material to allow later decrypting those keys and providing compatibility with other Parquet implementations.

@tustvold
Copy link
Contributor

tustvold commented Apr 9, 2025

I'll try to find time to take a more detailed look at the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Parquet key management tools
4 participants