Add Key Management Tools API for Parquet encryption #7387

adamreeve · 2025-04-05T12:39:41Z

Which issue does this PR close?

What changes are included in this PR?

Adds a new key_management module that allows generating Parquet encryption and decryption properties that integrate with a Key Management Server (KMS).

This module is enabled by a new key_management feature.

The implementation is fairly closely based on the C++ Arrow library behaviour (see https://github.com/apache/arrow/blob/main/cpp/src/parquet/encryption/crypto_factory.h as a starting point for reference).

There's also a design document on key management tools which is another useful reference.

One feature not included is the ability to use external key material to handle rotation of master encryption keys in the KMS. That could be added later.

Are there any user-facing changes?

Yes, this adds new user-facing functionality (within an experimental module).

This should come from the lower level API once it is available there, rather than being defined in the KMT API

adamreeve · 2025-04-05T12:42:34Z

I had implemented some integration tests with PyArrow but have removed them from this PR as it is already very big (0fb5372). I can add those in a follow up if people think they're worthwhile, although they add a bit of complexity to the CI for this one feature. Maybe there's a better way to implement them.

adamreeve · 2025-04-05T12:45:28Z

@ggershinsky could you please take a look at this?

ggershinsky · 2025-04-06T05:14:34Z

hi @adamreeve , certainly.

ggershinsky · 2025-04-07T06:14:34Z

I'll be adding comments as I go thru the code. The review might take a week or so, but I'll inform when I'm finished with this round.

parquet/src/encryption/key_management/crypto_factory.rs

parquet/src/encryption/key_management/key_wrapper.rs

parquet/src/encryption/key_management/mod.rs

ggershinsky · 2025-04-09T06:33:55Z

this PR might need a comprehensive unitest for different encryption options, similar to https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java .

Also, it would be good to test an interop with Spark or PyArrow, where files are written by these frameworks, and read by the Rust code (probably manually, as we don't have these files in the parquet-testing repo - but still, an initial/partial check would be helpful).

ggershinsky · 2025-04-09T06:36:44Z

Ok, I'm finished with the review round, adding some comments above. Overall, looks very good.

adamreeve · 2025-04-09T08:12:32Z

Thanks for the review @ggershinsky, I will try to address all your comments soon.

Also, it would be good to test an interop with Spark or PyArrow, where files are written by these frameworks, and read by the Rust code (probably manually, as we don't have these files in the parquet-testing repo - but still, an initial/partial check would be helpful).

Yes, I did write some tests that wrote files with Python and read them in Rust, and also wrote with Rust and read in Python, but I left them out of this PR to keep the size down as they were a bit complicated. I will make a follow up PR to add these later.

tustvold · 2025-04-09T09:36:32Z

I'm afraid I've not really been following this effort closely, and so I may be missing something but I would have thought this would need to be async to accommodate external stores.

Taking a step back though I wonder if this makes sense to include in the parquet crate proper, or if it could be some third-party crate. Is there some way we could add the necessary hooks to parquet-rs, if they don't already exist, and have this be an external project. I suspect arrow-cpp bundles all this largely for packaging reasons that don't apply to arrow-rs.

I say this for a few reasons

Supporting external KMS providers is a monumental undertaking
There is very limited arrow-rs review budget
Key management is hard and needs to be done and reviewed with care

Basically I'm a little concerned that the complexity and risks involved in this outstrip the arrow-rs project's ability to effectively review it...

alamb · 2025-04-09T11:04:55Z

In my opinion, features can not be implemented outside the crate (such as support for actually encrypting/decrypting parquet during encoder/decode ) clearly belong in the crate

Things that can be implemented entirely outside the crate (such as what this PR appears to be proposing) we should be much more careful about accepting (as @tustvold says) because we don't have infinite capacity to maintain features. Keeping the crates focused on arrow/parquet is part of how we'll be able to keep maintaining it.

The fact that other language implementations seem to have key management libraries is interesting. I am not much of an expert in how keys are normally managed

alamb · 2025-04-09T11:07:34Z

BTW the recent assistance in updating the Rust implementation of parquet to get closer to parity with the Java and C/C++ implementaions is uch appreciated

adamreeve · 2025-04-09T11:47:37Z

I'm afraid I've not really been following this effort closely, and so I may be missing something but I would have thought this would need to be async to accommodate external stores.

This is a good point but I don't think that's really feasible, it would require changing all code paths where encryption is used to be async which I expect would be a very large and breaking change. Providing a non-async API doesn't prevent you from communicating with external stores, you can always use block_on, but I can see how some users might prefer to avoid that.

I wonder if this makes sense to include in the parquet crate proper, or if it could be some third-party crate.

Yes this could easily be a third party crate. My concern with that is that it would be less discoverable, and users would assume that if they want Parquet encryption they should use the existing lower level API where encryption keys are specified directly. But I believe users should be directed to using this KMS based API if possible to push them towards better security practices.

Supporting external KMS providers is a monumental undertaking

I don't think any KMS specific clients should be included in the Parquet crate, those should definitely be third party crates. Arrow C++ and PyArrow don't include any KMS client implementations either, besides an example Vault client that is clearly documented as just being an example and not for production use.

ggershinsky · 2025-04-09T12:06:34Z

a side comment - adding this layer will help arrow-rs readers to open files written by AWS PME, Spark PME or PyArrow/ArrowC++. They all package a similar mechanism.
also, the key management libraries are indeed not included here; instead, an interface is provided for working with basically any KMS.

tustvold · 2025-04-09T12:10:45Z

This is a good point but I don't think that's really feasible, it would require changing all code paths where encryption is used to be async which I expect would be a very large and breaking change.

I can't help feeling this is something we are going to need eventually, and we should probably work out how it would work... If it means the sync APIs only support encryption with static keys, then maybe that is fine...

But I believe users should be directed to using this KMS based API if possible to push them towards better security practices

I mean even better would be for them to be using the envelope encryption facilities the cloud/hosting providers themselves provide... My understanding of this PR is the user's still have to manage and store the KEKs themselves, handle rotation, etc... Is there an argument that whilst perhaps better than the low-level interface it still requires quite a sophisticated user to use it securely?

IMO parquet-rs should provide the minimal hooks to allow people to securely support modular encryption in their environment with the primitives available to them, be they a cloud-based KMS or HSM solution or something else. I understand the thinking of providing an out of the box key management toolkit, especially as a way to dog-food these interfaces, but I worry about being able to maintain what is a piece of complex and security critical code.

Adding this layer will help arrow-rs readers to open files written by AWS PME, Spark PME or PyArrow/ArrowC++. They all package a similar mechanism.

At least AWS PME appears to integrate with AWS KMS, i.e. the approach I am alluding to above.

adamreeve · 2025-04-09T12:54:15Z

I mean even better would be for them to be using the envelope encryption facilities the cloud/hosting providers themselves provide... My understanding of this PR is the user's still have to manage and store the KEKs themselves, handle rotation, etc... Is there an argument that whilst perhaps better than the low-level interface it still requires quite a sophisticated user to use it securely?

I don't think this understanding is quite right. Users would be expected to use the encryption facilities provided by their cloud environment or organisation's security team. This module just provides a way to integrate with those facilities while being compatible with other Parquet implementations.

The management and rotation of master keys is the responsibility of the KMS, and ideally the user would only need to implement a very thin wrapper over the KMS API to integrate with this crate.

IMO parquet-rs should provide the minimal hooks to allow people to securely support modular encryption in their environment with the primitives available to them, be they a cloud-based KMS or HSM solution or something else

That is what this module does. I think it's not actually doing as much as you think it is? It really just provides a way to generate random data encryption keys that can then be encrypted in whatever way makes sense in the user's environment, and implements a standardised JSON metadata format for the key material to allow later decrypting those keys and providing compatibility with other Parquet implementations.

tustvold · 2025-04-09T13:28:19Z

I'll try to find time to take a more detailed look at the weekend

adamreeve · 2025-04-22T02:36:20Z

I raised this issue at the recent Parquet sync (see notes at https://lists.apache.org/thread/qd6kd13mv19xcfkwn19vzlnd4zgxmyf3) and agreed with @alamb that this doesn't need to be part of arrow-rs and should instead be a standalone third-party crate, particularly as this isn't part of the official Parquet specification. I'll most-likely set up a repository for this in the @G-Research org.

That doesn't rule out the possibility of integrating this into arrow-rs later if there's enough community demand for more official support for this feature, and available maintenance capacity.

One loose end from this PR is the question of async KMS clients. That would require changes to arrow-rs, in particular adding an async version of the KeyRetriever trait. That's not a high priority for me right now but is something I'll try to come back to and address.

alamb · 2025-04-28T17:46:39Z

Thank you for the very nice writeup @adamreeve 🙏

adamreeve added 30 commits April 5, 2025 11:19

Add key material deserialization

f7d2e19

Add initial CryptoFactory implementation

687b4e8

Use boxed dyn KmsClient instead of type parameter in CryptoFactory

36cde66

Add more realistic test KMS client

839dfae

Implement key material serialization

5d156f3

Add basic unwrapping of double wrapped keys

981e6e0

Add KMS config and client cache expiration

a3bcdb5

Cache decrypted key encryption keys

6ec2c67

Prevent potential race condition reading key access token

fdcddcd

Add a test of client caching and access token refresh

eb69f6e

Implement building file encryption properties

9aa60f3

KmsConnectionConfig fix and tidy

b494dfb

Only allow modifying key access token in kms config

a779121

Refactor caching with expiration

41b2c77

Fix getting key retriever for tests after rebase

d3aa5c9

Fix clippy errors

56abae6

Switch to using actual encryption properties

9da61cb

Fix double wrapping default

d747b51

Add cache cleaning

8f9600d

Fix ups after rebase

c629c88

Add test of client expiration

b3f3b00

Add test for key encryption key caching

93cb522

Add tests for interoperability with PyArrow encryption

a505500

Set non-empty default for KMS instance ID and URL

7d818bf

Add tests for write in Python, read in Rust

e37ee79

Test single wrapping in PyArrow integration tests

2ca12fe

Tidy ups

e15c331

Allow getting KMS instance ID and URL from footer key metadata

d6d69c0

Fixes and tidy ups after rebase

1917b2c

Remove unused encryption algorithm config option

8d469e4

This should come from the lower level API once it is available there, rather than being defined in the KMT API

adamreeve added 3 commits April 5, 2025 13:29

Make KmsClientFactory require Sync

71a4014

Remove parquet-pyarrow-integration-testing

0fb5372

Tidy ups

2325c52

github-actions bot added the parquet Changes to the parquet crate label Apr 5, 2025

ggershinsky reviewed Apr 7, 2025

View reviewed changes

ggershinsky reviewed Apr 8, 2025

View reviewed changes

parquet/src/encryption/key_management/key_wrapper.rs Show resolved Hide resolved

ggershinsky reviewed Apr 9, 2025

View reviewed changes

parquet/src/encryption/key_management/mod.rs Outdated Show resolved Hide resolved

adamreeve mentioned this pull request Apr 9, 2025

Use a "SecureString" like type to store Parquet encryption keys #7373

Open

Renaming and comment updates to address PR feedback

19914bd

adamreeve closed this Apr 22, 2025

This was referenced Apr 22, 2025

Support Parquet key management tools #7256

Closed

Add initial implementation G-Research/parquet-key-management-rs#1

Merged

Support integration with Parquet modular encryption apache/datafusion#15216

Closed

Add Key Management Tools API for Parquet encryption #7387

Add Key Management Tools API for Parquet encryption #7387

Uh oh!

Conversation

adamreeve commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

adamreeve commented Apr 5, 2025

Uh oh!

adamreeve commented Apr 5, 2025

Uh oh!

ggershinsky commented Apr 6, 2025

Uh oh!

ggershinsky commented Apr 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggershinsky commented Apr 9, 2025

Uh oh!

ggershinsky commented Apr 9, 2025

Uh oh!

adamreeve commented Apr 9, 2025

Uh oh!

tustvold commented Apr 9, 2025

Uh oh!

alamb commented Apr 9, 2025

Uh oh!

alamb commented Apr 9, 2025

Uh oh!

adamreeve commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggershinsky commented Apr 9, 2025

Uh oh!

tustvold commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamreeve commented Apr 9, 2025

Uh oh!

tustvold commented Apr 9, 2025

Uh oh!

adamreeve commented Apr 22, 2025

Uh oh!

alamb commented Apr 28, 2025

Uh oh!

Uh oh!

adamreeve commented Apr 5, 2025 •

edited

Loading

adamreeve commented Apr 9, 2025 •

edited

Loading

tustvold commented Apr 9, 2025 •

edited

Loading