Skip to content

Docs for multiple storage backend support #8758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Mar 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/img/msb/msb_create_repo_ui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/msb/msb_repo_settings_ui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion docs/enterprise/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ it is separated by ports for security reasons.
[6] SSO IdP - Identity provider (e.g. Azure AD, Okta, JumpCloud). fluffy
implements SAML and Oauth2 protocols.


For more details and pricing, please [contact sales](https://lakefs.io/contact-sales/).


Expand Down
4 changes: 3 additions & 1 deletion docs/enterprise/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,10 @@ With lakeFS Enterprise you’ll receive access to the security package containin

## What additional functionality does lakeFS Enterprise provide?

1. [lakeFS Mount]({% link reference/mount.md %}) allows users to virtually mount a remote lakeFS repository onto a local directory. Once mounted, users can access the data as if it resides on their local filesystem, using any tool, library, or framework that reads from a local filesystem.
1. [lakeFS Mount]({% link reference/mount.md %}) - allows users to virtually mount a remote lakeFS repository onto a local directory. Once mounted, users can access the data as if it resides on their local filesystem, using any tool, library, or framework that reads from a local filesystem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on how we communicate mount - we mount a branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outside the scope of this prd, let's differ for later?

2. [Transactional Mirroring]({% link howto/mirroring.md %}) - allows replicating lakeFS repositories into consistent read-only copies in remote locations.
3. [Multiple Storage Backends]({% link howto/multiple-storage-backends.md %}) - allows managing data stored across multiple storage locations: on-prem, hybrid, or multi-cloud.


| Feature | OSS | Enterprise |
|------------------------------------------------|-----------|-----------|
Expand Down
320 changes: 320 additions & 0 deletions docs/howto/multiple-storage-backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,320 @@
---
title: Multiple Storage Backend
description: How to manage data across multiple storage systems with lakeFS
parent: How-To
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the parent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I opened this task to move features to the right place on a separate pr.

---

# Multi-Storage Backend

lakeFS Enterprise
{: .label .label-purple }

{: .note}
> Multi-storage backend support is only available to licensed [lakeFS Enterprise]({% link enterprise/index.md %}) customers.
> [Contact us](https://info.lakefs.io/thanks-msb) to get started!

{% include toc.html %}

## What is Multi-storage Backend Support?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also - every feature can have Support in it.
I think we should simply use Multi-storage Backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. I changed the title but the sentence "What is Multi-storage Backend?" has different answer than I expect for "What is Multi-storage Backend support?" the later is related to what it means in lakeFS


lakeFS multi-storage backend support enables seamless data management across multiple storage systems —
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lakeFS multi-storage backend support enables seamless data management across multiple storage systems —
Using lakeFS multi-storage backend enables seamless data management across multiple storage systems —

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disagree here too

on-premises, across public clouds, or hybrid environments. This capability makes lakeFS a unified data management platform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-cloud support may be a trap. Obviously it will work, but most use cases will involve paying egress fees to at least one of the cloud providers. Do we want to go there with our customers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this feature sets the grounds for supporting cross cloud in a way that is egress fees aware - I can now manage cross cloud data for a single platform. making it more efficient in cost is another (very reasonable) feature to build

for all organizational data assets, which is especially critical in AI/ML environments that rely on diverse datasets stored
in multiple locations.

With a multi-store setup, lakeFS can connect to and manage any combination of supported storage systems, including:
* AWS S3
* Azure Blob
* Google Cloud Storage
* other S3-compatible storage
* local storage

{: .note}
> Multi-storage backends support is available from version v1.51.0 of lakeFS Enterprise.

## Use Cases

1. **Distributed Data Management**:
* Eliminate data silos and enable seamless cross-cloud collaboration.
* Maintain version control across different storage providers for consistency and reproducibility.
* Ideal for AI/ML environments where datasets are distributed across multiple storage locations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for SEO purposes?
I think removing the AI/ML makes this bullet more inclusive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for SEO but to make it clear to what we think is our target customers.
I think that making this more inclusive is not necessarily a good thing here. I want people to easily understand what's in it for them and that's one way


2. **Unified Data Access**:
* Access data across multiple storage backends using a single, consistent [URI format](../understand/model.md#lakefs-protocol-uris).

3. **Centralized Access Control & Governance**:
* Access permissions and policies can be centrally managed across all connected storage systems using lakeFS [RBAC](../security/rbac.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note sure we need to specify - but we currently do not provide a mechanism to control access based on storage id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think that we shouldn't specify this, at least not here where we are walking through the high level use cases.
for now, I'd like not to mention this

* Compliance and security controls remain consistent, regardless of where the data is stored.

## Configuration

To configure your lakeFS server to connect to multiple storage backends, define them under the `blockstores` section in your server configurations.
The `blockstores.stores` field is an array of storage backends, each with its own configuration.

For a complete list of available options, refer to the [server configuration reference](../reference/configuration.md#blockstores).

{: .note}
> **Note:** If you're upgrading from a single-store lakeFS setup, refer to the [upgrade guidelines](#upgrading-from-single-to-multi-store)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to use the same term "multi-storage" instead of store in all the doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

> to ensure a smooth transition.

### Example Configurations

<div class="tabs">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using html instead of md here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"tabs"... this is some CSS thing that renders multiple selectable tabs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we do it in other places in the docs. I don't know of another way to do it and prefer to differ for later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, at least in GitHub this doesn't render as tabs...
Not sure about the guidelines in our documentation though, regarding using html etc.

<ul>
<li><a href="#on-prem">On-prem</a></li>
<li><a href="#multi-cloud">Multi-cloud</a></li>
<li><a href="#hybrid">Hybrid</a></li>
</ul>

<div markdown="1" id="on-prem">

This example setup configures lakeFS to manage data across two separate MinIO instances:

```yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is wrong (2 spaces everywhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

blockstores:
signing:
secret_key: "some-secret"
stores:
- id: "minio-prod"
description: "Primary on-prem MinIO storage for production data"
type: "s3"
s3:
force_path_style: true
endpoint: 'http://minio-prod.local'
discover_bucket_region: false
credentials:
access_key_id: "prod_access_key"
secret_access_key: "prod_secret_key"
- id: "minio-backup"
description: "Backup MinIO storage for disaster recovery"
type: "s3"
s3:
force_path_style: true
endpoint: 'http://minio-backup.local'
discover_bucket_region: false
credentials:
access_key_id: "backup_access_key"
secret_access_key: "backup_secret_key"
```

</div>

<div markdown="2" id="multi-cloud">

This example setup configures lakeFS to manage data across two public cloud providers: AWS and Azure:

```yaml
blockstores:
signing:
secret_key: "some-secret"
stores:
- id: "s3-prod"
description: "AWS S3 storage for production data"
type: "s3"
s3:
region: "us-east-1"
- id: "azure-analytics"
description: "Azure Blob storage for analytics data"
type: "azure"
azure:
storage_account: "analytics-account"
storage_access_key: "EXAMPLE45551FSAsVVCXCF"
```
</div>

<div markdown="3" id="hybrid">

This hybrid setup allows lakeFS to manage data across both cloud and on-prem storages.
```yaml
blockstores:
signing:
secret_key: "some-secret"
stores:
- id: "s3-archive"
description: "AWS S3 storage for long-term archival"
type: "s3"
s3:
region: "us-west-2"
- id: "minio-fast-access"
description: "On-prem MinIO for high-performance workloads"
type: "s3"
s3:
force_path_style: true
endpoint: 'http://minio.local'
discover_bucket_region: false
credentials:
access_key_id: "minio_access_key"
secret_access_key: "minio_secret_key"
```
</div>
</div>

### Key Considerations

* Unique Blockstore IDs: Each storage must have a unique id.
* Persistence of Blockstore IDs: Once defined, an id must not change.
* S3 Authentication Handling:
* All standard S3 authentication methods are supported.
* Every blockstore needs to be authenticated. So make sure to configure a profile or static credentials for all storages of type `s3`.
S3 storage will use the credentials chain by default, so you might be able to use that for one storage.

{: .warning}
> Changing a storage ID is not supported and may result in unexpected behavior. Ensure IDs remain consistent once configured.

### Upgrading from a single storage backend to Multiple Storage backends

When upgrading from a single storage backend to a multi-storage setup, follow these guidelines:
* Use the new `blockstores` structure, **replacing** the existing `blockstore` configuration. Note that `blockstore` and `blockstores`
configurations are mutually exclusive - lakeFS does not support both simultaneously.
* Define all previously available [single-blockstore settings](../reference/configuration.md#blockstore) under their respective storage backends.
* The `signing.secret_key` is a required setting global to all connected stores.
* Set `backward_compatible: true` for the existing storage backend to ensure:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we need to be much more explicit here to the fact that things will not work correctly without it and that it cannot be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, I added your points

* Existing repositories continue to use the original storage backend.
* Newly created repositories default to this backend unless explicitly assigned a different one, to ensure a non-breaking upgrade process.
* **This setting is mandatory** — lakeFS will not function if it is unset.
* **Do not remove this setting** as long as you need to support repositories created before the upgrade.
If removed, lakeFS will fail to start because it will treat existing repositories as disconnected from any configured storage.

### Adding or Removing a Storage Backend

To add a storage backend, update the server configuration with the new storage entry and restart the server.

To remove a storage backend:
* Delete all repositories associated with the storage backend.
* Remove the storage entry from the configuration.
* Restart the server.

{: .warning}
> lakeFS will fail to start if there are repositories defined on a removed storage. Ensure all necessary cleanup is completed before removing a storage backend.

### Listing Connected Storage Backends

The [Get Config](https://docs.lakefs.io/reference/api.html#/config/getConfig) API endpoint returns a list of storage
configurations. In multi-storage setups, this is the recommended method to list connected storage backends and view their details.

### Troubleshooting

| Issue | Cause | Solution |
|---------------------------------------------------------------------|-------|----------------------------------------------------------|
| Blockstore ID conflicts | Duplicate `id` values in `stores` | Ensure each storage backend has a unique ID |
| Missing `backward_compatible` | Upgrade from single to multi-storage without setting the flag | Add `backward_compatible: true` for the existing storage |
| Unsupported configurations in OSS or unlicensed Enterprise accounts | Using multi-storage features in an unsupported setup | Contact us to start using the feature |

## Working with Repositories

After setting up lakeFS Enterprise to connect with multiple storage backends, this section explains how to use these
connected storages when working with lakeFS.

With multiple storage backends configured, lakeFS repositories are now linked to a specific storage. Together with
the repository's [storage namespace](../understand/model.md#concepts-unique-to-lakefs), this defines the exact location in
the underlying storage where the repository's data is stored.

The choice of storage backend impacts the following lakeFS operations:

### Creating a Repository

In a multi-storage setup, users must specify a storage ID when creating a repository. This can be done using the following methods:

<div class="tabs">
<ul>
<li><a href="#ui">UI</a></li>
<li><a href="#cli">CLI</a></li>
<li><a href="#api">API</a></li>
<li><a href="#hl-sdk">High-level Python SDK</a></li>
</ul>

<div markdown="1" id="ui">

Select a storage backend from the dropdown menu.
![create repo with storage id](../assets/img/msb/msb_create_repo_ui.png)

</div>

<div markdown="2" id="cli">

Use the `--storage-id` flag with the [repo create](../reference/cli.md#lakectl-repo-create) command:

```bash
lakectl repo create lakefs://my-repo s3://my-bucket --storage-id my-storage
```

**Note**: The `--storage-id` flag is currently hidden in the CLI.

</div>

<div markdown="3" id="api">

Use the `storage_id` parameter in the [Create Repository endpoint](../reference/api.md#/repositories/createRepository).

</div>

<div markdown="4" id="hl-sdk">

Starting from version 0.9.0 of the [High-level Python SDK](https://docs.lakefs.io/integrations/python.html#using-the-lakefs-sdk),
you can use `kwargs` to pass `storage_id` dynamically when calling the [create repository method](https://pydocs-lakefs.lakefs.io/lakefs.repository.html#lakefs.repository.Repository.create):

```python
import lakefs

repo = lakefs.Repository("example-repo").create(storage_namespace="s3://storage-bucket/repos/example-repo", storage_id="my-storage-id")
```
</div>

</div>

**Important notes:**
* In multi-storage setups where a storage backend is marked as `backward_compatible: true`, repository creation requests
without a storage ID will default to this storage.
* If no storage backend is marked as `backward_compatible`, repository creation requests without a storage ID will fail.
* Each repository is linked to a single backend and stores data within a single storage namespace on that backend.

### Viewing Repository Details

To check which storage backend is associated with a repository:

<div class="tabs">
<ul>
<li><a href="#ui">UI</a></li>
<li><a href="#api">API</a></li>
</ul>

<div markdown="1" id="ui">

The storage ID is displayed under "Storage" in the repository settings page.
![repo settings](../assets/img/msb/msb_repo_settings_ui.png)

</div>

<div markdown="2" id="api">

Use the [List Repositories](../reference/api.md#/repositories/listRepositories) endpoint. Its response includes the storage ID.

</div>
</div>

### Importing Data into a Repository

Importing data into a repository is supported when the credentials used for the repository's backing blockstore allow
read and list access to the storage location.

## Limitations

### Supported storages

Multi-storage backend support has been validated on:
* Self-managed S3-compatible object storage (MinIO)
* Amazon S3
* Local storage

{: .note}
> **Note:** Other storage backends may work but have not been officially tested. If you're interested in exploring
> additional configurations, please reach [contact us](https://info.lakefs.io/thanks-msb).

### Unsupported clients

The following clients do not currently support working with multiple storage backends. However, we are actively working
to bridge this gap:
* [Spark-based GC](../howto/garbage-collection/index.md)
* [Spark client](../reference/spark-client.md)
* [lakeFS Hadoop FileSystem](../integrations/spark.md#lakefs-hadoop-filesystem)
* [Everest](../reference/mount.md)
Loading
Loading