Skip to content

Conversation

@SergeiPatiakin
Copy link
Contributor

This PR adds Iceberg support with the pg-to-iceberg and parquet-to-iceberg subcommands.

Testing

-- Run lakehouse-loader pg-to-iceberg
AWS_ALLOW_HTTP=true AWS_ENDPOINT=http://localhost:9000 AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin PGPASSWORD=test-password cargo run pg-to-iceberg --query "select cint4, cint8, ctext, cbool from t1" "postgres://test-user@localhost:5432/test-db" "s3://lhl-test-bucket/default/t1"

-- Ensure you have a recent build of Seafowl with Iceberg support
-- Start Seafowl CLI
cargo run -- --cli

-- Register the external table
CREATE EXTERNAL TABLE t1 STORED AS ICEBERG  OPTIONS ('s3.access-key-id' 'minioadmin', 's3.secret-access-key' 'minioadmin', 's3.endpoint' 'http://127.0.0.1:9000', 'allow_http' 'true') LOCATION 's3://lhl-test-bucket/default/t1/metadata/v1.metadata.json';

-- Verify you read data from iceberg
SELECT * from staging.t1 LIMIT 10;

@SergeiPatiakin SergeiPatiakin marked this pull request as ready for review November 25, 2024 16:18

use crate::error::DataLoadingError;

pub async fn record_batches_to_iceberg(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is almost 200 lines long and handle multiple tasks. You can break it into something like:

let file_io = setup_file_io(...)
let iceberg_schema = assign_field_ids(...)
let metadata_v0 = prepare_metadata(...)
let data_files = write_record_batches(...)
let manifest_file = write_manifest(...)
write_snapshot(manifest_file, metadata_v0, target_url, file_io)

Besides, some values repeat along it, so you can store them in constants:

const DEFAULT_SCHEMA_ID: i32 = 0;
const METADATA_PATH: &str = "metadata";
const PARTITION_FILENAME: &str = "part";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've moved out the more self-contained logic out of record_batches_to_iceberg now. The remaining logic could be split up further, but right now I have low confidence that we know the right way to do it. It will become clearer in future as we learn more about Iceberg and which iceberg-related features we want to build

let iceberg_schema = iceberg::arrow::arrow_schema_to_schema(&cloned_arrow_schema)?;

let table_creation = TableCreation::builder()
.name("dummy_name".to_string()) // Required by TableCreationBuilder. Doesn't affect output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this name is stored in table metadata? If so how about using something that conveys at least some information (e.g. signifying that the table was created by lakehouse-loader, timestamp/source) or adding a new name param to the CLI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't get persisted in table metadata or anywhere else


while let Some(maybe_batch) = record_batch_stream.next().await {
let batch = maybe_batch?;
file_writer.write(&batch).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this results in one or more row groups being buffered in memory prior to flushing, which may (or may not) pose memory usage issues. If it turns out it does we'd need to customize it, like we do with delta (temp files + multipart upload), but for start I think it's good.

Copy link
Contributor Author

@SergeiPatiakin SergeiPatiakin Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this results in one or more row groups being buffered in memory

You're right, it could be one or more row groups, but it should be bounded, if I'm interpreting this comment correctly: https://github.com/apache/arrow-rs/blob/ccd18f203116e87675ea4ddaf5ddc1b4f986ad14/parquet/src/arrow/async_writer/mod.rs#L211-L214

Therefore I don't expect memory usage issues

@SergeiPatiakin SergeiPatiakin merged commit 41c2631 into main Nov 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants