Iceberg support #8

SergeiPatiakin · 2024-11-25T16:16:48Z

This PR adds Iceberg support with the pg-to-iceberg and parquet-to-iceberg subcommands.

Testing

-- Run lakehouse-loader pg-to-iceberg
AWS_ALLOW_HTTP=true AWS_ENDPOINT=http://localhost:9000 AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin PGPASSWORD=test-password cargo run pg-to-iceberg --query "select cint4, cint8, ctext, cbool from t1" "postgres://test-user@localhost:5432/test-db" "s3://lhl-test-bucket/default/t1"

-- Ensure you have a recent build of Seafowl with Iceberg support
-- Start Seafowl CLI
cargo run -- --cli

-- Register the external table
CREATE EXTERNAL TABLE t1 STORED AS ICEBERG  OPTIONS ('s3.access-key-id' 'minioadmin', 's3.secret-access-key' 'minioadmin', 's3.endpoint' 'http://127.0.0.1:9000', 'allow_http' 'true') LOCATION 's3://lhl-test-bucket/default/t1/metadata/v1.metadata.json';

-- Verify you read data from iceberg
SELECT * from staging.t1 LIMIT 10;

src/iceberg_destination.rs

lizardoluis · 2024-11-25T16:39:00Z

src/iceberg_destination.rs

+
+use crate::error::DataLoadingError;
+
+pub async fn record_batches_to_iceberg(


This function is almost 200 lines long and handle multiple tasks. You can break it into something like:

let file_io = setup_file_io(...) let iceberg_schema = assign_field_ids(...) let metadata_v0 = prepare_metadata(...) let data_files = write_record_batches(...) let manifest_file = write_manifest(...) write_snapshot(manifest_file, metadata_v0, target_url, file_io)

Besides, some values repeat along it, so you can store them in constants:

const DEFAULT_SCHEMA_ID: i32 = 0; const METADATA_PATH: &str = "metadata"; const PARTITION_FILENAME: &str = "part";

Ok, I've moved out the more self-contained logic out of record_batches_to_iceberg now. The remaining logic could be split up further, but right now I have low confidence that we know the right way to do it. It will become clearer in future as we learn more about Iceberg and which iceberg-related features we want to build

src/iceberg_destination.rs

Co-authored-by: Luís Lizardo <[email protected]>

gruuya · 2024-11-26T09:06:04Z

src/iceberg_destination.rs

+    let iceberg_schema = iceberg::arrow::arrow_schema_to_schema(&cloned_arrow_schema)?;
+
+    let table_creation = TableCreation::builder()
+        .name("dummy_name".to_string()) // Required by TableCreationBuilder. Doesn't affect output


Presumably this name is stored in table metadata? If so how about using something that conveys at least some information (e.g. signifying that the table was created by lakehouse-loader, timestamp/source) or adding a new name param to the CLI?

No, it doesn't get persisted in table metadata or anywhere else

gruuya · 2024-11-26T09:17:07Z

src/iceberg_destination.rs

+
+    while let Some(maybe_batch) = record_batch_stream.next().await {
+        let batch = maybe_batch?;
+        file_writer.write(&batch).await?;


I think this results in one or more row groups being buffered in memory prior to flushing, which may (or may not) pose memory usage issues. If it turns out it does we'd need to customize it, like we do with delta (temp files + multipart upload), but for start I think it's good.

I think this results in one or more row groups being buffered in memory

You're right, it could be one or more row groups, but it should be bounded, if I'm interpreting this comment correctly: https://github.com/apache/arrow-rs/blob/ccd18f203116e87675ea4ddaf5ddc1b4f986ad14/parquet/src/arrow/async_writer/mod.rs#L211-L214

Therefore I don't expect memory usage issues

Iceberg support

fa58372

SergeiPatiakin marked this pull request as ready for review November 25, 2024 16:18

SergeiPatiakin requested a review from gruuya November 25, 2024 16:19

lizardoluis reviewed Nov 25, 2024

View reviewed changes

Update src/iceberg_destination.rs

425ea27

Co-authored-by: Luís Lizardo <[email protected]>

gruuya approved these changes Nov 26, 2024

View reviewed changes

SergeiPatiakin added 2 commits November 26, 2024 12:01

Helper functions in iceberg_destination.rs

82c006a

Move more logic to helpers

0c5dcff

SergeiPatiakin merged commit 41c2631 into main Nov 26, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Iceberg support #8

Iceberg support #8

Uh oh!

SergeiPatiakin commented Nov 25, 2024

Uh oh!

Uh oh!

Uh oh!

lizardoluis Nov 25, 2024

Uh oh!

SergeiPatiakin Nov 26, 2024

Uh oh!

Uh oh!

Uh oh!

gruuya Nov 26, 2024

Uh oh!

SergeiPatiakin Nov 26, 2024

Uh oh!

gruuya Nov 26, 2024

Uh oh!

SergeiPatiakin Nov 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		use crate::error::DataLoadingError;

		pub async fn record_batches_to_iceberg(

Iceberg support #8

Iceberg support #8

Uh oh!

Conversation

SergeiPatiakin commented Nov 25, 2024

Testing

Uh oh!

Uh oh!

Uh oh!

lizardoluis Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

SergeiPatiakin Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gruuya Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

SergeiPatiakin Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

gruuya Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

SergeiPatiakin Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SergeiPatiakin Nov 26, 2024 •

edited

Loading