feat: IBM watsonx.data Destination connector by mpolomdeepsense · Pull Request #428 · Unstructured-IO/unstructured-ingest

mpolomdeepsense · 2025-03-18T13:46:16Z

No description provided.

Retry on failed commit

…en uploading data to ibm watsonx.data connector

nikpocuca

very nice

potter-potter · 2025-03-18T15:17:30Z

I'll check this out later today.

mpolomdeepsense · 2025-03-20T14:40:20Z

@rbiseck3

Is ibm_watson sufficient? I feel like ibm_watson_data is overly verbose and redundant.

Changed to ibm_watsonx

We should also drop all of the dependency updates from this PR and isolate it to just the new watson file added.

Changed, only updating ibm-watsonx and test dependencies

I'm assuming this new connector wasn't registered, when I run the cli I don't see watson listed under the destination connectors.

Not sure why you are not seeing it, I'm able to make CLI calls. I registered it inside setup.py and connectors __init__.py

rbiseck3 · 2025-03-24T14:24:50Z

+                    "uri": self.iceberg_url,
+                    "token": bearer_token,
+                    "warehouse": self.catalog,
+                    "s3.endpoint": self.object_storage_url,


Is ibm watson always backed by s3? Is this something we want configurable for other blob stores from other cloud providers if that's supported?

I think you can make it use different type of storage but from what I understand the specific example they sent us and they need implementation for is using S3.

@raiderrobert

Then we should follow the same pattern we have for other connectors that can be backed by various blob stores, so we should have an s3 suffix for this one and create a method, i.e. def get catalog_configs()-> dict that gets called to generate this on a base class. For now, we don't need to introduce this base class, but at least add that method so if we introduce other blob stores, we could easily support those. Take a look at databricks volumes for an example of a connector supporting multiple blob stores.

Actually, it could even be more precise ibm-watsonx-iceberg-s3, but let's stick with ibm-watsonx-s3 for now, we can always change it later.

Co-authored-by: Roman Isecke <[email protected]>

rbiseck3 · 2025-03-24T14:31:29Z

+        def _upload_data_table(table: "Table", data_table: "ArrowTable", file_data: FileData):
+            try:
+                with table.transaction() as transaction:
+                    self._delete(transaction, file_data.identifier)


We might want to break this up into a delete transaction and an append transaction where we wrap each in their own tenacity retry loop. Otherwise this is going to retry the delete each time even if that already passed or doesn't apply.

I think it's a bad idea. I ran some tests and:

Because of the way Iceberg works this is going to create a lot of retries.

Tried partitioning and uploading 4 files with separate transactions and it fails (on delete) when using max_retries=5 for 1 file (so 3 out of 4 files got uploaded). Worked after changing it to max_retries=10

Because of so many retries it makes this uploader very slow.

Why even make transaction? Isn't the point of transaction to rollback the changes in case of a fail? After split we have two single separate operations that shouldn't require transaction.

Co-authored-by: Roman Isecke <[email protected]>

mpolomdeepsense added 14 commits March 13, 2025 14:13

IBM watsonx.data destination connector initial commit

5fc9c84

pip compile

3c5824a

Extends SQL upload stager for data preparation

e50ce33

Retry on failed commit

Comment describing commit retries

748639f

Code cleanup

9d3ef31

IBM Watsonx data code cleanup and optimizations

636d316

IBM Watsonx data connection config unit tests

b212238

Use httpx instead of requests

803d038

Code cleanup

a5c6342

Code cleanup

46d815b

IBM watsonx.data uploader tests

44a8933

Intersect iceberg table schema columns and input dataframe columns wh…

facbf16

…en uploading data to ibm watsonx.data connector

Change default number of max retries from 3 to 5

09af281

ibm watsonx data uploader docs

5ca07b4

mpolomdeepsense requested review from potter-potter, raiderrobert and rbiseck3 March 18, 2025 13:46

Merge branch 'main' into feat/ibm-watsonx-data-connector

9fbe335

mpolomdeepsense changed the title ~~feat: IBM watsonx.data Uploader connector~~ feat: IBM watsonx.data Destination connector Mar 18, 2025

mpolomdeepsense added 2 commits March 18, 2025 14:50

pip-compile

f1ebf6d

Version and changelog update: ibm watsonx.data destination connector

32b7402

mpolomdeepsense had a problem deploying to ci March 18, 2025 13:59 — with GitHub Actions Error

mpolomdeepsense had a problem deploying to ci March 18, 2025 13:59 — with GitHub Actions Failure

nikpocuca approved these changes Mar 18, 2025

View reviewed changes

Lint fix

300984b

mpolomdeepsense temporarily deployed to ci March 18, 2025 15:26 — with GitHub Actions Inactive

mpolomdeepsense added 14 commits March 20, 2025 09:37

pip-compile

b663d5e

add eof new line

33ac0ca

Fix lint

e26bcbd

Review changes

979573e

Remove unused ibm-watsonx uploader config field

2a1dc13

documentation fix

a0d4a5b

ibm watsonx data unit tests fix

c33ca93

sql uploader _fit_to_schema tests

c2d34d2

Merge branch 'main' into feat/ibm-watsonx-data-connector

f870fb3

Code cleanup

59207ec

Remove unused import

f426348

Fix tenacity dependency

e62a8e2

ibm watsonx pip-compile

9d8552b

test dependencies fix

9646507

Merge branch 'main' into feat/ibm-watsonx-data-connector

438e36c