feat: IBM watsonx.data Destination connector#428
Conversation
Retry on failed commit
…en uploading data to ibm watsonx.data connector
|
I'll check this out later today. |
Changed to
Changed, only updating ibm-watsonx and test dependencies
Not sure why you are not seeing it, I'm able to make CLI calls. I registered it inside |
| "uri": self.iceberg_url, | ||
| "token": bearer_token, | ||
| "warehouse": self.catalog, | ||
| "s3.endpoint": self.object_storage_url, |
There was a problem hiding this comment.
Is ibm watson always backed by s3? Is this something we want configurable for other blob stores from other cloud providers if that's supported?
There was a problem hiding this comment.
I think you can make it use different type of storage but from what I understand the specific example they sent us and they need implementation for is using S3.
There was a problem hiding this comment.
There was a problem hiding this comment.
Then we should follow the same pattern we have for other connectors that can be backed by various blob stores, so we should have an s3 suffix for this one and create a method, i.e. def get catalog_configs()-> dict that gets called to generate this on a base class. For now, we don't need to introduce this base class, but at least add that method so if we introduce other blob stores, we could easily support those. Take a look at databricks volumes for an example of a connector supporting multiple blob stores.
There was a problem hiding this comment.
Actually, it could even be more precise ibm-watsonx-iceberg-s3, but let's stick with ibm-watsonx-s3 for now, we can always change it later.
Co-authored-by: Roman Isecke <[email protected]>
| def _upload_data_table(table: "Table", data_table: "ArrowTable", file_data: FileData): | ||
| try: | ||
| with table.transaction() as transaction: | ||
| self._delete(transaction, file_data.identifier) |
There was a problem hiding this comment.
We might want to break this up into a delete transaction and an append transaction where we wrap each in their own tenacity retry loop. Otherwise this is going to retry the delete each time even if that already passed or doesn't apply.
There was a problem hiding this comment.
I think it's a bad idea. I ran some tests and:
- Because of the way Iceberg works this is going to create a lot of retries.
- Tried partitioning and uploading 4 files with separate transactions and it fails (on delete) when using
max_retries=5for 1 file (so 3 out of 4 files got uploaded). Worked after changing it tomax_retries=10 - Because of so many retries it makes this uploader very slow.
- Why even make transaction? Isn't the point of transaction to rollback the changes in case of a fail? After split we have two single separate operations that shouldn't require transaction.
Co-authored-by: Roman Isecke <[email protected]>
No description provided.