-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): support configuring the "database" #909
feat(datasets): support configuring the "database" #909
Conversation
create_table
, create_view
, and table
methodscreate_table
, create_view
, and table
methods
create_table
, create_view
, and table
methodscreate_table
, create_view
, and table
methods
…o avoid breaking changes Signed-off-by: Mark Druffel <[email protected]>
Signed-off-by: Mark Druffel <[email protected]>
47331ff
to
ef3712e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving initial comments; happy to review later once it's ready.
|
||
def save(self, data: ir.Table) -> None: | ||
if self._table_name is None: | ||
raise DatasetError("Must provide `table_name` for materialization.") | ||
|
||
writer = getattr(self.connection, f"create_{self._materialized}") | ||
writer(self._table_name, data, **self._save_args) | ||
writer(self._table_name, data, **self._table_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? I think the table args should only apply to the table
call, but haven't looked into it deeply before commenting now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepyaman Sorry this is a little confusing so just adding a bit more context.
This PR
The table
method takes the database
argument, butcreate_table
& create_view
methods both take the database
and overwrite
arguments. The overwrite
argument is already in save_args
, but I'm assuming save_args
will be removed from TableDataset
in version 6. To avoid breaking changes, but also minimize change between this release and version 6 I just added the new parameters (database
) to table_args
and left the old parameters alone. is already in the save_args
they both also have overwrite
which is already in _save_args
.
To avoid breaking changes but still allow create_table
and create_view
arguments to flow through, I combined _save_args
and _table_args
here.
Version 6
I am assuming that save_args
& load_args
will be dropped from TableDataset
in version 6. In that change, I'd assume the arguments still used from load_args
and save_args
would be added to table_args
. To make TableDataset and FileDataset look / feel similar, we could consider just making a commensurate file_args
. I've not used 5.1 enough yet to say with certainty, but I can't think of a reason a user would want different values in load_args
than save_args
now that it's split from TableDataset (i.e. the filepath
, file_type
, sep
, etc. would be same for load and save)? I may be totally overlooking some things though 🤷♂️
bronze_tracks:
type: ibis.FileDataset # use `to_<file_format>` (write) & `read_<file_format>` (read)
connection:
backend: pyspark
file_args:
filepath: hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv
file_format: csv
materialized: view
overwrite: True
table_name: tracks #`to_<file_format>` in ibis has no database parameter so there's no ability to write to a specific catalog / db schema atm, `to_<file_format>` just writes to w/e is active
sep: ","
silver_tracks:
type: ibis.TableDataset # would use `create_<materialized>` (write) & `table` (read)
connection:
backend: pyspark
table_args:
name: tracks
database: spotify.silver
overwrite: True
create_table
, create_view
, and table
methodscreate_table
, create_view
, and table
methods
Signed-off-by: Mark Druffel <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Mark Druffel <[email protected]>
Signed-off-by: Mark Druffel <[email protected]>
…ark-druffel/kedro-plugins into fix/datasets/ibis-TableDataset
@deepyaman I changed this to ready for review, but I'm failing a bunch of steps. I tried to follow the guidelines, but when I run the Aside from the failing checks, I tested this version of table_dataset.py on a duckdb pipeline, a pyspark pipeline, and a pyspark pipeline on databricks and it seems to be working. My only open question relates to my musing above about the expected format of |
@jakepenzak For visibility |
Signed-off-by: Mark Druffel <[email protected]>
Sorry, I saw this yesterday and started drafting an apology. 🙈
I will review it later today. 🤞
…On Wed, Nov 13, 2024, 6:16 AM Merel Theisen ***@***.***> wrote:
@merelcht <https://github.com/merelcht> requested your review on: #909
<#909> feat(datasets):
Created table_args to pass to create_table, create_view, and table
methods.
—
Reply to this email directly, view it on GitHub
<#909 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADK3W3SOIHESNW3FEMOTGED2ANGKTAVCNFSM6AAAAABQUDWM3CVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJVGI4DGMBQGQYTQMQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
No worries @deepyaman, really appreciate your help! Let me know what I can do to support, just trying to make sure the yaml changes I'm introducing make sense and figure out how to get through the PR process :) Regarding my issues with For testing, unfortunately I don't think the tests will work on my personal machine because I'm on an old processor that doesn't support |
@mark-druffel Actually, putting aside the issues with local development, if you look at the CI failure on
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on the whole, but one comment re how database
is handled.
Let me know if I can help with any of the technical aspects of resolving merge conflicts, adding tests, etc.!
if table_args is not None: | ||
save_args["database"] = table_args.get("database", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit magical to me. It's not really consistent with the docstring, either, which says that arguments will be passed to create_{materialized}
; in reality, the user needs to know that just database
will be passed.
I personally would recommend one of two approaches. One is to not do anything special here; the user can pass database
in save_args
and database
in table_args
, and, while it may feel duplicative, at least it's explicit. The other approach to make an explicit database
keyword for the dataset, and likely raise an error if database
is specified in save_args
and/or table_args
if also passed explicitly.
@mark-druffel does this make sense, and do you have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepyaman As discussed yesterday, I've moved database to the top-level as discussed. I'm trying to push the changes, but I'm getting blocked by pre-commit now that I have it setup properly.
When it ran, it changed a bunch of files I never touched. I staged those as well (not sure if I should've), but my commit still failed because of Black. I've run black manually on the file I changed too to try to lint the file. Any suggestions how I can get this working properly? 😬
Based on the screenshot, it's only reformatting one file. Maybe you can do
a `git diff` to see what's changed? You can also just add that change, and
I cam take a look.
Also happy to help debug the workflow on a quick call, if that would help!
…On Fri, Nov 15, 2024, 3:48 PM Mark Druffel ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In kedro-datasets/kedro_datasets/ibis/table_dataset.py
<#909 (comment)>
:
> + if table_args is not None:
+ save_args["database"] = table_args.get("database", None)
@deepyaman <https://github.com/deepyaman> As discussed yesterday, I've
moved database to the top-level as discussed. I'm trying to push the
changes, but I'm getting blocked by pre-commit now that I have it setup
properly.
When it ran, it changed a bunch of files I never touched. I staged those
as well (not sure if I should've), but my commit still failed because of
Black. I've run black manually on the file I changed too to try to lint the
file. Any suggestions how I can get this working properly? 😬
image.png (view on web)
<https://github.com/user-attachments/assets/94b397cc-7263-4eaf-871f-0405a5cc59ee>
—
Reply to this email directly, view it on GitHub
<#909 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADK3W3X364WF4MLCOIUU5K32AZ23BAVCNFSM6AAAAABQUDWM3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMZZHA4DMOJQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@mark-druffel It would be great to get this included in 6.0.0 in the form we eventually arrived to—no need for Let me know if I can help |
… only file I touched was table_dataset Signed-off-by: Mark Druffel <[email protected]>
06196aa
to
2905cf8
Compare
kedro-datasets/tests/conftest.py
Outdated
@fixture(params=[None]) | ||
def database(request): | ||
return request.param | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fixture(params=[None]) | |
def database(request): | |
return request.param | |
@fixture(params=[None]) | |
def database(request): | |
return request.param | |
@mark-druffel Just for your future information (I will update this now), the fixtures under kedro-datasets/tests/conftest.py
are shared across all tests, so we don't need to add an Ibis-specific fixture here.
Signed-off-by: Deepyaman Datta <[email protected]>
618a96e
to
6be937e
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
6be937e
to
ec8ccb2
Compare
database
parameter to pass through to create_table(), create_view(), and table()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Looks like it's your first contribution @mark-druffel; congratulations on taking on a tricky one and seeing it through. :)
Appreciate if you can take a quick look at some of the changes I've added on top, just as a sanity check; of course, we'll need a second review from the Kedro maintainer group, as well. Feel free to update the PR description, etc., as well, if what I threw in is too terse/not communicated clearly.
Signed-off-by: Deepyaman Datta <[email protected]>
return ( | ||
self.connection.table(self._table_name) | ||
if self._database is None | ||
else self.connection.table(self._table_name, **self._load_args) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ( | |
self.connection.table(self._table_name) | |
if self._database is None | |
else self.connection.table(self._table_name, **self._load_args) | |
) | |
return self.connection.table(self._table_name, **self._load_args) |
This seems unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to address backends that don't support database
(e.g. polars), we discussed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right! I think you set it up in __init__()
though such that database
is only added to self._load_args
if it's not None
, so this simplification is probably still correct. Maybe it makes sense to add a Polars test if making this change, just to be safe, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also am not 100% sure if it could be an issue that **self._load_args
isn't being included if database is None
, in case something beyond name
and database
can be passed to Backend.table()
; asking here: https://ibis-project.zulipchat.com/#narrow/channel/405263-general/topic/Are.20there.20any.20restrictions.20around.20.60Backend.2Etable.60.20signature.3F/near/486157200
(Also probably relevant to some of the questions from @ravi-kumar-pilla in #956, although the use of **self._load_args
here is one of the reasons why it shouldn't be removed in that PR regardless.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right! I think you set it up in
__init__()
though such thatdatabase
is only added toself._load_args
if it's notNone
, so this simplification is probably still correct. Maybe it makes sense to add a Polars test if making this change, just to be safe, though.
Agree doing it in _init_()
would make sense 👍 I recall looking to do it there first, but it felt more complicated because of this reader code which I think is what's going to get deprecated in favor of FileDataset
? The reader
fails if load_args
has a database
parameter so I can't be passed in those use cases...
I could modify the _init_()
to pass database
to load_args
if self._database is not None and self._filepath is None:
if that seems more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also am not 100% sure if it could be an issue that
**self._load_args
isn't being included ifdatabase is None
, in case something beyondname
anddatabase
can be passed toBackend.table()
; asking here: https://ibis-project.zulipchat.com/#narrow/channel/405263-general/topic/Are.20there.20any.20restrictions.20around.20.60Backend.2Etable.60.20signature.3F/near/486157200(Also probably relevant to some of the questions from @ravi-kumar-pilla in #956, although the use of
**self._load_args
here is one of the reasons why it shouldn't be removed in that PR regardless.)
Definitely makes sense to confirm, but I did review several ibis backends and test the few I use to confirm that database defaults to None
for what it's worth
I commented on the one change, I think the code you removed is required for backends that don't support Thanks so much for all your patience and help, you carried me the whole way through this 🥇 |
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
876f804
to
6e411e4
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
Refs: 10af4db Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Merged and synced; this is ready to go pending one review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @mark-druffel and @deepyaman ! LGTM!
Description
Close #833
Development notes
Ended up supporting database in two senses: attached external databases and what many databases call schemas
Checklist
RELEASE.md
file