Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): support configuring the "database" #909

Merged
merged 30 commits into from
Dec 10, 2024

Conversation

mark-druffel
Copy link
Contributor

@mark-druffel mark-druffel commented Oct 25, 2024

Description

Close #833

Development notes

Ended up supporting database in two senses: attached external databases and what many databases call schemas

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@mark-druffel mark-druffel changed the title Created table_args to pass to create_table, create_view, and table methods Fix(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 25, 2024
@mark-druffel mark-druffel changed the title Fix(datasets): Created table_args to pass to create_table, create_view, and table methods fix(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 25, 2024
@mark-druffel mark-druffel force-pushed the fix/datasets/ibis-TableDataset branch from 47331ff to ef3712e Compare October 25, 2024 22:44
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving initial comments; happy to review later once it's ready.

kedro-datasets/RELEASE.md Outdated Show resolved Hide resolved

def save(self, data: ir.Table) -> None:
if self._table_name is None:
raise DatasetError("Must provide `table_name` for materialization.")

writer = getattr(self.connection, f"create_{self._materialized}")
writer(self._table_name, data, **self._save_args)
writer(self._table_name, data, **self._table_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? I think the table args should only apply to the table call, but haven't looked into it deeply before commenting now.

Copy link
Contributor Author

@mark-druffel mark-druffel Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Sorry this is a little confusing so just adding a bit more context.

This PR

The table method takes the database argument, butcreate_table & create_view methods both take the database and overwrite arguments. The overwrite argument is already in save_args, but I'm assuming save_args will be removed from TableDataset in version 6. To avoid breaking changes, but also minimize change between this release and version 6 I just added the new parameters (database) to table_args and left the old parameters alone. is already in the save_args they both also have overwrite which is already in _save_args.

To avoid breaking changes but still allow create_table and create_view arguments to flow through, I combined _save_args and _table_args here.

Version 6

I am assuming that save_args & load_args will be dropped from TableDataset in version 6. In that change, I'd assume the arguments still used from load_args and save_args would be added to table_args. To make TableDataset and FileDataset look / feel similar, we could consider just making a commensurate file_args. I've not used 5.1 enough yet to say with certainty, but I can't think of a reason a user would want different values in load_args than save_args now that it's split from TableDataset (i.e. the filepath, file_type, sep, etc. would be same for load and save)? I may be totally overlooking some things though 🤷‍♂️

bronze_tracks:
  type: ibis.FileDataset # use `to_<file_format>` (write) & `read_<file_format>` (read)
  connection:
    backend: pyspark
  file_args:
    filepath: hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv
    file_format: csv
    materialized: view
    overwrite: True
    table_name: tracks #`to_<file_format>` in ibis has no database parameter so there's no ability to write to a specific catalog / db schema atm, `to_<file_format>` just writes to w/e is active
    sep: "," 

silver_tracks:
  type: ibis.TableDataset # would use `create_<materialized>` (write) & `table` (read)
  connection:
    backend: pyspark
  table_args:
    name: tracks
    database: spotify.silver
    overwrite: True

@mark-druffel mark-druffel changed the title fix(datasets): Created table_args to pass to create_table, create_view, and table methods feat(datasets): Created table_args to pass to create_table, create_view, and table methods Oct 28, 2024
@mark-druffel mark-druffel deleted the fix/datasets/ibis-TableDataset branch October 28, 2024 19:39
@deepyaman deepyaman reopened this Oct 28, 2024
@mark-druffel mark-druffel marked this pull request as ready for review November 1, 2024 22:37
@mark-druffel
Copy link
Contributor Author

@deepyaman I changed this to ready for review, but I'm failing a bunch of steps. I tried to follow the guidelines, but when I run the make tests they all fail saying No rule. Any chance you can take a look and give me a bit of guidance? Sorry just not sure where to go from here 😬

image

Aside from the failing checks, I tested this version of table_dataset.py on a duckdb pipeline, a pyspark pipeline, and a pyspark pipeline on databricks and it seems to be working. My only open question relates to my musing above about the expected format of TableDataset and FileDataset above.

@mark-druffel
Copy link
Contributor Author

@jakepenzak For visibility

@merelcht merelcht requested a review from deepyaman November 13, 2024 13:16
@deepyaman
Copy link
Member

deepyaman commented Nov 13, 2024 via email

@mark-druffel
Copy link
Contributor Author

mark-druffel commented Nov 13, 2024

Sorry, I saw this yesterday and started drafting an apology. 🙈 I will review it later today. 🤞

On Wed, Nov 13, 2024, 6:16 AM Merel Theisen @.> wrote: @merelcht https://github.com/merelcht requested your review on: #909 <#909> feat(datasets): Created table_args to pass to create_table, create_view, and table methods. — Reply to this email directly, view it on GitHub <#909 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK3W3SOIHESNW3FEMOTGED2ANGKTAVCNFSM6AAAAABQUDWM3CVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJVGI4DGMBQGQYTQMQ . You are receiving this because your review was requested.Message ID: @.>

No worries @deepyaman, really appreciate your help! Let me know what I can do to support, just trying to make sure the yaml changes I'm introducing make sense and figure out how to get through the PR process :)

Regarding my issues with make, I was able to figure out my initial question, but still ran into some errors running when linting and testing. Not sure what went wrong, but for linting I get errors related to some other datasets. Perhaps from rebasing 🤷
image

For testing, unfortunately I don't think the tests will work on my personal machine because I'm on an old processor that doesn't support AVX2. I know that causes issues when running polars, not sure if that's the root cause or not, the setps seem to fail because of a python illegal instruction error in execnet...
image

@deepyaman
Copy link
Member

@mark-druffel Actually, putting aside the issues with local development, if you look at the CI failure on kedro-datasets / unit-tests, you'll see that the main thing is not having unit tests covering lines 150 and 155 in kedro_datasets/ibis/table_dataset.py:

kedro_datasets/ibis/table_dataset.py                            66      2    97%   150, 155

Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on the whole, but one comment re how database is handled.

Let me know if I can help with any of the technical aspects of resolving merge conflicts, adding tests, etc.!

Comment on lines 149 to 150
if table_args is not None:
save_args["database"] = table_args.get("database", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit magical to me. It's not really consistent with the docstring, either, which says that arguments will be passed to create_{materialized}; in reality, the user needs to know that just database will be passed.

I personally would recommend one of two approaches. One is to not do anything special here; the user can pass database in save_args and database in table_args, and, while it may feel duplicative, at least it's explicit. The other approach to make an explicit database keyword for the dataset, and likely raise an error if database is specified in save_args and/or table_args if also passed explicitly.

@mark-druffel does this make sense, and do you have a preference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman As discussed yesterday, I've moved database to the top-level as discussed. I'm trying to push the changes, but I'm getting blocked by pre-commit now that I have it setup properly.

When it ran, it changed a bunch of files I never touched. I staged those as well (not sure if I should've), but my commit still failed because of Black. I've run black manually on the file I changed too to try to lint the file. Any suggestions how I can get this working properly? 😬

image

@deepyaman
Copy link
Member

deepyaman commented Nov 15, 2024 via email

@merelcht merelcht mentioned this pull request Nov 25, 2024
7 tasks
@deepyaman
Copy link
Member

@mark-druffel It would be great to get this included in 6.0.0 in the form we eventually arrived to—no need for table_args, just accept a top-level database, I think? The reason it would be great in 6.0.0 is because we will also be removing the file reading from TableDataset for the release (in favor of FileDataset), so we don't need to think about that edge case at all.

Let me know if I can help

… only file I touched was table_dataset

Signed-off-by: Mark Druffel <[email protected]>
@deepyaman deepyaman force-pushed the fix/datasets/ibis-TableDataset branch 2 times, most recently from 06196aa to 2905cf8 Compare December 2, 2024 00:25
Comment on lines 92 to 96
@fixture(params=[None])
def database(request):
return request.param


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@fixture(params=[None])
def database(request):
return request.param
@fixture(params=[None])
def database(request):
return request.param

@mark-druffel Just for your future information (I will update this now), the fixtures under kedro-datasets/tests/conftest.py are shared across all tests, so we don't need to add an Ibis-specific fixture here.

@deepyaman deepyaman force-pushed the fix/datasets/ibis-TableDataset branch from 618a96e to 6be937e Compare December 4, 2024 00:01
@deepyaman deepyaman force-pushed the fix/datasets/ibis-TableDataset branch from 6be937e to ec8ccb2 Compare December 4, 2024 00:03
@deepyaman deepyaman changed the title feat(datasets): table_dataset add database parameter to pass through to create_table(), create_view(), and table() feat(datasets): support configuring the "database" Dec 4, 2024
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Looks like it's your first contribution @mark-druffel; congratulations on taking on a tricky one and seeing it through. :)

Appreciate if you can take a quick look at some of the changes I've added on top, just as a sanity check; of course, we'll need a second review from the Kedro maintainer group, as well. Feel free to update the PR description, etc., as well, if what I threw in is too terse/not communicated clearly.

Comment on lines 181 to 185
return (
self.connection.table(self._table_name)
if self._database is None
else self.connection.table(self._table_name, **self._load_args)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (
self.connection.table(self._table_name)
if self._database is None
else self.connection.table(self._table_name, **self._load_args)
)
return self.connection.table(self._table_name, **self._load_args)

This seems unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to address backends that don't support database (e.g. polars), we discussed here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right! I think you set it up in __init__() though such that database is only added to self._load_args if it's not None, so this simplification is probably still correct. Maybe it makes sense to add a Polars test if making this change, just to be safe, though.

Copy link
Member

@deepyaman deepyaman Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also am not 100% sure if it could be an issue that **self._load_args isn't being included if database is None, in case something beyond name and database can be passed to Backend.table(); asking here: https://ibis-project.zulipchat.com/#narrow/channel/405263-general/topic/Are.20there.20any.20restrictions.20around.20.60Backend.2Etable.60.20signature.3F/near/486157200

(Also probably relevant to some of the questions from @ravi-kumar-pilla in #956, although the use of **self._load_args here is one of the reasons why it shouldn't be removed in that PR regardless.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right! I think you set it up in __init__() though such that database is only added to self._load_args if it's not None, so this simplification is probably still correct. Maybe it makes sense to add a Polars test if making this change, just to be safe, though.

Agree doing it in _init_() would make sense 👍 I recall looking to do it there first, but it felt more complicated because of this reader code which I think is what's going to get deprecated in favor of FileDataset? The reader fails if load_args has a database parameter so I can't be passed in those use cases...

I could modify the _init_() to pass database to load_args if self._database is not None and self._filepath is None: if that seems more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also am not 100% sure if it could be an issue that **self._load_args isn't being included if database is None, in case something beyond name and database can be passed to Backend.table(); asking here: https://ibis-project.zulipchat.com/#narrow/channel/405263-general/topic/Are.20there.20any.20restrictions.20around.20.60Backend.2Etable.60.20signature.3F/near/486157200

(Also probably relevant to some of the questions from @ravi-kumar-pilla in #956, although the use of **self._load_args here is one of the reasons why it shouldn't be removed in that PR regardless.)

Definitely makes sense to confirm, but I did review several ibis backends and test the few I use to confirm that database defaults to None for what it's worth

@mark-druffel
Copy link
Contributor Author

This looks good to me! Looks like it's your first contribution @mark-druffel; congratulations on taking on a tricky one and seeing it through. :)

Appreciate if you can take a quick look at some of the changes I've added on top, just as a sanity check; of course, we'll need a second review from the Kedro maintainer group, as well. Feel free to update the PR description, etc., as well, if what I threw in is too terse/not communicated clearly.

I commented on the one change, I think the code you removed is required for backends that don't support database, but I didn't run any direct tests on polars after we chatted about it so I could be wrong.

Thanks so much for all your patience and help, you carried me the whole way through this 🥇

@deepyaman deepyaman force-pushed the fix/datasets/ibis-TableDataset branch from 876f804 to 6e411e4 Compare December 5, 2024 13:42
@deepyaman
Copy link
Member

deepyaman commented Dec 5, 2024

Merge after #957

Merged and synced; this is ready to go pending one review!

@deepyaman deepyaman enabled auto-merge (squash) December 6, 2024 18:31
@DimedS DimedS self-requested a review December 9, 2024 14:50
Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mark-druffel and @deepyaman ! LGTM!

@deepyaman deepyaman merged commit 50fa3c0 into kedro-org:main Dec 10, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(datasets): support setting Ibis table schemas
3 participants