Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

synchronise pipeline protocol definitions and fix signature to use proper typing #2364

Open
wants to merge 4 commits into
base: devel
Choose a base branch
from

Conversation

ubiquitousbyte
Copy link

@ubiquitousbyte ubiquitousbyte commented Mar 1, 2025

Description

Adds Optional to all arguments of pipeline.run(...) to make None values acceptable by all type checkers.

Related Issues

Additional Context

Extracts the pipeline.run(...) arguments into a typed dictionary and uses Unpack to put them back in SupportsPipeline, SupportsPipelineRun and Pipeline. This way, the protocols and the implementation will always be in sync with each other.

This also allows users of dlt to create custom wrappers around run with enhanced functionality. Here's an example of our custom run wrapper:

def run(
    pipeline: SupportsPipeline,
    data: Any = None,
    *,
    destination: TDestinationReferenceArg | None = None,
    dataset_name: str | None = None,
    credentials: Any = None,
    table_name: str | None = None,
    write_disposition: TWriteDispositionConfig | None = None,
    columns: Sequence[TColumnSchema] | None = None,
    primary_key: TColumnNames | None = None,
    schema: Schema | None = None,
    loader_file_format: TLoaderFileFormat | None = None,
    schema_contract: TSchemaContract | None = None,
    state_store: PipelineStateStore | None = None,
) -> LoadInfo:
    with _maybe_reverse_etl_pipeline(pipeline, state_store) as pipeline:
        return pipeline.run(
            data,
            dataset_name=dataset_name,  # type: ignore
            destination=destination,  # type: ignore
            credentials=credentials,  # type: ignore
            table_name=table_name,  # type: ignore
            write_disposition=write_disposition,  # type: ignore
            columns=columns,  # type: ignore
            primary_key=primary_key,  # type: ignore
            schema=schema,  # type: ignore
            loader_file_format=loader_file_format,  # type: ignore
            schema_contract=schema_contract,  # type: ignore
        )

With the proposed changes here, we can refactor it into:

def run(
    pipeline: SupportsPipeline,
    data: Any = None,
    *,
    state_store: PipelineStateStore | None = None,
    **kwargs: Unpack[SupportsPipelineRunArgs],
) -> LoadInfo:
    with _maybe_reverse_etl_pipeline(pipeline, state_store) as pipeline:
        return pipeline.run(data, **kwargs)

Copy link

netlify bot commented Mar 1, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 4824e0f
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67c4af26c9c383000839c64b

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks really good! we were considering using Unpack here and in many different places (ie. to represent resource hints). the problem was that support in code editors was really bad:

  • you could not see arguments in Unpack, just kwargs
  • docstrings for an argument are not automatically displayed

I just checked VSCODE and it looks like it works! What editor are you on @ubiquitousbyte ? we need to check pycharm, cursor (is it VSCODE based?) to make sure we are not breaking dev experience.

EDIT: vscode () does not display docstring for unpacked kwargs :/ ie.

pipeline.run([1, 2, 3], destination="bigquery")

I do not see docstrings when I hover over destination. does it work for you?

@rudolfix rudolfix self-assigned this Mar 2, 2025
@ubiquitousbyte
Copy link
Author

ubiquitousbyte commented Mar 2, 2025

I wasn't getting the docstrings either. Adding them to SupportsPipelineRunArgs has partially fixed this for me, i.e. they are rendered as can be seen in the attached image. I also get an error if I try to use an invalid type, indicating that type checking works. However, I am unable to get any doc strings for the credentials, staging, and destination arguments. :/
image

I am using the Zed editor with pyright as a language server.

EDIT: I do get doc strings for credentials, staging and destination when I use SupportsPipeline:

def __test(pipeline: "SupportsPipeline"):
     return p.run(destination="bigquery") # <--- destination auto-completes with a docstring here

I don't get doc strings for credentials, staging and destination only when I use dlt.pipeline().run - I guess that has something to do with the return type being TPipeline which is bound to Pipeline.

@ubiquitousbyte
Copy link
Author

@rudolfix let me know if the insufficient auto-completion support for unpacked kwargs is a deal breaker for you. I'll revert back to the named arguments, but will still keep the Optional wrappers around them to avoid our original # type: ignore issues with Pyright.

@rudolfix
Copy link
Collaborator

rudolfix commented Mar 6, 2025

@ubiquitousbyte let' me check your newest changes. I'd prefer to keep unpack OFC, thanks!

@rudolfix
Copy link
Collaborator

rudolfix commented Mar 9, 2025

Now I do not see any arguments at all... but there's still a chance. IMO if you convert SupportsPipelineRun into a def (function) with a all the arguments and docstrings you'll be able to use dlt.common.typing:copy_sig to copy the signature and docstring.

for methods (that take self) you may try to use copy_sig_any that should skip first parameter from copied signature. If that does not work, I'm pretty sure you can add another copy function that does that (using Concatenate).

see how methods above are used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline protocols are out of sync with implementation and do not play well with static type checkers.
2 participants