Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deferred pagination mode to GenericTransfer #44809

Open
wants to merge 114 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
114 commits
Select commit Hold shift + click to select a range
8f3657f
refactor: Refactored the GenericTransfer operator to support paginate…
davidblain-infrabel Dec 10, 2024
9b36a14
refactor: updated provider dependencies
dabla Dec 10, 2024
eedcd52
refactor: Added TestSQLExecuteQueryTrigger and moved test code which …
davidblain-infrabel Dec 10, 2024
38f654a
refactor: Fixed static checks
davidblain-infrabel Dec 10, 2024
15383e3
refactor: Fixed static checks
davidblain-infrabel Dec 10, 2024
a93b49a
refactor: Fixed static checks
davidblain-infrabel Dec 10, 2024
8a1d2de
refactor: Reformatted GenericTransfer
davidblain-infrabel Dec 10, 2024
757ab68
refactor: Moved source and destination hooks into cached properties
davidblain-infrabel Dec 10, 2024
0744383
refactor: Moved imports to type checking block
davidblain-infrabel Dec 11, 2024
edff5f4
refactor: Fixed execute method of GenericTransfer
davidblain-infrabel Dec 11, 2024
f3b2893
refactor: Refactored get_hook method of GenericTransfer which checks …
davidblain-infrabel Dec 11, 2024
ac4df02
refactor: Remove white lines from mock_context
davidblain-infrabel Dec 11, 2024
0a771d8
refactor: Reformatted get_hook in GenericTransfer operator
davidblain-infrabel Dec 11, 2024
0e426dc
refactor: Added sql.pyi for SQLExecuteQueryTrigger
davidblain-infrabel Dec 16, 2024
4bc7d6f
refactor: Reformatted SQLExecuteQueryTrigger definition
davidblain-infrabel Dec 16, 2024
59eae35
refactor: Added alias in SQLExecuteQueryTrigger definition
davidblain-infrabel Dec 16, 2024
fff10ce
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 2, 2025
1662d2f
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 2, 2025
735a557
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 7, 2025
f092507
refactor: Added unit test for GenericTransfer using deferred pageable…
davidblain-infrabel Jan 7, 2025
3550b84
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 7, 2025
44a6965
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 8, 2025
9e75525
refactor: Moved generic_transfer operator definition in provider.yaml…
davidblain-infrabel Jan 8, 2025
5c52e4a
refactor: Renamed typo of module which allows you to run deferrable o…
davidblain-infrabel Jan 8, 2025
73f5f1c
refactor: Reformatted xcom_pull method from mock_context
davidblain-infrabel Jan 8, 2025
abadb0b
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 9, 2025
9a5760f
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 10, 2025
ca6464e
refactor: Fixed module name of the GenericTransfer in the deprecated …
davidblain-infrabel Jan 10, 2025
d0cddfd
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 10, 2025
ba7fd40
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 10, 2025
f1971c3
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 11, 2025
483b816
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 13, 2025
0e70c4a
refactor: updated provider dependencies
dabla Jan 14, 2025
d721806
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 14, 2025
8a4a8ab
refactor: Fixed static checks
davidblain-infrabel Jan 14, 2025
f9a2f90
refactor: Added definition for generic_transfer.pyi
davidblain-infrabel Jan 14, 2025
440135f
Merge branch 'main' into feature/paginated-generic-transfer
davidblain-infrabel Jan 15, 2025
de1adff
Merge branch 'main' into feature/paginated-generic-transfer
davidblain-infrabel Jan 15, 2025
3539e1e
refactor: Reformatted SQLExecuteQueryTrigger definition
davidblain-infrabel Jan 17, 2025
961a0a2
refactor: Reformatted GenericTransfer definition
davidblain-infrabel Jan 17, 2025
b4fca1d
refactor: Removed compat check in TestSQLExecuteQueryTrigger
davidblain-infrabel Jan 17, 2025
8b01016
refactor: Removed template_fields_renderers from GenericTransfer defi…
davidblain-infrabel Jan 17, 2025
84b75e9
refactor: Added newline after imports GenericTransfer definition
davidblain-infrabel Jan 17, 2025
b0f56c5
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 17, 2025
96c3691
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 19, 2025
015509d
refactor: Updated definitions for GenericTransfer and SQLExecuteQuery…
davidblain-infrabel Jan 19, 2025
faa0399
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 19, 2025
5cc657a
refactor: Fixed TestSQLExecuteQueryTrigger
davidblain-infrabel Jan 19, 2025
e8d0759
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 23, 2025
fde46da
refactor: Removed unused json import from TestSQLExecuteQueryTrigger
davidblain-infrabel Jan 20, 2025
fb240c8
refactor: Changed return type of run method in SQLExecuteQueryTrigger…
davidblain-infrabel Jan 20, 2025
7aca76d
Merge branch 'main' into feature/paginated-generic-transfer
davidblain-infrabel Jan 28, 2025
029ab5c
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 28, 2025
7276e33
refactor: Updated provider dependencies
dabla Jan 28, 2025
1b47ae7
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 28, 2025
11bf47b
refactor: Removed dependencies form provider.yaml and removed common …
davidblain-infrabel Jan 28, 2025
d70520d
refactor: Updated get_provider_info for standard and common sql provider
davidblain-infrabel Jan 28, 2025
c23d4d0
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 28, 2025
775f913
refactor: Updated provider dependencies
dabla Jan 28, 2025
c54bb73
refactor: Updated README.rst
davidblain-infrabel Jan 28, 2025
ea9040a
refactor: Reorganized imports
davidblain-infrabel Jan 28, 2025
325a7b5
refactor: Reorganized imports generic transfer pyi
davidblain-infrabel Jan 28, 2025
ec1c563
refactor: Reformatted provider info
davidblain-infrabel Jan 28, 2025
e8bacc7
refactor: Some reformatting
davidblain-infrabel Jan 28, 2025
04bdafc
refactor: Fixed import SQLExecuteQueryTrigger for AsyncIterator
davidblain-infrabel Jan 28, 2025
028bc24
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 28, 2025
1440f88
refactor: Changed back import of AsyncIterator
davidblain-infrabel Jan 28, 2025
7fa79f1
refactor: Removed definition of serialize and run methods
davidblain-infrabel Jan 28, 2025
0c68547
refactor: Re-added serialize and run methods
davidblain-infrabel Jan 28, 2025
f5a43bf
refactor: Added alias for BaseTrigger
davidblain-infrabel Jan 28, 2025
88941f1
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 28, 2025
c3f91db
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 28, 2025
c10e063
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 29, 2025
0d27771
refactor: Updated exception message in SQLTrigger
davidblain-infrabel Jan 29, 2025
f117a6c
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 29, 2025
2d67c70
Updates docs to improve Readability (#46205)
aritra24 Jan 29, 2025
6c4d041
Add Webserver parameters: max_form_parts, max_form_memory_size (#45749)
moiseenkov Jan 29, 2025
2e59373
Fix `api_connexion` test failures in main (#46245)
vincbeck Jan 29, 2025
ff0df4c
Add autorefresh to Task Instance and Dag Run pages (#46213)
bbovenzi Jan 29, 2025
46e1a5a
use populate_by_name (#46168)
rawwar Jan 29, 2025
25f8cb9
Disable Flask-SQLAlchemy modification tracking in FAB provider (#46249)
jedcunningham Jan 29, 2025
be657ad
Fix AWS auth manager compatibility test (#46246)
vincbeck Jan 29, 2025
947ccf9
Prevent from being added to root logger by default (#46236)
amoghrajesh Jan 29, 2025
7294194
Move more DAG parsing related config to dag_processor section (#46034)
jedcunningham Jan 29, 2025
b98fe8e
Update test fixture docstring (#46252)
jedcunningham Jan 29, 2025
bcbeeac
AIP-38 Invalidate DryRun query cache on submit (#46238)
pierrejeambrun Jan 29, 2025
b089ffc
Fix pre-commit to re-generate on dependency changes (#46259)
jscheffl Jan 29, 2025
a12b762
update some icons and add dag import error to dagslist (#46251)
bbovenzi Jan 29, 2025
0634ed6
Limit `google-cloud-aiplatform` to fix issues in CI (#46242)
amoghrajesh Jan 29, 2025
0453bdb
Adding from_path.exists() check in a couple of conditions (#46255)
kunaljubce Jan 30, 2025
81cec72
Remove caplog from HDFS tests (#46263)
jscheffl Jan 30, 2025
114fa7b
Remove caplog from Livy tests (#46272)
potiuk Jan 30, 2025
8c0b2ed
Replace caplog with patching log property in k8s tests (#46273)
potiuk Jan 30, 2025
f4122e9
Remove import from MySQL provider tests in generic transfer test (#46…
potiuk Jan 30, 2025
6bf3a23
Fix GitDagBundle to support https (include 46073/46179) (#46226)
jx2lee Jan 30, 2025
9b6a211
AIP-72: Port _validate_inlet_outlet_assets_activeness into Task SDK (…
amoghrajesh Jan 30, 2025
5cae77d
Try and reduce flakiness of number of test call asserts (#46276)
amoghrajesh Jan 30, 2025
49ac241
Add evaluation extra to google-cloud-aiplatform (#46270)
potiuk Jan 30, 2025
3c5fd86
Move Docker Provider to the New Structure (#46097)
bugraoz93 Jan 30, 2025
7e88cb3
Move Atlassian Jira Provider to the new structure (#46271)
bugraoz93 Jan 30, 2025
5abd218
moved apache.cassandra provider to new structure and fixed check-for-…
Prab-27 Jan 30, 2025
ff34487
Make azure test less flaky/racy (#46281)
potiuk Jan 30, 2025
0dbe2ed
move Influxdb provider to new provider structure (#46277)
Prab-27 Jan 30, 2025
ba42ba2
Move Apprise provider to new structure (#46161)
Prab-27 Jan 30, 2025
1def6cf
Make racy test test_start_pod_startup_interval_seconds less racy (#4…
potiuk Jan 30, 2025
f65064e
Move FTP Provider to the New Structure (#46206)
aritra24 Jan 30, 2025
df0c941
move Trino provider to new provider structure (#46162)
Prab-27 Jan 30, 2025
a19f87d
refactor(providers/slack): move slack provider to new structure (#46209)
josix Jan 30, 2025
a341c3d
Move Apache Livy to new provider structure (#46131)
jason810496 Jan 30, 2025
b6ebe21
refactor: Re-added test_generic_transfer under standard provider
davidblain-infrabel Jan 30, 2025
6c1ae47
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 30, 2025
63f74d5
refactor: Moved test_generic_transfer from standard to common sql pro…
davidblain-infrabel Jan 30, 2025
66f5ac5
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 30, 2025
260928e
Merge branch 'main' into feature/paginated-generic-transfer
dabla Jan 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion airflow/operators/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
"BranchDateTimeOperator": "airflow.providers.standard.operators.datetime.BranchDateTimeOperator",
},
"generic_transfer": {
"GenericTransfer": "airflow.providers.standard.operators.generic_transfer.GenericTransfer",
"GenericTransfer": "airflow.providers.common.sql.operators.generic_transfer.GenericTransfer",
},
"weekday": {
"BranchDayOfWeekOperator": "airflow.providers.standard.operators.weekday.BranchDayOfWeekOperator",
Expand Down
1 change: 0 additions & 1 deletion generated/provider_dependencies.json
Original file line number Diff line number Diff line change
Expand Up @@ -1281,7 +1281,6 @@
},
"standard": {
"deps": [
"apache-airflow-providers-common-sql>=1.20.0",
"apache-airflow>=2.9.0"
],
"devel-deps": [],
Expand Down
6 changes: 6 additions & 0 deletions providers/common/sql/provider.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ operators:
- integration-name: Common SQL
python-modules:
- airflow.providers.common.sql.operators.sql
- airflow.providers.common.sql.operators.generic_transfer

dialects:
- dialect-type: default
Expand All @@ -87,6 +88,11 @@ hooks:
- airflow.providers.common.sql.hooks.handlers
- airflow.providers.common.sql.hooks.sql

triggers:
- integration-name: Common SQL
python-modules:
- airflow.providers.common.sql.triggers.sql

sensors:
- integration-name: Common SQL
python-modules:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,10 @@ def get_provider_info():
"operators": [
{
"integration-name": "Common SQL",
"python-modules": ["airflow.providers.common.sql.operators.sql"],
"python-modules": [
"airflow.providers.common.sql.operators.sql",
"airflow.providers.common.sql.operators.generic_transfer",
],
}
],
"dialects": [
Expand All @@ -98,6 +101,12 @@ def get_provider_info():
],
}
],
"triggers": [
{
"integration-name": "Common SQL",
"python-modules": ["airflow.providers.common.sql.triggers.sql"],
}
],
"sensors": [
{"integration-name": "Common SQL", "python-modules": ["airflow.providers.common.sql.sensors.sql"]}
],
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
from __future__ import annotations

from collections.abc import Sequence
from functools import cached_property
from typing import TYPE_CHECKING, Any

from airflow.exceptions import AirflowException
from airflow.hooks.base import BaseHook
from airflow.models import BaseOperator
from airflow.providers.common.sql.hooks.sql import DbApiHook
from airflow.providers.common.sql.triggers.sql import SQLExecuteQueryTrigger

if TYPE_CHECKING:
import jinja2

try:
from airflow.sdk.definitions.context import Context
except ImportError:
# TODO: Remove once provider drops support for Airflow 2
from airflow.utils.context import Context


class GenericTransfer(BaseOperator):
"""
Moves data from a connection to another.

Assuming that they both provide the required methods in their respective hooks.
The source hook needs to expose a `get_records` method, and the destination a
`insert_rows` method.

This is meant to be used on small-ish datasets that fit in memory.

:param sql: SQL query to execute against the source database. (templated)
:param destination_table: target table. (templated)
:param source_conn_id: source connection. (templated)
:param source_hook_params: source hook parameters.
:param destination_conn_id: destination connection. (templated)
:param destination_hook_params: destination hook parameters.
:param preoperator: sql statement or list of statements to be
executed prior to loading the data. (templated)
:param insert_args: extra params for `insert_rows` method.
:param page_size: number of records to be read in paginated mode (optional).
"""

template_fields: Sequence[str] = (
"source_conn_id",
"destination_conn_id",
"sql",
"destination_table",
"preoperator",
"insert_args",
)
template_ext: Sequence[str] = (
".sql",
".hql",
)
template_fields_renderers = {"preoperator": "sql"}
ui_color = "#b0f07c"

def __init__(
self,
*,
sql: str,
destination_table: str,
source_conn_id: str,
source_hook_params: dict | None = None,
destination_conn_id: str,
destination_hook_params: dict | None = None,
preoperator: str | list[str] | None = None,
insert_args: dict | None = None,
page_size: int | None = None,
**kwargs,
) -> None:
super().__init__(**kwargs)
self.sql = sql
self.destination_table = destination_table
self.source_conn_id = source_conn_id
self.source_hook_params = source_hook_params
self.destination_conn_id = destination_conn_id
self.destination_hook_params = destination_hook_params
self.preoperator = preoperator
self.insert_args = insert_args or {}
self.page_size = page_size
self._paginated_sql_statement_format = kwargs.get(
"paginated_sql_statement_format", "{} LIMIT {} OFFSET {}"
)

@classmethod
def get_hook(cls, conn_id: str, hook_params: dict | None = None) -> DbApiHook:
"""
Return DbApiHook for this connection id.

:param conn_id: connection id
:param hook_params: hook parameters
:return: DbApiHook for this connection
"""
connection = BaseHook.get_connection(conn_id)
hook = connection.get_hook(hook_params=hook_params)
if not isinstance(hook, DbApiHook):
raise RuntimeError(f"Hook for connection {conn_id!r} must be of type {DbApiHook.__name__}")
return hook

@cached_property
def source_hook(self) -> DbApiHook:
return self.get_hook(conn_id=self.source_conn_id, hook_params=self.source_hook_params)

@cached_property
def destination_hook(self) -> DbApiHook:
return self.get_hook(conn_id=self.destination_conn_id, hook_params=self.destination_hook_params)

def get_paginated_sql(self, offset: int) -> str:
"""Format the paginated SQL statement using the current format."""
return self._paginated_sql_statement_format.format(self.sql, self.page_size, offset)

def render_template_fields(
self,
context: Context,
jinja_env: jinja2.Environment | None = None,
) -> None:
super().render_template_fields(context=context, jinja_env=jinja_env)

# Make sure string are converted to integers
if isinstance(self.page_size, str):
self.page_size = int(self.page_size)
commit_every = self.insert_args.get("commit_every")
if isinstance(commit_every, str):
self.insert_args["commit_every"] = int(commit_every)

def execute(self, context: Context):
if self.preoperator:
self.log.info("Running preoperator")
self.log.info(self.preoperator)
self.destination_hook.run(self.preoperator)

if self.page_size and isinstance(self.sql, str):
self.defer(
trigger=SQLExecuteQueryTrigger(
conn_id=self.source_conn_id,
hook_params=self.source_hook_params,
sql=self.get_paginated_sql(0),
),
method_name=self.execute_complete.__name__,
)
else:
self.log.info("Extracting data from %s", self.source_conn_id)
self.log.info("Executing: \n %s", self.sql)

results = self.destination_hook.get_records(self.sql)

self.log.info("Inserting rows into %s", self.destination_conn_id)
self.destination_hook.insert_rows(table=self.destination_table, rows=results, **self.insert_args)

def execute_complete(
self,
context: Context,
event: dict[Any, Any] | None = None,
) -> Any:
if event:
if event.get("status") == "failure":
raise AirflowException(event.get("message"))

results = event.get("results")

if results:
map_index = context["ti"].map_index
offset = (
context["ti"].xcom_pull(
key="offset",
task_ids=self.task_id,
dag_id=self.dag_id,
map_indexes=map_index,
default=0,
)
+ self.page_size
)

self.log.info("Offset increased to %d", offset)
self.xcom_push(context=context, key="offset", value=offset)

self.log.info("Inserting %d rows into %s", len(results), self.destination_conn_id)
self.destination_hook.insert_rows(
table=self.destination_table, rows=results, **self.insert_args
)
self.log.info(
"Inserting %d rows into %s done!",
len(results),
self.destination_conn_id,
)

self.defer(
trigger=SQLExecuteQueryTrigger(
conn_id=self.source_conn_id,
hook_params=self.source_hook_params,
sql=self.get_paginated_sql(offset),
),
method_name=self.execute_complete.__name__,
)
else:
self.log.info(
"No more rows to fetch into %s; ending transfer.",
self.destination_table,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
# This is automatically generated stub for the `common.sql` provider
#
# This file is generated automatically by the `update-common-sql-api stubs` pre-commit
# and the .pyi file represents part of the "public" API that the
# `common.sql` provider exposes to other providers.
#
# Any, potentially breaking change in the stubs will require deliberate manual action from the contributor
# making a change to the `common.sql` provider. Those stubs are also used by MyPy automatically when checking
# if only public API of the common.sql provider is used by all the other providers.
#
# You can read more in the README_API.md file
#
"""
Definition of the public interface for airflow.providers.common.sql.operators.generic_transfer
isort:skip_file
"""

from collections.abc import Sequence
from functools import cached_property as cached_property
from typing import Any, ClassVar

import jinja2
from _typeshed import Incomplete as Incomplete

from airflow.models import BaseOperator
from airflow.providers.common.sql.hooks.sql import DbApiHook as DbApiHook
from airflow.utils.context import Context as Context

class GenericTransfer(BaseOperator):
template_fields: Sequence[str]
template_ext: Sequence[str]
template_fields_renderers: ClassVar[dict]
ui_color: str
sql: Incomplete
destination_table: Incomplete
source_conn_id: Incomplete
source_hook_params: Incomplete
destination_conn_id: Incomplete
destination_hook_params: Incomplete
preoperator: Incomplete
insert_args: Incomplete
page_size: Incomplete
def __init__(
self,
*,
sql: str,
destination_table: str,
source_conn_id: str,
source_hook_params: dict | None = None,
destination_conn_id: str,
destination_hook_params: dict | None = None,
preoperator: str | list[str] | None = None,
insert_args: dict | None = None,
page_size: int | None = None,
**kwargs,
) -> None: ...
@classmethod
def get_hook(cls, conn_id: str, hook_params: dict | None = None) -> DbApiHook: ...
@cached_property
def source_hook(self) -> DbApiHook: ...
@cached_property
def destination_hook(self) -> DbApiHook: ...
def get_paginated_sql(self, offset: int) -> str: ...
def render_template_fields(
self, context: Context, jinja_env: jinja2.Environment | None = None
) -> None: ...
def execute(self, context: Context): ...
def execute_complete(self, context: Context, event: dict[Any, Any] | None = None) -> Any: ...
Loading
Loading