Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for resource parameters causes performance issues #88

Open
titan-teej opened this issue Jul 25, 2024 · 0 comments
Open

Support for resource parameters causes performance issues #88

titan-teej opened this issue Jul 25, 2024 · 0 comments
Labels
performance Something is too slow

Comments

@titan-teej
Copy link
Contributor

Resources that support parameters (warehouse, database, schema, table, task) require at least 2 metadata queries to fetch, typically a SHOW command and a SHOW PARAMETERS IN command. For example, in Titan Schema supports the max_data_extension_time_in_days parameter.

For most fetch methods, Titan uses the broadest possible SHOW command (eg SHOW TABLES IN ACCOUNT) and caches the result. This minimizes queries to Snowflake. In the best case, only a single SHOW command is needed for all resources of the same type in the blueprint.

However, there is no way to retrieve resource parameters in bulk. So a SHOW PARAMETERS command must be run for every single resource that supports parameters. This leads to unnecessary performance issues.

My proposal is to handle parameters similar to tags: accept parameters as named kwargs just like today, but instead of storing them in the _data member, make a call to set_parameters(...) which will selectively create a new resource type: ResourceParameters.

When a user specifies parameters for a resource, Titan will track and fetch those parameters separately. Critically: if parameters aren't specified by a user, Titan will skip fetching them, saving time.

class Schema(ParameterizedResource, ...):

    def __init__(
        self,
        name: str,
        transient: bool = False,
        managed_access: bool = False,
        data_retention_time_in_days: int = None,
        max_data_extension_time_in_days: int = None,
        default_ddl_collation: str = None,
        tags: dict[str, str] = None,
        owner: str = "SYSADMIN",
        comment: str = None,
        **kwargs,
    ):
        super().__init__(name, **kwargs)
        ...
        self._data: _Schema = _Schema(
            name=self._name,
            transient=transient,
            managed_access=managed_access,
            owner=owner,
            comment=comment,
        )
        self.set_parameters(
            data_retention_time_in_days=data_retention_time_in_days,
            max_data_extension_time_in_days=max_data_extension_time_in_days,
            default_ddl_collation=default_ddl_collation,
        )
        self.set_tags(tags)


class ParameterizedResource:
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._parameters: Optional[ResourceParameters] = None

    def set_parameters(self, parameters: dict[str, str]):
        if parameters is None:
            return
        if self._parameters is None:
            self._parameters = ResourceParameters(parameters)
        else:
            raise ValueError("Parameters cannot be set on a resource that already has parameters")

    def create_tag_reference(self):
        if self._parameters is None:
            return None
        ref = parameters_for_resource(self, self._parameters)
        ref.requires(self)
        return ref

    @property
    def parameters(self) -> Optional[ResourceParameters]:
        return self._parameters

This design has trade-offs. During apply, when a resource with parameters is created for the first time, Titan will run 2 commands (CREATE & ALTER) instead of just one. To me, this trade-off is acceptable. This proposed design will significantly reduce queries required for plan and export. This design has no performance impact on apply if parameter fields aren't used.

There are issues with drift that also need to be considered.

@teej teej added the performance Something is too slow label Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something is too slow
Projects
None yet
Development

No branches or pull requests

2 participants