Support for resource parameters causes performance issues #88

titan-teej · 2024-07-25T23:47:00Z

Resources that support parameters (warehouse, database, schema, table, task) require at least 2 metadata queries to fetch, typically a SHOW command and a SHOW PARAMETERS IN command. For example, in Titan Schema supports the max_data_extension_time_in_days parameter.

For most fetch methods, Titan uses the broadest possible SHOW command (eg SHOW TABLES IN ACCOUNT) and caches the result. This minimizes queries to Snowflake. In the best case, only a single SHOW command is needed for all resources of the same type in the blueprint.

However, there is no way to retrieve resource parameters in bulk. So a SHOW PARAMETERS command must be run for every single resource that supports parameters. This leads to unnecessary performance issues.

My proposal is to handle parameters similar to tags: accept parameters as named kwargs just like today, but instead of storing them in the _data member, make a call to set_parameters(...) which will selectively create a new resource type: ResourceParameters.

When a user specifies parameters for a resource, Titan will track and fetch those parameters separately. Critically: if parameters aren't specified by a user, Titan will skip fetching them, saving time.

class Schema(ParameterizedResource, ...):

    def __init__(
        self,
        name: str,
        transient: bool = False,
        managed_access: bool = False,
        data_retention_time_in_days: int = None,
        max_data_extension_time_in_days: int = None,
        default_ddl_collation: str = None,
        tags: dict[str, str] = None,
        owner: str = "SYSADMIN",
        comment: str = None,
        **kwargs,
    ):
        super().__init__(name, **kwargs)
        ...
        self._data: _Schema = _Schema(
            name=self._name,
            transient=transient,
            managed_access=managed_access,
            owner=owner,
            comment=comment,
        )
        self.set_parameters(
            data_retention_time_in_days=data_retention_time_in_days,
            max_data_extension_time_in_days=max_data_extension_time_in_days,
            default_ddl_collation=default_ddl_collation,
        )
        self.set_tags(tags)


class ParameterizedResource:
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._parameters: Optional[ResourceParameters] = None

    def set_parameters(self, parameters: dict[str, str]):
        if parameters is None:
            return
        if self._parameters is None:
            self._parameters = ResourceParameters(parameters)
        else:
            raise ValueError("Parameters cannot be set on a resource that already has parameters")

    def create_tag_reference(self):
        if self._parameters is None:
            return None
        ref = parameters_for_resource(self, self._parameters)
        ref.requires(self)
        return ref

    @property
    def parameters(self) -> Optional[ResourceParameters]:
        return self._parameters

This design has trade-offs. During apply, when a resource with parameters is created for the first time, Titan will run 2 commands (CREATE & ALTER) instead of just one. To me, this trade-off is acceptable. This proposed design will significantly reduce queries required for plan and export. This design has no performance impact on apply if parameter fields aren't used.

There are issues with drift that also need to be considered.

The text was updated successfully, but these errors were encountered:

teej added the performance Something is too slow label Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for resource parameters causes performance issues #88

Support for resource parameters causes performance issues #88

titan-teej commented Jul 25, 2024

Support for resource parameters causes performance issues #88

Support for resource parameters causes performance issues #88

Comments

titan-teej commented Jul 25, 2024