Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add measures metadata for metrics #1234

Merged
merged 8 commits into from
Dec 3, 2024

Conversation

shangyian
Copy link
Contributor

@shangyian shangyian commented Dec 2, 2024

Summary

The detailed context and explanation of the broader problem we're trying to solve is in the issue. To briefly summarize here:

For any set of metrics and dimensions, we can generate measures SQL, which is used to materialize a dataset that can serve those metrics and dimensions in a more efficient manner. Today, the generated measures SQL is non-optimized - the measures are computed at the level of the metrics' upstream transforms, without considering any downstream aggregation or grain optimizations.

We can make the generated measures SQL significantly more efficient by breaking down metrics into pre-aggregated (but still further aggregatable) measures. This PR provides appropriate appropriate metadata for a given metric on its measures, as a first step towards better measures SQL.

image

  • Metrics are combinations of one or more measures with various aggregation or formulas applied (e.g., SUM(sales_amount) * 100.0, AVG(revenue), SUM(clicks) / SUM(impressions)).
  • Measures are components used to build metrics (e.g., sales_amount, revenue, user_count), along with aggregation and other metadata:
measures:
- name: sales_amount_sum
  expression: amount
  aggregation: SUM
  rule: FULL
- name: user_id_count_distinct
  expression: "distinct user_id"
  aggregation: COUNT
  rule: LIMITED

Test Plan

Deployment Plan

Copy link

netlify bot commented Dec 2, 2024

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 08a1a42
🔍 Latest deploy log https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/674ec5428e11db0008da42db

Copy link
Member

@agorajek agorajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shangyian this is so good. Very clear and easy to follow. Minor questions in line.

I can't wait to use this in Cube's materialization and collapsing the 2 styles into one!

aggregation=func.name.name.upper(),
rule=AggregationRule(
type=Aggregability.FULL
if func.quantifier != "DISTINCT"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make a constant for "DISTINCT".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me add this


def update_ast(self, func, measures: list[Measure]):
"""
Updates the query AST based on the measures derived from the function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this function make any updates to self? Is it by updating the func pointers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right, it actually doesn't make any changes to self and just updates func. Let me refactor this method to be a staticmethod instead.

Copy link
Contributor

@anhqle anhqle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Really enjoyable & elegant to read.

Initially, I was envisioning composing metric expression from user-input measures. This PR's approach of parsing arbitrary, user-input metric expression into measures is more general at the cost of greater complexity. Thanks to you I think the solution here wasn't that crazy complex.

Plus, the experimentation UI side may force users to construct metric expression out of constrained measures anyway, so the risk of us parsing the metric expression wrong is further reduced.

Thanks again!

query_ast = parse(metric_query)
measures = []

for idx, func in enumerate(query_ast.find_all(ast.Function)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we operate on all Functions here or just Aggregate Functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually these are already just aggregate functions, because at the moment they're limited to the ones registered under self.handlers, but let me add a check anyway to future-proof this.

),
]

def _avg(self, func, idx) -> list[Measure]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: Do we expect _avg to ever be used for a func other than AVG?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. This should only be mapped to AVG

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth raising when func is not AVG then

Comment on lines +15 to +16
LIMITED = "limited"
NONE = "none"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example of "limited aggregability"? COUNT DISTINCT is categorized as "limited" below, but I can't think of a way to aggregate COUNT DISTINCT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider a metric definition like COUNT(DISTINCT user_id). This can be decomposed into a single measure with expression DISTINCT user_id that can be aggregatable if the measure is calculated at least at the user_id level, but not if it's calculated at some other level (e.g., country).

Technically, if the aggregability type is LIMITED, the level should be specified to highlight the level at which the measure needs to be aggregated to in order to support the specified aggregation function. I haven't added this yet because I'm not sure how useful this is in practice.

So in the particular example of COUNT(DISTINCT user_id), it can be decomposed into:

    - name: users_count_distinct
      expression: DISTINCT user_id
      aggregation: COUNT
      rule:
        type: LIMITED
        level: ["user_id"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not saying this always makes sense, but I can see one doing some additional aggs (say AVG()) on top of count_distinct results over some particular dimensions. And if your data is MECE over some dimensions you could even do SUM() on top of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed, although it does also get super complicated to manage or indicate through metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be decomposed into a single measure with expression DISTINCT user_id that can be aggregatable if the measure is calculated at least at the user_id level, but not if it's calculated at some other level (e.g., country).

OK, I see the case for "limited aggregability" now.
But I think the example above is backward. Let's say we want to COUNT DISTINCT user_id globally. It is possible to first COUNT DISTINCT user_id GROUP BY country, then SUM(count_distinct_user_id). Essentially, GROUP BY any grain that is as or more coarse than user_id is allowed.

Is that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, GROUP BY any grain that is as or more coarse than user_id is allowed.

Only if the grain is 1:many with the user_id, otherwise you may be double counting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anhqle if you did COUNT DISTINCT user_id GROUP BY country first, then you would lose the ability to distinguish between overlapping user_ids across countries, and end up with double-counted results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my unstated assumption is that user:country is many:one.

In any case, it's complicated to track that relationship as we discussed above. So in practice, "limited aggregability" only works when the measure is aggregated to exactly level.

@shangyian shangyian marked this pull request as ready for review December 3, 2024 08:42
@shangyian shangyian merged commit 95e607a into DataJunction:main Dec 3, 2024
16 checks passed
@shangyian shangyian deleted the decompose-metrics branch December 3, 2024 16:14
@shangyian shangyian mentioned this pull request Dec 6, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants