Add measures metadata for metrics #1234

shangyian · 2024-12-02T17:49:44Z

Summary

The detailed context and explanation of the broader problem we're trying to solve is in the issue. To briefly summarize here:

For any set of metrics and dimensions, we can generate measures SQL, which is used to materialize a dataset that can serve those metrics and dimensions in a more efficient manner. Today, the generated measures SQL is non-optimized - the measures are computed at the level of the metrics' upstream transforms, without considering any downstream aggregation or grain optimizations.

We can make the generated measures SQL significantly more efficient by breaking down metrics into pre-aggregated (but still further aggregatable) measures. This PR provides appropriate appropriate metadata for a given metric on its measures, as a first step towards better measures SQL.

Metrics are combinations of one or more measures with various aggregation or formulas applied (e.g., SUM(sales_amount) * 100.0, AVG(revenue), SUM(clicks) / SUM(impressions)).
Measures are components used to build metrics (e.g., sales_amount, revenue, user_count), along with aggregation and other metadata:

measures:
- name: sales_amount_sum
  expression: amount
  aggregation: SUM
  rule: FULL
- name: user_id_count_distinct
  expression: "distinct user_id"
  aggregation: COUNT
  rule: LIMITED

Test Plan

PR has an associated issue: Surface metric metadata on aggregation #1223
make check passes
make test shows 100% unit test coverage

Deployment Plan

netlify · 2024-12-02T17:50:52Z

✅ Deploy Preview for thriving-cassata-78ae72 canceled.

Name	Link
🔨 Latest commit	`08a1a42`
🔍 Latest deploy log	https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/674ec5428e11db0008da42db

agorajek

@shangyian this is so good. Very clear and easy to follow. Minor questions in line.

I can't wait to use this in Cube's materialization and collapsing the 2 styles into one!

agorajek · 2024-12-02T19:35:23Z

datajunction-server/datajunction_server/sql/decompose.py

+                aggregation=func.name.name.upper(),
+                rule=AggregationRule(
+                    type=Aggregability.FULL
+                    if func.quantifier != "DISTINCT"


We should probably make a constant for "DISTINCT".

Good point, let me add this

agorajek · 2024-12-02T19:44:25Z

datajunction-server/datajunction_server/sql/decompose.py

+
+    def update_ast(self, func, measures: list[Measure]):
+        """
+        Updates the query AST based on the measures derived from the function.


How does this function make any updates to self? Is it by updating the func pointers?

Yeah, you're right, it actually doesn't make any changes to self and just updates func. Let me refactor this method to be a staticmethod instead.

anhqle

Great PR! Really enjoyable & elegant to read.

Initially, I was envisioning composing metric expression from user-input measures. This PR's approach of parsing arbitrary, user-input metric expression into measures is more general at the cost of greater complexity. Thanks to you I think the solution here wasn't that crazy complex.

Plus, the experimentation UI side may force users to construct metric expression out of constrained measures anyway, so the risk of us parsing the metric expression wrong is further reduced.

Thanks again!

datajunction-server/datajunction_server/sql/decompose.py

anhqle · 2024-12-02T23:50:15Z

datajunction-server/datajunction_server/sql/decompose.py

+        query_ast = parse(metric_query)
+        measures = []
+
+        for idx, func in enumerate(query_ast.find_all(ast.Function)):


Should we operate on all Functions here or just Aggregate Functions?

Actually these are already just aggregate functions, because at the moment they're limited to the ones registered under self.handlers, but let me add a check anyway to future-proof this.

anhqle · 2024-12-02T23:59:44Z

datajunction-server/datajunction_server/sql/decompose.py

+            ),
+        ]
+
+    def _avg(self, func, idx) -> list[Measure]:


q: Do we expect _avg to ever be used for a func other than AVG?

I don't think so. This should only be mapped to AVG

Maybe it's worth raising when func is not AVG then

datajunction-server/datajunction_server/sql/decompose.py

anhqle · 2024-12-03T01:00:45Z

datajunction-server/datajunction_server/sql/decompose.py

+    LIMITED = "limited"
+    NONE = "none"


What's an example of "limited aggregability"? COUNT DISTINCT is categorized as "limited" below, but I can't think of a way to aggregate COUNT DISTINCT

Let's consider a metric definition like COUNT(DISTINCT user_id). This can be decomposed into a single measure with expression DISTINCT user_id that can be aggregatable if the measure is calculated at least at the user_id level, but not if it's calculated at some other level (e.g., country).

Technically, if the aggregability type is LIMITED, the level should be specified to highlight the level at which the measure needs to be aggregated to in order to support the specified aggregation function. I haven't added this yet because I'm not sure how useful this is in practice.

So in the particular example of COUNT(DISTINCT user_id), it can be decomposed into:

- name: users_count_distinct expression: DISTINCT user_id aggregation: COUNT rule: type: LIMITED level: ["user_id"]

Not saying this always makes sense, but I can see one doing some additional aggs (say AVG()) on top of count_distinct results over some particular dimensions. And if your data is MECE over some dimensions you could even do SUM() on top of it.

Yeah, agreed, although it does also get super complicated to manage or indicate through metadata.

This can be decomposed into a single measure with expression DISTINCT user_id that can be aggregatable if the measure is calculated at least at the user_id level, but not if it's calculated at some other level (e.g., country).

OK, I see the case for "limited aggregability" now.
But I think the example above is backward. Let's say we want to COUNT DISTINCT user_id globally. It is possible to first COUNT DISTINCT user_id GROUP BY country, then SUM(count_distinct_user_id). Essentially, GROUP BY any grain that is as or more coarse than user_id is allowed.

Is that right?

Essentially, GROUP BY any grain that is as or more coarse than user_id is allowed.

Only if the grain is 1:many with the user_id, otherwise you may be double counting.

@anhqle if you did COUNT DISTINCT user_id GROUP BY country first, then you would lose the ability to distinguish between overlapping user_ids across countries, and end up with double-counted results.

Yeah, my unstated assumption is that user:country is many:one.

In any case, it's complicated to track that relationship as we discussed above. So in practice, "limited aggregability" only works when the measure is aggregated to exactly level.

…rate SQL derived from measures

…nctions

shangyian force-pushed the decompose-metrics branch from de4e064 to a40b9bd Compare December 2, 2024 17:54

shangyian requested review from agorajek and anhqle December 2, 2024 17:54

agorajek reviewed Dec 2, 2024

View reviewed changes

anhqle approved these changes Dec 3, 2024

View reviewed changes

shangyian force-pushed the decompose-metrics branch from 41c8c37 to 3d035be Compare December 3, 2024 07:20

shangyian added 7 commits December 3, 2024 00:41

Add the ability to extract measures from a metric definition and gene…

f24e578

…rate SQL derived from measures

Add tests for measure extraction and derived sql generation

1489c5f

Add aggregation rule metadata to measures

c0603ac

Use enum values for all set quantifiers

d31116b

Add clearer comments on the limited aggregability type

5d72f51

Add check to make sure the decomposed functions are only aggregate fu…

cb83827

…nctions

Fix coverage

4860eef

shangyian force-pushed the decompose-metrics branch from 7785015 to 4860eef Compare December 3, 2024 08:41

shangyian marked this pull request as ready for review December 3, 2024 08:42

Fix

08a1a42

shangyian merged commit 95e607a into DataJunction:main Dec 3, 2024
16 checks passed

shangyian deleted the decompose-metrics branch December 3, 2024 16:14

shangyian mentioned this pull request Dec 6, 2024

Add derived measures to GraphQL #1241

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add measures metadata for metrics #1234

Add measures metadata for metrics #1234

shangyian commented Dec 2, 2024 •

edited

Loading

netlify bot commented Dec 2, 2024 •

edited

Loading

agorajek left a comment

agorajek Dec 2, 2024

shangyian Dec 2, 2024

agorajek Dec 2, 2024

shangyian Dec 2, 2024

anhqle left a comment

anhqle Dec 2, 2024

shangyian Dec 3, 2024

anhqle Dec 2, 2024

shangyian Dec 3, 2024

anhqle Dec 3, 2024

anhqle Dec 3, 2024

shangyian Dec 3, 2024

agorajek Dec 3, 2024

shangyian Dec 3, 2024

anhqle Dec 3, 2024

agorajek Dec 3, 2024

shangyian Dec 4, 2024

anhqle Dec 4, 2024

Add measures metadata for metrics #1234

Add measures metadata for metrics #1234

Conversation

shangyian commented Dec 2, 2024 • edited Loading

Summary

Test Plan

Deployment Plan

netlify bot commented Dec 2, 2024 • edited Loading

✅ Deploy Preview for thriving-cassata-78ae72 canceled.

agorajek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anhqle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangyian commented Dec 2, 2024 •

edited

Loading

netlify bot commented Dec 2, 2024 •

edited

Loading