-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add measures metadata for metrics #1234
Conversation
✅ Deploy Preview for thriving-cassata-78ae72 canceled.
|
de4e064
to
a40b9bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shangyian this is so good. Very clear and easy to follow. Minor questions in line.
I can't wait to use this in Cube's materialization and collapsing the 2 styles into one!
aggregation=func.name.name.upper(), | ||
rule=AggregationRule( | ||
type=Aggregability.FULL | ||
if func.quantifier != "DISTINCT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably make a constant for "DISTINCT"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let me add this
|
||
def update_ast(self, func, measures: list[Measure]): | ||
""" | ||
Updates the query AST based on the measures derived from the function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this function make any updates to self
? Is it by updating the func
pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right, it actually doesn't make any changes to self
and just updates func
. Let me refactor this method to be a staticmethod
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR! Really enjoyable & elegant to read.
Initially, I was envisioning composing metric expression from user-input measures. This PR's approach of parsing arbitrary, user-input metric expression into measures is more general at the cost of greater complexity. Thanks to you I think the solution here wasn't that crazy complex.
Plus, the experimentation UI side may force users to construct metric expression out of constrained measures anyway, so the risk of us parsing the metric expression wrong is further reduced.
Thanks again!
query_ast = parse(metric_query) | ||
measures = [] | ||
|
||
for idx, func in enumerate(query_ast.find_all(ast.Function)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we operate on all Functions here or just Aggregate Functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually these are already just aggregate functions, because at the moment they're limited to the ones registered under self.handlers
, but let me add a check anyway to future-proof this.
), | ||
] | ||
|
||
def _avg(self, func, idx) -> list[Measure]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q: Do we expect _avg
to ever be used for a func other than AVG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. This should only be mapped to AVG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth raising when func
is not AVG
then
LIMITED = "limited" | ||
NONE = "none" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's an example of "limited aggregability"? COUNT DISTINCT
is categorized as "limited" below, but I can't think of a way to aggregate COUNT DISTINCT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider a metric definition like COUNT(DISTINCT user_id)
. This can be decomposed into a single measure with expression DISTINCT user_id
that can be aggregatable if the measure is calculated at least at the user_id
level, but not if it's calculated at some other level (e.g., country
).
Technically, if the aggregability type is LIMITED
, the level
should be specified to highlight the level at which the measure needs to be aggregated to in order to support the specified aggregation function. I haven't added this yet because I'm not sure how useful this is in practice.
So in the particular example of COUNT(DISTINCT user_id)
, it can be decomposed into:
- name: users_count_distinct
expression: DISTINCT user_id
aggregation: COUNT
rule:
type: LIMITED
level: ["user_id"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not saying this always makes sense, but I can see one doing some additional aggs (say AVG()) on top of count_distinct results over some particular dimensions. And if your data is MECE over some dimensions you could even do SUM() on top of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed, although it does also get super complicated to manage or indicate through metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be decomposed into a single measure with expression DISTINCT user_id that can be aggregatable if the measure is calculated at least at the user_id level, but not if it's calculated at some other level (e.g., country).
OK, I see the case for "limited aggregability" now.
But I think the example above is backward. Let's say we want to COUNT DISTINCT user_id globally. It is possible to first COUNT DISTINCT user_id GROUP BY country
, then SUM(count_distinct_user_id)
. Essentially, GROUP BY any grain that is as or more coarse than user_id
is allowed.
Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, GROUP BY any grain that is as or more coarse than user_id is allowed.
Only if the grain
is 1:many
with the user_id
, otherwise you may be double counting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anhqle if you did COUNT DISTINCT user_id GROUP BY country
first, then you would lose the ability to distinguish between overlapping user_ids across countries, and end up with double-counted results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my unstated assumption is that user:country is many:one.
In any case, it's complicated to track that relationship as we discussed above. So in practice, "limited aggregability" only works when the measure is aggregated to exactly level
.
41c8c37
to
3d035be
Compare
…rate SQL derived from measures
7785015
to
4860eef
Compare
Summary
The detailed context and explanation of the broader problem we're trying to solve is in the issue. To briefly summarize here:
For any set of metrics and dimensions, we can generate measures SQL, which is used to materialize a dataset that can serve those metrics and dimensions in a more efficient manner. Today, the generated measures SQL is non-optimized - the measures are computed at the level of the metrics' upstream transforms, without considering any downstream aggregation or grain optimizations.
We can make the generated measures SQL significantly more efficient by breaking down metrics into pre-aggregated (but still further aggregatable) measures. This PR provides appropriate appropriate metadata for a given metric on its measures, as a first step towards better measures SQL.
SUM(sales_amount) * 100.0
,AVG(revenue)
,SUM(clicks) / SUM(impressions)
).sales_amount
,revenue
,user_count
), along with aggregation and other metadata:Test Plan
make check
passesmake test
shows 100% unit test coverageDeployment Plan