Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] document LIR attribution #30899

Merged
merged 13 commits into from
Jan 9, 2025
314 changes: 314 additions & 0 deletions doc/user/content/transform-data/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,320 @@ If you've gone through the dataflow troubleshooting and do not want to make any
changes to your query, consider [sizing up your cluster](/sql/create-cluster/#size).
A larger size cluster will provision more memory and CPU resources.

## Which part of my query runs slowly or uses a lot of memory?

{{< public-preview />}}

You can [`EXPLAIN`](/sql/explain-plan/) a query to see how it will be run as a
dataflow. In particular, `EXPLAIN PHYSICAL PLAN` will show the concrete, fully
optimized plan that Materialize will run. That plan is written in our "low-level
intermediate representation" (LIR).

For [indexes](/concepts/indexes) and [materialized
views](/concepts/views#materialized-views), you can use
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
to attribute various performance characteristics to the operators inside your
query.

Every time you create an index or materialized view, Materialize uses
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
to map the higher-level LIR operators to zero or more lower-level
dataflow operators. You can construct queries that will combine
information from
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
and other internal views to discover which parts of your query are
computationally expensive (e.g.,
[`mz_introspection.mz_compute_operator_durations_histogram`](/sql/system-catalog/mz_introspection/#mz_compute_operator_durations_histogram), [`mz_introspection.mz_scheduling_elapsed`](/sql/system-catalog/mz_introspection/#mz_scheduling_elapsed))
or consuming excessive memory (e.g., [`mz_introspection.mz_arrangement_sizes`](/sql/system-catalog/mz_introspection/#mz_arrangement_sizes)).

To show how you can use
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
to attribute performance characteristics, the attribution examples in this
section reference the `wins_by_item` index (and the underlying `winning_bids`
view) from the [quickstart
guide](/get-started/quickstart/#step-2-create-the-source):

```sql
CREATE SOURCE auction_house
FROM LOAD GENERATOR AUCTION
(TICK INTERVAL '1s', AS OF 100000)
FOR ALL TABLES;

CREATE VIEW winning_bids AS
SELECT DISTINCT ON (a.id) b.*, a.item, a.seller
FROM auctions AS a
JOIN bids AS b
ON a.id = b.auction_id
WHERE b.bid_time < a.end_time
AND mz_now() >= a.end_time
ORDER BY a.id, b.amount DESC, b.bid_time, b.buyer;

CREATE INDEX wins_by_item ON winning_bids (item);
```

We attribute four different kinds of performance data to parts of the
`wins_by_item` query:

- [computation time](#attributing-computation-time)
- [memory usage](#attributing-memory-usage)
- [Top-k query sizing hints](#attributing-topk-hints)
- [worker skew](#localizing-worker-skew)

### Attributing computation time

When optimizing a query, it helps to be able to attribute 'cost' to its parts,
starting with how much time is spent computing in each part overall. Materialize
reports the time spent in each _dataflow operator_ in
[`mz_introspection.mz_compute_operator_durations_histogram`](/sql/system-catalog/mz_introspection/#mz_compute_operator_durations_histogram).
By joining it with
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping),
we can attribute the time spent in each operator to the higher-level, more
intelligible LIR operators.

For example, to find out how much time is spent in each operator for the `wins_by_item` index (and the underlying `winning_bids` view), run the following query:

```sql
SELECT mo.name AS name, global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
SUM(duration_ns)/1000 * '1 microsecond'::INTERVAL AS duration, SUM(count) AS count
FROM mz_introspection.mz_lir_mapping mlm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial nit (feel free to disregard). Do we want the FROM to either left-align with SELECT or right align with the 'LEFT JOIN'/'JOIN' ?

Have zero opinion as I've seen various alignments when using JOINS and I don't think we have a company style yet. But, this one seems to differ from the others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it so FROM aligns with the SELECT, like all the other queries I wrote. I'm happy to have these reformatted, no strong feelings.

LEFT JOIN mz_introspection.mz_compute_operator_durations_histogram mcodh
ON (mlm.operator_id_start <= mcodh.id AND mcodh.id < mlm.operator_id_end)
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
WHERE mo.name IN ('wins_by_item', 'winning_bids')
GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting
ORDER BY global_id, lir_id DESC;
```

The query produces results similar to the following (your specific numbers will
vary):

{{< yaml-table data="query_attribution_computation_time_output" >}}

- The `duration` column shows that the `TopK` operator is where we spend the
bulk of the query's computation time.

- Creating an index on a view starts _two_ dataflows as denoted by the two
`global_ids`:
- `u148` is the dataflow for the `winning_bids` view (installed when the index
is created), and
- `u149` is the dataflow for the `wins_by_item` index on `winning_bids` (which
arranges the results of the `winning_bids` view by the index key).

The LIR operators reported in `mz_lir_mapping.operator` are terser than those in
`EXPLAIN PHYSICAL PLAN`. Each operator is restricted to a single line of the
form `OPERATOR [INPUT_LIR_ID ...]`. For example `lir_id` 4 is the operator
`Arrange 3`, where `Arrange` is the operator that will arrange in memory the
results of `lir_id` 3 (`GET::PassArrangements u145`), which `GET`s/reads
dataflow `u145` (while not shown, `u145` is the `bids` table from the Auction
source).

To examine the query in more detail:

- The query works by finding every dataflow operator in the range
(`mz_lir_mapping.operator_id_start` inclusive to
`mz_lir_mapping.operator_id_end` exclusive) and summing up the time spent
(`SUM(duration_ns)`).

- The query joins with
[`mz_catalog.mz_objects`](/sql/system-catalog/mz_catalog/#mz_objects) to find
the actual name corresponding to the `global_id`. The `WHERE mo.name IN ...`
clause of the query ensures we only see information about this index and view.
If you leave this `WHERE` clause out, you will see information on _every_
view, materialized view, and index on your current cluster.

- The `operator` is indented using [`REPEAT`](/sql/functions/#repeat) and
`mz_lir_mapping.nesting`. The indenting, combined with ordering by
`mz_lir_mapping.lir_id` descending, gives us a tree-shaped view of the LIR
plan.

### Attributing memory usage

{{< tip >}}

If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.

{{< /tip >}}

To find the memory usage of each operator for the index and view, join
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
with
[`mz_introspection.mz_arrangement_sizes`](/sql/system-catalog/mz_introspection/#mz_arrangement_sizes):

```sql
SELECT mo.name AS name, global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
pg_size_pretty(SUM(size)) AS size
FROM mz_introspection.mz_lir_mapping mlm
LEFT JOIN mz_introspection.mz_arrangement_sizes mas
ON (mlm.operator_id_start <= mas.operator_id AND mas.operator_id < mlm.operator_id_end)
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
WHERE mo.name IN ('wins_by_item', 'winning_bids')
GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting
ORDER BY global_id, lir_id DESC;
```

The query produces results similar to the following (your specific numbers will
vary):

{{< yaml-table data="query_attribution_memory_usage_output" >}}

The results show:

- The [`TopK`](/transform-data/idiomatic-materialize-sql/top-k/) is
overwhelmingly responsible for memory usage.

- Arranging the `bids` table (`lir_id` 4) and `auctions` table (`lir_id` 2) are
cheap in comparison as is arranging the final output in the `wins_by_item`
index (`lir_id` 8).

### Attributing `TopK` hints

{{< tip >}}

If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.

{{< /tip >}}

The
[`mz_introspection.mz_expected_group_size_advice`](/sql/system-catalog/mz_introspection/#mz_expected_group_size_advice)
looks at your running dataflow and suggests parameters you can set. We
can attribute this to particular parts of our query using
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping):

```sql
SELECT mo.name AS name, mlm.global_id AS global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
levels, to_cut, hint, pg_size_pretty(savings) AS savings
FROM mz_introspection.mz_lir_mapping mlm
JOIN mz_introspection.mz_dataflow_global_ids mdgi
ON (mlm.global_id = mdgi.global_id)
LEFT JOIN mz_introspection.mz_expected_group_size_advice megsa
ON (megsa.dataflow_id = mdgi.id AND
mlm.operator_id_start <= megsa.region_id AND megsa.region_id < mlm.operator_id_end)
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
WHERE mo.name IN ('wins_by_item', 'winning_bids')
ORDER BY mlm.global_id, lir_id DESC;
```

Each `TopK` operator will have an [associated `DISTINCT ON INPUT GROUP SIZE`
query hint](/transform-data/idiomatic-materialize-sql/top-k/#query-hints-1):

{{< yaml-table data="query_attribution_topk_hints_output" >}}

Here, the hinted `DISTINCT ON INPUT GROUP SIZE` is `255.0`. We can re-create our view and index using the hint as follows:
mgree marked this conversation as resolved.
Show resolved Hide resolved

```sql
DROP VIEW winning_bids CASCADE;

CREATE VIEW winning_bids AS
SELECT DISTINCT ON (a.id) b.*, a.item, a.seller
FROM auctions AS a
JOIN bids AS b
ON a.id = b.auction_id
WHERE b.bid_time < a.end_time
AND mz_now() >= a.end_time
OPTIONS (DISTINCT ON INPUT GROUP SIZE = 255) -- use hint!
ORDER BY a.id,
b.amount DESC,
b.bid_time,
b.buyer;

CREATE INDEX wins_by_item ON winning_bids (item);
```

Re-running the `TopK`-hints query will show only `null` hints; there are no
hints because our `TopK` is now appropriately sized. But if we re-run our [query
for attributing memory usage](#attributing-memory-usage), we can see that our
`TopK` operator uses a third of the memory it was using before:

{{< yaml-table data="query_attribution_memory_usage_w_hint_output" >}}

### Localizing worker skew

{{< tip >}}

If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.

{{< /tip >}}


[Worker skew](/transform-data/dataflow-troubleshooting/#is-work-distributed-equally-across-workers) occurs when your data do not end up getting evenly
partitioned between workers. Worker skew can only happen when your
cluster has more than one worker. You can query
[`mz_catalog.mz_cluster_replica_sizes`](/sql/system-catalog/mz_catalog/#mz_cluster_replica_sizes)
to determine how many workers a given cluster size has; in our example, there are 4 workers.

You can identify worker skew by comparing a worker's time spent to the
overall time spent across all workers:

```sql
SELECT mo.name AS name, global_id, lir_id, REPEAT(' ', 2 * nesting) || operator AS operator,
worker_id,
ROUND(SUM(elapsed_ns) / SUM(aebi.total_ns), 2) AS ratio,
SUM(epw.elapsed_ns) / 1000 * '1 microsecond'::INTERVAL AS elapsed_ns,
SUM(aebi.total_ns) / 1000 * '1 microsecond'::INTERVAL AS avg_ns
FROM mz_introspection.mz_lir_mapping mlm
CROSS JOIN LATERAL ( SELECT SUM(elapsed_ns) AS total_ns
FROM mz_introspection.mz_scheduling_elapsed_per_worker mse
WHERE mlm.operator_id_start <= id AND id < mlm.operator_id_end
GROUP BY worker_id) aebi
CROSS JOIN LATERAL ( SELECT worker_id, SUM(elapsed_ns) AS elapsed_ns
FROM mz_introspection.mz_scheduling_elapsed_per_worker mse
WHERE mlm.operator_id_start <= id AND id < mlm.operator_id_end
GROUP BY worker_id) epw
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
WHERE mo.name IN ('wins_by_item', 'winning_bids')
GROUP BY mo.name, global_id, lir_id, nesting, operator, worker_id
ORDER BY global_id, lir_id DESC;
```

{{< yaml-table data="query_attribution_worker_skew_output" >}}

The `ratio` column tells you whether a worker is particularly over- or
under-loaded:

- a `ratio` below 1 indicates a worker doing a below average amount of work.

- a `ratio` above 1 indicates a worker doing an above average amount of work.

While there will always be some amount of variation, very high ratios indicate a
skewed workload.

### Writing your own attribution queries

Materialize maps LIR nodes to ranges of dataflow operators in
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping).
By combining information from
[`mz_catalog`](/sql/system-catalog/mz_catalog/) and
[`mz_introspection`](/sql/system-catalog/mz_introspection/),
you can better understand your dataflows' behavior. Using the above queries as a
starting point, you can build your own attribution queries. When building your own, keep the following in mind:

- `mz_lir_mapping.operator` is not stable and **should not be parsed**.

- If you want to traverse the LIR tree, use `mz_lir_mapping. parent_lir_id`.

- To request additional metadata that would be useful for us to provide,
please [contact our team](https://materialize.com/contact/).

- Include `GROUP BY global_id` to avoid mixing `lir_ids` from different
`global_id`s. Mixing `lir_id`s from different `global_id`s will produce
nonsense. `global_id`s will produce nonsense.

- Use `REPEAT(' ', 2 * nesting) || operator` and `ORDER BY lir_id DESC` to
correctly render the LIR tree.

- `mz_lir_mapping.operator_id_start` is inclusive;
`mz_lir_mapping.operator_id_end` is exclusive. If they are equal to each
other, that LIR operator does not correspond to any dataflow operators.

- To only see output for views, materialized views, and indexes you're
interested in, join with `mz_catalog.mz_objects` and restrict based on
`mz_objects.name`. Otherwise, you will see information on everything installed
in your current cluster.

## How do I troubleshoot slow queries?

Materialize stores a (sampled) log of the SQL statements that are issued against
Expand Down
Loading
Loading