MaterializeInc · kay-kim · Jan 9, 2025 · Dec 23, 2024 · Dec 26, 2024 · Dec 26, 2024
@@ -277,6 +277,320 @@ If you've gone through the dataflow troubleshooting and do not want to make any
 changes to your query, consider [sizing up your cluster](/sql/create-cluster/#size).
 A larger size cluster will provision more memory and CPU resources.
 
+## Which part of my query runs slowly or uses a lot of memory?
+
+{{< public-preview />}}
+
+You can [`EXPLAIN`](/sql/explain-plan/) a query to see how it will be run as a
+dataflow. In particular, `EXPLAIN PHYSICAL PLAN` will show the concrete, fully
+optimized plan that Materialize will run. That plan is written in our "low-level
+intermediate representation" (LIR).
+
+For [indexes](/concepts/indexes) and [materialized
+views](/concepts/views#materialized-views), you can use
+[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
+to attribute various performance characteristics to the operators inside your
+query.
+
+Every time you create an index or materialized view, Materialize uses
+[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
+to map the higher-level LIR operators to zero or more lower-level
+dataflow operators. You can construct queries that will combine
+information from
+[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
+and other internal views to discover which parts of your query are
+computationally expensive (e.g.,
+[`mz_introspection.mz_compute_operator_durations_histogram`](/sql/system-catalog/mz_introspection/#mz_compute_operator_durations_histogram), [`mz_introspection.mz_scheduling_elapsed`](/sql/system-catalog/mz_introspection/#mz_scheduling_elapsed))
+or consuming excessive memory (e.g., [`mz_introspection.mz_arrangement_sizes`](/sql/system-catalog/mz_introspection/#mz_arrangement_sizes)).
+
+To show how you can use
+[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
+to attribute performance characteristics, the attribution examples in this
+section reference the `wins_by_item` index (and the underlying `winning_bids`
+view) from the [quickstart
+guide](/get-started/quickstart/#step-2-create-the-source):
+
+```sql
+CREATE SOURCE auction_house
+FROM LOAD GENERATOR AUCTION
+(TICK INTERVAL '1s', AS OF 100000)
+FOR ALL TABLES;
+
+CREATE VIEW winning_bids AS
+  SELECT DISTINCT ON (a.id) b.*, a.item, a.seller
+    FROM auctions AS a
+    JOIN bids AS b
+      ON a.id = b.auction_id
+   WHERE b.bid_time < a.end_time
+     AND mz_now() >= a.end_time
+   ORDER BY a.id, b.amount DESC, b.bid_time, b.buyer;
+
+CREATE INDEX wins_by_item ON winning_bids (item);
+```
+
+We attribute four different kinds of performance data to parts of the
+`wins_by_item` query:
+
+- [computation time](#attributing-computation-time)
+- [memory usage](#attributing-memory-usage)
+- [Top-k query sizing hints](#attributing-topk-hints)
+- [worker skew](#localizing-worker-skew)
+
+### Attributing computation time
+
+When optimizing a query, it helps to be able to attribute 'cost' to its parts,
+starting with how much time is spent computing in each part overall. Materialize
+reports the time spent in each _dataflow operator_ in
+[`mz_introspection.mz_compute_operator_durations_histogram`](/sql/system-catalog/mz_introspection/#mz_compute_operator_durations_histogram).
+By joining it with
+[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping),
+we can attribute the time spent in each operator to the higher-level, more
+intelligible LIR operators.
+
+For example, to find out how much time is spent in each operator for the `wins_by_item` index (and the underlying `winning_bids` view), run the following query:
+
+```sql
+SELECT mo.name AS name, global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
+       SUM(duration_ns)/1000 * '1 microsecond'::INTERVAL AS duration, SUM(count) AS count
+    FROM           mz_introspection.mz_lir_mapping mlm
+         LEFT JOIN mz_introspection.mz_compute_operator_durations_histogram mcodh
+                ON (mlm.operator_id_start <= mcodh.id AND mcodh.id < mlm.operator_id_end)
+              JOIN mz_catalog.mz_objects mo
+                ON (mlm.global_id = mo.id)
+   WHERE mo.name IN ('wins_by_item', 'winning_bids')
+GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting
+ORDER BY global_id, lir_id DESC;
+```
+
+The query produces results similar to the following (your specific numbers will
+vary):
+
+{{< yaml-table data="query_attribution_computation_time_output" >}}
+
+- The `duration` column shows that the `TopK` operator is where we spend the
+  bulk of the query's computation time.
+
+- Creating an index on a view starts _two_ dataflows as denoted by the two
+  `global_ids`:
+  - `u148` is the dataflow for the `winning_bids` view (installed when the index
+    is created), and
+  - `u149` is the dataflow for the `wins_by_item` index on `winning_bids` (which
+    arranges the results of the `winning_bids` view by the index key).
+
+The LIR operators reported in `mz_lir_mapping.operator` are terser than those in
+`EXPLAIN PHYSICAL PLAN`. Each operator is restricted to a single line of the
+form `OPERATOR [INPUT_LIR_ID ...]`. For example `lir_id` 4 is the operator
+`Arrange 3`, where `Arrange` is the operator that will arrange in memory the
+results of `lir_id` 3 (`GET::PassArrangements u145`), which `GET`s/reads
+dataflow `u145` (while not shown, `u145` is the `bids` table from the Auction
+source).
+
+To examine the query in more detail:
+
+- The query works by finding every dataflow operator in the range
+  (`mz_lir_mapping.operator_id_start` inclusive to
+`mz_lir_mapping.operator_id_end` exclusive) and summing up the time spent
+(`SUM(duration_ns)`).
+
+- The query joins with
+  [`mz_catalog.mz_objects`](/sql/system-catalog/mz_catalog/#mz_objects) to find
+  the actual name corresponding to the `global_id`.  The `WHERE mo.name IN ...`
+  clause of the query ensures we only see information about this index and view.
+  If you leave this `WHERE` clause out, you will see information on _every_
+  view, materialized view, and index on your current cluster.
+
+- The `operator` is indented using [`REPEAT`](/sql/functions/#repeat) and
+  `mz_lir_mapping.nesting`. The indenting, combined with ordering by
+  `mz_lir_mapping.lir_id` descending, gives us a tree-shaped view of the LIR
+  plan.
+
+### Attributing memory usage
+
+{{< tip >}}
+
+If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.
+
+{{< /tip >}}
+
+To find the memory usage of each operator for the index and view, join
+[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
+with
+[`mz_introspection.mz_arrangement_sizes`](/sql/system-catalog/mz_introspection/#mz_arrangement_sizes):
+
+```sql
+  SELECT mo.name AS name, global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
+         pg_size_pretty(SUM(size)) AS size
+    FROM           mz_introspection.mz_lir_mapping mlm
+         LEFT JOIN mz_introspection.mz_arrangement_sizes mas
+                ON (mlm.operator_id_start <= mas.operator_id AND mas.operator_id < mlm.operator_id_end)
+              JOIN mz_catalog.mz_objects mo
+                ON (mlm.global_id = mo.id)
+   WHERE mo.name IN ('wins_by_item', 'winning_bids')
+GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting
+ORDER BY global_id, lir_id DESC;
+```
+
+The query produces results similar to the following (your specific numbers will
+vary):
+
+{{< yaml-table data="query_attribution_memory_usage_output" >}}
+
+The results show:
+
+- The [`TopK`](/transform-data/idiomatic-materialize-sql/top-k/) is
+  overwhelmingly responsible for memory usage.
+
+- Arranging the `bids` table (`lir_id` 4) and `auctions` table (`lir_id` 2) are
+  cheap in comparison as is arranging the final output in the `wins_by_item`
+  index (`lir_id` 8).
+
+### Attributing `TopK` hints
+
+{{< tip >}}
+
+If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.
+
+{{< /tip >}}
+
+The
+[`mz_introspection.mz_expected_group_size_advice`](/sql/system-catalog/mz_introspection/#mz_expected_group_size_advice)
+looks at your running dataflow and suggests parameters you can set. We
+can attribute this to particular parts of our query using
+[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping):
+
+```sql
+  SELECT mo.name AS name, mlm.global_id AS global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
+         levels, to_cut, hint, pg_size_pretty(savings) AS savings
+    FROM           mz_introspection.mz_lir_mapping mlm
+              JOIN mz_introspection.mz_dataflow_global_ids mdgi
+                ON (mlm.global_id = mdgi.global_id)
+         LEFT JOIN mz_introspection.mz_expected_group_size_advice megsa
+                ON (megsa.dataflow_id = mdgi.id AND
+                    mlm.operator_id_start <= megsa.region_id AND megsa.region_id < mlm.operator_id_end)
+              JOIN mz_catalog.mz_objects mo
+                ON (mlm.global_id = mo.id)
+   WHERE mo.name IN ('wins_by_item', 'winning_bids')
+ORDER BY mlm.global_id, lir_id DESC;
+```
+
+Each `TopK` operator will have an [associated `DISTINCT ON INPUT GROUP SIZE`
+query hint](/transform-data/idiomatic-materialize-sql/top-k/#query-hints-1):
+
+{{< yaml-table data="query_attribution_topk_hints_output" >}}
+
+Here, the hinted `DISTINCT ON INPUT GROUP SIZE` is `255.0`. We can re-create our view and index using the hint as follows:
+
+```sql
+DROP VIEW winning_bids CASCADE;
+
+CREATE VIEW winning_bids AS
+    SELECT DISTINCT ON (a.id) b.*, a.item, a.seller
+      FROM auctions AS a
+      JOIN bids AS b
+        ON a.id = b.auction_id
+     WHERE b.bid_time < a.end_time
+       AND mz_now() >= a.end_time
+   OPTIONS (DISTINCT ON INPUT GROUP SIZE = 255) -- use hint!
+  ORDER BY a.id,
+    b.amount DESC,
+    b.bid_time,
+    b.buyer;
+
+CREATE INDEX wins_by_item ON winning_bids (item);
+```
+
+Re-running the `TopK`-hints query will show only `null` hints; there are no
+hints because our `TopK` is now appropriately sized. But if we re-run our [query
+for attributing memory usage](#attributing-memory-usage), we can see that our
+`TopK` operator uses a third of the memory it was using before:
+
+{{< yaml-table data="query_attribution_memory_usage_w_hint_output" >}}
+
+### Localizing worker skew
+
+{{< tip >}}
+
+If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.
+
+{{< /tip >}}
+
+
+[Worker skew](/transform-data/dataflow-troubleshooting/#is-work-distributed-equally-across-workers) occurs when your data do not end up getting evenly
+partitioned between workers.  Worker skew can only happen when your
+cluster has more than one worker. You can query
+[`mz_catalog.mz_cluster_replica_sizes`](/sql/system-catalog/mz_catalog/#mz_cluster_replica_sizes)
+to determine how many workers a given cluster size has; in our example, there are 4 workers.
+
+You can identify worker skew by comparing a worker's time spent to the
+overall time spent across all workers:
+
+```sql
+ SELECT mo.name AS name, global_id, lir_id, REPEAT(' ', 2 * nesting) || operator AS operator,
+        worker_id,
+        ROUND(SUM(elapsed_ns) / SUM(aebi.total_ns), 2) AS ratio,
+        SUM(epw.elapsed_ns) / 1000 * '1 microsecond'::INTERVAL AS elapsed_ns,
+        SUM(aebi.total_ns) / 1000 * '1 microsecond'::INTERVAL AS avg_ns
+   FROM                    mz_introspection.mz_lir_mapping mlm
+        CROSS JOIN LATERAL (  SELECT SUM(elapsed_ns) AS total_ns
+                                FROM mz_introspection.mz_scheduling_elapsed_per_worker mse
+                               WHERE mlm.operator_id_start <= id AND id < mlm.operator_id_end
+                            GROUP BY worker_id) aebi
+        CROSS JOIN LATERAL (  SELECT worker_id, SUM(elapsed_ns) AS elapsed_ns
+                                FROM mz_introspection.mz_scheduling_elapsed_per_worker mse
+                               WHERE mlm.operator_id_start <= id AND id < mlm.operator_id_end
+                            GROUP BY worker_id) epw
+                      JOIN mz_catalog.mz_objects mo
+                        ON (mlm.global_id = mo.id)
+   WHERE mo.name IN ('wins_by_item', 'winning_bids')
+GROUP BY mo.name, global_id, lir_id, nesting, operator, worker_id
+ORDER BY global_id, lir_id DESC;
+```
+
+{{< yaml-table data="query_attribution_worker_skew_output" >}}
+
+The `ratio` column tells you whether a worker is particularly over- or
+under-loaded:
+
+- a `ratio` below 1 indicates a worker doing a below average amount of work.
+
+- a `ratio` above 1 indicates a worker doing an above average amount of work.
+
+While there will always be some amount of variation, very high ratios indicate a
+skewed workload.
+
+### Writing your own attribution queries
+
+Materialize maps LIR nodes to ranges of dataflow operators in
+[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping).
+By combining information from
+[`mz_catalog`](/sql/system-catalog/mz_catalog/) and
+[`mz_introspection`](/sql/system-catalog/mz_introspection/),
+you can better understand your dataflows' behavior. Using the above queries as a
+starting point, you can build your own attribution queries. When building your own, keep the following in mind:
+
+- `mz_lir_mapping.operator` is not stable and **should not be parsed**.
+
+  - If you want to traverse the LIR tree, use `mz_lir_mapping. parent_lir_id`.
+
+  - To request additional metadata that would be useful for us to provide,
+    please [contact our team](https://materialize.com/contact/).
+
+- Include `GROUP BY global_id` to avoid mixing `lir_ids` from different
+  `global_id`s. Mixing `lir_id`s from different `global_id`s will produce
+  nonsense. `global_id`s will produce nonsense.
+
+- Use `REPEAT(' ', 2 * nesting) || operator` and `ORDER BY lir_id DESC` to
+  correctly render the LIR tree.
+
+- `mz_lir_mapping.operator_id_start` is inclusive;
+  `mz_lir_mapping.operator_id_end` is exclusive. If they are equal to each
+  other, that LIR operator does not correspond to any dataflow operators.
+
+- To only see output for views, materialized views, and indexes you're
+  interested in, join with `mz_catalog.mz_objects` and restrict based on
+  `mz_objects.name`. Otherwise, you will see information on everything installed
+  in your current cluster.
+
 ## How do I troubleshoot slow queries?
 
 Materialize stores a (sampled) log of the SQL statements that are issued against