Skip to content

Commit f1f117e

Browse files
authored
Merge pull request #3680 from ClickHouse/mvs_unions
example for mvs and unions
2 parents 9608b88 + ff7be37 commit f1f117e

File tree

1 file changed

+282
-56
lines changed

1 file changed

+282
-56
lines changed

docs/materialized-view/incremental-materialized-view.md

+282-56
Original file line numberDiff line numberDiff line change
@@ -272,6 +272,104 @@ LIMIT 10
272272

273273
Note we use a `GROUP BY` here instead of using `FINAL`.
274274

275+
## Other applications {#other-applications}
276+
277+
The above focuses primarily on using Materialized Views to incrementally update partial aggregates of data, thus moving the computation from query to insert time. Beyond this common use case, Materialized Views have a number of other applications.
278+
279+
### Filtering and transformation {#filtering-and-transformation}
280+
281+
In some situations, we may wish to only insert a subset of the rows and columns on insertion. In this case, our `posts_null` table could receive inserts, with a `SELECT` query filtering rows prior to insertion into the `posts` table. For example, suppose we wished to transform a `Tags` column in our `posts` table. This contains a pipe delimited list of tag names. By converting these into an array, we can more easily aggregate by individual tag values.
282+
283+
> We could perform this transformation when running an `INSERT INTO SELECT`. The Materialized View allows us to encapsulate this logic in ClickHouse DDL and keep our `INSERT` simple, with the transformation applied to any new rows.
284+
285+
Our Materialized View for this transformation is shown below:
286+
287+
```sql
288+
CREATE MATERIALIZED VIEW posts_mv TO posts AS
289+
SELECT * EXCEPT Tags, arrayFilter(t -> (t != ''), splitByChar('|', Tags)) as Tags FROM posts_null
290+
```
291+
292+
### Lookup table {#lookup-table}
293+
294+
Users should consider their access patterns when choosing a ClickHouse ordering key. Columns which are frequently used in filter and aggregation clauses should be used. This can be restrictive for scenarios where users have more diverse access patterns which cannot be encapsulated in a single set of columns. For example, consider the following `comments` table:
295+
296+
```sql
297+
CREATE TABLE comments
298+
(
299+
`Id` UInt32,
300+
`PostId` UInt32,
301+
`Score` UInt16,
302+
`Text` String,
303+
`CreationDate` DateTime64(3, 'UTC'),
304+
`UserId` Int32,
305+
`UserDisplayName` LowCardinality(String)
306+
)
307+
ENGINE = MergeTree
308+
ORDER BY PostId
309+
310+
0 rows in set. Elapsed: 46.357 sec. Processed 90.38 million rows, 11.14 GB (1.95 million rows/s., 240.22 MB/s.)
311+
```
312+
313+
The ordering key here optimizes the table for queries which filter by `PostId`.
314+
315+
Suppose a user wishes to filter on a specific `UserId` and compute their average `Score`:
316+
317+
```sql
318+
SELECT avg(Score)
319+
FROM comments
320+
WHERE UserId = 8592047
321+
322+
┌──────────avg(Score)─┐
323+
0.18181818181818182
324+
└─────────────────────┘
325+
326+
1 row in set. Elapsed: 0.778 sec. Processed 90.38 million rows, 361.59 MB (116.16 million rows/s., 464.74 MB/s.)
327+
Peak memory usage: 217.08 MiB.
328+
```
329+
330+
While fast (the data is small for ClickHouse), we can tell this requires a full table scan from the number of rows processed - 90.38 million. For larger datasets, we can use a Materialized View to lookup our ordering key values `PostId` for filtering column `UserId`. These values can then be used to perform an efficient lookup.
331+
332+
In this example, our Materialized View can be very simple, selecting only the `PostId` and `UserId` from `comments` on insert. These results are in turn sent to a table `comments_posts_users` which is ordered by `UserId`. We create a null version of the `Comments` table below and use this to populate our view and `comments_posts_users` table:
333+
334+
```sql
335+
CREATE TABLE comments_posts_users (
336+
PostId UInt32,
337+
UserId Int32
338+
) ENGINE = MergeTree ORDER BY UserId
339+
340+
341+
CREATE TABLE comments_null AS comments
342+
ENGINE = Null
343+
344+
CREATE MATERIALIZED VIEW comments_posts_users_mv TO comments_posts_users AS
345+
SELECT PostId, UserId FROM comments_null
346+
347+
INSERT INTO comments_null SELECT * FROM comments
348+
349+
0 rows in set. Elapsed: 5.163 sec. Processed 90.38 million rows, 17.25 GB (17.51 million rows/s., 3.34 GB/s.)
350+
```
351+
352+
We can now use this View in a subquery to accelerate our previous query:
353+
354+
```sql
355+
SELECT avg(Score)
356+
FROM comments
357+
WHERE PostId IN (
358+
SELECT PostId
359+
FROM comments_posts_users
360+
WHERE UserId = 8592047
361+
) AND UserId = 8592047
362+
363+
┌──────────avg(Score)─┐
364+
0.18181818181818182
365+
└─────────────────────┘
366+
367+
1 row in set. Elapsed: 0.012 sec. Processed 88.61 thousand rows, 771.37 KB (7.09 million rows/s., 61.73 MB/s.)
368+
```
369+
370+
### Chaining {#chaining}
371+
372+
Materialized views can be chained, allowing complex workflows to be established. For a practical example, we recommend reading this [blog post](https://clickhouse.com/blog/chaining-materialized-views).
275373

276374
## Materialized Views and JOINs {#materialized-views-and-joins}
277375

@@ -623,29 +721,16 @@ INSERT INTO badges VALUES (53505058, 2936484, 'gingerwizard', now(), 'Gold', 0);
623721

624722
In the above operation, only one row is retrieved from the users table for the user id `2936484`. This lookup is also optimized with a table ordering key of `Id`.
625723

626-
## Other applications {#other-applications}
627-
628-
The above focuses primarily on using Materialized Views to incrementally update partial aggregates of data, thus moving the computation from query to insert time. Beyond this common use case, Materialized Views have a number of other applications.
629-
630-
### Filtering and transformation {#filtering-and-transformation}
631-
632-
In some situations, we may wish to only insert a subset of the rows and columns on insertion. In this case, our `posts_null` table could receive inserts, with a `SELECT` query filtering rows prior to insertion into the `posts` table. For example, suppose we wished to transform a `Tags` column in our `posts` table. This contains a pipe delimited list of tag names. By converting these into an array, we can more easily aggregate by individual tag values.
633-
634-
> We could perform this transformation when running an `INSERT INTO SELECT`. The Materialized View allows us to encapsulate this logic in ClickHouse DDL and keep our `INSERT` simple, with the transformation applied to any new rows.
635-
636-
Our Materialized View for this transformation is shown below:
724+
## Materialized Views and Unions {#materialized-views-and-unions}
637725

638-
```sql
639-
CREATE MATERIALIZED VIEW posts_mv TO posts AS
640-
SELECT * EXCEPT Tags, arrayFilter(t -> (t != ''), splitByChar('|', Tags)) as Tags FROM posts_null
641-
```
726+
`UNION ALL` queries are commonly used to combine data from multiple source tables into a single result set.
642727

643-
### Lookup table {#lookup-table}
728+
While `UNION ALL` is not directly supported in Incremental Materialized Views, you can achieve the same outcome by creating a separate Materialized View for each `SELECT` branch and writing their results to a shared target table.
644729

645-
Users should consider their access patterns when choosing a ClickHouse ordering key. Columns which are frequently used in filter and aggregation clauses should be used. This can be restrictive for scenarios where users have more diverse access patterns which cannot be encapsulated in a single set of columns. For example, consider the following `comments` table:
730+
For our example, we'll use the Stack Overflow dataset. Consider the `badges` and `comments` tables below, which represent the badges earned by a user and the comments they make on posts:
646731

647732
```sql
648-
CREATE TABLE comments
733+
CREATE TABLE stackoverflow.comments
649734
(
650735
`Id` UInt32,
651736
`PostId` UInt32,
@@ -656,68 +741,209 @@ CREATE TABLE comments
656741
`UserDisplayName` LowCardinality(String)
657742
)
658743
ENGINE = MergeTree
659-
ORDER BY PostId
744+
ORDER BY CreationDate
660745

661-
0 rows in set. Elapsed: 46.357 sec. Processed 90.38 million rows, 11.14 GB (1.95 million rows/s., 240.22 MB/s.)
746+
CREATE TABLE stackoverflow.badges
747+
(
748+
`Id` UInt32,
749+
`UserId` Int32,
750+
`Name` LowCardinality(String),
751+
`Date` DateTime64(3, 'UTC'),
752+
`Class` Enum8('Gold' = 1, 'Silver' = 2, 'Bronze' = 3),
753+
`TagBased` Bool
754+
)
755+
ENGINE = MergeTree
756+
ORDER BY UserId
662757
```
663758

664-
The ordering key here optimizes the table for queries which filter by `PostId`.
759+
These can be populated with the following `INSERT INTO` commands:
665760

666-
Suppose a user wishes to filter on a specific `UserId` and compute their average `Score`:
667761

668762
```sql
669-
SELECT avg(Score)
670-
FROM comments
671-
WHERE UserId = 8592047
763+
INSERT INTO stackoverflow.badges SELECT *
764+
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/badges.parquet')
765+
INSERT INTO stackoverflow.comments SELECT *
766+
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/comments/*.parquet')
767+
```
672768

673-
┌──────────avg(Score)─┐
674-
0.18181818181818182
675-
└─────────────────────┘
769+
Suppose we want to create a unified view of user activity, showing the last activity by each user by combining these two tables:
676770

677-
1 row in set. Elapsed: 0.778 sec. Processed 90.38 million rows, 361.59 MB (116.16 million rows/s., 464.74 MB/s.)
678-
Peak memory usage: 217.08 MiB.
771+
```sql
772+
SELECT
773+
UserId,
774+
argMax(description, event_time) AS last_description,
775+
argMax(activity_type, event_time) AS activity_type,
776+
max(event_time) AS last_activity
777+
FROM
778+
(
779+
SELECT
780+
UserId,
781+
CreationDate AS event_time,
782+
Text AS description,
783+
'comment' AS activity_type
784+
FROM stackoverflow.comments
785+
UNION ALL
786+
SELECT
787+
UserId,
788+
Date AS event_time,
789+
Name AS description,
790+
'badge' AS activity_type
791+
FROM stackoverflow.badges
792+
)
793+
GROUP BY UserId
794+
ORDER BY last_activity DESC
795+
LIMIT 10
679796
```
680797

681-
While fast (the data is small for ClickHouse), we can tell this requires a full table scan from the number of rows processed - 90.38 million. For larger datasets, we can use a Materialized View to lookup our ordering key values `PostId` for filtering column `UserId`. These values can then be used to perform an efficient lookup.
798+
Let's assume we have a target table to receive the results of this query. Note the use of the [AggregatingMergeTree](/engines/table-engines/mergetree-family/aggregatingmergetree) table engine and [AggregateFunction](/sql-reference/data-types/aggregatefunction) to ensure results are merged correctly:
682799

683-
In this example, our Materialized View can be very simple, selecting only the `PostId` and `UserId` from `comments` on insert. These results are in turn sent to a table `comments_posts_users` which is ordered by `UserId`. We create a null version of the `Comments` table below and use this to populate our view and `comments_posts_users` table:
800+
```sql
801+
CREATE TABLE user_activity
802+
(
803+
`UserId` String,
804+
`last_description` AggregateFunction(argMax, String, DateTime64(3, 'UTC')),
805+
`activity_type` AggregateFunction(argMax, String, DateTime64(3, 'UTC')),
806+
`last_activity` SimpleAggregateFunction(max, DateTime64(3, 'UTC'))
807+
)
808+
ENGINE = AggregatingMergeTree
809+
ORDER BY UserId
810+
```
811+
812+
Wanting this table to update as new rows are inserted into either `badges` or `comments`, a naive approach to this problem may be to try and create a Materialized View with the previous union query:
684813

685814
```sql
686-
CREATE TABLE comments_posts_users (
687-
PostId UInt32,
688-
UserId Int32
689-
) ENGINE = MergeTree ORDER BY UserId
815+
CREATE MATERIALIZED VIEW user_activity_mv TO user_activity AS
816+
SELECT
817+
UserId,
818+
argMaxState(description, event_time) AS last_description,
819+
argMaxState(activity_type, event_time) AS activity_type,
820+
max(event_time) AS last_activity
821+
FROM
822+
(
823+
SELECT
824+
UserId,
825+
CreationDate AS event_time,
826+
Text AS description,
827+
'comment' AS activity_type
828+
FROM stackoverflow.comments
829+
UNION ALL
830+
SELECT
831+
UserId,
832+
Date AS event_time,
833+
Name AS description,
834+
'badge' AS activity_type
835+
FROM stackoverflow.badges
836+
)
837+
GROUP BY UserId
838+
ORDER BY last_activity DESC
839+
```
690840

841+
While this is valid syntactically, it will produce unintended results - the view will only trigger inserts to the `comments` table. For example:
691842

692-
CREATE TABLE comments_null AS comments
693-
ENGINE = Null
694843

695-
CREATE MATERIALIZED VIEW comments_posts_users_mv TO comments_posts_users AS
696-
SELECT PostId, UserId FROM comments_null
844+
```sql
845+
INSERT INTO comments VALUES (99999999, 23121, 1, 'The answer is 42', now(), 2936484, 'gingerwizard');
697846

698-
INSERT INTO comments_null SELECT * FROM comments
847+
SELECT
848+
UserId,
849+
argMaxMerge(last_description) AS description,
850+
argMaxMerge(activity_type) AS activity_type,
851+
max(last_activity) AS last_activity
852+
FROM user_activity
853+
WHERE UserId = '2936484'
854+
GROUP BY UserId
699855

700-
0 rows in set. Elapsed: 5.163 sec. Processed 90.38 million rows, 17.25 GB (17.51 million rows/s., 3.34 GB/s.)
856+
┌─UserId──┬─description──────┬─activity_type─┬───────────last_activity─┐
857+
2936484 │ The answer is 42 │ comment │ 2025-04-15 09:56:19.000
858+
└─────────┴──────────────────┴───────────────┴─────────────────────────┘
859+
860+
1 row in set. Elapsed: 0.005 sec.
701861
```
702862

703-
We can now use this View in a subquery to accelerate our previous query:
863+
Inserts into the `badges` table will trigger the view, causing `user_activity` to not receive updates:
704864

705865
```sql
706-
SELECT avg(Score)
707-
FROM comments
708-
WHERE PostId IN (
709-
SELECT PostId
710-
FROM comments_posts_users
711-
WHERE UserId = 8592047
712-
) AND UserId = 8592047
866+
INSERT INTO badges VALUES (53505058, 2936484, 'gingerwizard', now(), 'Gold', 0);
713867

714-
┌──────────avg(Score)─┐
715-
0.18181818181818182
716-
└─────────────────────┘
868+
SELECT
869+
UserId,
870+
argMaxMerge(last_description) AS description,
871+
argMaxMerge(activity_type) AS activity_type,
872+
max(last_activity) AS last_activity
873+
FROM user_activity
874+
WHERE UserId = '2936484'
875+
GROUP BY UserId;
717876

718-
1 row in set. Elapsed: 0.012 sec. Processed 88.61 thousand rows, 771.37 KB (7.09 million rows/s., 61.73 MB/s.)
877+
┌─UserId──┬─description──────┬─activity_type─┬───────────last_activity─┐
878+
2936484 │ The answer is 42 │ comment │ 2025-04-15 09:56:19.000
879+
└─────────┴──────────────────┴───────────────┴─────────────────────────┘
880+
881+
1 row in set. Elapsed: 0.005 sec.
719882
```
720883

721-
### Chaining {#chaining}
884+
To solve this, we simply create a Materialized View for each SELECT statement:
722885

723-
Materialized views can be chained, allowing complex workflows to be established. For a practical example, we recommend reading this [blog post](https://clickhouse.com/blog/chaining-materialized-views).
886+
```sql
887+
DROP TABLE user_activity_mv;
888+
TRUNCATE TABLE user_activity;
889+
890+
CREATE MATERIALIZED VIEW comment_activity_mv TO user_activity AS
891+
SELECT
892+
UserId,
893+
argMaxState(Text, CreationDate) AS last_description,
894+
argMaxState('comment', CreationDate) AS activity_type,
895+
max(CreationDate) AS last_activity
896+
FROM stackoverflow.comments
897+
GROUP BY UserId;
898+
899+
CREATE MATERIALIZED VIEW badges_activity_mv TO user_activity AS
900+
SELECT
901+
UserId,
902+
argMaxState(Name, Date) AS last_description,
903+
argMaxState('badge', Date) AS activity_type,
904+
max(Date) AS last_activity
905+
FROM stackoverflow.badges
906+
GROUP BY UserId;
907+
```
908+
909+
Inserting to either table now results in the correct results. For example, if we insert into the `comments` table:
910+
911+
```sql
912+
INSERT INTO comments VALUES (99999999, 23121, 1, 'The answer is 42', now(), 2936484, 'gingerwizard');
913+
914+
SELECT
915+
UserId,
916+
argMaxMerge(last_description) AS description,
917+
argMaxMerge(activity_type) AS activity_type,
918+
max(last_activity) AS last_activity
919+
FROM user_activity
920+
WHERE UserId = '2936484'
921+
GROUP BY UserId;
922+
923+
┌─UserId──┬─description──────┬─activity_type─┬───────────last_activity─┐
924+
2936484 │ The answer is 42 │ comment │ 2025-04-15 10:18:47.000
925+
└─────────┴──────────────────┴───────────────┴─────────────────────────┘
926+
927+
1 row in set. Elapsed: 0.006 sec.
928+
```
929+
930+
Likewise, inserts into the `badges` table are reflected in the `user_activity` table:
931+
932+
```sql
933+
INSERT INTO badges VALUES (53505058, 2936484, 'gingerwizard', now(), 'Gold', 0);
934+
935+
SELECT
936+
UserId,
937+
argMaxMerge(last_description) AS description,
938+
argMaxMerge(activity_type) AS activity_type,
939+
max(last_activity) AS last_activity
940+
FROM user_activity
941+
WHERE UserId = '2936484'
942+
GROUP BY UserId
943+
944+
┌─UserId──┬─description──┬─activity_type─┬───────────last_activity─┐
945+
2936484 │ gingerwizard │ badge │ 2025-04-15 10:20:18.000
946+
└─────────┴──────────────┴───────────────┴─────────────────────────┘
947+
948+
1 row in set. Elapsed: 0.006 sec.
949+
```

0 commit comments

Comments
 (0)