You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/materialized-view/incremental-materialized-view.md
+282-56
Original file line number
Diff line number
Diff line change
@@ -272,6 +272,104 @@ LIMIT 10
272
272
273
273
Note we use a `GROUP BY` here instead of using `FINAL`.
274
274
275
+
## Other applications {#other-applications}
276
+
277
+
The above focuses primarily on using Materialized Views to incrementally update partial aggregates of data, thus moving the computation from query to insert time. Beyond this common use case, Materialized Views have a number of other applications.
278
+
279
+
### Filtering and transformation {#filtering-and-transformation}
280
+
281
+
In some situations, we may wish to only insert a subset of the rows and columns on insertion. In this case, our `posts_null` table could receive inserts, with a `SELECT` query filtering rows prior to insertion into the `posts` table. For example, suppose we wished to transform a `Tags` column in our `posts` table. This contains a pipe delimited list of tag names. By converting these into an array, we can more easily aggregate by individual tag values.
282
+
283
+
> We could perform this transformation when running an `INSERT INTO SELECT`. The Materialized View allows us to encapsulate this logic in ClickHouse DDL and keep our `INSERT` simple, with the transformation applied to any new rows.
284
+
285
+
Our Materialized View for this transformation is shown below:
286
+
287
+
```sql
288
+
CREATE MATERIALIZED VIEW posts_mv TO posts AS
289
+
SELECT* EXCEPT Tags, arrayFilter(t -> (t !=''), splitByChar('|', Tags)) as Tags FROM posts_null
290
+
```
291
+
292
+
### Lookup table {#lookup-table}
293
+
294
+
Users should consider their access patterns when choosing a ClickHouse ordering key. Columns which are frequently used in filter and aggregation clauses should be used. This can be restrictive for scenarios where users have more diverse access patterns which cannot be encapsulated in a single set of columns. For example, consider the following `comments` table:
295
+
296
+
```sql
297
+
CREATETABLEcomments
298
+
(
299
+
`Id` UInt32,
300
+
`PostId` UInt32,
301
+
`Score` UInt16,
302
+
`Text` String,
303
+
`CreationDate` DateTime64(3, 'UTC'),
304
+
`UserId` Int32,
305
+
`UserDisplayName` LowCardinality(String)
306
+
)
307
+
ENGINE = MergeTree
308
+
ORDER BY PostId
309
+
310
+
0 rows inset. Elapsed: 46.357 sec. Processed 90.38 million rows, 11.14 GB (1.95 million rows/s., 240.22 MB/s.)
311
+
```
312
+
313
+
The ordering key here optimizes the table for queries which filter by `PostId`.
314
+
315
+
Suppose a user wishes to filter on a specific `UserId` and compute their average `Score`:
316
+
317
+
```sql
318
+
SELECTavg(Score)
319
+
FROM comments
320
+
WHERE UserId =8592047
321
+
322
+
┌──────────avg(Score)─┐
323
+
│ 0.18181818181818182 │
324
+
└─────────────────────┘
325
+
326
+
1 row inset. Elapsed: 0.778 sec. Processed 90.38 million rows, 361.59 MB (116.16 million rows/s., 464.74 MB/s.)
327
+
Peak memory usage: 217.08 MiB.
328
+
```
329
+
330
+
While fast (the data is small for ClickHouse), we can tell this requires a full table scan from the number of rows processed - 90.38 million. For larger datasets, we can use a Materialized View to lookup our ordering key values `PostId` for filtering column `UserId`. These values can then be used to perform an efficient lookup.
331
+
332
+
In this example, our Materialized View can be very simple, selecting only the `PostId` and `UserId` from `comments` on insert. These results are in turn sent to a table `comments_posts_users` which is ordered by `UserId`. We create a null version of the `Comments` table below and use this to populate our view and `comments_posts_users` table:
333
+
334
+
```sql
335
+
CREATETABLEcomments_posts_users (
336
+
PostId UInt32,
337
+
UserId Int32
338
+
) ENGINE = MergeTree ORDER BY UserId
339
+
340
+
341
+
CREATETABLEcomments_nullAS comments
342
+
ENGINE =Null
343
+
344
+
CREATE MATERIALIZED VIEW comments_posts_users_mv TO comments_posts_users AS
345
+
SELECT PostId, UserId FROM comments_null
346
+
347
+
INSERT INTO comments_null SELECT*FROM comments
348
+
349
+
0 rows inset. Elapsed: 5.163 sec. Processed 90.38 million rows, 17.25 GB (17.51 million rows/s., 3.34 GB/s.)
350
+
```
351
+
352
+
We can now use this View in a subquery to accelerate our previous query:
353
+
354
+
```sql
355
+
SELECTavg(Score)
356
+
FROM comments
357
+
WHERE PostId IN (
358
+
SELECT PostId
359
+
FROM comments_posts_users
360
+
WHERE UserId =8592047
361
+
) AND UserId =8592047
362
+
363
+
┌──────────avg(Score)─┐
364
+
│ 0.18181818181818182 │
365
+
└─────────────────────┘
366
+
367
+
1 row inset. Elapsed: 0.012 sec. Processed 88.61 thousand rows, 771.37 KB (7.09 million rows/s., 61.73 MB/s.)
368
+
```
369
+
370
+
### Chaining {#chaining}
371
+
372
+
Materialized views can be chained, allowing complex workflows to be established. For a practical example, we recommend reading this [blog post](https://clickhouse.com/blog/chaining-materialized-views).
275
373
276
374
## Materialized Views and JOINs {#materialized-views-and-joins}
In the above operation, only one row is retrieved from the users table for the user id `2936484`. This lookup is also optimized with a table ordering key of `Id`.
625
723
626
-
## Other applications {#other-applications}
627
-
628
-
The above focuses primarily on using Materialized Views to incrementally update partial aggregates of data, thus moving the computation from query to insert time. Beyond this common use case, Materialized Views have a number of other applications.
629
-
630
-
### Filtering and transformation {#filtering-and-transformation}
631
-
632
-
In some situations, we may wish to only insert a subset of the rows and columns on insertion. In this case, our `posts_null` table could receive inserts, with a `SELECT` query filtering rows prior to insertion into the `posts` table. For example, suppose we wished to transform a `Tags` column in our `posts` table. This contains a pipe delimited list of tag names. By converting these into an array, we can more easily aggregate by individual tag values.
633
-
634
-
> We could perform this transformation when running an `INSERT INTO SELECT`. The Materialized View allows us to encapsulate this logic in ClickHouse DDL and keep our `INSERT` simple, with the transformation applied to any new rows.
635
-
636
-
Our Materialized View for this transformation is shown below:
724
+
## Materialized Views and Unions {#materialized-views-and-unions}
637
725
638
-
```sql
639
-
CREATE MATERIALIZED VIEW posts_mv TO posts AS
640
-
SELECT* EXCEPT Tags, arrayFilter(t -> (t !=''), splitByChar('|', Tags)) as Tags FROM posts_null
641
-
```
726
+
`UNION ALL` queries are commonly used to combine data from multiple source tables into a single result set.
642
727
643
-
### Lookup table {#lookup-table}
728
+
While `UNION ALL` is not directly supported in Incremental Materialized Views, you can achieve the same outcome by creating a separate Materialized View for each `SELECT` branch and writing their results to a shared target table.
644
729
645
-
Users should consider their access patterns when choosing a ClickHouse ordering key. Columns which are frequently used in filter and aggregation clauses should be used. This can be restrictive for scenarios where users have more diverse access patterns which cannot be encapsulated in a single set of columns. For example, consider the following `comments` table:
730
+
For our example, we'll use the Stack Overflow dataset. Consider the `badges`and `comments` tables below, which represent the badges earned by a user and the comments they make on posts:
646
731
647
732
```sql
648
-
CREATETABLEcomments
733
+
CREATETABLEstackoverflow.comments
649
734
(
650
735
`Id` UInt32,
651
736
`PostId` UInt32,
@@ -656,68 +741,209 @@ CREATE TABLE comments
656
741
`UserDisplayName` LowCardinality(String)
657
742
)
658
743
ENGINE = MergeTree
659
-
ORDER BYPostId
744
+
ORDER BYCreationDate
660
745
661
-
0 rows inset. Elapsed: 46.357 sec. Processed 90.38 million rows, 11.14 GB (1.95 million rows/s., 240.22 MB/s.)
746
+
CREATETABLEstackoverflow.badges
747
+
(
748
+
`Id` UInt32,
749
+
`UserId` Int32,
750
+
`Name` LowCardinality(String),
751
+
`Date` DateTime64(3, 'UTC'),
752
+
`Class` Enum8('Gold'=1, 'Silver'=2, 'Bronze'=3),
753
+
`TagBased` Bool
754
+
)
755
+
ENGINE = MergeTree
756
+
ORDER BY UserId
662
757
```
663
758
664
-
The ordering key here optimizes the table for queries which filter by `PostId`.
759
+
These can be populated with the following `INSERT INTO` commands:
665
760
666
-
Suppose a user wishes to filter on a specific `UserId` and compute their average `Score`:
667
761
668
762
```sql
669
-
SELECTavg(Score)
670
-
FROM comments
671
-
WHERE UserId =8592047
763
+
INSERT INTOstackoverflow.badgesSELECT*
764
+
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/badges.parquet')
765
+
INSERT INTOstackoverflow.commentsSELECT*
766
+
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/comments/*.parquet')
767
+
```
672
768
673
-
┌──────────avg(Score)─┐
674
-
│ 0.18181818181818182 │
675
-
└─────────────────────┘
769
+
Suppose we want to create a unified view of user activity, showing the last activity by each user by combining these two tables:
676
770
677
-
1 row inset. Elapsed: 0.778 sec. Processed 90.38 million rows, 361.59 MB (116.16 million rows/s., 464.74 MB/s.)
678
-
Peak memory usage: 217.08 MiB.
771
+
```sql
772
+
SELECT
773
+
UserId,
774
+
argMax(description, event_time) AS last_description,
775
+
argMax(activity_type, event_time) AS activity_type,
776
+
max(event_time) AS last_activity
777
+
FROM
778
+
(
779
+
SELECT
780
+
UserId,
781
+
CreationDate AS event_time,
782
+
TextAS description,
783
+
'comment'AS activity_type
784
+
FROMstackoverflow.comments
785
+
UNION ALL
786
+
SELECT
787
+
UserId,
788
+
DateAS event_time,
789
+
Name AS description,
790
+
'badge'AS activity_type
791
+
FROMstackoverflow.badges
792
+
)
793
+
GROUP BY UserId
794
+
ORDER BY last_activity DESC
795
+
LIMIT10
679
796
```
680
797
681
-
While fast (the data is small for ClickHouse), we can tell this requires a full table scan from the number of rows processed - 90.38 million. For larger datasets, we can use a Materialized View to lookup our ordering key values `PostId` for filtering column `UserId`. These values can then be used to perform an efficient lookup.
798
+
Let's assume we have a target table to receive the results of this query. Note the use of the [AggregatingMergeTree](/engines/table-engines/mergetree-family/aggregatingmergetree) table engine and [AggregateFunction](/sql-reference/data-types/aggregatefunction)to ensure results are merged correctly:
682
799
683
-
In this example, our Materialized View can be very simple, selecting only the `PostId` and `UserId` from `comments` on insert. These results are in turn sent to a table `comments_posts_users` which is ordered by `UserId`. We create a null version of the `Comments` table below and use this to populate our view and `comments_posts_users` table:
Wanting this table to update as new rows are inserted into either `badges` or `comments`, a naive approach to this problem may be to try and create a Materialized View with the previous union query:
684
813
685
814
```sql
686
-
CREATETABLEcomments_posts_users (
687
-
PostId UInt32,
688
-
UserId Int32
689
-
) ENGINE = MergeTree ORDER BY UserId
815
+
CREATE MATERIALIZED VIEW user_activity_mv TO user_activity AS
816
+
SELECT
817
+
UserId,
818
+
argMaxState(description, event_time) AS last_description,
819
+
argMaxState(activity_type, event_time) AS activity_type,
820
+
max(event_time) AS last_activity
821
+
FROM
822
+
(
823
+
SELECT
824
+
UserId,
825
+
CreationDate AS event_time,
826
+
TextAS description,
827
+
'comment'AS activity_type
828
+
FROMstackoverflow.comments
829
+
UNION ALL
830
+
SELECT
831
+
UserId,
832
+
DateAS event_time,
833
+
Name AS description,
834
+
'badge'AS activity_type
835
+
FROMstackoverflow.badges
836
+
)
837
+
GROUP BY UserId
838
+
ORDER BY last_activity DESC
839
+
```
690
840
841
+
While this is valid syntactically, it will produce unintended results - the view will only trigger inserts to the `comments` table. For example:
691
842
692
-
CREATETABLEcomments_nullAS comments
693
-
ENGINE =Null
694
843
695
-
CREATE MATERIALIZED VIEW comments_posts_users_mv TO comments_posts_users AS
696
-
SELECT PostId, UserId FROM comments_null
844
+
```sql
845
+
INSERT INTO comments VALUES (99999999, 23121, 1, 'The answer is 42', now(), 2936484, 'gingerwizard');
697
846
698
-
INSERT INTO comments_null SELECT*FROM comments
847
+
SELECT
848
+
UserId,
849
+
argMaxMerge(last_description) AS description,
850
+
argMaxMerge(activity_type) AS activity_type,
851
+
max(last_activity) AS last_activity
852
+
FROM user_activity
853
+
WHERE UserId ='2936484'
854
+
GROUP BY UserId
699
855
700
-
0 rows inset. Elapsed: 5.163 sec. Processed 90.38 million rows, 17.25 GB (17.51 million rows/s., 3.34 GB/s.)
To solve this, we simply create a Materialized View for each SELECT statement:
722
885
723
-
Materialized views can be chained, allowing complex workflows to be established. For a practical example, we recommend reading this [blog post](https://clickhouse.com/blog/chaining-materialized-views).
886
+
```sql
887
+
DROPTABLE user_activity_mv;
888
+
TRUNCATE TABLE user_activity;
889
+
890
+
CREATE MATERIALIZED VIEW comment_activity_mv TO user_activity AS
891
+
SELECT
892
+
UserId,
893
+
argMaxState(Text, CreationDate) AS last_description,
894
+
argMaxState('comment', CreationDate) AS activity_type,
895
+
max(CreationDate) AS last_activity
896
+
FROMstackoverflow.comments
897
+
GROUP BY UserId;
898
+
899
+
CREATE MATERIALIZED VIEW badges_activity_mv TO user_activity AS
900
+
SELECT
901
+
UserId,
902
+
argMaxState(Name, Date) AS last_description,
903
+
argMaxState('badge', Date) AS activity_type,
904
+
max(Date) AS last_activity
905
+
FROMstackoverflow.badges
906
+
GROUP BY UserId;
907
+
```
908
+
909
+
Inserting to either table now results in the correct results. For example, if we insert into the `comments` table:
910
+
911
+
```sql
912
+
INSERT INTO comments VALUES (99999999, 23121, 1, 'The answer is 42', now(), 2936484, 'gingerwizard');
0 commit comments