Overview

Clickhouse is quite fast storage, but when your storage is huge enough searching and aggregating in raw data become quite expensive. In this case you would think about optimization some queries. Today I would like to talk about a way where we will use AggregatingMergeTree with Materialized View. Materialized View gets all data by a given query and AggregatingMergeTree aggregates inserted records by sorting key. With this approach, We can group data by some fields and it helps us optimize heavy queries for a long period. Most important things, the optimization rate is a proportion of all records with unique records in the materialized view. It tells how fast your optimization will work.

How to

CREATE TABLE views
(
  id                UInt64,
  manager           UInt64,
  splitTest         UInt64,
  splitTestOption   UInt64,
  client            UInt64,
  schemeType        UInt64,
  campaign          UInt64,
  creative          UInt64,
  source            UInt64,
  domain            UInt64,
  referrer          UInt64,
  country           String,
  device            String,
  os                String,
  viewAt            DateTime,
  viewDate          Date DEFAULT toDate(viewAt),
  platform          UInt64,
  orientation       UInt64,
  pageType          UInt64,
  categoryName      UInt64,
  widgetName        UInt64,
  widgetElement     UInt64,
  logged            UInt8,
  isVisited         UInt8
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test/{shard}/views', '{replica}', viewDate,
           (viewDate, client), 8192);

We have a huge table with a large amount of data and we want to optimize queries by viewDate, manager, client, campaign, creative, source, country, device, domain columns. We need to create Materialized view with AggregatingMergeTree engine.

CREATE MATERIALIZED VIEW views_mat
  engine = AggregatingMergeTree partition by viewDate order by (
    viewDate,
    manager,
    client,
    campaign,
    creative,
    source,
    country,
    device,
    domain,
    landing)
AS
SELECT viewDate,
       manager,
       client,
       campaign,
       creative,
       source,
       country,
       device,
       domainId,
       landing,
       countState() as amount
FROM views
GROUP BY viewDate,
         manager,
         client,
         campaign,
         creative,
         source,
         country,
         device,
         domain,
         landing

Aggregate function combinators

You can see countState() function. This function doesn’t return the resulting value, but just an intermediate state of the aggregation. You should use countMerge() function for getting the resulting value. More information about combinators you can check (here)[https://clickhouse.yandex/docs/en/query_language/agg_functions/combinators/].

Let’s try queries and check the difference between the raw table and the materialized view.

Raw:

SELECT
    campaign,
    count() AS views
FROM views
WHERE (client IN 7559922) AND (country IN 'in') AND ((viewDate >= '2018-06-01') AND (viewDate <= '2018-12-30'))
GROUP BY campaign;

Row 1:
──────
campaignId:  0
impressions: 7955062

Row 2:
──────
campaignId:  812661
impressions: 11003

Row 3:
──────
campaignId:  1334
impressions: 253350

Row 4:
──────
campaignId:  569467
impressions: 93182604

Row 5:
──────
campaignId:  237410
impressions: 7554

Row 6:
──────
campaignId:  23456
impressions: 3

Row 7:
──────
campaignId:  524556
impressions: 438

7 rows in set. Elapsed: 46.781 sec. Processed 9.18 billion rows, 116.92 GB (196.24 million rows/s., 2.50 GB/s.)

AgregateMergeTree and Materialized view:

SELECT
    campaignId,
    countMerge(amount) AS impressions
FROM impressions_mat
WHERE (userId IN 26) AND (country IN 'in') AND ((impressionDate >= '2018-06-01') AND (impressionDate <= '2018-12-30'))
GROUP BY campaignId

Row 1:
──────
campaignId:  0
impressions: 7955062

Row 2:
──────
campaignId:  812661
impressions: 11003

Row 3:
──────
campaignId:  1334
impressions: 253350

Row 4:
──────
campaignId:  569467
impressions: 93182604

Row 5:
──────
campaignId:  237410
impressions: 7554

Row 6:
──────
campaignId:  23456
impressions: 3

Row 7:
──────
campaignId:  524556
impressions: 438

7 rows in set. Elapsed: 0.116 sec. Processed 9.13 million rows, 448.10 MB (78.78 million rows/s., 3.86 GB/s.)

In other queries, optimization might be more or less.

Populate

CREATE MATERIALIZED VIEW has option POPULATE. It allows you to populate the view with data from request but it has one disadvantage. The view won’t get data which come while population. Therefore, we need to create a new view then populate it after view creation. In our case, we can stop inserting new data, populate view, then start inserting.

Summary

If you need to increase speed for some specific queries you can use Materialized View with AggregatingMergeTree engine. Keep in mind that improving is proportional to the amount of rows with unique values.