Discovering the secrets of Clickhouse
These days, I am having a serious columnar affair with my new friend, ClickHouse.
Since I have some experience with quite a few other data stores, I notice new features that I once wished were present.
Today, I will share with you my reflections on some interesting ClickHouse data types that have grabbed my attention.
Enum data type:
- a highly efficient way to store string values from a predefined fixed set. ClickHouse stores them as integers, but they appear in the queries as strings. The bigger the string, the more this data type is storage-efficient, especially for huge tables. Sorting and filtering on integers are much faster than on strings, and indices are much more compact and efficient.
- There are two types, depending on the number of unique values that you have in your set.
Enum8 (1 byte, 128 unique values)
Enum16 (2 bytes, 32768 unique values)
- When defining an Enum column, we explicitly define the key to string mapping
- When inserting data, you can either send a string or an integer. If you send undefined strings or integers, insert will throw an exception.
- Note that strings are case sensitive; sending Success instead of success will cause insert to fail.
- Adding new values is quite a lightweight process, but updating or deleting values requires a whole table rewrite - so avoid using it for dynamic data.
- MySQL has a similar datatype that behaves almost identically to the Clickhouse one. Postgres handles such data through the CREATE TYPE command, but you cannot assign integers; Postgres manages numbers under the hood
- ClickHouse table creation Example:
CREATE TABLE user_logs (
event_time DateTime,
status Enum8('success' = 1, 'error' = 2, 'pending' = 3)
) ENGINE = MergeTree()
ORDER BY event_time;
LowCardinality data type
- Somewhat more flexible version of the Enum type. ClickHouse creates an internal dictionary of strings where each row has an integer mapped to it. Filtering and grouping are performed on the integers and the integers are being converted into strings when it shows you the results of the query.
- The difference from the Enum data type is that ClickHouse learns new values dynamically.
- Does not require ALTER TABLE to add new values
- Not as fast as Enum, but still very powerful.
- Best for changing/evolving data
- Best practice/ Ideal for less than 10K values
- Unlike Enum, can be Nullable
- Create table example:
CREATE TABLE orders (
order_id UInt64,
country LowCardinality(String),
status Enum8('open' = 1, 'closed' = 2)
) ENGINE = MergeTree() ORDER BY order_id;
AggregateFunction data type
- allows to store the intermediate state of calculation, helps to pre-calculate billions of rows into a few thousand rows without losing the ability to calculate the unique count or average later.
- Instead of calculating the max or average value, Clickhouse stops halfway, allowing it to finish the calculation later
- State contains sum/current total, count of rows and uniq - counter that helps to estimate uniqueness
- The query will combine the states into a final result
- This datatype helps to avoid scanning billions of rows over and over to refresh reports
- This data type preaggregates data and stores partial results into a summary table. To get final results, we will need to aggregate data in the summary table, scanning states, which is fast and cheap.
- Here is an example of a summary table
CREATE TABLE daily_uniques (
event_date Date,
unique_users AggregateFunction(uniq, UInt64) )
ENGINE = AggregatingMergeTree()
ORDER BY event_date;
that will be populated by a Materialised View:
CREATE MATERIALIZED VIEW daily_uniques_mv TO daily_uniques AS SELECT
event_date,
uniqState(user_id) AS unique_users
FROM raw_logs
GROUP BY event_date;
IPv4 and IPv6 data types:
- Those two types store IPs and treat them as integers, therefore making filtering and sorting extremely fast. In addition, ClickHouse provides several useful functions to work with IPs; this one, for instance, allows checking millions of IPs against a CIDR block in milliseconds:
WHERE isIPAddressInRange(source_ip, '192.168.1.0/24');
Map data type:
- Similar to JSON but more optimised for columnar storage and very efficient for key searches
- Internally, ClickHouse stores two arrays, one for keys and one for values
- Maps in ClickHouse are not unique; most functions will return the first occurrence
- If the key does not exist, ClickHouse will return default values, an empty string for Strings and 0 for numerics
- Accessing a key in Map is significantly slower than accessing a regular column. ClickHouse needs to perform a scan for a key array for every row.
- Works best if they have a small number of keys, fewer than 100
- Avoid using a Map for a primary key and for an order by clause.
CREATE TABLE user_attributes (
user_id UInt64,
attributes Map(String, String) )
ENGINE = MergeTree()
ORDER BY user_id;
Nested data type:
- This data type allows to store a table inside a single row. Internally stored as multiple array columns. This datatype eliminates the need for JOINs, which have significant performance bottlenecks.
- In this example, ClickHouse will create 3 arrays, each for a column in the nested βtableβ
CREATE TABLE orders (
order_id UInt64,
items Nested (
product_id UInt32,
price Float64,
quantity UInt32 ) )
ENGINE = MergeTree()
ORDER BY order_id;
- Using a Nested data type is faster than JSON because CH can only scan one of the arrays without touching the rest, instead of scanning the whole JSON blob
In general, interestingly enough, in Clickhouse, if the integer datatype is not defined as Nullable, a missing value would be stored as 0 and for strings, a missing value would be an empty string.
Happy learnings
Maria
Comments
No comments yet.
Leave a comment