Schema Builder Guide

Building Schemas

Introduction

This document provides an in-depth guide on building and optimizing schemas in HyperSec XDR. It is intended for data engineers and developers who need to design efficient, scalable, and performant schemas for handling large volumes of data in real-time analytics.

We will cover how to:

  • Choose appropriate data types for your fields.

  • Select optimal ORDER BY clauses for efficient data retrieval.

  • Decide on index types to improve query performance.

  • Use mapping files and guidelines to standardize schema definitions.

Schema Design Principles

Designing an effective schema in ClickHouse involves understanding both the nature of your data and the querying patterns. Key principles include:

  • Data Compression: ClickHouse offers various codecs for data compression, which can significantly reduce storage requirements.

  • Columnar Storage: As a column-oriented database, ClickHouse stores data by columns, which allows for efficient data processing and retrieval.

  • Partitioning and Ordering: Proper use of ORDER BY and partitioning can improve query performance by enabling data skipping.

  • Indexing: Secondary indexes can be used to speed up queries on non-primary key columns.

Choosing Data Types

Selecting the right data types is crucial for performance and storage efficiency. Below is a mapping of logical types to ClickHouse types, along with OpenSearch equivalents and comments.

Data Type Mappings

Logical Type
ClickHouse Type
Comment

string

String CODEC(ZSTD(1))

Standard strings with compression for storage efficiency.

string_fast

String CODEC(LZ4)

Strings that are heavily queried; LZ4 offers faster decompression at the expense of larger storage size.

string_lowcardinality

LowCardinality(String) CODEC(ZSTD(1))

For strings with fewer distinct values (<1 million); improves performance and reduces storage.

string_fast_lowcardinality

LowCardinality(String) CODEC(LZ4)

Heavily queried low cardinality strings; faster access with acceptable storage overhead.

text

String CODEC(ZSTD(1))

Larger strings and text fields that may not be used in exact match queries.

json

String CODEC(ZSTD(1))

Store JSON objects as strings; move to native JSON type once it is GA.

int8

Nullable(Int8) CODEC(ZSTD(1))

8-bit integers; nullable fields are supported.

int16

Nullable(Int16) CODEC(ZSTD(1))

16-bit integers.

int32

Nullable(Int32) CODEC(ZSTD(1))

32-bit integers.

int64

Nullable(Int64) CODEC(ZSTD(1))

64-bit integers.

int128

Nullable(Int128) CODEC(ZSTD(1))

128-bit integers; for very large numbers.

int256

Nullable(Int256) CODEC(ZSTD(1))

256-bit integers.

float32

Nullable(Float32) CODEC(ZSTD(1))

32-bit floating-point numbers.

float64

Nullable(Float64) CODEC(ZSTD(1))

64-bit floating-point numbers.

boolean

Nullable(Boolean) CODEC(LZ4)

Boolean values; LZ4 codec for fast access.

timestamp

DateTime64(3, 'UTC') CODEC(DoubleDelta, LZ4)

Timeseries increasing timestamps (e.g., timestamp, timestamp_load); required and not nullable.

datetime

Nullable(DateTime64(3, 'UTC')) CODEC(DoubleDelta, ZSTD(1))

All other date and timestamp fields; nullable.

ipv4

Nullable(IPv4) CODEC(T64, LZ4)

IPv4 addresses.

ipv6

Nullable(IPv6) CODEC(LZ4)

IPv6 addresses.

geo_point

Nullable(Point) CODEC(ZSTD(1))

Geographical points (latitude and longitude).

tuple

Nullable(Tuple) CODEC(ZSTD(1))

Tuples representing a fixed set of elements.

map

Nullable(Map) CODEC(ZSTD(1))

Key-value pairs.

uuid

Nullable(UUID) CODEC(ZSTD(1))

Supports all GUID types in OpenSearch.

Guidelines for Choosing Data Types

  • Strings:

    • Use string for standard strings where storage efficiency is important.

    • Use string_fast for strings that are heavily queried and can accept greater storage consumption.

    • Use string_lowcardinality when the string field has a low number of distinct values (<1 million).

    • Use string_fast_lowcardinality for heavily queried, low-cardinality strings.

  • Numbers:

    • Choose the smallest integer type that can represent your data (int8, int16, int32, int64, int128, int256).

    • Use float32 or float64 for floating-point numbers.

  • Booleans:

    • Use the boolean type for true/false values.

  • Timestamps and Dates:

    • Use timestamp for timeseries data that increases over time (e.g., event timestamps). These should be not nullable.

    • Use datetime for other date/time fields, which can be nullable.

  • IP Addresses:

    • Use ipv4 and ipv6 for storing IP addresses.

  • Geospatial Data:

    • Use geo_point for geographical coordinates.

  • Complex Types:

    • Use tuple and map for structured data.

    • For JSON data, store it as a string using the json type until native JSON support is available in ClickHouse.

  • UUIDs:

    • Use the uuid type for universally unique identifiers.

Compression Codecs

  • ZSTD(1):

    • Provides a good balance between compression ratio and speed.

    • Use for fields where storage efficiency is prioritized.

  • LZ4:

    • Faster compression and decompression but with lower compression ratios.

    • Use for fields that are heavily queried and where read performance is critical.

  • DoubleDelta:

    • Use for numeric fields that increase over time, such as timestamps.

  • T64:

    • Use for small numeric types with low maximum values (e.g., error codes).

Nullable Fields

  • Fields can be defined as Nullable if they may contain null values.

  • For critical fields like timestamp, ensure they are not nullable to maintain data integrity and query performance.

Choosing ORDER BY Clauses

In ClickHouse, the ORDER BY clause is fundamental as it determines both how data is sorted on disk and how the primary index is built. This ordering directly impacts query performance, data compression efficiency, and overall storage requirements.

  • ORDER BY: Specifies the sorting order of data on disk, which influences how effectively ClickHouse can skip over unnecessary data during query execution.

  • PRIMARY KEY: By default, the PRIMARY KEY is the same as the ORDER BY clause unless explicitly defined. It defines how ClickHouse builds its sparse index to locate data blocks efficiently.

Best Practices for Choosing Columns for ORDER BY and PRIMARY KEY

When selecting columns for the ORDER BY clause, consider the following guidelines based on ClickHouse's official recommendations:

1. Align with Common Query Filters

  • Select Columns Frequently Used in WHERE Clauses: Prioritize columns that are commonly used in query filters. This alignment allows ClickHouse to skip over large portions of data that do not match the query criteria, improving query performance.

  • Exclude a Large Percentage of Data: Choose columns that, when filtered, can exclude a significant portion of the total data. This maximizes the effectiveness of data skipping.

2. Prefer Columns with Low Cardinality Early in the ORDER BY

  • Order Columns by Increasing Cardinality: Generally, place columns with lower cardinality (fewer unique values) earlier in the ORDER BY clause. This ordering can enhance compression ratios and improve query performance for filters on these columns.

  • Balance with Query Needs: While ordering by low cardinality can improve compression, ensure that it does not significantly hinder query performance. If high-cardinality columns are frequently used in queries, they may need to be placed earlier despite potential compression trade-offs.

3. Consider Data Correlation

  • Columns Highly Correlated with Others: Include columns that are likely to be highly correlated with other columns in the table. This correlation can lead to better compression because similar data is stored together.

  • Impact on Aggregations: Columns in the ORDER BY clause can make GROUP BY and ORDER BY operations more efficient if they align with common aggregation patterns.

4. Use Expressions for Efficiency

  • Transform Columns in ORDER BY: Utilize expressions like toDate(timestamp) to reduce the granularity of data, effectively partitioning the data and improving query performance.

Handling Nullable Fields

  • Avoid Nullable Columns in ORDER BY: Columns used in the ORDER BY clause must be non-nullable. If necessary, provide default values or clean data during ingestion to prevent nulls.

Deciding on the Order of Columns

The order of columns in the ORDER BY clause significantly impacts both query performance and compression efficiency.

Ordering by Increasing Cardinality

  • Start with Low-Cardinality Columns: Placing low-cardinality columns first can improve compression because similar values are stored together, which is advantageous for compression algorithms.

  • Example Order:

    ORDER BY (event_type, toDate(timestamp), org_id)
    • event_type: Low cardinality (e.g., a limited set of event types).

    • toDate(timestamp): Reduces timestamp granularity to date level, aiding in partitioning and compression.

    • org_id: Higher cardinality but necessary for query filters.

Balancing with Query Patterns

  • Frequent Query Filters: If high-cardinality columns are frequently used in WHERE clauses, consider placing them earlier in the ORDER BY clause despite potential impacts on compression.

  • Testing Different Orders: It's essential to test different ORDER BY configurations to find the optimal balance between compression and query performance for your specific workload.

Examples

Example 1: Optimized for Compression

CREATE TABLE logs_events (
    event_type LowCardinality(String),
    timestamp DateTime64(3, 'UTC'),
    org_id UInt32,
    user_id UInt64,
    message String CODEC(ZSTD(1)),
    -- Additional fields...
) ENGINE = MergeTree()
ORDER BY (event_type, toDate(timestamp), org_id);
  • Explanation:

    • event_type is placed first due to low cardinality and frequent use in filters.

    • toDate(timestamp) reduces timestamp precision, aiding in partitioning and compression.

    • org_id is included later, balancing between compression and query performance.

Example 2: Optimized for Query Performance

CREATE TABLE logs_events (
    org_id UInt32,
    timestamp DateTime64(3, 'UTC'),
    event_type LowCardinality(String),
    user_id UInt64,
    message String CODEC(ZSTD(1)),
    -- Additional fields...
) ENGINE = MergeTree()
ORDER BY (org_id, timestamp, event_type);
  • Explanation:

    • org_id is placed first because queries frequently filter by organization.

    • timestamp allows efficient time-range queries.

    • event_type is included to further refine data skipping during queries.

Applying ClickHouse's Recommendations

From the official documentation:

  • Select Columns Aligning with Common Filters: Start with columns used in WHERE clauses to maximize data skipping.

  • Prefer Columns Likely to Exclude Large Data Portions: Choose columns that can exclude significant data volumes when filtered.

  • Order Columns by Increasing Cardinality: Generally, low-cardinality columns first, but balance this with the need for efficient filtering on columns that appear later.

Testing and Iteration

  • Benchmark Queries: Test your queries against different ORDER BY configurations to measure performance.

  • Monitor Compression Ratios: Evaluate how changes to the ORDER BY clause affect data compression and storage requirements.

  • Adjust Based on Findings: Iterate on your schema design based on empirical evidence from testing.

Conclusion

Choosing the optimal ORDER BY and PRIMARY KEY in ClickHouse requires balancing multiple factors:

  • Query Patterns: Align the ORDER BY with columns frequently used in filters.

  • Data Cardinality: Consider the cardinality of columns to optimize compression and data skipping.

  • Data Correlation: Leverage columns that correlate with others to enhance compression.

  • Performance Testing: Continuously test and adjust based on query performance and compression outcomes.

By carefully considering these aspects and leveraging ClickHouse's capabilities, you can design schemas that provide efficient storage and high-performance query execution.


Updated Section in Documentation

Choosing ORDER BY Clauses

The ORDER BY clause in ClickHouse determines how data is sorted on disk, which directly impacts query performance and storage efficiency. Proper ordering enables ClickHouse to skip reading unnecessary data during queries (data skipping) and improves compression ratios.

Guidelines for ORDER BY

  1. Align with Common Query Filters

    • Select Frequently Queried Columns: Include columns that are commonly used in WHERE clauses.

    • Maximize Data Skipping: Choose columns that, when filtered, can exclude large portions of data.

  2. Consider Data Cardinality

    • Order by Increasing Cardinality: Generally, place low-cardinality columns before high-cardinality ones to improve compression.

    • Balance with Query Needs: If high-cardinality columns are essential for query performance, they may need to be placed earlier.

  3. Data Correlation

    • Include Correlated Columns: Columns that correlate with others can improve compression when ordered together.

  4. Use Expressions for Effective Partitioning

    • Reduce Granularity: Use functions like toDate(timestamp) to reduce data granularity, effectively partitioning the data.

  5. Avoid Nullable Columns

    • Non-Nullable in ORDER BY: Ensure columns in the ORDER BY clause are non-nullable.

Deciding on Column Order

  • First Position: Column most frequently used in queries and capable of filtering out significant data portions.

  • Subsequent Positions: Columns that further refine data skipping, ordered by increasing cardinality.

  • Example Order:

    ORDER BY (event_type, toDate(timestamp), org_id)

Examples

Example Table Definition

CREATE TABLE logs_events (
    event_type LowCardinality(String),
    timestamp DateTime64(3, 'UTC'),
    org_id UInt32,
    message String CODEC(ZSTD(1)),
    -- Additional fields...
) ENGINE = MergeTree()
ORDER BY (event_type, toDate(timestamp), org_id);

Choosing Index Types

ClickHouse offers secondary indexes to accelerate queries that cannot be efficiently served by the primary key alone.

Types of Indexes

  • Data Skipping Indexes:

    • Allow ClickHouse to skip over data that doesn't match query predicates.

    • Types include minmax, bloom_filter, set, and ngrambf_v1.

  • Token Bloom Filters:

    • Use for searching within text fields.

    • Suitable for fields where you need to perform partial matches or wildcard searches.

Guidelines for Index Selection

  • When to Use Indexes:

    • If you frequently query on fields that are not part of the ORDER BY clause.

    • When queries on certain fields are slow due to full table scans.

  • Choosing Index Type:

    • Use minmax indexes for numeric and date fields.

    • Use bloom_filter for string fields with a large number of distinct values.

    • Use set indexes when the set of possible values is known and relatively small.

Defining an Index

  • Syntax:

    INDEX index_name (column_name) TYPE index_type GRANULARITY N
    • index_name: Name of the index.

    • column_name: The column on which the index is built.

    • index_type: Type of index (e.g., minmax, bloom_filter).

    • GRANULARITY N: Determines how often the index is stored; lower values mean more index entries but increased storage.

Example

  • Creating a Bloom Filter Index:

    INDEX idx_user_agent (user_agent) TYPE bloom_filter(0.01) GRANULARITY 64

Best Practices

  • Data Modeling:

    • Understand your data access patterns; design schemas that align with how data is queried.

    • Normalize data when appropriate but denormalize if it leads to better query performance.

  • Testing and Optimization:

    • Test query performance with different ORDER BY clauses and index configurations.

    • Use the EXPLAIN query to understand how ClickHouse processes your queries.

  • Compression:

    • Choose appropriate codecs based on query frequency and performance requirements.

    • Remember that higher compression reduces storage but may impact read performance.

  • Monitoring:

    • Monitor query performance and adjust schemas as data volume and access patterns evolve.

    • Use ClickHouse system tables to analyze query statistics and optimize accordingly.

Examples

Defining a Table with Optimal Data Types and Indexes

CREATE TABLE xdr_12345678.logs_beats_winlogbeat (
    timestamp DateTime64(3, 'UTC') CODEC(DoubleDelta, LZ4),
    org_id String CODEC(ZSTD(1)),
    event_hash String CODEC(ZSTD(1)),
    user_id LowCardinality(String) CODEC(ZSTD(1)),
    event_type String CODEC(ZSTD(1)),
    source_ip IPv4 CODEC(T64, LZ4),
    destination_ip IPv4 CODEC(T64, LZ4),
    message String CODEC(ZSTD(1)),
    -- Additional fields...
    INDEX idx_event_type (event_type) TYPE set(1000) GRANULARITY 4,
    INDEX idx_user_id (user_id) TYPE bloom_filter(0.01) GRANULARITY 64
) ENGINE = MergeTree()
ORDER BY (org_id, timestamp);

Explanation

  • Data Types:

    • timestamp: Using DateTime64(3, 'UTC') with DoubleDelta codec for efficient storage of increasing timestamps.

    • org_id: Stored as String with ZSTD(1) compression for balance between speed and storage.

    • user_id: Low cardinality string as it likely has fewer distinct values.

  • Indexes:

    • idx_event_type: Using a set index on event_type because the number of event types is limited.

    • idx_user_id: Using a bloom_filter index on user_id to speed up queries filtering on user IDs.

  • Order By Clause:

    • Ordered by org_id and timestamp to optimize queries filtering by organization and time range.

Conclusion

Designing schemas in ClickHouse for HyperSec XDR involves careful consideration of data types, compression codecs, ordering, and indexing. By following the guidelines provided in this document, you can build efficient and scalable schemas that meet your real-time analytics needs.

Additional Resources

Last updated