Schema Builder Guide
Building Schemas
Introduction
This document provides an in-depth guide on building and optimizing schemas in HyperSec XDR. It is intended for data engineers and developers who need to design efficient, scalable, and performant schemas for handling large volumes of data in real-time analytics.
We will cover how to:
Choose appropriate data types for your fields.
Select optimal
ORDER BYclauses for efficient data retrieval.Decide on index types to improve query performance.
Use mapping files and guidelines to standardize schema definitions.
Schema Design Principles
Designing an effective schema in ClickHouse involves understanding both the nature of your data and the querying patterns. Key principles include:
Data Compression: ClickHouse offers various codecs for data compression, which can significantly reduce storage requirements.
Columnar Storage: As a column-oriented database, ClickHouse stores data by columns, which allows for efficient data processing and retrieval.
Partitioning and Ordering: Proper use of
ORDER BYand partitioning can improve query performance by enabling data skipping.Indexing: Secondary indexes can be used to speed up queries on non-primary key columns.
Choosing Data Types
Selecting the right data types is crucial for performance and storage efficiency. Below is a mapping of logical types to ClickHouse types, along with OpenSearch equivalents and comments.
Data Type Mappings
string
String CODEC(ZSTD(1))
Standard strings with compression for storage efficiency.
string_fast
String CODEC(LZ4)
Strings that are heavily queried; LZ4 offers faster decompression at the expense of larger storage size.
string_lowcardinality
LowCardinality(String) CODEC(ZSTD(1))
For strings with fewer distinct values (<1 million); improves performance and reduces storage.
string_fast_lowcardinality
LowCardinality(String) CODEC(LZ4)
Heavily queried low cardinality strings; faster access with acceptable storage overhead.
text
String CODEC(ZSTD(1))
Larger strings and text fields that may not be used in exact match queries.
json
String CODEC(ZSTD(1))
Store JSON objects as strings; move to native JSON type once it is GA.
int8
Nullable(Int8) CODEC(ZSTD(1))
8-bit integers; nullable fields are supported.
int16
Nullable(Int16) CODEC(ZSTD(1))
16-bit integers.
int32
Nullable(Int32) CODEC(ZSTD(1))
32-bit integers.
int64
Nullable(Int64) CODEC(ZSTD(1))
64-bit integers.
int128
Nullable(Int128) CODEC(ZSTD(1))
128-bit integers; for very large numbers.
int256
Nullable(Int256) CODEC(ZSTD(1))
256-bit integers.
float32
Nullable(Float32) CODEC(ZSTD(1))
32-bit floating-point numbers.
float64
Nullable(Float64) CODEC(ZSTD(1))
64-bit floating-point numbers.
boolean
Nullable(Boolean) CODEC(LZ4)
Boolean values; LZ4 codec for fast access.
timestamp
DateTime64(3, 'UTC') CODEC(DoubleDelta, LZ4)
Timeseries increasing timestamps (e.g., timestamp, timestamp_load); required and not nullable.
datetime
Nullable(DateTime64(3, 'UTC')) CODEC(DoubleDelta, ZSTD(1))
All other date and timestamp fields; nullable.
ipv4
Nullable(IPv4) CODEC(T64, LZ4)
IPv4 addresses.
ipv6
Nullable(IPv6) CODEC(LZ4)
IPv6 addresses.
geo_point
Nullable(Point) CODEC(ZSTD(1))
Geographical points (latitude and longitude).
tuple
Nullable(Tuple) CODEC(ZSTD(1))
Tuples representing a fixed set of elements.
map
Nullable(Map) CODEC(ZSTD(1))
Key-value pairs.
uuid
Nullable(UUID) CODEC(ZSTD(1))
Supports all GUID types in OpenSearch.
Guidelines for Choosing Data Types
Strings:
Use
stringfor standard strings where storage efficiency is important.Use
string_fastfor strings that are heavily queried and can accept greater storage consumption.Use
string_lowcardinalitywhen the string field has a low number of distinct values (<1 million).Use
string_fast_lowcardinalityfor heavily queried, low-cardinality strings.
Numbers:
Choose the smallest integer type that can represent your data (
int8,int16,int32,int64,int128,int256).Use
float32orfloat64for floating-point numbers.
Booleans:
Use the
booleantype for true/false values.
Timestamps and Dates:
Use
timestampfor timeseries data that increases over time (e.g., event timestamps). These should be not nullable.Use
datetimefor other date/time fields, which can be nullable.
IP Addresses:
Use
ipv4andipv6for storing IP addresses.
Geospatial Data:
Use
geo_pointfor geographical coordinates.
Complex Types:
Use
tupleandmapfor structured data.For JSON data, store it as a string using the
jsontype until native JSON support is available in ClickHouse.
UUIDs:
Use the
uuidtype for universally unique identifiers.
Compression Codecs
ZSTD(1):
Provides a good balance between compression ratio and speed.
Use for fields where storage efficiency is prioritized.
LZ4:
Faster compression and decompression but with lower compression ratios.
Use for fields that are heavily queried and where read performance is critical.
DoubleDelta:
Use for numeric fields that increase over time, such as timestamps.
T64:
Use for small numeric types with low maximum values (e.g., error codes).
Nullable Fields
Fields can be defined as
Nullableif they may contain null values.For critical fields like
timestamp, ensure they are not nullable to maintain data integrity and query performance.
Choosing ORDER BY Clauses
ORDER BY ClausesIn ClickHouse, the ORDER BY clause is fundamental as it determines both how data is sorted on disk and how the primary index is built. This ordering directly impacts query performance, data compression efficiency, and overall storage requirements.
ORDER BY: Specifies the sorting order of data on disk, which influences how effectively ClickHouse can skip over unnecessary data during query execution.PRIMARY KEY: By default, thePRIMARY KEYis the same as theORDER BYclause unless explicitly defined. It defines how ClickHouse builds its sparse index to locate data blocks efficiently.
Best Practices for Choosing Columns for ORDER BY and PRIMARY KEY
ORDER BY and PRIMARY KEYWhen selecting columns for the ORDER BY clause, consider the following guidelines based on ClickHouse's official recommendations:
1. Align with Common Query Filters
Select Columns Frequently Used in
WHEREClauses: Prioritize columns that are commonly used in query filters. This alignment allows ClickHouse to skip over large portions of data that do not match the query criteria, improving query performance.Exclude a Large Percentage of Data: Choose columns that, when filtered, can exclude a significant portion of the total data. This maximizes the effectiveness of data skipping.
2. Prefer Columns with Low Cardinality Early in the ORDER BY
ORDER BYOrder Columns by Increasing Cardinality: Generally, place columns with lower cardinality (fewer unique values) earlier in the
ORDER BYclause. This ordering can enhance compression ratios and improve query performance for filters on these columns.Balance with Query Needs: While ordering by low cardinality can improve compression, ensure that it does not significantly hinder query performance. If high-cardinality columns are frequently used in queries, they may need to be placed earlier despite potential compression trade-offs.
3. Consider Data Correlation
Columns Highly Correlated with Others: Include columns that are likely to be highly correlated with other columns in the table. This correlation can lead to better compression because similar data is stored together.
Impact on Aggregations: Columns in the
ORDER BYclause can makeGROUP BYandORDER BYoperations more efficient if they align with common aggregation patterns.
4. Use Expressions for Efficiency
Transform Columns in
ORDER BY: Utilize expressions liketoDate(timestamp)to reduce the granularity of data, effectively partitioning the data and improving query performance.
Handling Nullable Fields
Avoid Nullable Columns in
ORDER BY: Columns used in theORDER BYclause must be non-nullable. If necessary, provide default values or clean data during ingestion to prevent nulls.
Deciding on the Order of Columns
The order of columns in the ORDER BY clause significantly impacts both query performance and compression efficiency.
Ordering by Increasing Cardinality
Start with Low-Cardinality Columns: Placing low-cardinality columns first can improve compression because similar values are stored together, which is advantageous for compression algorithms.
Example Order:
ORDER BY (event_type, toDate(timestamp), org_id)event_type: Low cardinality (e.g., a limited set of event types).toDate(timestamp): Reduces timestamp granularity to date level, aiding in partitioning and compression.org_id: Higher cardinality but necessary for query filters.
Balancing with Query Patterns
Frequent Query Filters: If high-cardinality columns are frequently used in
WHEREclauses, consider placing them earlier in theORDER BYclause despite potential impacts on compression.Testing Different Orders: It's essential to test different
ORDER BYconfigurations to find the optimal balance between compression and query performance for your specific workload.
Examples
Example 1: Optimized for Compression
CREATE TABLE logs_events (
event_type LowCardinality(String),
timestamp DateTime64(3, 'UTC'),
org_id UInt32,
user_id UInt64,
message String CODEC(ZSTD(1)),
-- Additional fields...
) ENGINE = MergeTree()
ORDER BY (event_type, toDate(timestamp), org_id);Explanation:
event_typeis placed first due to low cardinality and frequent use in filters.toDate(timestamp)reduces timestamp precision, aiding in partitioning and compression.org_idis included later, balancing between compression and query performance.
Example 2: Optimized for Query Performance
CREATE TABLE logs_events (
org_id UInt32,
timestamp DateTime64(3, 'UTC'),
event_type LowCardinality(String),
user_id UInt64,
message String CODEC(ZSTD(1)),
-- Additional fields...
) ENGINE = MergeTree()
ORDER BY (org_id, timestamp, event_type);Explanation:
org_idis placed first because queries frequently filter by organization.timestampallows efficient time-range queries.event_typeis included to further refine data skipping during queries.
Applying ClickHouse's Recommendations
From the official documentation:
Select Columns Aligning with Common Filters: Start with columns used in
WHEREclauses to maximize data skipping.Prefer Columns Likely to Exclude Large Data Portions: Choose columns that can exclude significant data volumes when filtered.
Order Columns by Increasing Cardinality: Generally, low-cardinality columns first, but balance this with the need for efficient filtering on columns that appear later.
Testing and Iteration
Benchmark Queries: Test your queries against different
ORDER BYconfigurations to measure performance.Monitor Compression Ratios: Evaluate how changes to the
ORDER BYclause affect data compression and storage requirements.Adjust Based on Findings: Iterate on your schema design based on empirical evidence from testing.
Conclusion
Choosing the optimal ORDER BY and PRIMARY KEY in ClickHouse requires balancing multiple factors:
Query Patterns: Align the
ORDER BYwith columns frequently used in filters.Data Cardinality: Consider the cardinality of columns to optimize compression and data skipping.
Data Correlation: Leverage columns that correlate with others to enhance compression.
Performance Testing: Continuously test and adjust based on query performance and compression outcomes.
By carefully considering these aspects and leveraging ClickHouse's capabilities, you can design schemas that provide efficient storage and high-performance query execution.
Updated Section in Documentation
Choosing ORDER BY Clauses
ORDER BY ClausesThe ORDER BY clause in ClickHouse determines how data is sorted on disk, which directly impacts query performance and storage efficiency. Proper ordering enables ClickHouse to skip reading unnecessary data during queries (data skipping) and improves compression ratios.
Guidelines for ORDER BY
ORDER BYAlign with Common Query Filters
Select Frequently Queried Columns: Include columns that are commonly used in
WHEREclauses.Maximize Data Skipping: Choose columns that, when filtered, can exclude large portions of data.
Consider Data Cardinality
Order by Increasing Cardinality: Generally, place low-cardinality columns before high-cardinality ones to improve compression.
Balance with Query Needs: If high-cardinality columns are essential for query performance, they may need to be placed earlier.
Data Correlation
Include Correlated Columns: Columns that correlate with others can improve compression when ordered together.
Use Expressions for Effective Partitioning
Reduce Granularity: Use functions like
toDate(timestamp)to reduce data granularity, effectively partitioning the data.
Avoid Nullable Columns
Non-Nullable in
ORDER BY: Ensure columns in theORDER BYclause are non-nullable.
Deciding on Column Order
First Position: Column most frequently used in queries and capable of filtering out significant data portions.
Subsequent Positions: Columns that further refine data skipping, ordered by increasing cardinality.
Example Order:
ORDER BY (event_type, toDate(timestamp), org_id)
Examples
Example Table Definition
CREATE TABLE logs_events (
event_type LowCardinality(String),
timestamp DateTime64(3, 'UTC'),
org_id UInt32,
message String CODEC(ZSTD(1)),
-- Additional fields...
) ENGINE = MergeTree()
ORDER BY (event_type, toDate(timestamp), org_id);Choosing Index Types
ClickHouse offers secondary indexes to accelerate queries that cannot be efficiently served by the primary key alone.
Types of Indexes
Data Skipping Indexes:
Allow ClickHouse to skip over data that doesn't match query predicates.
Types include
minmax,bloom_filter,set, andngrambf_v1.
Token Bloom Filters:
Use for searching within text fields.
Suitable for fields where you need to perform partial matches or wildcard searches.
Guidelines for Index Selection
When to Use Indexes:
If you frequently query on fields that are not part of the
ORDER BYclause.When queries on certain fields are slow due to full table scans.
Choosing Index Type:
Use
minmaxindexes for numeric and date fields.Use
bloom_filterfor string fields with a large number of distinct values.Use
setindexes when the set of possible values is known and relatively small.
Defining an Index
Syntax:
INDEX index_name (column_name) TYPE index_type GRANULARITY Nindex_name: Name of the index.
column_name: The column on which the index is built.
index_type: Type of index (e.g.,
minmax,bloom_filter).GRANULARITY N: Determines how often the index is stored; lower values mean more index entries but increased storage.
Example
Creating a Bloom Filter Index:
INDEX idx_user_agent (user_agent) TYPE bloom_filter(0.01) GRANULARITY 64
Best Practices
Data Modeling:
Understand your data access patterns; design schemas that align with how data is queried.
Normalize data when appropriate but denormalize if it leads to better query performance.
Testing and Optimization:
Test query performance with different
ORDER BYclauses and index configurations.Use the
EXPLAINquery to understand how ClickHouse processes your queries.
Compression:
Choose appropriate codecs based on query frequency and performance requirements.
Remember that higher compression reduces storage but may impact read performance.
Monitoring:
Monitor query performance and adjust schemas as data volume and access patterns evolve.
Use ClickHouse system tables to analyze query statistics and optimize accordingly.
Examples
Defining a Table with Optimal Data Types and Indexes
CREATE TABLE xdr_12345678.logs_beats_winlogbeat (
timestamp DateTime64(3, 'UTC') CODEC(DoubleDelta, LZ4),
org_id String CODEC(ZSTD(1)),
event_hash String CODEC(ZSTD(1)),
user_id LowCardinality(String) CODEC(ZSTD(1)),
event_type String CODEC(ZSTD(1)),
source_ip IPv4 CODEC(T64, LZ4),
destination_ip IPv4 CODEC(T64, LZ4),
message String CODEC(ZSTD(1)),
-- Additional fields...
INDEX idx_event_type (event_type) TYPE set(1000) GRANULARITY 4,
INDEX idx_user_id (user_id) TYPE bloom_filter(0.01) GRANULARITY 64
) ENGINE = MergeTree()
ORDER BY (org_id, timestamp);Explanation
Data Types:
timestamp: UsingDateTime64(3, 'UTC')withDoubleDeltacodec for efficient storage of increasing timestamps.org_id: Stored asStringwithZSTD(1)compression for balance between speed and storage.user_id: Low cardinality string as it likely has fewer distinct values.
Indexes:
idx_event_type: Using asetindex onevent_typebecause the number of event types is limited.idx_user_id: Using abloom_filterindex onuser_idto speed up queries filtering on user IDs.
Order By Clause:
Ordered by
org_idandtimestampto optimize queries filtering by organization and time range.
Conclusion
Designing schemas in ClickHouse for HyperSec XDR involves careful consideration of data types, compression codecs, ordering, and indexing. By following the guidelines provided in this document, you can build efficient and scalable schemas that meet your real-time analytics needs.
Additional Resources
Last updated