Schema Skipping Index

Introduction

ClickHouse provides several types of skipping indexes that can significantly improve query performance by allowing ClickHouse to skip reading data blocks that cannot contain matching values. These indexes are particularly useful for queries that cannot efficiently use the primary key.

Important: Understanding ClickHouse Skip Indexes

Skip indexes in ClickHouse work fundamentally differently from traditional database indexes. Before implementing skip indexes, it's crucial to understand these key differences:

  • Skip indexes are not intuitive, especially for users familiar with secondary row-based indexes from RDMS or inverted indexes from document stores.

  • To be beneficial, a skip index must avoid enough granule reads to offset the cost of calculating the index.

  • If a value occurs even once in an indexed block, the entire block must be read into memory and evaluated, potentially negating the index's benefit.

  • Unlike traditional secondary indexes that point to specific rows, skip indexes work at the granule level (blocks of data).

  • The effectiveness of a skip index heavily depends on data distribution and correlation with the primary key.

Index Types

Dimension Index

Used for low cardinality columns where values tend to be grouped together. Best for columns like status, category, or other enumerated values.

column,type,index_type,index_order,os_order,comment
category,string,dimension,2,2,category field

Fulltext Index

Used for text columns where you need to search for words or tokens within the text. Best for descriptions, messages, or other free-form text fields.

column,type,index_type,index_order,os_order,comment
description,string,fulltext,3,3,description field

High Cardinality Index

Used for columns with many unique values. Best for identifiers like user_id, session_id, or other high cardinality fields.

column,type,index_type,index_order,os_order,comment
user_id,string,hc,4,4,user identifier

Range Index

Used for numeric columns where you need to search for values within a range. Best for numeric fields like port numbers, timestamps, or measurements.

column,type,index_type,index_order,os_order,comment
port,int16,range,5,5,port number

Schema Configuration

Skipping indexes can be defined in either the meta schema or derived schema files. The derived schema takes precedence over the meta schema for both index_type and index_order.

Meta Schema Example (meta_schema.csv)

column,type,index_type,index_order,os_order,comment
event_id,string,,1,1,event id
category,string,,2,2,category field
description,string,,3,3,description field
user_id,string,,4,4,user identifier
port,int16,,5,5,port number

Derived Schema Example (derived_schema.csv)

column,type,index_type,index_order,os_order,comment
category,string,dimension,2,2,category field
description,string,fulltext,3,3,description field
user_id,string,hc,4,4,user identifier
port,int16,range,5,5,port number

Schema Merging Rules

  1. The derived schema takes precedence over the meta schema for both index_type and index_order.

  2. If the derived schema specifies an index_type but has an invalid index_order (null or non-numeric), the index_type will still be applied while keeping the original index_order from the meta schema.

  3. When overriding index_order in the derived schema, a warning will be logged to indicate the override.

Example of index_type override with invalid index_order:

# Meta Schema
column,type,index_type,index_order,os_order,comment
category,string,,2,2,category field

# Derived Schema (index_type applied, original index_order kept)
column,type,index_type,index_order,os_order,comment
category,string,dimension,,2,category field

Best Practices

  1. Use dimension indexes for low cardinality columns:

    column,type,index_type,index_order,os_order,comment
    status,string,dimension,1,1,status field
    category,string,dimension,2,2,category field
  2. Use fulltext indexes for text columns that need word/token searching:

    column,type,index_type,index_order,os_order,comment
    description,string,fulltext,3,3,description field
    message,string,fulltext,4,4,message content
  3. Use high cardinality indexes for columns with many unique values:

    column,type,index_type,index_order,os_order,comment
    user_id,string,hc,5,5,user identifier
    session_id,string,hc,6,6,session identifier
  4. Use range indexes for numeric columns that need range queries:

    column,type,index_type,index_order,os_order,comment
    port,int16,range,7,7,port number
    timestamp,datetime,range,8,8,event timestamp
  5. Consider the data distribution when choosing index types:

    • Indexes work best when values are clustered together in blocks

    • A skip index on randomly distributed data may not provide any benefit

    • Consider the correlation between the indexed column and the primary key

  6. Be mindful of index overhead:

    • Each index increases storage space and slows down inserts

    • Test with real data to ensure the performance benefit outweighs the cost

    • Limit the number of indexes to those that provide significant query improvements

  7. Test index performance:

    • Verify that queries actually use the indexes (analyze query plans)

    • Test with realistic data volumes and distributions

    • Monitor insert performance impact

    • Consider the trade-off between query speed and insert speed

Last updated