Schema Overview

Handling Large and Sparsely Populated Schemas at Scale

Handling large sets of unstructured logs with wide and long table shapes that are sparsely populated can be inherently complicated. Existing data management practices like applying schema-on-read, which are generally used to manage complex and evolving schemas, are slow, expensive, and ineffective in delivering real-time outcomes.

HyperSec has developed a new approach to handling schemas for these wide, long, and sparsely populated datasets. Leveraging years of industry experience and applying first-principles thinking, we've designed a schema management approach that best fits the type of real-time OLAP engines powering XDR today. This document describes how HyperSec leverages a core unified schema and enables customers to evolve the core schemas or derive subschemas from base schemas.

Schema Overlay Approach

HyperSec XDR provides a schema overlay approach to source data, whereby a common set of fields are added to a source or unified schema supplied by a data source or set of transforms. For example, a Beats source such as WinLogBeat will be accepted, and XDR common and control fields such as customer ID, unique ID (hash), and type are added upon initial ingest landing. The source data format (enforced JSON, not raw log) is retained.

Component Renaming

Where individual components do not support special characters or nested typing, they are resolved as follows:

ClickHouse

Fields

. (dot) or - (dash) are replaced with _ (underscore).
Nested names are flattened.

Example:

fred.nerk.field becomes fred_nerk_field

Tables

Customer ID is used as the database, prefixed with xdr_.
. (dot) or - (dash) are replaced with _ (underscore).
Nested names are flattened.

Example:

An OpenSearch index stream or Kafka topic with org ID 12345678:

logs-beats-winlogbeat-12345678 becomes xdr_12345678.logs_beats_winlogbeat

Common Header (Version v001.001.005)

A common header is used for all event and context data supplied and ingested. Fields that require performant query analysis are located as dedicated high-level fields and not solely within the tags JSON control field, as querying within nested JSON can be significantly slower in OpenSearch and ClickHouse.

Version: v001.001.005

The header is provided as follows:

Name

Type & Format

Notes

timestamp

- @timestamp (String in JSON message): 2023-03-05T12:00:00+00:00Z - @timestamp in OpenSearch - timestamp in ClickHouse (DateTime with CODEC(DoubleDelta, LZ4))

Renamed to timestamp within ClickHouse; remains @timestamp in the JSON message and OpenSearch. This field is an exception to the underscore _ prefix rule requirement.

timestamp_load

String (JSON message): 2023-03-05T12:00:00+00:00Z

The time the event is loaded into the engine (ClickHouse/OpenSearch) or stored (NAS/S3).

log_original

String

The original (completely raw as received by the HyperCollector) event. Optional for ingestion into engines due to its size and inefficiency. Useful for:

Sending to Splunk
Re-using public raw log parsers

event_hash

String

Unique ID for the event, generated by Vector's sha2 (or equivalent) of the entire original JSON. See Vector Documentation.

org_id

String

A unique alphanumeric string identifying the source organization, typically an 8-digit number. Can be alphanumeric for backwards compatibility with XDR 1.x. Also maps from and has the same value as tags.event.org_id from the source JSON.

tags_event_type

String

Event type in <type>_<source>_<subsource> form, e.g., logs_beats_winlogbeat. Also maps from and has the same value as tags.event.type from the source JSON (dot-separated field).

tags_event_category

String

Event category in <type>_<source> form, e.g., logs_syslog for logs_syslog_cisco_asa. Note: Used to select the landing topic by default. This typically maps directly to <type>_<source> from event_type but is separated for exception handling purposes. Also maps from and has the same value as tags.event.category from the source JSON.

tags_event_subtype

String

Event subtype providing additional granularity, e.g., authentication_failure. Derived from specific fields within the event data. Helps in detailed event classification and routing.

customer_id

String

An alternative identifier for the customer, used interchangeably with org_id in some contexts. Ensures compatibility with systems that use customer_id as the primary identifier.

source_ip

String (IP Address)

The IP address of the source generating the event. Extracted from the event data if available. Useful for network-related analyses and filtering.

destination_ip

String (IP Address)

The IP address of the destination in the event, if applicable. Extracted from the event data. Useful for tracking communication patterns and potential threats.

event_severity

String or Integer

Indicates the severity level of the event, e.g., low, medium, high, or numeric scales. Standardized across different event types for consistent severity assessment.

event_description

String

A brief description of the event. Provides human-readable context for easier understanding and analysis.

Note: This updated common header reflects the latest structure as of version v001.001.005, ensuring consistency and completeness across all ingested events.

Control Fields

Control fields are used to provide event metadata, ingest pathway routing, and debugging information. These are typically not required for analysis or visualization.

Control fields are placed under the tags field as a JSON object. For example, tags.collector.hostname.

Note: The tags field will override source data meta tags fields (such as Beats tags, which is a string, not JSON).

Standard fields are as follows:

Name

Type

Notes

collector

JSON Object

The collector that received the event.

collector.host

String

The IP address of the collector.

collector.hostname

String

The hostname of the collector.

collector.source

String

How the event was received.

collector.timestamp

String

When the event was received by the collector.

event

JSON Object

Event metadata.

event.origin

String

How the event was transported or the receiving module.

event.type

String

The type of event in event.type form (version 2.0) or naming standard form (version 2.1+): <type>_<source>_<subsource>. See Naming Standard.

event.category

String

The broad category of the event, used for ingestion routing. This aligns with the Kafka topic, OpenSearch index, and ClickHouse table it is routed to, as well as the ingestion DQ, enrichment, mapping, and additional parsing stages. Typically maps directly to <type>_<source> but can vary for granular sources (e.g., logs_beats_winlogbeat) or for ingestion exception handling purposes.

event.subtype

String

Provides additional classification of the event for more detailed routing and analysis. Derived from specific event attributes.

PreviousPackage Configuration NextSchema Quick Start

Last updated 1 year ago

hashtagHandling Large and Sparsely Populated Schemas at Scale

hashtagSchema Overlay Approach

hashtagComponent Renaming

hashtagClickHouse

hashtagFields

hashtagTables

hashtagCommon Header (Version v001.001.005)

hashtagControl Fields

Handling Large and Sparsely Populated Schemas at Scale

Schema Overlay Approach

Component Renaming

ClickHouse

Fields

Tables

Common Header (Version v001.001.005)

Control Fields