Schema Overview
Handling Large and Sparsely Populated Schemas at Scale
Handling large sets of unstructured logs with wide and long table shapes that are sparsely populated can be inherently complicated. Existing data management practices like applying schema-on-read, which are generally used to manage complex and evolving schemas, are slow, expensive, and ineffective in delivering real-time outcomes.
HyperSec has developed a new approach to handling schemas for these wide, long, and sparsely populated datasets. Leveraging years of industry experience and applying first-principles thinking, we've designed a schema management approach that best fits the type of real-time OLAP engines powering XDR today. This document describes how HyperSec leverages a core unified schema and enables customers to evolve the core schemas or derive subschemas from base schemas.
Schema Overlay Approach
HyperSec XDR provides a schema overlay approach to source data, whereby a common set of fields are added to a source or unified schema supplied by a data source or set of transforms. For example, a Beats source such as WinLogBeat will be accepted, and XDR common and control fields such as customer ID, unique ID (hash), and type are added upon initial ingest landing. The source data format (enforced JSON, not raw log) is retained.
Component Renaming
Where individual components do not support special characters or nested typing, they are resolved as follows:
ClickHouse
Fields
.
(dot) or-
(dash) are replaced with_
(underscore).Nested names are flattened.
Example:
fred.nerk.field
becomes fred_nerk_field
Tables
Customer ID is used as the database, prefixed with
xdr_
..
(dot) or-
(dash) are replaced with_
(underscore).Nested names are flattened.
Example:
An OpenSearch index stream or Kafka topic with org ID 12345678
:
logs-beats-winlogbeat-12345678
becomes xdr_12345678.logs_beats_winlogbeat
Common Header (Version v001.001.005)
A common header is used for all event and context data supplied and ingested. Fields that require performant query analysis are located as dedicated high-level fields and not solely within the tags
JSON control field, as querying within nested JSON can be significantly slower in OpenSearch and ClickHouse.
Version: v001.001.005
The header is provided as follows:
timestamp
- @timestamp
(String in JSON message): 2023-03-05T12:00:00+00:00Z
- @timestamp
in OpenSearch
- timestamp
in ClickHouse (DateTime
with CODEC(DoubleDelta, LZ4)
)
Renamed to timestamp
within ClickHouse; remains @timestamp
in the JSON message and OpenSearch. This field is an exception to the underscore _
prefix rule requirement.
timestamp_load
String (JSON message): 2023-03-05T12:00:00+00:00Z
The time the event is loaded into the engine (ClickHouse/OpenSearch) or stored (NAS/S3).
log_original
String
The original (completely raw as received by the HyperCollector) event. Optional for ingestion into engines due to its size and inefficiency. Useful for:
Sending to Splunk
Re-using public raw log parsers
event_hash
String
Unique ID for the event, generated by Vector's sha2
(or equivalent) of the entire original JSON.
See Vector Documentation.
org_id
String
A unique alphanumeric string identifying the source organization, typically an 8-digit number. Can be alphanumeric for backwards compatibility with XDR 1.x.
Also maps from and has the same value as tags.event.org_id
from the source JSON.
tags_event_type
String
Event type in <type>_<source>_<subsource>
form, e.g., logs_beats_winlogbeat
.
Also maps from and has the same value as tags.event.type
from the source JSON (dot-separated field).
tags_event_category
String
Event category in <type>_<source>
form, e.g., logs_syslog
for logs_syslog_cisco_asa
.
Note: Used to select the landing topic by default.
This typically maps directly to <type>_<source>
from event_type
but is separated for exception handling purposes.
Also maps from and has the same value as tags.event.category
from the source JSON.
tags_event_subtype
String
Event subtype providing additional granularity, e.g., authentication_failure
.
Derived from specific fields within the event data.
Helps in detailed event classification and routing.
customer_id
String
An alternative identifier for the customer, used interchangeably with org_id
in some contexts.
Ensures compatibility with systems that use customer_id
as the primary identifier.
source_ip
String (IP Address)
The IP address of the source generating the event. Extracted from the event data if available. Useful for network-related analyses and filtering.
destination_ip
String (IP Address)
The IP address of the destination in the event, if applicable. Extracted from the event data. Useful for tracking communication patterns and potential threats.
event_severity
String or Integer
Indicates the severity level of the event, e.g., low
, medium
, high
, or numeric scales.
Standardized across different event types for consistent severity assessment.
event_description
String
A brief description of the event. Provides human-readable context for easier understanding and analysis.
Note: This updated common header reflects the latest structure as of version v001.001.005, ensuring consistency and completeness across all ingested events.
Control Fields
Control fields are used to provide event metadata, ingest pathway routing, and debugging information. These are typically not required for analysis or visualization.
Control fields are placed under the tags
field as a JSON object. For example, tags.collector.hostname
.
Note: The tags
field will override source data meta tags fields (such as Beats tags, which is a string, not JSON).
Standard fields are as follows:
collector
JSON Object
The collector that received the event.
collector.host
String
The IP address of the collector.
collector.hostname
String
The hostname of the collector.
collector.source
String
How the event was received.
collector.timestamp
String
When the event was received by the collector.
event
JSON Object
Event metadata.
event.origin
String
How the event was transported or the receiving module.
event.type
String
The type of event in event.type
form (version 2.0) or naming standard form (version 2.1+): <type>_<source>_<subsource>
.
See Naming Standard.
event.category
String
The broad category of the event, used for ingestion routing. This aligns with the Kafka topic, OpenSearch index, and ClickHouse table it is routed to, as well as the ingestion DQ, enrichment, mapping, and additional parsing stages.
Typically maps directly to <type>_<source>
but can vary for granular sources (e.g., logs_beats_winlogbeat
) or for ingestion exception handling purposes.
event.subtype
String
Provides additional classification of the event for more detailed routing and analysis. Derived from specific event attributes.
Last updated