Ingestion Pipeline Overview

What Are Ingestion Pipelines?

Ingestion pipelines are the backbone of our data infrastructure, designed to efficiently collect, process, and store data from various sources. They serve as conduits that ensure data is ingested in real-time, transformed appropriately, and stored for further analysis or usage. Our pipelines support diverse data types, including logs, metrics, and custom events, and are built to offer flexibility in transformation and storage.

Key Characteristics of Our Ingestion Pipelines:

  • Push-Based: Data is actively pushed from sources to the pipeline, enabling real-time data processing with minimal latency.

  • Modular Design: Composed of interchangeable components, our pipelines are highly maintainable and scalable. Each module handles specific tasks, allowing for easy updates and customization.

  • Dynamic Mappings: Configurations and mappings are driven by CSV files. This approach simplifies updates, allowing adjustments without code changes and enabling non-developers to modify pipeline behavior.

  • Data Cleansing and Enrichment: Significant effort is invested in cleansing data, normalizing timestamps, and enriching events to ensure high-quality, consistent datasets for downstream processing.

Main Steps in the Pipeline

Our ingestion pipeline consists of three primary steps:

  1. Receiver: The initial point of data capture. The receiver listens on designated ports for incoming data from various sources. It performs initial processing such as normalizing timestamps, categorizing events based on dynamic mappings, and forwarding data to the next stage. Key functions include:

    • Timestamp Normalization: Ensures all events have consistent and accurate timestamps.

    • Event Category Mapping: Assigns categories to events for correct routing and processing downstream.

    • Security: Implements TLS configurations to secure communications between components.

  2. Finalization: This step performs generic transformations, data cleansing, and enrichment required for all data before it reaches its final destination. Key functions include:

    • Data Cleansing: Removes empty or null fields, and standardizes field names to a consistent format (e.g., camelCase).

    • Enrichment: Adds unique identifiers like event_hash for deduplication and tracking, and converts timestamps to a unified format.

    • Transformation: Flattens nested JSON structures and converts specific fields as required.

  3. Load to ClickHouse (Load_CH): The final step involves loading data into ClickHouse, a high-performance columnar database optimized for analytical queries. This step ensures data is correctly routed to the appropriate tables based on subschemas, handles customer organization IDs (org_id), and applies necessary transformations to align with ClickHouse's schema and data types. Key functions include:

    • Subschema Routing: Categorizes data into subschemas to ensure it is stored in the correct tables.

    • Data Validation: Filters out invalid data and routes problematic events to a Dead Letter Queue (DLQ) for further inspection.

    • Schema Compliance: Applies transformations to conform data to ClickHouse's requirements, enhancing query performance and data integrity.

By following this structured approach, our ingestion pipelines efficiently process vast amounts of data, ensuring it is clean, enriched, and ready for analysis or further processing.


Summary of the Main Points:

  • Push-Based: Real-time data processing by actively receiving data pushed from various sources.

  • Modular: The pipeline is built from interchangeable components for easy maintenance and scalability.

  • Dynamic Mappings: Uses CSV-driven configurations for flexible and simple updates.

  • Data Cleansing and Enrichment: Prioritizes data quality through cleansing, timestamp normalization, and enrichment tasks.

Our ingestion pipelines are designed to be robust, flexible, and efficient, ensuring that data flows smoothly from sources to storage, ready to deliver value through analysis and insights.

Last updated