Skip to content

Guide to Data Ingestion in RDA Fabric

1. Ingesting Data into RDAF Using Event Gateway

RDA Event Gateway is a RDA Fabric component that can be deployed in Cloud or on-premises / edge locations to ingest data from various sources.

RDA Event Gateway supports following endpoints types:

Endpoint Type Protocols Description
syslog_tcp TCP/SSL Syslog or Syslog like event ingestion via TCP or SSL
syslog_udp UDP Syslog or Syslog like event ingestion via UDP
http HTTP/HTTPS JSON or Plain Text formatted events via Webhook. Supports HTTP operations POST & PUT
tcp_json TCP/SSL JSON encoded messages with one message per line
filebeat HTTP/HTTPS Elasticsearch Filebeat / Winlogbeat based ingestion of data
file Ingestion of data from one or more file(s) or folder(s)

RDA Event Gateway Endpoint configuration example for Webhook:

endpoints:
  - name: http_events
    enabled: true
    type: http
    secure: true
    content_type: auto
    port: 516
    stream: my-webhook-data
    attrs:
        site_code: cfx_dc

Explanation of configuration fields:

  • name: Name of the endpoint. Must be unique
  • enabled: If set to false, Event Gateway will shutdown the endpoint
  • type: Type of Endpoint. For this example http
  • secure: If true runs the endpoint in HTTPS mode
  • content_type: Type of content to expect in incoming payload. Possible values are 'auto', 'json', 'text'. If set to auto, endpoint will detect the content using Content-Type HTTP header.
  • port: TCP port to listen for data
  • stream: Name of the RDA Stream where the data will be published for further consumption by RDA Pipelines or Persistent Streams
  • attrs: Optional dictionary of attributes that will be added to each message's payload. Event Gateway automatically inserts attributes:
  • Following attributes are automatically inserted into each message:
    • rda_gw_ep_type: Endpoint Type (in this example: 'http')
    • rda_gw_ep_name: Endpoint Name
    • rda_gw_timestamp: Ingested timestamp in ISO format
    • rda_content_type: HTTP Content-Type header value
    • rda_url: HTTP URL
    • rda_path: Path part of HTTP URL
    • rda_gw_client_ip: IP Address of the client that posted the data
    • rda_user_agent: User-Agent of the client
    • rda_stream: RDA Stream where this message is being forwarded to

Automatic Archival of Data from Event Gateway:

RDA Event Gateway can be configured to automatically archive using RDA Log Archive feature.

Following is an example snippet for main.yml configuration file:

# This is the main configuration for Event Gateway.
# Changes to this file only take affect after Gateway container is restarted
#

# Name of the site at which Event gateway is deployed
# If not specified, uses ENV variable RDA_SITE_NAME. This will override 
# env variable value
site_name: SITE_NAME

archival:
    enabled: true
    # local directory where log files (JSONs and then .gz files) will be saved
    local_dir: /tmp/log_archive/

    # Name of the archival. Must be only letters and digits and optionally _-
    name: example_archive

    # Local .gz files are deleted immediately after copying to destination (minio or s3)
    # If not able to push to minio, how long to keep in local directory
    local_retain_max_hours: 24

    # Archival destination. If not specified, archival will be disabled
    destination_repository: demo_logarchive

Note

Log Archive repository (demo_logarchive) must be pre-created using CLI or RDA Portal.


2. Ingesting Data into RDAF Using Message Queues

RDA Pipelines can continuously ingest data from many types of message queues. Some of the most commonly used approaches are:

See above pages for list of bots available for ingesting data from different types of queues.

3. Ingesting Data into RDAF Using Purpose Built Bots

RDA Provides extensive set of bots to retrieve data from various sources. Following are some of the integrations available:

4. Ingesting Data Using Staging Area

RDA Pipelines can continuously ingest data from staging area (for example S3 or minio). Data can be ingested directly from files in a specified bucket and a folder path (or prefix).

Staging area definition specifies where data files are stored so that the data in the files can be ingested into RDA Fabric.

Storage Location

  • Storage area definitions are stored in RDAF Object Storage.
  • The staging area data can be in RDAF Object Storage or any external storage (S3 or minio). For external staging area, the user needs to create credential of type stagingarea-ingest for RDA platform to access the bucket.

Related Bots

Related RDA Client CLI Commands

    staging-area-add          Add or update staging area
    staging-area-delete       Delete a staging area
    staging-area-get          Get YAML data for a staging area
    staging-area-list         List all staging areas.

See RDA CLI Guide for installation instructions

Sample YAML: For staging area in RDA platform
name: staging-area-platform-sample
description: staging area data in RDAF Object Storage
# platform_config field is used when staging area is in RDAF Object Storage
platform_config:
 object_prefix: /data/
# file criteria in regex format 
filename_pattern: .*
# Optional. Delete data from staging area after ingestion (y/n). Default: n
delete_after_ingestion: n
# Optional. If you set the date, it will only ingest the files that are newer than provided 'ingest_after' datetime. The fields needs to be in UTC Time Zone and ISO Date Time Format. Default is null
ingest_after: "2022-05-10T23:12:03.223067"
Sample YAML: For staging area that is external (S3 or minio)
1
2
3
4
5
6
name: staging-area-external-sample
description: staging area data in external S3 or minio
# Name of the predefined credential for the external S3 or minio
external_storage_credential_name: "s3-sa"
# file criteria in regex format 
filename_pattern: sample.json

Managing through RDA Portal

  • In RDA Portal, Click on left menu Data
  • Click on 'View Details' next to Data Staging Area

Managing through RDA Studio

  • Studio does not have any user interface for managing the staging area.

5. Ingesting Data once from location

RDA Pipelines can also ingest data once from a given location (S3 or minio). Data can be ingested directly from files in a specified bucket and a folder path (or prefix).

For external location, the user needs to create credential of type stagingarea-ingest for RDA platform to access the bucket.

Related Bots

6. Ingesting Data from Kafka

Data can be ingested into persistent streams via Kafka.

You can provide Kafka as the messaging platform to read data from and then write to Open Search the following way when you are creating the persistent stream in the attributes section of the UI:

On the Left Side Menu Bar Click on the ConfigurationRDA AdministrationPersistent StreamsAddAttributes (please add the below code) → Save

{
  "messaging_platform_settings": {
    "platform": "kafka",
    "credential_name": "mykafka",
    "kafka-params": {
      "topics": [
        "kafka_topic1",
        "kafka_topic2"
      ],
      "auto.offset.reset": "latest",
      "consumer_poll_timeout": 1.0
    }
  }
}

To add kafka-v2 Credentials from the UI: Click on ConfigurationRDA IntegrationsCredentialsAddSave

Parameter Name
Description
credential_name Name of the credential of type kafka-v2
topics One or more kafka topics to receive data from
auto.offset.reset “earliest” or “latest”, Default: “latest”
consumer_poll_timeout Milliseconds spent waiting in the poll if data is not available in the buffer. Default is 1.

Related RDA Client CLI Commands:

pstream-add             Add a new Persistent stream
Sub-option              --messaging_platform_settings
                        JSON file containing Messaging platform settings to
                        read data from (Ex KAFKA) and write to Open Search. If
                        not provided, default platform NATS is used.