Skip to content

cfxOIA: Operations Intelligence & Analytics

1. What is Operations Intelligence & Analytics

CloudFabrix AIOps solution is called as Operations Intelligence & Analytics (cfxOIA). This solution provides domain-agnostic AIOps capabilities to bring algorithmic decisions to IT operations from several disparate monitoring and other operational data sources. cfxOIA or OIA is a software solution that runs as a distributed application using microservices and containers architecture. OIA is available as an enterprise offering, for on-premise or cloud deployment. OIA is also be offered as fully managed SaaS by CloudFabrix or its partners.


2. How it works

cfxOIA works by ingesting IT operational data, like alerts, events, and traces from multiple performance monitoring tools, log-based alerts from log monitoring tools and observational data from data-lakes for performing algorithmic correlation of alerts to reduce noise. OIA normalizes every alert with enrichment data established by stitching CMDB data, service mappings, and asset management data together to derive context-rich data for every alert that is ingested into the platform.

cfxOIA then correlates alerts, based on enriched data. Identifying correlation patterns is done on OIA's machine learning engine to identify symptomatic patterns in alert data. These patterns are then provided as recommendations to AIOps administrators to consider grouping or deduplication of future alerts that match those symptoms. Admins can create additional correlation policies to tune algorithmic correlation behavior to group alerts across on the entire application stack, within a time window, or in an infrastructure layer.


cfxOIA has an out-of-box implementation to correlate well-known operational issues related to alert burst scenarios, alert flapping situations, and transient alerts. This robust correlation engine allows the admin to implement event correlation for any type of situation, where the majority of patterns are detected with an unsupervised machine learning combined with additional flexibility for admin configurable policies to tune correlation behavior. Alerts that are correlated are called Alert Groups and the policies are called Correlation Policies.

Deduplicated and correlated alerts are grouped in an Alert Group that indicates an active operational issue or an OIA Incident. Every Alert Group has one OIA Incident, which is sent to the ITSM systems (like ServiceNow, PagerDuty, etc,.) and to OIA Incident Room for further Incident processing.


Incident Room is a dynamic and incident-centric workbench that provides all the triage data, Operational metrics, KPIs, Logs, Impacted assets context, Collaboration, and Diagnostic tools all at one place, so that operators can swiftly perform incident root cause analysis and service restoration. This helps in reducing Incident MTTR.

3. Deployment

cfxOIA is an application that is installed on top of RDA Fabric platform.

Please refer Setup, Install and Upgrade of OIA Application Services

4. Data Ingestion & Integrations

cfxOIA operates on IT operational data like alerts, events, traces, metrics, most of which are generated by monitoring tools and in some cases replicated in an aggregate data-lake. OIA supports integrations with many featured vendors using Webhooks, APIs, Kafka messages, etc. Custom integrations can be developed and supported by CloudFabrix professional services, Partners, using CloudFabrix Provided Developer SDKs.

4.1 Alert Ingestion in RDA AlertWatch Module


Some screenshots are normal some screenshots needs zooming, Please click on those screenshots below to enlarge and again to go back to the actual page please click on the back arrow button on the top left.

4.1.1 High Level Flow Diagram


The following broad steps are essentially needed to ingest and process events (alerts / incidents / messages).

1. Add a source endpoint that creates a sink for posting events from the source ex. for consuming alerts from AppDynamics; add a webhook source endpoint

2. Enable the endpoint to capture initial events. These events are not processed yet but will be recorded in event tracking. The raw event payload can be downloaded from the event tracking report.

3. Add a mapping rule to transform the raw event payload to an internal event model. Use the downloaded event payload as input to test the mapping rule and evaluate the internal event generated via the mapping.

4. Enable the endpoint to process incoming events from source

4.1.2 Source Event Endpoint

For ingesting events from a source, we need to add a Source endpoint.

Go to Home --> Administration --> Organizations --> Configure --> ALERT ENDPOINTS --> Click Add -->


Alert Endpoints

  • Navigate to the appropriate section based on the type of the incoming event - alert or incident or message.

Endpoint Type

  • Add an endpoint and pick the appropriate type for the endpoint. For instance choose Webhook HTTP Service for creating an HTTP endpoint where events can be posted.


  • After the endpoint is added, just as shown in the below screenshot, use the toggle switch under Enabled this will enable the endpoint to start ingesting alerts into the system.

Enable Endpoint

4.1.3 Source Event Payload

Go to Home --> User Dashboards --> select OIA Alerts and Incidents --> Click on Tracking --> Select Events

  • Download the raw payload of the event posted to the webhook from the source as shown as example in the screenshot below

Download Payload

4.1.4 Source Event Mapping

  • The source events will NOT be processed yet by the system without an appropriate mapping rule for translating the raw events into an actionable alert or incident or message.

To create a mapping rule for translating the source event as an alert, navigate to Home --> Administration --> Organizations --> Configure --> ALERT MAPPINGS --> click on Add

Alert Mapping

  • Select the source endpoint created earlier and select OIA Alert as target to ingest the event into the system as an actionable alert. Click Next

Add Details

  • Select one of the pre-defined mappers or select default.mapper-default-events-inbound-json to create a new JSON mapper. Also select pipeline default.pipeline-inbound for processing the alert via the ingestion pipeline.

Mapper Pipeline

  • Use the contents from the downloaded source payload as input. Create a JSON mapping definition to translate the input to the system identifiable alert.

For more Information on Json Based Alert Mapping Please click here

Use the Run Test option to view the mapping results in the output pane. After testing, save the mapping rule.

Run Test

  • Any incoming events to the system will now be processed as per the defined mapping rule and an alert will be created in the system. Configure Enrichment of Metric Columns Using Mapper
  • This adds attributes required for metric collection.
  "func": {
    "stream_enrich": {
      "name": "oia-selected-metrics",
      "condition": "asset_id is '$assetIpAddress' or asset_name is '$assetName'",
      "enriched_columns": {
        "metric_source": "metric_source",
        "asset_id": "metric_asset_id"

4.1.5 Event Tracking

Go to Home --> User Dashboards --> select OIA Alerts and Incidents --> Click on Tracking --> Select Events

View the alert created from the incoming event by navigating to the dashboard OIA Alerts and Incidents, and under the Events tab select View Alerts for the incoming event

Event Tracking

  • Alternatively you can check all the mapped alerts under Alerts tab

Event Tracking

4.1.6 Alerts Report

  • View all alerts created in the system by clicking on the Alerts tab

Go to Home --> User Dashboards --> select OIA Alerts and Incidents --> Click on Alerts

Alert Report

Click on the incident link to view the incident details


After the user Clicks on the Incident Link

Incident Id

4.1.7 Incidents Report

View all incidents created in the system by clicking on the Incidents tab

Incident Report

5. Data Analysis and Stitching

Large enterprise environments have a mix of structured and unstructured IT data sources and many custom IT data parameters defined and implemented across various data sources. For example, IT environments can implement custom attributes like machine type, environment, site code, department name, support group, application ID, etc. Not every tool implements these attributes, making it difficult to understand which operational data sources are relevant for AIOps implementation and which attributes can be gleaned from which sources to enrich raw alert data. This is where the cfxOIA Data Analysis and Stitching module comes into the picture to help establish the below

  • Asset Identities
  • Enrichment Attributes
  • Enrichment Flows
  • Baseline Analysis

This module works off of historical alert/event data, ticket data, CMDB data, service mappings, asset management and establishes a data chain that will help in appropriate data source selection and enrichment attributes for AIOps implementation.

6. Alert Enrichment

Raw alert data contains extremely limited information, often consisting of id, severity, message/description, rule name, and asset IP/hostname, etc. This information doesn't provide enough service context (Application or Service name, Environment, machine-type, etc.) or supportability context (NOC id, Site-id, Department, Support-group, etc.) which are essential data for efficient correlation of alerts. cfxOIA performs automated alert data enrichment using a combination of following approaches

  • ACE (Automated Context Extraction): Using this method, it extracts useful information like IP Address, DNS name and certain identifiable attributes from the source alert's payload. This doesn't require any external integrations, however, in majority of the scenarios, this may not be sufficient for alert correlation.
  • External source lookup: This process looks up information related to the incoming alerts in an external data source (ex: CMDB or Inventory system, CSV etc...) and then adds them as enriched alert attributes. Enriched attributed presents more contextual information to the IT Operations user and also will be used to correlate the alerts.

6.1. Normalization

Alert notifications are ingested from disparate monitoring tools into cfxOIA application and each of them follow different format with different alert attributes. Some of the below attributes (not limited to) are important ones in general related to any incoming alert.

  • Alert Timestamp
  • Alert Status
  • Alert Severity
  • Alert Source
  • Alert Message

Below are three sample alert notifications payload from VMware vROps, Nagios & AppDynamics. As shown in the below, the alert attributes are completely different from each other.


In cfxOIA application, it is a prerequisite to normalize these alert attributes coming from different monitoring tool sources to a common data model. Below are list of attributes which are used as part of the alert mapping process. Every ingested alert will go through Alert mapping process and their's payload attributes are mapped to the below standard attributes.


Not all below attributes are mandatory to be mapped. The attributes that are flagged with * are mandatory ones.

  • alertCategory: An attribute which can be used to categorize the alert
  • alertType: An attribute to classify type of alert
  • assetId: An attribute which can be used to identify the source of alert (Endpoint identity)
  • assetIpAddress: An attribute that is used to identify the IP Address of the end point
  • assetName*: An attribute that is used to identify the AssetName of the end point (ex: Hostname / Devicename)
  • assetType: An attribute that is used to identify type of the Asset or the end point (ex: VM / Server / Storage / CPU / Memory etc)
  • clearedAt*: Alert timestamp that is used to identify when the alert was cleared
  • componentId: An attribute to associate a sub-component ID of an endpoint from which the alert was generated
  • componentName: An attribute to associate a sub-component name of an endpoint from which the alert was generated
  • message*: Alert message that states the symptom or problem which has caused the alert
  • raisedAt*: Alert timestamp that is used to identify when the alert was occured
  • severity*: Alert's severity (Ex: Critical, Warning, Minor etc..)
  • status*: Alert's state (Open / Closed / Active / Recovered / Cancelled)
  • alertkey*: Alert's unique identifier which is used to identify an incoming alert and to apply alert de-duplication process. It can be taken from a single alert attribute or a combination of alert's attributes

Alert ingestion with alert mapping & normalization process data flow:


6.2. Enrichment

Below flow illustrates different stages of Alert processing from ingestion, alert attributes mapping, alert enrichment, correlation/suppression and persisting into the system's database.



In the above image illustration, listed enrichment datasources such as SNOW (ServiceNow), Nagios & vROPs are used for a quick reference only. cfxOIA support many datasources for enrichment process.

6.3. Enrichment Pipelines

Alert enrichment pipeline has two configuration blocks.

  • Querying external datasource (like CMDB, Nagios, vROps etc) and save enriched attributes into a dataset (CSV style table)

  • Define condition(s) or filter(s) rule to lookup, by taking one or more alert’s payload attributes (Ex: assetname / assetipaddress etc) and query additional attributes for a matched record from the saved dataset of an external datasource


Source alert attributes should be normalized first using alert mapping configuration for each source before enrichment process. For more information, please refer Alert attribute normalization

Below are few key alert attributes which can be used for alert enrichment attributes from the saved dataset that was created from external datasource integration.

  • assetName
  • assetIpAddress
  • componentName

Below screen shows a sample enrichment pipeline extracting additional attributes from 'Nagios' monitoring tool and configuring the system to use it as part of alert enrichment process.


Enrichment conditions / filters examples:

In the below example, the saved dataset is from VMware vROps, i.e vrops-resource-properties.

As a condition rule, multiple attributes are used for a lookup from the above saved dataset for enriched attributes.

Condition-1: identifier == '$assetId' (identifier is a column within the saved dataset from vROps, $assetId is the alert attribute which was mapped from the source alert using alert mapping process). Check if assetId attribute from Alert payload matches identifier within the saved dataset.

Condition-2: vmw_name == '$assetName' (vmw_name is a column within the saved dataset from vROps, $assetName is the alert attribute which was mapped from the source alert using alert mapping process). Check if assetName attribute from Alert payload matches vmw_name within the saved dataset.

Condition-3: vmw_guest_ipaddress == '$assetIpAddress' (vmw_guest_ipaddress is a column within the saved dataset from vROps, $assetIpAddress is the alert attribute which was mapped from the source alert using alert mapping process). Check if assetIpAddress attribute from Alert payload matches vmw_guest_ipaddress within the saved dataset.


Any attribute which is specified with $ represents alert payload's mapped attribute. For more information, please refer Alert attribute normalization

enrichcolumns: vmw_name, vmw_parent_vcenter, vmw_powerstate, vmw_accessible_status


When none of the enrich columns are specified, it will fetch all columns that do not have null or empty values as enriched attributes.

Condition operator for all of the above condition: OR (which means, extract enriched attributes if any of them find a matching record within the saved dataset from vROps.

- query: datasetname = 'vrops-resource-properties' & condition = "(identifier == '$assetId' or vmw_name == '$assetName' or vmw_guest_ipaddress == '$assetIpAddress')" & enrichcolumns = 'vmw_name,vmw_parent_vcenter,vmw_powerstate,vmw_accessible_status'

Below are some of the supported operators which can be used while querying the saved dataset of an external source using conditions.

  • == (equals)
  • != (not equals)
  • or
  • and

7. Alert Correlation

Alert correlation is a process of grouping together related alerts to reduce noise and increase actionability of alerts and events. Correlated alerts are grouped and translated into CFX incidents, which are then routed to ITSM systems for handling by NOC/IT Analysts, who can then login to cfxOIA's Incident Room module to perform swift triage, diagnosis and root cause analysis of an Incident.

cfxOIA's correlation engine provides recommendations for detecting and grouping new alert patterns. Admins can grasp, analyze the recommendations, and convert into Correlation Policies or define new policies altogether. Admins can also implement alert Suppression Policies to suppress alerts that escape during maintenance windows. cfxOIA provides out of box policies to treat well-known operational issues like alert burst scenarios, flapping scenarios, etc.

7.1 Key Points:

  • Ingested alerts and events are normalized to OIA alert model, to allow addressing most alerts/tool implementations
  • Customers can add custom attributes to alert model using enrichment process
  • Ingested alerts are enriched with context about application, stack, department, ownership, support-group etc. using a process called alert enrichment.
  • Enriched alerts are then evaluated for any correlation or suppression to be performed. Suppression policies are used to suppress alerts that escape maintenance windows.
  • Alerts that remain are then evaluated for correlation that is determined by correlation policies, which are setup in 3-ways
    • System defined policies: To address well-known behavior like alert burst and alert flapping situations.
    • ML driven correlation recommendations: OIA uses unsupervised ML clustering to detect alert patterns and provides list of suggested correlations in the form of Symptom Clusters.
    • Admin defined correlation policies: Administrators can define new correlation policies or customize existing policies to meet their needs. For instance, correlation policies allow admins to group alerts across a full-stack or an application instance. Admins can also group alerts across a common infrastructure (like network, storage etc.) or shared services (ex: SSO, DNS etc.).

7.2 How correlation policy works

Correlation policies are in enabled state when created, but they can be disabled. Correlation policies determine how alerts can be grouped together. Most of the correlation policies can be created in an assisted-manner by recommendations provided by cfxOIA's correlation engine with symptom clusters.

A correlation policy can result in one or more instances of alert correlations, each represented by a correlation Alert Group.

Following controls are available to specify correlation behavior.

Minimum Severity of Alert Group:

Severity of a correlated alert group is always determined by the highest severity of alerts that it comprises of. However, admin users can configure if they want a minimum level of severity to correlated alert groups formed by a correlation policy.

Time Boxing:

Time boxing is the concept of grouping related alerts that fall within a certain time window, like 15-mins, 30-mins or 1-hour. The time window is started when first matching alert is detected and closed after the time window expires. Any new matching alert after time window expiration will result into a new correlated alert group instance which leads to a new incident.


Precedence values help determine which policy takes precedence when conflicts arise, which could arise when an alert matches multiple policies. For example, an alert belonging to symptom cluster prod and application CMS can match both policies that are setup to correlated alerts at application level (app-name == CMS) or at symptom cluster level (cluster-name == prod). By providing higher precedence to application-level policy, alerts can will be grouped at application level.

Precedence is a numeric value starting with 10000, and higher values indicate higher precedence and take priority in case of match. Precedence values are optional, if not provided, system provides Precedence values automatically, based on chronological order i.e newly created correlation policies will get higher precedence.

A typical approach would be setup more wider or broad-scope correlation policies with higher precedence and more specific correlation policies to be with lower precedence.

Property Filters:

Narrows down related alert selection criteria using a set of property filters that match property fields with specified values using conditions like (equals, contains, in list of values etc.)

Property filters allow fine grained control of correlation policies to meet organizational processes, administrative domains or functional groups.

Group By:

Related alerts can be grouped by values in a certain attribute. This works best for attributes that are typically of type enumeration, list of values or represent a limited set of identities.

For example, assume Machine-Type attribute has following values Machine-Type = Application / Server / Network / Storage then if the Group By selects Machine-Type as attribute, correlation engine will automatically group alerts which have

Machine-Type = Application / Server / Network /Storage 
Environmnt = Prod / UAT

With two group by attribute selections indicated above, following alert group correlations will be

"Machine-Type == Application and Environment == Prod" into one group. 
"Machine-Type == Application and Environment == UAT" into one group. 
"Machine-Type == Server and Environment == Prod" into one group. 
"Machine-Type == Server and Environment == UAT" into one group. 
"Machine-Type == Storage and Environment == Prod" into one group. 
"Machine-Type == Storage and Environment == UAT" into one group. 
"Machine-Type == Network and Environment == Prod" into one group. 
"Machine-Type == Network and Environment == UAT" into one group.

7.3 Correlation Group Policy:

7.3.1 Alert Group Correlation Diagram:


Some screenshots are normal some screenshots needs zooming, Please click on those screenshots below to enlarge and again to go back to the actual page please click on the back arrow button on the top left.


7.3.2 Correlation Use Case:

  • Correlation is the process to correlate alerts generated from different sources and to reduce the noise of alerts, One of the ways to achieve this is to define correlation policy and get alerts correlated to one alert-group.

  • Alerts can be filtered using the policy filter and can be grouped using the GROUP BY methods of the correlation policy definition.

Creating and Updating Correlation Policies

Home --> Administration --> Organization --> Click on Configure --> click Correlation Policies --> Add all required credentials




7.3.3 Policy Definition Attributes: Precedence:
  • Each policy has to be defined with a precedence. Policy applicability happens based on the defined precedence for incoming alerts.

  • Below are the allowed values for precedence:

    a) Minimum Value: 10

    b) Max Value: 1000000 Auto Resolve Incident When Alerts are Cleared:
  • When an auto resolve is defined as Yes for a policy, alert-group/incident is auto cleared on clear of all the children alerts.

  • Allowed values are Yes/No

  • When auto resolve is defined as No group will remain active even though all children in the group are CLEARED. Group Expiry:
  • Group expiry is defined in minutes, which defines a window for alerts to get correlated to an alert group.

  • For example if group expiry is defined as 15 mins, if a group is created at 10:00 then the group window is opened till 10:15. All the alerts received in between 10:00 to 10:15 are correlated to the group created at 10:00.

Below is the sample explanation on alert group with expiry

10 minute window Alerts in Incident Alert Group State Incident State
10:03:00 2 Alerts raised, Alert Group is created - valid till 10:13:00, Incident is created - Incident 1 2 Open Open
10:05:00 2 Alerts cleared 2 Open Open
10:07:00 3 Alerts raised 5 Open Open
10:12:00 3 Alerts cleared 5 Open Open
10:13:00 Alert Group is closed 5 Closed Resolved(If all the children are cleared)

In the Above Sample Incident 1 is cleared if the group is expired and all the children are cleared in the group. If any one of the children is in ACTIVE state then alert-group/Incident will remain in the open state.

7.3.4 Auto Clear after last update:

  • Auto clear an Incident happens if no new alerts gets correlated to the incident till the prescribed Auto clear interval.

  • For example if a correlation group is created at 10:00 and auto clear is defined as 20 minutes then the group will be auto cleared if no alert is received until next 20 minutes(i.e 10:20).

7.3.5 Policy Filter:

  • Policy filters can be defined in order filter of the ingested alert to get correlated based on certain criteria.

  • Supported policy types

    a) Basic

    b) Advanced Basic:
  • Filter which can be defined using UI supported filter widget. Default operator AND is used when multiple conditions are defined in filter.

  • Below is the sample basic format

Basic Advanced:
  • Advanced can be defined using CFXQL language, when we want to specify multiple conditions with OR/AND operations.

  • Reference for CFXQL query format click here

  • Characters which needs to be escaped in the value of a defined filter.

    Usecase ( with escape Characters ) Examples
    $ test$123 Message contains 'Error logging\$123'
    ^ test^123 Message contains 'Error logging\^123'
    * test*123 Message contains 'Error logging\*123'
    ( test(123 Message contains 'Error logging\(123'
    ) test)123 Message contains 'Error logging\)123'
    + test+123 Message contains 'Error logging\+123'
    [ test[123 Message contains 'Error logging\[123'
    ' test'123 Message contains 'Error logging\'123'
    ? test?123 Message contains 'Error logging\?123'
  • Below is the sample for advanced filter using CFXQL


7.3.6 Filter Attributes Used to Define Filter:

  • Source Mechanism, Source, Severity, Alert Category, Alert Type, Cluster, Asset Type, IP Address, Asset Name,Component Name, Message. Using Enriched Attributes as Filter Attributes to Define Policy:
  • Enriched attributes of an alert can be used to define filters by enabling them from the Enriched Attributes management section.

  • Path to manage enriched attributes of alerts: HomeAdministrationOrganizationConfigureAlertsENRICHED ATTRIBUTES → toggle switch under Correlation Policy

Enriched Attributes

7.3.7 Group By:

  • Policy can be defined to have multiple attributes as group by, Unique incidents will be created using the group by attributes.

7.3.8 FAQ’s:

  • Can we have one incident created for a policy and correlate all the alerts to one incident ?

    We can define a policy with zero expiry minutes. Incoming alerts filtered with the policy will get correlated to one incident and the incident will be cleared when all the alerts are cleared.

  • Can an incident be active even though all the children are cleared?

    Incidents will be in active state when a policy is defined with auto resolve No, even though all the alerts of an incident are cleared.

8. Incident Management

cfxOIA creates Incident for every correlated Alert Group and sends them to ITSM tools (such as ServiceNow, PagerDuty, etc.) for further processing by IT Analysts, NOC/SOC Engineers, or Tier-1/Tier-2 Engineers. cfxOIA provides a module called Incident Room that AIOps operators and ITSM operators can use to accelerate incident analysis and resolution. The Incident room provides all the relevant context, data, insights, and tools at one place for incident resolution.

Learn more about:

9. ML Driven Operations

cfxOIA uses machine learning (ML) at its core to intelligently learn patterns from huge volumes of historical as well as streaming data and automate key IT operational activities and decisions at large scale.

Key ML driven operations include:

  • Alert Correlation (uses unsupervised ML)
  • Log Clustering
  • Alert volume seasonality
  • Alert volume anomaly detection
  • Alert volume prediction
  • Incident triage data anomaly detection and noticeable changes
  • Similar incidents

Prediction insights consist of forecasting alert volume or ingestion rate, providing a perspective into how many alerts Operations team can expect in future. cfxOIA can perform this prediction analysis on multiple dimensions, including alerts coming from a certain source, or alerts of certain application, severity, site or even alerts of certain symptom. In addition to prediction insights, cfxOIA also provides seasonality and anomaly detection when ML jobs are run, which can be executed on-demand or scheduled to be run periodically, which helps in continuos learning, training and testing of models.

cfxOIA currently supports 3 ML pipelines out of the box, Clustering, Classification and Regression. ML jobs allow hyper parameter tuning by making the selections from UI itself. Advanced customization scenarios allow uploading of new ML pipelines.

Learn more about:

10. Analytics

cfxOIA provides key analytics to track AIOps related KPIs like noise reduction efficiency, Alert ingestion trends, Most chatty alert types, etc. cfxOIA has a unique data exploration feature called Quick Insights that provides an at-a-glance visual clue of distribution and other characteristics of data. Quick Insights on Incidents provide visual clues about the distribution of Incidents based on priority, Support-group, Incident-age, Environment, Application, Department, etc.

Learn more about:

11. UI features

cfxOIA provides a web-based portal that is accessible via a standard browser and uses HTML 5 to render User Interface (UI). There is no need to install any thick client to access the cfxOIA web-portal. cfxOIA portal provides certain advanced UI features for efficient data handling and customization.

Filters: Allow efficient filtering, Saving, and Reusing filters. Table View Customization

Learn more about:

Customize Columns: Displayed in the table, Change the order of columns.

Learn more about:

Exporting Data: Exporting data like Incidents, Alerts, etc.

Learn more about: