cfxOIA: Operations Intelligence & Analytics
1. What is Operations Intelligence & Analytics
CloudFabrix AIOps solution is called as Operations Intelligence & Analytics (cfxOIA). This solution provides domain-agnostic AIOps capabilities to bring algorithmic decisions to IT operations from several disparate monitoring and other operational data sources. cfxOIA or OIA is a software solution that runs as a distributed application using microservices and containers architecture. OIA is available as an enterprise offering, for on-premise or cloud deployment. OIA is also be offered as fully managed SaaS by CloudFabrix or its partners.
2. How it works
cfxOIA works by ingesting IT operational data, like alerts, events, and traces from multiple performance monitoring tools, log-based alerts from log monitoring tools and observational data from data-lakes for performing algorithmic correlation of alerts to reduce noise. OIA normalizes every alert with enrichment data established by stitching CMDB data, service mappings, and asset management data together to derive context-rich data for every alert that is ingested into the platform.
cfxOIA then correlates alerts, based on enriched data. Identifying correlation patterns is done on OIA's machine learning engine to identify symptomatic patterns in alert data. These patterns are then provided as recommendations to AIOps administrators to consider grouping or deduplication of future alerts that match those symptoms. Admins can create additional correlation policies to tune algorithmic correlation behavior to group alerts across on the entire application stack, within a time window, or in an infrastructure layer.
cfxOIA has an out-of-box implementation to correlate well-known operational issues related to alert burst scenarios, alert flapping situations, and transient alerts. This robust correlation engine allows the admin to implement event correlation for any type of situation, where the majority of patterns are detected with an unsupervised machine learning combined with additional flexibility for admin configurable policies to tune correlation behavior. Alerts that are correlated are called Alert Groups and the policies are called Correlation Policies.
Deduplicated and correlated alerts are grouped in an Alert Group that indicates an active operational issue or an OIA Incident. Every Alert Group has one OIA Incident, which is sent to the ITSM systems (like ServiceNow, PagerDuty, etc,.) and to OIA Incident Room for further Incident processing.
Incident Room is a dynamic and incident-centric workbench that provides all the triage data, Operational metrics, KPIs, Logs, Impacted assets context, Collaboration, and Diagnostic tools all at one place, so that operators can swiftly perform incident root cause analysis and service restoration. This helps in reducing Incident MTTR.
cfxOIA is an application that is installed on top of RDA Fabric platform.
4. Data Ingestion & Integrations
cfxOIA operates on IT operational data like alerts, events, traces, metrics, most of which are generated by monitoring tools and in some cases replicated in an aggregate data-lake. OIA supports integrations with many featured vendors using Webhooks, APIs, Kafka messages, etc. Custom integrations can be developed and supported by CloudFabrix professional services, Partners, using CloudFabrix Provided Developer SDKs.
5. Data Analysis and Stitching
Large enterprise environments have a mix of structured and unstructured IT data sources and many custom IT data parameters defined and implemented across various data sources. For example, IT environments can implement custom attributes like machine type, environment, site code, department name, support group, application ID, etc. Not every tool implements these attributes, making it difficult to understand which operational data sources are relevant for AIOps implementation and which attributes can be gleaned from which sources to enrich raw alert data. This is where the cfxOIA Data Analysis and Stitching module comes into the picture to help establish the below
- Asset Identities
- Enrichment Attributes
- Enrichment Flows
- Baseline Analysis
This module works off of historical alert/event data, ticket data, CMDB data, service mappings, asset management and establishes a data chain that will help in appropriate data source selection and enrichment attributes for AIOps implementation.
6. Alert Enrichment
Raw alert data contains extremely limited information, often consisting of id, severity, message/description, rule name, and asset IP/hostname, etc. This information doesn't provide enough service context (Application or Service name, Environment, machine-type, etc.) or supportability context (NOC id, Site-id, Department, Support-group, etc.) which are essential data for efficient correlation of alerts. cfxOIA performs automated alert data enrichment using a combination of following approaches
- ACE (Automated Context Extraction): Using this method, it extracts useful information like IP Address, DNS name and certain identifiable attributes from the source alert's payload. This doesn't require any external integrations, however, in majority of the scenarios, this may not be sufficient for alert correlation.
- External source lookup: This process looks up information related to the incoming alerts in an external data source (ex: CMDB or Inventory system, CSV etc...) and then adds them as enriched alert attributes. Enriched attributed presents more contextual information to the IT Operations user and also will be used to correlate the alerts.
Alert notifications are ingested from disparate monitoring tools into cfxOIA application and each of them follow different format with different alert attributes. Some of the below attributes (not limited to) are important ones in general related to any incoming alert.
- Alert Timestamp
- Alert Status
- Alert Severity
- Alert Source
- Alert Message
Below are three sample alert notifications payload from VMware vROps, Nagios & AppDynamics. As shown in the below, the alert attributes are completely different from each other.
In cfxOIA application, it is a prerequisite to normalize these alert attributes coming from different monitoring tool sources to a common data model. Below are list of attributes which are used as part of the alert mapping process. Every ingested alert will go through Alert mapping process and their's payload attributes are mapped to the below standard attributes.
Not all below attributes are mandatory to be mapped. The attributes that are flagged with * are mandatory ones.
- alertCategory: An attribute which can be used to categorize the alert
- alertType: An attribute to classify type of alert
- assetId: An attribute which can be used to identify the source of alert (Endpoint identity)
- assetIpAddress: An attribute that is used to identify the IP Address of the end point
- assetName*: An attribute that is used to identify the AssetName of the end point (ex: Hostname / Devicename)
- assetType: An attribute that is used to identify type of the Asset or the end point (ex: VM / Server / Storage / CPU / Memory etc)
- clearedAt*: Alert timestamp that is used to identify when the alert was cleared
- componentId: An attribute to associate a sub-component ID of an endpoint from which the alert was generated
- componentName: An attribute to associate a sub-component name of an endpoint from which the alert was generated
- message*: Alert message that states the symptom or problem which has caused the alert
- raisedAt*: Alert timestamp that is used to identify when the alert was occured
- severity*: Alert's severity (Ex: Critical, Warning, Minor etc..)
- status*: Alert's state (Open / Closed / Active / Recovered / Cancelled)
- alertkey*: Alert's unique identifier which is used to identify an incoming alert and to apply alert de-duplication process. It can be taken from a single alert attribute or a combination of alert's attributes
Alert ingestion with alert mapping & normalization process data flow:
Below flow illustrates different stages of Alert processing from ingestion, alert attributes mapping, alert enrichment, correlation/suppression and persisting into the system's database.
In the above image illustration, listed enrichment datasources such as SNOW (ServiceNow), Nagios & vROPs are used for a quick reference only. cfxOIA support many datasources for enrichment process.
6.3. Enrichment Pipelines
Alert enrichment pipeline has two configuration blocks.
Querying external datasource (like CMDB, Nagios, vROps etc) and save enriched attributes into a dataset (CSV style table)
Define condition(s) or filter(s) rule to lookup, by taking one or more alert’s payload attributes (Ex: assetname / assetipaddress etc) and query additional attributes for a matched record from the saved dataset of an external datasource
Source alert attributes should be normalized first using alert mapping configuration for each source before enrichment process. For more information, please refer Alert attribute normalization
Below are few key alert attributes which can be used for alert enrichment attributes from the saved dataset that was created from external datasource integration.
Below screen shows a sample enrichment pipeline extracting additional attributes from 'Nagios' monitoring tool and configuring the system to use it as part of alert enrichment process.
Enrichment conditions / filters examples:
In the below example, the saved dataset is from VMware vROps, i.e vrops-resource-properties.
As a condition rule, multiple attributes are used for a lookup from the above saved dataset for enriched attributes.
identifier == '$assetId' (identifier is a column within the saved dataset from vROps,
$assetId is the alert attribute which was mapped from the source alert using alert mapping process). Check if assetId attribute from Alert payload matches identifier within the saved dataset.
vmw_name == '$assetName' (vmw_name is a column within the saved dataset from vROps,
$assetName is the alert attribute which was mapped from the source alert using alert mapping process). Check if assetName attribute from Alert payload matches vmw_name within the saved dataset.
vmw_guest_ipaddress == '$assetIpAddress' (vmw_guest_ipaddress is a column within the saved dataset from vROps,
$assetIpAddress is the alert attribute which was mapped from the source alert using alert mapping process). Check if assetIpAddress attribute from Alert payload matches vmw_guest_ipaddress within the saved dataset.
Any attribute which is specified with $ represents alert payload's mapped attribute. For more information, please refer Alert attribute normalization
vmw_name, vmw_parent_vcenter, vmw_powerstate, vmw_accessible_status
When none of the enrich columns are specified, it will fetch all columns that do not have null or empty values as enriched attributes.
Condition operator for all of the above condition: OR (which means, extract enriched attributes if any of them find a matching record within the saved dataset from vROps.
Below are some of the supported operators which can be used while querying the saved dataset of an external source using conditions.
- == (equals)
- != (not equals)
7. Alert Correlation
Alert correlation is a process of grouping together related alerts to reduce noise and increase actionability of alerts and events. Correlated alerts are grouped and translated into CFX incidents, which are then routed to ITSM systems for handling by NOC/IT Analysts, who can then login to cfxOIA's Incident Room module to perform swift triage, diagnosis and root cause analysis of an Incident.
cfxOIA's correlation engine provides recommendations for detecting and grouping new alert patterns. Admins can grasp, analyze the recommendations, and convert into Correlation Policies or define new policies altogether. Admins can also implement alert Suppression Policies to suppress alerts that escape during maintenance windows. cfxOIA provides out of box policies to treat well-known operational issues like alert burst scenarios, flapping scenarios, etc.
7.1 Key Points:
- Ingested alerts and events are normalized to OIA alert model, to allow addressing most alerts/tool implementations
- Customers can add custom attributes to alert model using enrichment process
- Ingested alerts are enriched with context about application, stack, department, ownership, support-group etc. using a process called alert enrichment.
- Enriched alerts are then evaluated for any correlation or suppression to be performed. Suppression policies are used to suppress alerts that escape maintenance windows.
- Alerts that remain are then evaluated for correlation that is determined by correlation policies, which are setup in 3-ways
- System defined policies: To address well-known behavior like alert burst and alert flapping situations.
- ML driven correlation recommendations: OIA uses unsupervised ML clustering to detect alert patterns and provides list of suggested correlations in the form of Symptom Clusters.
- Admin defined correlation policies: Administrators can define new correlation policies or customize existing policies to meet their needs. For instance, correlation policies allow admins to group alerts across a full-stack or an application instance. Admins can also group alerts across a common infrastructure (like network, storage etc.) or shared services (ex: SSO, DNS etc.).
7.2 How correlation policy works
Correlation policies are in enabled state when created, but they can be disabled. Correlation policies determine how alerts can be grouped together. Most of the correlation policies can be created in an assisted-manner by recommendations provided by cfxOIA's correlation engine with symptom clusters.
A correlation policy can result in one or more instances of alert correlations, each represented by a correlation Alert Group.
Following controls are available to specify correlation behavior.
Minimum Severity of Alert Group:
Severity of a correlated alert group is always determined by the highest severity of alerts that it comprises of. However, admin users can configure if they want a minimum level of severity to correlated alert groups formed by a correlation policy.
Time boxing is the concept of grouping related alerts that fall within a certain time window, like 15-mins, 30-mins or 1-hour. The time window is started when first matching alert is detected and closed after the time window expires. Any new matching alert after time window expiration will result into a new correlated alert group instance which leads to a new incident.
Precedence values help determine which policy takes precedence when conflicts arise, which could arise when an alert matches multiple policies. For example, an alert belonging to symptom cluster
prod and application
CMS can match both policies that are setup to correlated alerts at application level
(app-name == CMS) or at symptom cluster level
(cluster-name == prod). By providing higher precedence to application-level policy, alerts can will be grouped at application level.
Precedence is a numeric value starting with 10000, and higher values indicate higher precedence and take priority in case of match. Precedence values are optional, if not provided, system provides Precedence values automatically, based on chronological order i.e newly created correlation policies will get higher precedence.
A typical approach would be setup more wider or broad-scope correlation policies with higher precedence and more specific correlation policies to be with lower precedence.
Narrows down related alert selection criteria using a set of property filters that match property fields with specified values using conditions like (equals, contains, in list of values etc.)
Property filters allow fine grained control of correlation policies to meet organizational processes, administrative domains or functional groups.
Related alerts can be grouped by values in a certain attribute. This works best for attributes that are typically of type enumeration, list of values or represent a limited set of identities.
For example, assume
Machine-Type attribute has following values
Machine-Type = Application / Server / Network / Storage then if the Group By selects Machine-Type as attribute, correlation engine will automatically group alerts which have
With two group by attribute selections indicated above, following alert group correlations will be
"Machine-Type == Application and Environment == Prod" into one group. "Machine-Type == Application and Environment == UAT" into one group. "Machine-Type == Server and Environment == Prod" into one group. "Machine-Type == Server and Environment == UAT" into one group. "Machine-Type == Storage and Environment == Prod" into one group. "Machine-Type == Storage and Environment == UAT" into one group. "Machine-Type == Network and Environment == Prod" into one group. "Machine-Type == Network and Environment == UAT" into one group.
7.3 Creating and Updating Correlation Policies:
Follow the below steps to view and manage the Alert correlation policies.
Login as Tenant admin user and click on OIA application.
8. Incident Management
cfxOIA creates Incident for every correlated Alert Group and sends them to ITSM tools (such as ServiceNow, PagerDuty, etc.) for further processing by IT Analysts, NOC/SOC Engineers, or Tier-1/Tier-2 Engineers. cfxOIA provides a module called Incident Room that AIOps operators and ITSM operators can use to accelerate incident analysis and resolution. The Incident room provides all the relevant context, data, insights, and tools at one place for incident resolution.
Learn more about:
9. ML Driven Operations
cfxOIA uses machine learning (ML) at its core to intelligently learn patterns from huge volumes of historical as well as streaming data and automate key IT operational activities and decisions at large scale.
Key ML driven operations include:
- Alert Correlation (uses unsupervised ML)
- Log Clustering
- Alert volume seasonality
- Alert volume anomaly detection
- Alert volume prediction
- Incident triage data anomaly detection and noticeable changes
- Similar incidents
Prediction insights consist of forecasting alert volume or ingestion rate, providing a perspective into how many alerts Operations team can expect in future. cfxOIA can perform this prediction analysis on multiple dimensions, including alerts coming from a certain source, or alerts of certain application, severity, site or even alerts of certain symptom. In addition to prediction insights, cfxOIA also provides seasonality and anomaly detection when ML jobs are run, which can be executed on-demand or scheduled to be run periodically, which helps in continuos learning, training and testing of models.
cfxOIA currently supports 3 ML pipelines out of the box, Clustering, Classification and Regression. ML jobs allow hyper parameter tuning by making the selections from UI itself. Advanced customization scenarios allow uploading of new ML pipelines.
Learn more about:
cfxOIA provides key analytics to track AIOps related KPIs like noise reduction efficiency, Alert ingestion trends, Most chatty alert types, etc. cfxOIA has a unique data exploration feature called Quick Insights that provides an at-a-glance visual clue of distribution and other characteristics of data. Quick Insights on Incidents provide visual clues about the distribution of Incidents based on priority, Support-group, Incident-age, Environment, Application, Department, etc.
Learn more about:
11. UI features
cfxOIA provides a web-based portal that is accessible via a standard browser and uses HTML 5 to render User Interface (UI). There is no need to install any thick client to access the cfxOIA web-portal. cfxOIA portal provides certain advanced UI features for efficient data handling and customization.
Filters: Allow efficient filtering, Saving, and Reusing filters. Table View Customization
Learn more about:
Customize Columns: Displayed in the table, Change the order of columns.
Learn more about:
Exporting Data: Exporting data like Incidents, Alerts, etc.
Learn more about: