Rule Engine

A Rule Engine provides a framework for the automated and scalable evaluation and execution of business rules or decision logic. By decoupling business rules from the core application code, this architecture enables independent definition, management, and execution of these rules. When an event stream triggers a match with a defined rule criteria, an "incident" record is generated. A detailed explanation of the "incident" concept follows.

Rule Engines are commonly used in various domains, including finance, insurance, healthcare, e-commerce, cyber security and telecommunications, to automate decision-making processes and enforce business policies.

image not found

An example of rule: let's you want to monitoring computer network firewall and temporarily block traffic if there more than 100MB data exfiltrating to an IP address within a time period of 5 minutes that belong to a certain geographic region (e.g. Russia) as an example.

Avalanchio rule engine run the rules very close to the storage layer providing the system ability to run many rules concurrency with low latency.

Rule Attributes

Attribute	Description
Time Window Type	What is the time span for which the rule will evaluate the data. Two types time window are supported right now. Tumbling Window and Sliding Window. Tumbling windows are discrete time periods, while sliding time windows are the overlapping time windows.
Time Unit	Unit of time for trigger interval and window length.
Trigger Interval	The time frequency at which the rule will trigger.
Time Window Length	Time period for which data are processed in a single trigger. The time filter to work, the table must select the time column in the table configuration. If the time column is not specified, the time filter is applied on that table.
Weight	The weitage for a incident generated due to this rule. It is used to calculate scoreboard.
Severity	Attribute to qualify the importantance of the incident found by this rule.
Product	Product indicates whether the rule is a relevent for certain product such as Network Firewall, Storage Server etc.
Measure confidence	This functionality enables the assessment of confidence in rule-triggered incidents, expressed as a probability score ranging from 0.0 to 1.0. This score provides a measure of certainty regarding the validity and significance of the detected incident.
Reference ID	Any external reference number for the rule. In some cases, the rule is defined as per some regulatory or industry frameworks such as NIST, USA PATRIOT Act (2020) etc.
Description	Use friendly description of the rule.

image not found

Rule Evaluation Funnel

Each rule has a series of checks to identify incidents. At minimum it requires a query statement to perform the first level of the filtering. Optionally, the rule can also apply a set of techniques such as rare event checks, anomaly detection checks etc. Before an incident is created, it must pass through all steps in the funnel.

Query statement

SQL statement is the first filtering step to identify the relevent events. In many cases, the statement alone is sufficient to identify an suspecious event. The statement contains conditions only qualifying the events, and it does not include any time period filtering condition. The time period window operation is performed by the engine backend.

For example, if you want to check whether the total amount of data exfiltration (bytes_out) by destination IP address (dst_ip) from the firewall exceeds 100 MB, you would write this query:

SELECT 
    dst_ip, 
    Sum(bytes_out) / 1048576 AS mb_out
FROM   firewall
GROUP  BY dst_ip
HAVING mb_out > 100

If you schedule the rule having time window = Sliding and trigger interval = 10 minutes, the engine will check the above rule every 10 minutes over the data accumulated on 10 minutes. The engine keep track of the required checkpoints of the time window boundaries.

You can fine tune the above statement to exclude certain IP addresses.

[Optional] Anomaly Detection

Anomaly detection identifies data points, events, or observations that deviate significantly from the expected pattern or norm within a dataset. These deviations, often referred to as anomalies or outliers, can indicate critical insights or issues, such as fraud, system failures, or unusual patterns of behavior. Anomaly detection currently works for only numeric data. All records from the query statement is passed to the model.

If you want to find anomalies on the time series data, make sure you group the data by a timestamp interval using the query statement. For example, you want to find anomalous hourly aggregate of network bytes outs per IP can be created in the following statemet.

SELECT
    dst_ip,
    date_trunc("hour", ts) AS ts,
    Sum(bytes_out) AS bytes_out
FROM   firewall
GROUP  BY 
    dst_ip,
    ts

Attributes

Attributed	Description
Features	Select the numeric attributes using which anomaly detection will be performed. The engine will peform several preprocessing before feeding the data to the model, such null data handling, scaling, mean shifting.
Princuple Components	[Optional] This optional steps allows the model to perform a dimentionasionality reduction as a part of the preprocessing steps.
Grouping Field	[Optional] The query output for each trigger may represent a single entity or multiple entities, contingent upon the SQL query's structure. In the provided example, the query output pertains to a group of dst_ip values. For such scenarios, consider selecting this grouping field to construct separate anomaly detection models for each group. This approach can enhance model accuracy and performance compared to building a single model for all entities collectively.
Contamination	The amount of contamination of the data set, i.e. the proportion of anomalies in the data set. Used when fitting to define the threshold on the decision function. The value varies from (0., 0.5) with the default value 0.1
Algorithm	Select the algorithm for detection of the anomalies. Find the list of the supported algorithms in the following section.

Anomaly Detection Algoriths

Rule engines uses unsupervised outlier detection technique with the following considerations

Training data (unlabelled) contains both normal and anomalous observations.
The model identifies outliers during the fitting process.
This approach is taken when outliers are defined as points that exist in low-density regions in the data.
Any new observations that do not belong to high-density regions are considered outliers.

Following algorithms are supported to detect the anomalies.

Algorithm	Description
Angle-based Outlier Detector	ABOD is an unsupervised outlier detection technique that leverages the concept of angular variations between a data point and its neighbors. The underlying principle is that outliers tend to exhibit significantly larger angular variations compared to normal data points.
Clustering Based Local Outlier Factor	The Clustering Based Local Outlier Factor (CBLOF) operator calculates the outlier score based on cluster-based local outlier factor. CBLOF takes as an input the data set and the cluster model that was generated by a clustering algorithm. It classifies the clusters into small clusters and large clusters using the parameters alpha and beta. The anomaly score is then calculated based on the size of the cluster the point belongs to as well as the distance to the nearest large cluster. By default, kMeans is used for clustering algorithm.
Connectivity-Based Outlier Factor	Connectivity-Based Outlier Factor (COF) COF uses the ratio of average chaining distance of data point and the average of average chaining distance of k nearest neighbor of the data point, as the outlier score for observations.
Histogram-based Outlier Detection	Histogram- based outlier detection (HBOS) is an efficient unsupervised method. It assumes the feature independence and calculates the degree of outlyingness by building histograms. Two versions of HBOS are supported: A) Static number of bins: uses a static number of bins for all features. B) Automatic number of bins: every feature uses a number of bins deemed to be optimal according to the Birge-Rozenblac method.
IsolationForest Outlier Detector	The IsolationForest (IFOREST) ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
Local Outlier Factor	Local Outlier Factor (LOF) measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. Locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors.
Local Correlation Integral	LOCI is highly effective for detecting outliers and groups of outliers ( a.k.a.micro-clusters), which offers the following advantages and novelties: (a) It provides an automatic, data-dictated cut-off to determine whether a point is an outlier—in contrast, previous methods force users to pick cut-offs, without any hints as to what cut-off value is best for a given dataset. (b) It can provide a LOCI plot for each point; this plot summarizes a wealth of information about the data in the vicinity of the point, determining clusters, micro-clusters, their diameters and their inter-cluster distances.

image not found

Anomaly Detection Configuration

Anomaly detection configurations have three subsets of configurations.

Feature selection options
Data preprocessing options
Algorithm Options

Feature selection options

Option	Description
Single variate model vs multi-variate model	This allows you mention whether you want to use single variate model or multi variate models. For the single variate model, you can choose which thresholder. For multi variate model, you need to select the algorith and then optionally select the thresholder that is applied on the metric scores.
Time Series Model	For time series model, you need to select a timestamp field. The timestamp field can be Date or Timestamp type.
Time series frequency	Select the time frequency of the observations. For example, daily, hourly or by minute. A number of string aliases are given to useful common time series frequencies. See the table "time series frequecy aliases".
Features	Select the numeric features which would be used to detect the anomalies.
Minimum Observations	Anomaly detection requires a minimum number of observations per time window.

Data preprocessing

Option	Description
Imputer	To fill the missing data, specify an imputer.
Standard Scaler	Use the standard scaler to regularize the feature measurements. It performs mean shift and scaling to make the mean zero and uniform variance for all features.
Principal Components	In case you suspect the features have multi-collinearity or you want to reduce the dimension of the data, it this feaure to. Currently, only linear PCA - components are derived.

Outlier detection technique

Option	Description
Multi Variate Algorithm	The algorithm used for detecting the outliers using multi variate scenario.
Use Thresholder	Indicate whether to use a dynamic thresholder model to find anomalies or identify a fixed ratio of the observations as outliers.
Thresholder	Which thresholder algorithm to use. A thresholder is applicable for both single variate and multi-variate models.
Fixed Threshold	The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. The range of the value is [0.0 to 1.0], 0.1 is a good starting point.
Use Pre-trained	Whether to use a pre-trained model.
Pre-trained model Id	What is the model ID of the pre-trained model.

Why do we set the frequency of the data in time series analysis?

Regular Intervals: Setting a frequency ensures your data points are evenly spaced in time. This is crucial because many time series algorithms and statistical methods assume regular intervals between observations.
Handling Missing Data: With a set frequency, identify missing time periods and handle them appropriately
Time-Based Operations: It enables straightforward calculations like rolling averages, shifts, and seasonal decomposition.
Resampling: You can easily aggregate or downsample data to different time periods:
Seasonality Analysis: Regular frequencies make it easier to detect and analyze seasonal patterns.
Forecasting: Many forecasting models work better with regular time intervals, as they make assumptions about the temporal structure of the data.

Time series frequecy aliases

Alias	Description
B	business day frequency
W	weekly frequency
MS	month start frequency
ME	month end frequency
h	hourly frequency
min	minutely frequency
s	secondly frequency
ms	milliseconds

Rule Incidents

Incidents are the outcome of the rule execution. Rule engine executes the rule in a continuos manner to process the data in near realtime. Rules are executed asynchronously on data streams directed to a table. If a rule identifies records that meet its criteria, an incident record is generated.

Relationship between a incident and events (table records) maybe one-to-one or one-to-many depending on the initial query defined in the rule condition.

For example, an incident generated by the follow criteria, would corresponds to a single event.

SELECT
    dst_ip,
    ts,
    bytes_in,
    bytes_out,
FROM   firewall
WHERE
    dst_ip like '103.113.68.23' AND bytes_out > 1024

However, an incident generated by the follow criteria, would likely corresponds to multiple events. The incident record, infact, would not capture the actual event ids, but rather, it captures aggregated summary of the underlying events.

SELECT
    dst_ip,
    date_trunc("hour", ts) AS ts,
    Sum(bytes_out) AS bytes_out
FROM   firewall
GROUP  BY 
    dst_ip,
    ts

An incident may be recorded in the alert center, and/or various actions can be triggered, such as sending an email alert, invoking a webhook, or executing a playbook action.

Incident Properties

Following are the core fields on the rule_incident table.

Field	Type	Description
id	Bigint	System generated ID
rule_id	Bigint	ID of the corresponding rule
lower_bound	Timestamp	Lower of the time range.
upper_bound	Timestamp	Upper bound of the time range.
data	Json	Data (single row) in the query_result.

Following field values are copied to the rule_incident table from the corresponding rule description.

Field	Type	Description
priority	Text	Priority as per the rule description
category	Text	Rule category as per the rule description
product	Text	Rule product as per the rule description
ref_id	Text	Rule ref_id as per the rule description
severity	Text	Rule severiry as per the rule description
score	Float	Rule weight as per the rule description

Following field values are copied to the rule_incident table from the query output if they contain these values.

Field	Type	Description
table_name	Text	Table_name which contains the raw events.
event_id	Text	ID of the event if it is relevent for the rule. For aggregate based, rule it is not relevent.
context_id	Text	ID of the context, it may contain device information or user information. This ID may belong to an external system.
event_ts	Timestamp	Use this field to capture the timestamp when the event is generated
event_user	Text	Use this field to capture user information corresponding to the event.

Following fields in rule_incident are used to capture the interaction by the operation team.

Field	Type	Description
sla	Integer
product	Text
comment	Text
close_date	Date
due_date	Date
status	Text
assigned_id	Lookup(user)

Handle False Alarm

A common issue to operationalize the rule engine is the high number of false positive incidents. It occurres mostly due to the incompleteness of the rule criteria. To tackle Avalanchio uses a built in machine learning model to measure confidence score for each incidents. To measure the confidence, we need to capture operation team's feedback on the generated incidents to train the machine learning model. The engine uses the query output and rule metadata as feature to train the model, which is used to preodcting true positive likelihood probability.

Rule Actions

image not found

Below are the actions that can be triggered by the incidents detected by the rule.

image not found

Send to Alert Center

Send the incidents to the Alert Center. The alert center is built on top of the rule_indicator table. Alert center allows you to filter, sort, slice and dice data based on the various conditions.

image not found

Send email alert

Send email alert to the targeted email address. A single email is sent to the recipient containing top 10 records from the query output.

image not found

Call Webhooks

The engine makes web service call to send the rule metadata and the incident records. The endpoint must accept the following request body in the json format.

Field	Description
rule_id	UUID (Guid) of the rule
rule_guid	UUID (Guid) of the rule
rule_name	Name of the rule
rule_severity	Severity of the rule
fields	List of fields in the query output
records	List of json objects as per the query output

The endpoint must be a secure endpoint (SSL/TLS) and accessible from Avalanchio servers and HTTP request method must be "POST".

Example:

href="#__codelineno-4-1">{ "log_id": 154568723187846, "fields": [ { "name": "datetime", "type": "Timestamp" }, { "name": "diffuse_flows", "type": "DOUBLE" }, { "name": "general_diffuse_flows", "type": "DOUBLE" }, { "name": "humidity", "type": "DOUBLE" }, { "name": "id", "type": "Bigint" }, { "name": "name", "type": "Text" }, { "name": "temperature", "type": "DOUBLE" }, { "name": "wind_speed", "type": "DOUBLE" }, { "name": "zone_1_power_consumption", "type": "DOUBLE" }, { "name": "zone_2_power_consumption", "type": "DOUBLE" }, { "name": "zone_3_power_consumption", "type": "DOUBLE" } ], "records": [ { "datetime": "2017-01-01T01:00:00Z", "diffuse_flows": 0.096, "general_diffuse_flows": 0.099, "humidity": 73.7, "id": 1748123545, "name": null, "temperature": 4.92, "wind_speed": 0.083, "zone_1_power_consumption": 21654.68354, "zone_2_power_consumption": 13582.97872, "zone_3_power_consumption": 14480.96386 }, { "datetime": "2017-01-01T01:00:00Z", "diffuse_flows": 0.89, "general_diffuse_flows": 0.809, "humidity": 66.11, "id": 1748123617, "name": null, "temperature": 13.89, "wind_speed": 0.076, "zone_1_power_consumption": 39469.36709, "zone_2_power_consumption": 24251.67173, "zone_3_power_consumption": 25792.77108 } ], "date_range": { "end": 1483254000000, "start" :1483232400000 }, "rule_guid": "d9a741c0-62fc-4de1-a701-f50dc603bb2a", "rule_id":46, "rule_name":"Power Consumption by every hour", "severity":"Informational" }

Authentication Details Webhoook Endpoint

Current the rule engine accept two type authentication mechanisms. A. Basic Authentication
B. Jwot Token based authentication.

Look into the details of Webhook about how to configure and test the webhook.

In some environment, if exposing a web service endpoint is not feasible, alternative is to run a playbook as described in the next section.

Timeout

You can set the timeout of the webhook configuration. Default timeout is 120 seconds.

Example of webhook defined using python flask

import uuid

from flask import Flask, request, jsonify
import random
from time import sleep
app = Flask(__name__)

def verify_payload(payload:dict):
    rule_id = payload.get("rule_id")
    assert rule_id is not None, "rule_id is not found"
    assert isinstance(rule_id, int), "rule_id must be int"
    rule_guid = payload.get("rule_guid")
    assert rule_guid is not None, "rule_guid is not found"
    assert isinstance(rule_guid, str), "rule_guid must be str"
    rule_name = payload.get("rule_name")
    assert rule_name is not None, "rule_name is not found"
    assert isinstance(rule_name, str), "rule_name must be str"
    log_id = payload.get("log_id")
    assert log_id is not None, 'Execution log_id is not found in the request'
    date_range = payload.get("date_range")
    assert date_range is not None
    assert isinstance(date_range, dict), "date_range must be a dict with start and end"
    assert isinstance(date_range.get('start'), int), 'date_rate.start must be int'
    assert isinstance(date_range.get('end'), int), 'date_rate.end must be int'
    fields = payload.get("fields")
    assert fields is not None, 'fields are not found'
    assert isinstance(fields, list), 'fields must be list, but found ' + str(type(fields))
    records = payload.get("records")
    assert records is not None, "No records are found"
    assert isinstance(records, list), 'records must be list, but found ' + str(type(records))

@app.route("/power-consumption/alert-action", methods=['POST'])
def power_consumption_alert_action():
    try:
        payload = request.json
        print(f"Payload: {payload}")
        verify_payload(payload)
    except Exception as e:
        return jsonify({
            'status': 'Failed',
            'message': str(e)
        }), 400

    return jsonify({
        "status": "Success",
        "message": s
    })

if __name__ == "__main__":
    app.run(debug=False, port=4567)

Execute Playbook

You can write your business logic as a Python script to handle the json payload. The structure of the payload is same as that in the webhooks. You need to run an agent application in the backend on a server that you manage, which keeps making long polling requests to the Avalanchio service over secure http looking for any pending tasks in the queue related to the rule.

Example of python playbook:

def process(payload):

    """
    This method is called when a new task is assigned by the rule. 
    @param payload: dictionary, the payload which contains 
                    the metadata about the rule and the query_result  
    @return: str, message indicating status of the processing. 
             The response value is recorded in the rule_exec_log table
             for reporting purpose.
    """
    print(payload)

Note: the method name must be process (case-sensitive). The agent invokes this method from the playbook script.

For information about playbook, look into the playbook documentation. view

Use cases for playbooks:

You can use the playbooks for various automation, for example: - Lock/unlock some user account in an application - Use your own notification infrastructure for sending alerts. - Create a ticket in your internal ITSM system.

Write to table

You can write the query result to a table. The table scheme is automatically managed by the rule engine. In addition to capturing the query result, each record also captures the timestamp range for which the query is evaluated and the rule ID. It is possible that more than one rule writes to a single table to buid timeline profile by an entity.

In case you are collecting output from multiple rule to the same table, you must ensure that out field field alias and data type are exactly same for each rule.

image not found

Limit number of indicidents

You can enforce a limit on the number of incidents triggered actions in a rolling 24 hours period.

Logs and Metrics

Each invocation of the rule is recorded in rule_exec_log table. Following are the fields in the table.

Field	Type	Description
id	Bigint	System generated log id
ts	Timestamp	Rule triggering timestamp
status	Text	Status of the rule trigger
rule_id	Bigint	ID of the corresponding rule
message	Text	Any error while during execution
duration	Integer	Total duration of the rule execution
count	Intger	Number of incidents generated by the rule
lag_period	Intger	Lag period of the rule execution
lower_bound	Timestamp	Lower bound of the timestamp window range of the events
upper_bound	Timestamp	Upper bound of the timestamp window range of the events
query_output	Intger	Number of records in the query result from the SELECT query statement
query_duration	Intger	Duration taken to execute the SELECT query
email_sent	Intger	Number of email sent
email_duration	Intger	Duration taken to send the email
email_error	Text	Any error while sending the email
webhook_duration	Intger	Duration taken to call webhook action
webhook_error	Text	Any error while call the webhook action.
webhook_response	Text	Response from the webhook. First 4KB message is captured as text.
anomaly_duration	Intger	Duration taken by anomaly detection algorith.
anomaly_error	Text	Any error while processing the anomaly detection algorithm.
anomaly_ratio	Float	Ratio of the anomalies with respect to the query output count.
playbook_duration	Intger	Duration taken by the playbook execution.
playbook_error	Text	Any error while executing playbook action.
playbook_response	Text	Response message from the playbook.

image not found

Reset Checkpoint

Rule engine incrementally processes the data in a continuous manner. It uses id value of a table to record to as watermark to track how much data the engine has processed at a given point of time. Note: the "id" field values in a table are system generated, unique, and is monotonotically increasing sequence numbers. When the rule is trigger, it checks the last checkpoint values - that is id and timestamp and to determine the scope of execution. During development and testing you may setup a rule, test the rule, when the test is over, you like to delete all rule output, checkpoints and logs. That is where the "Reset" checkpoint feature comes to play. When you "Reset Checkpoint", it deletes the checkpoint, logs, and incidents corresponding the rule.

image not found

Back Testing

If you want to test certain rule in the historical data or maybe you want to start the rule execution at a given timepoint, you select the option "Edit Checkpoint". You will get three options:

Option	Description
Set watermark to the earliest event	It sets the checkpoint to the earliest record found in the corresponding tables.
Set watermark to the latest event	It sets the checkpoint to the current timestamp and processing any new data that inserted into the table.
Set custom watermark	In this option you set an arbitrary value of the timestamp, where from where you to like start reading data.

Alert Center

Alert center is the cockpit for the operations team view the rule incidents and take actions.

image not found