Elasticsearch for ML: Machine Learning Results and Evaluation

This blog teaches you how to access and evaluate the results of your machine learning jobs using Elasticsearch APIs and Kibana. You will learn about the different types of machine learning features in Elasticsearch and how to use them effectively.

Table of Contents

1. Introduction

Elasticsearch is a powerful and versatile search engine that can handle large amounts of data and perform complex queries. But did you know that Elasticsearch also has a built-in machine learning feature that can help you analyze your data and discover patterns, anomalies, and trends?

In this blog, you will learn how to access and evaluate the results of your machine learning jobs using Elasticsearch APIs and Kibana. You will learn about the different types of machine learning features in Elasticsearch and how to use them effectively.

By the end of this blog, you will be able to:

Create and run data frame analytics and anomaly detection jobs in Elasticsearch
Use APIs to retrieve and inspect the results of your machine learning jobs
Use Kibana to visualize and explore the results of your machine learning jobs
Evaluate the performance and accuracy of your machine learning jobs

Ready to get started? Let’s dive in!

2. Elasticsearch Machine Learning Overview

Elasticsearch machine learning is a feature that allows you to apply various types of machine learning techniques to your data stored in Elasticsearch. You can use machine learning to perform tasks such as data analysis, anomaly detection, and forecasting.

There are two main types of machine learning features in Elasticsearch: data frames and analytics and anomaly detection and forecasting. Let’s take a look at each of them in more detail.

2.1. Data Frames and Analytics

Data frames and analytics are a type of machine learning feature in Elasticsearch that allow you to transform and analyze your data in a structured and scalable way. You can use data frames and analytics to perform tasks such as classification, regression, outlier detection, and clustering.

A data frame is a tabular representation of your data that consists of rows and columns. You can create a data frame from your existing Elasticsearch indices by using a data frame transform. A data frame transform allows you to apply various operations to your data, such as filtering, grouping, aggregating, and pivoting. You can also use a data frame transform to create a continuous data frame that updates periodically as new data arrives.

An analytics job is a process that runs on a data frame and performs a specific type of analysis. You can create an analytics job by using the data frame analytics API. There are four types of analytics jobs in Elasticsearch:

Classification: A classification job predicts the value of a categorical field based on the values of other fields. For example, you can use a classification job to predict whether a customer will buy a product or not based on their age, gender, and purchase history.
Regression: A regression job predicts the value of a numerical field based on the values of other fields. For example, you can use a regression job to predict the price of a house based on its size, location, and features.
Outlier detection: An outlier detection job identifies the data points that are different from the normal distribution of the data. For example, you can use an outlier detection job to detect fraudulent transactions or anomalous behavior.
Clustering: A clustering job groups the data points that are similar to each other based on their features. For example, you can use a clustering job to segment your customers based on their preferences or behavior.

By using data frames and analytics, you can transform and analyze your data in a powerful and efficient way. You can also access and evaluate the results of your analytics jobs using APIs and Kibana, which we will cover in the next sections.

2.2. Anomaly Detection and Forecasting

Anomaly detection and forecasting are another type of machine learning feature in Elasticsearch that allow you to monitor your data and detect unusual patterns or trends. You can use anomaly detection and forecasting to perform tasks such as system performance monitoring, fraud detection, and demand forecasting.

An anomaly detection job is a process that analyzes a time series of data and identifies the data points that deviate from the normal behavior of the data. You can create an anomaly detection job by using the anomaly detection API. An anomaly detection job consists of three components:

Datafeed: A datafeed is a configuration that specifies how to retrieve the data from your Elasticsearch indices and what fields to use for the analysis.
Analysis config: An analysis config is a configuration that defines the type of analysis to perform on the data, such as mean, count, sum, rare, etc. You can also specify the detectors, influencers, and bucket span for the analysis.
Model snapshot: A model snapshot is a snapshot of the statistical model that the anomaly detection job uses to analyze the data. You can use model snapshots to restore or revert the state of the anomaly detection job.

A forecasting job is a process that predicts the future values of a time series based on the historical data and the statistical model. You can create a forecasting job by using the forecast API. A forecasting job requires an existing anomaly detection job and a duration for the forecast.

By using anomaly detection and forecasting, you can monitor and predict your data in a robust and flexible way. You can also access and evaluate the results of your anomaly detection and forecasting jobs using APIs and Kibana, which we will cover in the next sections.

3. Accessing Machine Learning Results with APIs

Once you have created and run your machine learning jobs in Elasticsearch, you might want to access and inspect the results of your jobs. For example, you might want to see the statistics, the progress, the status, or the output of your jobs. You can do this by using the APIs that Elasticsearch provides for each type of machine learning feature.

In this section, we will cover the APIs that you can use to access the results of your data frame analytics and anomaly detection jobs. We will also show you some examples of how to use these APIs with Python and the Elasticsearch Python client.

The APIs that we will cover are:

Get Data Frame Analytics Stats API: This API allows you to retrieve usage information and statistics for one or more data frame analytics jobs.
Get Data Frame Analytics API: This API allows you to retrieve configuration information and metadata for one or more data frame analytics jobs.
Get Anomaly Detection Job Stats API: This API allows you to retrieve usage information and statistics for one or more anomaly detection jobs.
Get Anomaly Detection Job API: This API allows you to retrieve configuration information and metadata for one or more anomaly detection jobs.

By using these APIs, you can access and evaluate the results of your machine learning jobs in a programmatic and convenient way. You can also use Kibana to visualize and explore the results of your machine learning jobs, which we will cover in the next section.

3.1. Get Data Frame Analytics Stats API

The Get Data Frame Analytics Stats API allows you to retrieve usage information and statistics for one or more data frame analytics jobs. You can use this API to monitor the progress, the status, and the performance of your analytics jobs.

To use this API, you need to specify the ID of the analytics job or a wildcard expression that matches multiple IDs. You can also use the _all value to get the statistics for all the analytics jobs in your cluster.

The response of this API contains the following information for each analytics job:

id: The unique identifier of the analytics job.
state: The current state of the analytics job, such as starting, analyzing, stopping, or stopped.
progress: The percentage of the data that has been analyzed by the analytics job.
data_counts: The number of documents that have been read, analyzed, and written by the analytics job.
memory_usage: The memory usage statistics of the analytics job, such as the peak memory usage and the timestamp of the peak.
node: The node information of the node that runs the analytics job, such as the node name, id, and attributes.
assignment_explanation: The reason why the analytics job is assigned to a certain node.

Here is an example of how to use this API with Python and the Elasticsearch Python client:

# Import the Elasticsearch client library
from elasticsearch import Elasticsearch

# Create a client instance
es = Elasticsearch()

# Get the statistics for a specific analytics job
res = es.ml.get_data_frame_analytics_stats(id="house_price_regression")

# Print the response
print(res)

The output of this code snippet is:

{
  "count": 1,
  "data_frame_analytics": [
    {
      "id": "house_price_regression",
      "state": "stopped",
      "progress": [
        {
          "phase": "reindexing",
          "progress_percent": 100
        },
        {
          "phase": "loading_data",
          "progress_percent": 100
        },
        {
          "phase": "analyzing",
          "progress_percent": 100
        },
        {
          "phase": "writing_results",
          "progress_percent": 100
        }
      ],
      "data_counts": {
        "training_docs_count": 1460,
        "test_docs_count": 365,
        "skipped_docs_count": 0
      },
      "memory_usage": {
        "peak_usage_bytes": 65812480,
        "timestamp": 1611743882000
      },
      "node": {
        "id": "4Z4ikX7lQMOZqZW3zFJhow",
        "name": "node-1",
        "ephemeral_id": "KrE9qImbQ-iRq8uXJ0cbRw",
        "transport_address": "127.0.0.1:9300",
        "attributes": {
          "ml.machine_memory": "17179869184",
          "ml.max_open_jobs": "20",
          "xpack.installed": "true",
          "transform.node": "true",
          "ml.max_jvm_size": "1073741824"
        }
      },
      "assignment_explanation": "persistent task has consistent assignment: [4Z4ikX7lQMOZqZW3zFJhow]"
    }
  ]
}

As you can see, this API gives you a lot of information about the analytics job and its results. You can use this information to check the status, the progress, and the performance of your analytics job.

3.2. Get Data Frame Analytics API

The Get Data Frame Analytics API allows you to retrieve configuration information and metadata for one or more data frame analytics jobs. You can use this API to inspect the details of your analytics jobs, such as the source and destination indices, the analysis type and parameters, and the model memory limit.

To use this API, you need to specify the ID of the analytics job or a wildcard expression that matches multiple IDs. You can also use the _all value to get the information for all the analytics jobs in your cluster.

The response of this API contains the following information for each analytics job:

id: The unique identifier of the analytics job.
source: The source configuration that specifies the index or query from which to fetch the data for the analysis.
dest: The destination configuration that specifies the index in which to store the results of the analysis.
analysis: The analysis configuration that defines the type and parameters of the analysis to perform on the data. The analysis type can be one of classification, regression, outlier_detection, or clustering.
analyzed_fields: The fields that are included or excluded from the analysis.
model_memory_limit: The maximum amount of memory that the analytics job is allowed to use.
create_time: The timestamp of when the analytics job was created.
version: The version of Elasticsearch that the analytics job was created with.
allow_lazy_start: A boolean value that indicates whether the analytics job can start on a node that has sufficient memory or wait for a more suitable node.

Here is an example of how to use this API with Python and the Elasticsearch Python client:

# Import the Elasticsearch client library
from elasticsearch import Elasticsearch

# Create a client instance
es = Elasticsearch()

# Get the information for a specific analytics job
res = es.ml.get_data_frame_analytics(id="house_price_regression")

# Print the response
print(res)

The output of this code snippet is:

{
  "count": 1,
  "data_frame_analytics": [
    {
      "id": "house_price_regression",
      "source": {
        "index": [
          "house_price"
        ],
        "query": {
          "match_all": {}
        }
      },
      "dest": {
        "index": "house_price_regression",
        "results_field": "ml"
      },
      "analysis": {
        "regression": {
          "dependent_variable": "price",
          "training_percent": 80,
          "randomize_seed": 42
        }
      },
      "analyzed_fields": {
        "includes": [],
        "excludes": []
      },
      "model_memory_limit": "100mb",
      "create_time": 1611743879000,
      "version": "7.10.2",
      "allow_lazy_start": false
    }
  ]
}

As you can see, this API gives you a lot of information about the analytics job and its configuration. You can use this information to check the details and settings of your analytics job.

3.3. Get Anomaly Detection Job Stats API

The Get Anomaly Detection Job Stats API allows you to retrieve usage information and statistics for one or more anomaly detection jobs. You can use this API to monitor the progress, the status, and the performance of your anomaly detection jobs.

To use this API, you need to specify the ID of the anomaly detection job or a wildcard expression that matches multiple IDs. You can also use the _all value to get the statistics for all the anomaly detection jobs in your cluster.

The response of this API contains the following information for each anomaly detection job:

job_id: The unique identifier of the anomaly detection job.
state: The current state of the anomaly detection job, such as opened, closed, failed, or closing.
opened_time: The timestamp of when the anomaly detection job was opened.
assignment_explanation: The reason why the anomaly detection job is assigned to a certain node.
node: The node information of the node that runs the anomaly detection job, such as the node name, id, and attributes.
data_counts: The number of records that have been processed by the anomaly detection job.
model_size_stats: The model size statistics of the anomaly detection job, such as the model memory limit, the model memory usage, the model bytes, and the bucket allocation failures.
forecasts_stats: The forecasts statistics of the anomaly detection job, such as the total number of forecasts, the memory usage, the records, and the status.

Here is an example of how to use this API with Python and the Elasticsearch Python client:

# Import the Elasticsearch client library
from elasticsearch import Elasticsearch

# Create a client instance
es = Elasticsearch()

# Get the statistics for a specific anomaly detection job
res = es.ml.get_job_stats(job_id="cpu_usage")

# Print the response
print(res)

The output of this code snippet is:

{
  "count": 1,
  "jobs": [
    {
      "job_id": "cpu_usage",
      "state": "opened",
      "opened_time": 1611743891000,
      "assignment_explanation": "persistent task has consistent assignment: [4Z4ikX7lQMOZqZW3zFJhow]",
      "node": {
        "id": "4Z4ikX7lQMOZqZW3zFJhow",
        "name": "node-1",
        "ephemeral_id": "KrE9qImbQ-iRq8uXJ0cbRw",
        "transport_address": "127.0.0.1:9300",
        "attributes": {
          "ml.machine_memory": "17179869184",
          "ml.max_open_jobs": "20",
          "xpack.installed": "true",
          "transform.node": "true",
          "ml.max_jvm_size": "1073741824"
        }
      },
      "data_counts": {
        "job_id": "cpu_usage",
        "processed_record_count": 1000,
        "processed_field_count": 2000,
        "input_bytes": 8000,
        "input_field_count": 2000,
        "invalid_date_count": 0,
        "missing_field_count": 0,
        "out_of_order_timestamp_count": 0,
        "empty_bucket_count": 0,
        "sparse_bucket_count": 0,
        "bucket_count": 100,
        "earliest_record_timestamp": 1611743890000,
        "latest_record_timestamp": 1611744890000,
        "last_data_time": 1611744891000,
        "latest_empty_bucket_timestamp": 0,
        "latest_sparse_bucket_timestamp": 0
      },
      "model_size_stats": {
        "job_id": "cpu_usage",
        "result_type": "model_size_stats",
        "model_bytes": 123456,
        "total_by_field_count": 10,
        "total_over_field_count": 0,
        "total_partition_field_count": 1,
        "bucket_allocation_failures_count": 0,
        "memory_status": "ok",
        "log_time": 1611744891000,
        "timestamp": 1611744890000
      },
      "forecasts_stats": {
        "total": 0,
        "memory_bytes": {
          "min": 0,
          "max": 0,
          "avg": 0,
          "total": 0
        },
        "records": {
          "min": 0,
          "max": 0,
          "avg": 0,
          "total": 0
        },
        "processing_time_ms": {
          "min": 0,
          "max": 0,
          "avg": 0,
          "total": 0
        },
        "status": {
          "finished": 0,
          "failed": 0,
          "canceled": 0
        },
        "messages": []
      }
    }
  ]
}

As you can see, this API gives you a lot of information about the anomaly detection job and its results. You can use this information to check the status, the progress, and the performance of your anomaly detection job.

3.4. Get Anomaly Detection Job API

The Get Anomaly Detection Job API allows you to retrieve the configuration and the current state of an anomaly detection job. You can use this API to check the status, progress, and results of your anomaly detection job.

To use this API, you need to specify the job_id of the anomaly detection job that you want to get. You can also use a wildcard (*) to get multiple jobs at once. For example, the following request gets the configuration and the state of the anomaly detection job with the job_id of farequote:

GET _ml/anomaly_detectors/farequote

The response contains the following fields:

count: The number of anomaly detection jobs that match the request.
jobs: An array of objects that contain the configuration and the state of each anomaly detection job.

Each object in the jobs array has the following fields:

job_id: The unique identifier of the anomaly detection job.
job_type: The type of the anomaly detection job. It can be either anomaly_detector or datafeed.
job_version: The version of Elasticsearch when the anomaly detection job was created.
description: A user-defined description of the anomaly detection job.
create_time: The time when the anomaly detection job was created.
finished_time: The time when the anomaly detection job finished. This field is only present if the job has finished.
analysis_config: The analysis configuration of the anomaly detection job. It contains the following subfields:

bucket_span: The time interval that the analysis is aggregated into.
detectors: An array of objects that define the analysis to perform on the data. Each object has the following subfields:

detector_description: A user-defined description of the detector.
function: The function that is used to analyze the data. It can be one of the following: count, low_count, high_count, non_zero_count, low_non_zero_count, high_non_zero_count, distinct_count, low_distinct_count, high_distinct_count, info_content, low_info_content, high_info_content, min, max, mean, median, low_mean, high_mean, metric, varp, low_varp, high_varp, time_of_day, time_of_week, lat_long, mean, max, min, sum, non_null_sum, low_non_null_sum, high_non_null_sum, freq_rare, rare, or freq.
field_name: The name of the field that is analyzed by the detector. This field is not required if the function is count, time_of_day, or time_of_week.
by_field_name: The name of the field that is used to split the data by an entity. For example, if the function is count and the by_field_name is airline, the analysis counts the number of documents for each airline.
over_field_name: The name of the field that is used to split the data by an entity. For example, if the function is count and the over_field_name is airline, the analysis counts the number of airlines for each bucket.
partition_field_name: The name of the field that is used to segment the data into independent groups. For example, if the function is count and the partition_field_name is airline, the analysis counts the number of documents for each airline separately.
detector_index: The unique identifier of the detector within the job.
custom_rules: An array of objects that define custom rules for the detector. Each object has the following subfields:

actions: An array of actions that are triggered when the rule conditions are met. The actions can be one of the following: skip_result, skip_model_update, or filter_results.
scope: An object that defines the scope of the rule. It contains one or more fields that are used to filter the results. Each field has an object value that specifies the filter terms or a numerical range.
conditions: An array of objects that define the conditions of the rule. Each object has the following subfields:

applies_to: The value to which the condition applies. It can be one of the following: actual, typical, diff_from_typical, time, or probability.
operator: The operator that is used to compare the applies_to value with the value. It can be one of the following: lt (less than), lte (less than or equal to), gt (greater than), gte (greater than or equal to), or within (within a range).
value: The value that is compared with the applies_to value. If the operator is within, this field must be an array of two values that define the lower and upper bounds of the range.

influencers: An array of fields that are used to identify the entities that influence the anomalies.
summary_count_field_name: The name of the field that is used to summarize the document counts. For example, if this field is set to doc_count, the analysis uses the value of the doc_count field in each document as the document count.
categorization_field_name: The name of the field that is used to categorize the data. For example, if this field is set to message, the analysis groups the documents by the value of the message field and assigns them a category.
categorization_filters: An array of regular expressions that are used to filter the values of the categorization_field_name. For example, if the categorization_field_name is message and the categorization_filters are ["^error"], the analysis only considers the messages that start with error.
per_partition_categorization: An object that defines whether the categorization is done per partition. It contains the following subfields:

enabled: A boolean value that indicates whether the per partition categorization is enabled or not.
stop_on_warn: A boolean value that indicates whether the categorization stops on a partition if the categorization quality warning is triggered or not.

analysis_limits: The limits that are applied to the analysis. It contains the following subfields:

model_memory_limit: The approximate maximum amount of memory resources that are required for the analysis.
categorization_examples_limit: The maximum number of examples that are stored per category.

data_description: The description of the data format and how it is used for the analysis. It contains the following subfields:

time_field: The name of the field that contains the timestamp of the data.

4. Evaluating Machine Learning Results with Kibana

Kibana is a web-based user interface that allows you to visualize and explore your data stored in Elasticsearch. You can use Kibana to evaluate the results of your machine learning jobs and gain insights into your data.

Kibana provides two main features for evaluating machine learning results: Data Frame Analytics Exploration and Anomaly Detection Exploration. These features enable you to view the results of your data frame analytics and anomaly detection jobs in various charts, tables, and maps. You can also filter, sort, and search the results to focus on the most relevant or interesting aspects of your data.

In this section, you will learn how to use Kibana to evaluate the results of your machine learning jobs. You will learn how to:

Access the Machine Learning app in Kibana
Navigate the Data Frame Analytics and Anomaly Detection tabs
Explore the results of your data frame analytics and anomaly detection jobs
Interpret the charts, tables, and maps that display the results
Apply filters, queries, and actions to the results

Let’s get started!

4.1. Data Frame Analytics Exploration

Data Frame Analytics Exploration is a feature in Kibana that allows you to view the results of your data frame analytics jobs in a user-friendly interface. You can use this feature to explore the data frame analytics results and understand the relationships, patterns, and outliers in your data.

To access the Data Frame Analytics Exploration feature, you need to go to the Machine Learning app in Kibana and click on the Data Frame Analytics tab. You will see a list of your data frame analytics jobs and their status, progress, and type. You can click on the View button next to the job that you want to explore and select the Results option.

You will be taken to the Data Frame Analytics Exploration page, where you can see the results of your data frame analytics job in various charts and tables. Depending on the type of your data frame analytics job, you will see different types of results. For example, if your job is a regression or a classification job, you will see the following results:

A scatterplot that shows the actual versus predicted values of your dependent variable. You can use this plot to assess the accuracy and performance of your model.
A table that shows the top feature importance values for each document. You can use this table to understand which features have the most influence on the prediction of your dependent variable.
A table that shows the prediction and the prediction probability for each document. You can use this table to see the predicted value and the confidence level of your model for each document.

If your job is an outlier detection job, you will see the following results:

A scatterplot matrix that shows the distribution of your features and the outliers in your data. You can use this plot to identify the outliers and the features that contribute to the outlier score.
A table that shows the outlier score and the feature influence for each document. You can use this table to see the degree of anomaly and the most influential features for each document.

If your job is a clustering job, you will see the following results:

A scatterplot that shows the clusters and the documents in your data. You can use this plot to see the number and the size of the clusters and the distribution of the documents within each cluster.
A table that shows the cluster assignment and the distance to the nearest centroid for each document. You can use this table to see the cluster membership and the similarity of each document to its cluster.

In the next section, you will learn how to use the Anomaly Detection Exploration feature in Kibana.

4.2. Anomaly Detection Exploration

Anomaly Detection Exploration is a feature in Kibana that allows you to view the results of your anomaly detection jobs in a user-friendly interface. You can use this feature to explore the anomaly detection results and understand the anomalies, trends, and influences in your data.

To access the Anomaly Detection Exploration feature, you need to go to the Machine Learning app in Kibana and click on the Anomaly Detection tab. You will see a list of your anomaly detection jobs and their status, progress, and type. You can click on the View button next to the job that you want to explore and select the Anomaly Explorer or the Single Metric Viewer option.

You will be taken to the Anomaly Detection Exploration page, where you can see the results of your anomaly detection job in various charts and tables. Depending on the option that you choose, you will see different types of results. For example, if you choose the Anomaly Explorer option, you will see the following results:

A swimlane chart that shows the overall anomaly score for each bucket over time. You can use this chart to identify the periods of time that have the highest anomaly scores.
A swimlane chart that shows the anomaly score for each entity over time. You can use this chart to identify the entities that have the highest anomaly scores.
A table that shows the top anomalies for each bucket. You can use this table to see the details of each anomaly, such as the actual and typical values, the anomaly score, and the influencer values.
A chart that shows the actual and typical values of the selected anomaly over time. You can use this chart to see the magnitude and the duration of the anomaly.

If you choose the Single Metric Viewer option, you will see the following results:

A chart that shows the actual and typical values of the selected metric over time. You can use this chart to see the trends and the anomalies in the metric.
A table that shows the anomalies for the selected metric. You can use this table to see the details of each anomaly, such as the actual and typical values, the anomaly score, and the influencer values.
A chart that shows the forecast of the selected metric over time. You can use this chart to see the predicted values of the metric based on the historical data and the model.

In the next section, you will learn how to use the Data Frame Analytics Exploration feature in Kibana.

5. Conclusion

In this blog, you have learned how to access and evaluate the results of your machine learning jobs using Elasticsearch APIs and Kibana. You have learned how to use the following APIs and features:

The Get Data Frame Analytics Stats API to retrieve the statistics and the progress of your data frame analytics jobs.
The Get Data Frame Analytics API to retrieve the configuration and the state of your data frame analytics jobs.
The Get Anomaly Detection Job Stats API to retrieve the statistics and the progress of your anomaly detection jobs.
The Get Anomaly Detection Job API to retrieve the configuration and the state of your anomaly detection jobs.
The Data Frame Analytics Exploration feature in Kibana to view the results of your data frame analytics jobs in various charts and tables.
The Anomaly Detection Exploration feature in Kibana to view the results of your anomaly detection jobs in various charts and tables.

By using these APIs and features, you can gain insights into your data and understand the relationships, patterns, anomalies, and trends in your data. You can also evaluate the performance and accuracy of your machine learning models and improve them if needed.

We hope that this blog has been helpful and informative for you. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading!

1. Introduction

2. Elasticsearch Machine Learning Overview

2.1. Data Frames and Analytics

2.2. Anomaly Detection and Forecasting

3. Accessing Machine Learning Results with APIs

3.1. Get Data Frame Analytics Stats API

3.2. Get Data Frame Analytics API

3.3. Get Anomaly Detection Job Stats API

3.4. Get Anomaly Detection Job API

4. Evaluating Machine Learning Results with Kibana

4.1. Data Frame Analytics Exploration

4.2. Anomaly Detection Exploration

5. Conclusion

Contempli

Related Posts

Elasticsearch for ML: Case Study – Image Recognition

Elasticsearch for ML: Case Study – Sentiment Analysis

Elasticsearch for ML: Advanced Topics and Tips