Monitor Metrics and Data for Applications and Clusters

Version: 1.0
Last Modified: 12/10/20

The Monitoring Dashboard allows users to view both high and low-level components' performance through easy-to-read metrics. The dashboard is designed to enable the user to use a focused approach when analyzing data performance. For example, the Monitoring Dashboard allows performance monitoring at the cluster level and can be configured to focus on a single Application Instance for a more detailed view. The Monitoring Dashboard allows a comprehensive view of Clusters and the associated Application Instances, giving the user a picture of its overall performance.

Data and analytics are displayed accordingly, contingent on your preset filters. Depending on roles assigned (RBAC), not all features and options may be available for monitoring. For more information about RBAC, see Assign role-baed access control (RBAC)

Monitoring screen

Kubernetes and Docker container type deployments may be monitored with the dashboard. While you may filter by App Inst, Cluster Inst, and Cloudlet, the data and metrics rendered are specific to your organizations rather than by application or cluster. Therefore, when filtering these options, you can view clusters and application instances deployed globally within your organization.

The Monitoring Dashboard provides access to the following information types. Metric information is presented as tiles. Each tile can be enlarged to view the graphical representation for each metric type.

  • Cluster level resource utilization, performance, and status metrics
  • Load balancer (Layer 4) metrics and status
  • Application Instance resource utilization, performance, and status metrics
  • Application Instance event logs, showing state changes and other Application Instance events
  • Distributed Matching Engine (DME) metrics, including location-based metrics for remote users
  • Cloudlet level information including regions, operator, disk and memory usage, and more.

When the monitoring page is first launched, by default, metrics are displayed for App instances only. From this view, administrators can navigate effortlessly to the desired option and switch the view to either Cluster Instance or Cloudlet. Please note, only administrators have full monitoring capability.

Switching views

You can filter using the following options:

  • Organization
  • Time Ranges
  • App Inst, Cluster Inst, or Cloudlet
  • Region
  • Metric Type (CPU, Memory, Disk Usage, Network Sent, Network Received, Active Connections)

You may also refresh your data and specify the refresh rate by seconds, minutes, or hours automatically. A progress bar is positioned at the top of the page and serves as an indication of progress.

Map view

The Map view within the monitoring page displays cloudlets and their locations where your app instances are deployed. The map view changes as you filter by region (US or EU). You may also filter the Map view page by regions.

Cloudlet location

Event information can also be viewed from the Monitoring page. For more information about the different events, see Event Logs.

Monitor events

Event Logs

Historical activities performed by you and others within your organization are logged and viewed from the Edge-Cloud Console. Logs are used for diagnostic purposes or error correction and are logged and displayed by date and time. Event logs provide valuable information if you require assistance from MobiledgeX support teams. To forward Event Logs to MobiledgeX, just copy and paste the traceid. Then, email the traceid to [email protected] Other log types are available. Please see below.

  • Audit Logs: Logs user activities such as logging, creating applications, deleting users, creating policies, etc.
  • Event Logs: These are system-generated events and can include services such as auto-provision policy, auto-scaling, application instance, or HA, where our platform will trigger events based on these user policies.
  • Usage Logs: These logs are generated to view the status (online or offline) of clusters, application instances status, or Cloudlets, and maintenance status.

To access the different log types, navigate to the desired Log icon at the top of the page, and select it. The three log types will be displayed once you have chosen to view the desired logs.

Log icon

Log menu

Health Check

The MobiledgeX Platform provides a Health Check function that manages autoscaling and failover of applications. The Health Check periodically tests specified ports ensuring that applications are responding correctly and available for service requests.

When creating an application with the Console, mcctl utility, or directly via the API, a Health Check on a per-port and per-protocol basis may be added. It is vital to ensure that the application instance backend is listening and capable of responding on all ports that have Health Check enabled. Otherwise, the Health Check process will report a failure condition when the port is tested.

The current status of the application instance(s) will be updated and is based on the results of the Health Check. The status of application instances may be viewed via the Console, the mcctl utility, or the API directly. The Health Check is enabled by default from the Create Apps page.

Health Check types

Two types of Health Checks are available:

  • Non-port specific check: This Health Check verifies the root Load-Balancer (rootLB) is alive and can forward requests to the application instance.
  • Per-port, per-protocol check: The system opens a socket connection to the backend application on the port that is specified for the application instance.

Health Check status

The Health Check process will return one of four status values:

  • HealthCheckOk: The check returned without issues.
  • HealthCheckFailRootlbOffline: The application instance is unreachable because the rootLB for this application is not accessible.
  • HealthCheckFailServerFail: The application instance is not responding. This state indicates the application has either crashed or has exited unexpectedly. Also, this status may indicate a problem with the application instance and should be investigated further.
  • HealthCheckUnknown: This is the default state at the initial startup for application instances. If this value persists, it may indicate a deployment problem with the application.

When Health Checked is enabled on multiple ports that have an associated application instance, the application instance will be marked as unhealthy if one of the associated ports fails the Health Check.

Note: A single status will be returned for the combined ports despite a failure occurring on only one of the ports. The health of applications associated with multiple ports is dependent on the health of all associated ports.

How Health Check failures are managed

The MobiledgeX Platform takes action on any health check statuses other than HealthCheckOk. Currently, the actions taken are as follows:

  • Application instances that fail any health check will be removed from the list of viable backends that are returned from the Distributed Matching Engine.
  • Application instances that fail a health check can trigger auto-provisioning policies, once the minimum number of instances is no longer satisfied.

Testing detail

MobiledgeX has tested different failed-state scenarios to ensure that the Health Check feature performs as expected.

Test Scenario1: HealthCheckFailRootlbOffline status. MobiledgeX simulated a VM issue of a platform service, or created a network issue where the platform was made unavailable, by shutting down the VM that had a rootLB Envoy proxy. The VM was restarted and then verified that the VM was returned to full operational status and remained in a healthy state.

Test Scenario2: HealthCheckFailServerFail status. MobiledgeX simulated a fault with the backend application by scaling down a K8s-based application to zero replicas, or the container for the Docker-based application was stopped. The application was brought back up, scaled to 1 or more pods (for K8s), and verified that the application returned to a healthy state.

Note: A test scenario was not performed for the HealthCheckUnknown status. HealthCheckUnknown is an invalid state and momentarily displays during the initial startup of application instances. However, if this value persists, issues may exist, and further investigation may be necessary.

Troubleshooting failed Health statuses

In the event of a failed Health Check status, it is recommended to validate the following:

  • The backend process is listening on the defined ports.
  • The defined ports can be reached from within the application instance.
  • The defined ports are reachable from the internet.

If either of the first two cases fails, troubleshooting of the application should be initiated. If the first two cases pass, but the third fails, a support ticket with MobiledgeX must be opened for technical troubleshooting.

If the connection to the backend drops, you can re-initiate a FindCloudlet call to retrieve the IP address of a working backend to connect your application.

Health Check limitations

As of the current release, the Health Check process is limited by the following. However, future releases will address these limitations:

  • Only TCP checks are supported; UDP support is actively being developed.
  • The Alerting framework does not support external notifications on Health Check status changes.
  • Occasionally, a healthy application may generate a HealthCheckUnknown status.

When to disable Health Check

There is a case when disabling Health Check is necessary. If an application does not require listening to a specified port, for instance, it's only opened if a certain condition in the application backend is satisfied, Health Check should be disabled on that particular port. Otherwise, our mechanism will try to connect to it and will generate a HealthCheckFailServerFail status for that application instance.

Alarms

Within MobiledgeX’s platform, an alarm is triggered by abnormal system behavior (an event) or unexpected result.

Alarms are classified into one of four severity levels of severity based on the component's performance's nature and impact.

  • Critical: Requires immediate attention and reflects conditions that may affect an appliance's performance or signal the loss of a broad category of service. An example would be a network failure taking an entire cloudlet offline.
  • Major: Indicates conditions that should be addressed within 24 hours of the notification. An example would be an unexpected traffic class error.
  • Minor: Denotes performance that may be addressed at your convenience. An example would be a user that has not changed their account’s default password or a degraded disk.
  • Warning: Signifies conditions that may develop into an issue over time. For example, a software version mismatch.

Alerts

Alerts provide notifications of alarms that constitute an irregular performance so that issues can be proactively mitigated. For all Critical and Major alerts, a notification will be sent to the user. The notification will be sent either through Slack or email, depending on the preferred delivery method configured by the user. When the issue or condition is resolved, an additional notification is sent to the user indicating that the issue has been fixed.

AlertManager

The AlertManager is a global component of the MobiledgeX’s product and is responsible for distributing alerts to application owners. Alarms are consolidated at the regional level, where each regional controller receives alarms via a notification.

The image below illustrates the AlertManager workflow. A user can create an alert receiver and set up their preferred notification method through the Edge-Cloud Console. Once an alert receiver is created, the receiver is pushed to the MobiledgeX Platform. When an alarm is triggered, the Alert Manager within the platform sends an alert notification to the user for mitigation. Currently, alert notifications are sent only for application instance(s) that are down-AppInstDown.

Alert Receiver Workflow

Alert Management

The MobiledgeX platform provides a flexible alerting interface that includes the following:

  • RBAC support for users, roles, and organizations that control access to alerts. Any users having the ability to view a resource [that generates an alert] can create or delete an alert receiver for the resource. However, since alerts are raised and cleared by the platform, users cannot create custom alerts.

  • Flexibility to manage the delivery of alerts to different “alert receivers” based on user configuration. We currently support the delivery of alerts to your Slack or email account.

Alert Receiver Types

Alerts may be generated from multiple components within the environment, such as app instances or clusters. For example, an alert notification will be sent if an application instance goes down, or if you has anomalies due to the health check for a particular application.

There are several different alert receivers you can set up to receive a notification about your application instance. For example, if you want to receive notification with a specific application instance, you can specify the appname, app-org, and appvers. You can also monitor all of the application instances associated to a particular application from all the cloudlets by, again, specifying the appname, app-org, appvers.

To receive notification about all the application instances that are running on a particular cluster, specify cluster and cluster-org.

Here's an example of what an alert receiver setup may look like for an application instance:

name: DevOrgReceiver2  
type: email  
severity: errors  
user: mexadmin  
email: [email protected]  
appinst:  
  appkey:  
      organization: DevOrg  
      name: DevOrg SDK Demo  
      version: "1.0"  
    clusterinstkey:
      clusterkey:  
        name: AppCluster  
      cloudletkey:  
        organization: mexdev  
        name: localtest  
      organization: DevOrg  

Alert Receiver and MobiledgeX APIs

Alert Receivers are designed to be configurable via the MobiledgeX APIs, directly and through the mcctl utility program, providing flexibility for users integrating with their existing monitoring systems.

Action API Route
Create an Alert Receiver api/v1/auth/alertreceiver/create
Delete an Alert Receiver api/v1/auth/alertreceiver/delete
Show all Alert Receivers api/v1/auth/alertreceiver/show

For detailed AlertReceiver API examples, please refer to the MCCTL Reference Guide.

To set up alert receivers and notification methods through the console

While you can use the mcctl tool and the commands provided to set up your alerts and notification preferences, we recommend using the Edge-Cloud Console to set up your alert receivers for ease-of-use.

  1. Navigate to the Alert Receivers sub-menu and click the + plus sign. The Create Receiver screen opens.

Create Alert Receiver screen

  1. Additional fields appear depending on your selections. Populate all the required fields.

Additional Alert Receiver fields

  1. Your new Alert Receiver will appear on the Alert Receivers page.

Alert Receiver screen

When you click the Alert icon, information about the alert is displayed.

Information about Alerts

Storing Metrics

The MobiledgeX platform provides the ability to retrieve metrics on your applications and clusters via both the Web Console and the MobiledgeX API. MobiledgeX controls the granularity and retention policy for these metrics. If you wish more control over your metrics, you can write an ETL pipeline to move the metrics that you are interested into your own Time Series Database(TSDB).

InfluxDB Example

This example uses InfluxDB as a TSDB to store application metric data.

Exclusions

The example script provided is not suited for production use, and is intended solely as a proof of concept. Additionally, please be aware of the following additional limitations of the script:

  1. Samples are taken at 10-second intervals.
  2. Metrics being sampled are CPU, disk, and memory.
  3. The MobiledgeX provided timestamp is not being used;instead, we are utilizing InfluxDB to create a timestamp for us.
  4. The InfluxDB installation being used has no security enabled.
  5. The script assumes you have logged into the MobiledgeX console with mcctl and have an active JWT token.
  6. The script relies on data being returned in json format from the mcctl utility.
  7. Please see the script header for additional information.

Assumptions

  • You have the mcctl utility installed.
  • You have an account with access to the application you wish to monitor.
  • You have an InfluxDB installation with a database named mex.
  • You can read/write to/from the InfluxDB database.

Script Flow

The script flow is very simple:

  1. Pull data from the MobiledgeX API.
  2. Transform the data into the InfluxDB line protocol.
  3. Post the data to InfluxDB using cURL.
  4. Sleep for 10 seconds.
  5. Return to step 1.

Pulling Data

The mcctl command is used to pull data from the MobiledgeX API.

mcctl --addr https://console.mobiledgex.net --output-format json metrics app \ region=$REGION app-org=$APPORG appname=$APPNAME appvers=$APPVER last=1"

You will need to replace REGION, APPORG, APPNAME, and APPVER with the data that corresponds to the application you wish to monitor. The use of last=1 restricts the data returned to the most recently collected metrics. This can be omitted, in which case the API will return multiple rows (unique by timestamp). You can also specify start and end times for metrics. For this example, we will just be using the last collected set of metrics.

Return Format

The data from the above will be returned in json format, and will be presented as follows:


{ "data": [ { "Series": [ { "columns": [ "time", "app", "ver", "cluster", "clusterorg", "cloudlet", "cloudletorg", "apporg", "pod", "cpu" ], "name": "appinst-cpu", "values": [ [ "2020-08-11T14:51:54.687583518Z", "compose-file-test", "10", "autoclustercompose-file-test", "demoorg", "hamburg-main", "TDG", "demoorg", "compose-file-test", 0 ] ] } ] } ] }

The structure is as follows:

  • data: This is the top-level key that all returned data will be presented beneath.
    • series: This is the level below data and contains information on the metrics you have requested.
    • columns: An array of the columns that are being presented. This occurs once in the series.
    • name: The name of the metric being returned. This can occur several times in the series, depending on the metrics selected.
    • values: An array of the values that correspond to the keys specified in the columns section. This can occur several times in the series, depending on the time/intervals selected.

Converting Data to Line Format

To load this data into a TSDB we will need to transform it into a format that the DB understands. For our example, we will be changing this data into InfluxDB's Line Protocol. To do this, we will need to parse the JSON output. To accomplish this, we will be using the jq utility, along with awk. This could also be accomplished using other JSON and text processing tools if desired.

Note: This document is not intended to guide the usage of jq. The example presented here has been tested and works correctly with the MobiledgeX API's JSON output. This particular example is parsing memory information.

Line Protocol Definition

In its simplest form line protocol provides the name of the metric, a list of one or more key/value paris of tags, a list of one or more key/value pairs of measurements, and an optional timestamp. The syntax is defined as:

<measurement>[,<tag_key>=<tag_value>[,<tag_key>=<tag_value>]] <field_key>=<field_value>[,<field_key>=<field_value>] [<timestamp>]  

For our purposes we will be constructing a very basic data payload. The following is an example of what that payload will look like for the memory metric:

mem.app=compose-file-test,ver=10 mem="1990197"  

Conversion

We will use the jq utility to convert our data; the following line will take as input the data returned from the MobiledgeX API and parse the JSON to prepare it for final transformation:

jq -r  '.data[0].Series[0] | (.columns | map(.)) as $headers| .values | \
 map(. as $row | $headers | with_entries({"key": .value, "value": $row[.key]})) |\
 {measurement: "mem", mem: .[].mem | tostring, app: .[].app, ver: .[].ver, \
 timestamp: .[].time }| \
 to_entries|map(.value)|@csv  

Breaking down that command, we are doing the following:

  • Telling jq to provide the output in raw format (-r) so we can parse the output with awk.
  • Breaking the data into key/value pairs from the input data provided by the column array and array(s) of values (Lines 1-2).
  • Creating a new data object containing the measurement, application, version, timestamp, and metric value (Line 3).
  • Dumping the new data object to CSV output (Line 4).

This provides us with the following output:

"mem","1990197","compose-file-test","10","2020-08-11T15:15:59.135953533Z"  

The next step is finalizing the conversion. To do this we need to manipulate the data into the Line Protocol format. We will be using awk to complete the transformation:

awk -F, '{gsub("\"","",$0);printf("%s.app=%s,ver=%s mem=\"%s\"\n",$1,$3,$4,$2)}'  

Breaking down that command, we are doing the following:

  1. Using , as our separator character.
  2. Re-ordering the output and adding headers.

The final output to be sent to InfluxDB is:

mem.app=compose-file-test,ver=10 mem="1990197"  

Timestamps

The reason we are allowing the InfluxDB installation to generate a timestamp rather than using the value returned from the API is due to the way that the MobiledgeX API provides the timestamp, and the way that InfluxDB requires timestamps to be presented.

The MobiledgeX API provides timestamps in RFC3339 format, whereas InfluxDB wants the timestamps to be in Unix Epoch format. Although it is possible to convert between these two (for example, using the GNU date program), this has not been done in this POC script to keep the complexity low.

Loading Data to InfluxDB

The InfluxDB API can be used to load the processed data into InfluxDB. The format for inserting data into InfluxDB using curl is:

curl -i -XPOST 'http://localhost:8086/write?db=mex'
--data-binary 'measurement-name.tag1=value1,tag2=value2 value=123 1434055562000000000'  

Breaking down the command, we are doing the following:

  • Issuing a POST to the server listening on port 8086 on the localhost.
  • Using the --data-binary flag to enables us to pass data without it being interpreted.
  • The -i flag shows us the return headers from the server (useful in debugging).
  • The string passed conforms to the syntax described above under "Line Protocol".
    For this test, we are going to be inserting the following data:
mem.app=compose-file-test,ver=10 mem="1990197"  

To do this, we can write the following cURL command:

$ curl -i -XPOST 'http://localhost:8086/write?db=mex' --data-binary  'mem.app=compose-file-test,ver=10 mem="1990197"'
HTTP/1.1 204 No Content
Content-Type: application/json
Request-Id: de38aed6-dc1d-11ea-8002-acde48001122
X-Influxdb-Build: OSS
X-Influxdb-Version: v1.8.1
X-Request-Id: de38aed6-dc1d-11ea-8002-acde48001122
Date: Tue, 11 Aug 2020 21:59:02 GMT  

The 204 return code indicates that the data was accepted.

Verification

There are several ways to verify the data being added to InfluxDB. Visualization tools such as Grafana or Chronograf can be used, as can the influx CLI utility. For this example, we are going to use the CLI.

$ influx
Connected to http://localhost:8086 version v1.8.1
InfluxDB shell version: v1.8.1
> use mex;
Using database mex
> SELECT * FROM "mex"."autogen"."mem.app=compose-file-test" WHERE  "ver"='10' limit 1;
name: mem.app=compose-file-test
time                mem     ver
----                ---     ---
1597096338602015000 1990197 10
>  

Putting it Together

The following script uses all of the components that have been discussed in this document. Again, please note that this is intended as a proof of concept demonstration only and is not intended for production usage.


#!/usr/bin/env bash ########################################################################### # # This is a simple shell script to show the process of pulling data from the MeX # metrics API endpoint and pushing them into a local influxdb data store. # # This script is intended as a demonstration of how this process can be # accomplished. This is not intended to be a script that can be productionized without # major rewriting. # # This script makes the following assumptions: # 1\. You are able to use the `mcctl` program to access the MeX API. # 2\. You have authenticated the `mcctl` program and saved an access token; # this script does not authenticate. # 3\. You have an influxdb server running on the standard port (8086) # 4\. There is no security on the influxdb database. # 5\. You have an existing database called `mex` without security. # # The script performs the following tasks: # 1\. Connects to the api and pulls the most recent update for the given metric. # 2\. Transforms the returned data using `jq` and `awk` to create influxdb line # protocol compatible output. # 3\. Writes the resulting data into the influxdb `mex` database using `curl` # # Notes: # 1\. Influxdb does not accept RFC3339 formatted dates as returned by the MeX API; # Because of this the example allows influxdb to generate a timestamp. In an # actual production implementation you would want to use the MeX provided # timestamp, which can be converted to epoch time using either the GNU `date` # command, or programatically. # ########################################################################### # General Variables MCCTL=/usr/local/bin/mcctl JQ=/usr/local/bin/jq INFLUXDB=mex INFLUXURI=http://localhost:8086 # MeX Vars APPNAME=compose-file-test APPVER="1.0" APPORG=demoorg REGION=EU CONSOLE="https://console.mobiledgex.net" MCCTLCONS="$MCCTL --addr $CONSOLE --output-format json metrics app region=$REGION app-org=$APPORG appname=$APPNAME appvers=$APPVER last=1" #cURL CURLC="curl -X POST -d @- http://localhost:8086/write?db=mex" # CPU $MCCTLCONS selector=cpu | $JQ -r '.data[0].Series[0] | (.columns | map(.)) as $headers| .values | map(. as $row | $headers | with_entries({"key": .value, "value": $row[.key]}))| {measurement: "cpu", cpu: .[].cpu | tostring, app: .[].app, ver: .[].ver, timestamp: .[].time }| to_entries|map(.value)|@csv' | awk -F, '{gsub("\"","",$0);printf("%s.app=%s,ver=%s mem=\"%s\"\n",$1,$3,$4,$2)}' | $CURLRC # MEM $MCCTLCONS selector=mem | $JQ -r '.data[0].Series[0] | (.columns | map(.)) as $headers| .values | map(. as $row | $headers | with_entries({"key": .value, "value": $row[.key]}))| {measurement: "mem", mem: .[].mem | tostring, app: .[].app, ver: .[].ver, timestamp: .[].time }| to_entries|map(.value)|@csv' | awk -F, '{gsub("\"","",$0);printf("%s.app=%s,ver=%s mem=\"%s\"\n",$1,$3,$4,$2)}' | $CURLRC # NET $MCCTLCONS selector=network | $JQ -r '.data[0].Series[0] | (.columns | map(.)) as $headers| .values | map(. as $row | $headers | with_entries({"key": .value, "value": $row[.key]}))| {measurement: "recvBytes", recvBytes: .[].recvBytes | tostring, app: .[].app, ver: .[].ver, timestamp: .[].time }| to_entries|map(.value)|@csv' | awk -F, '{gsub("\"","",$0);printf("%s.app=%s,ver=%s mem=\"%s\"\n",$1,$3,$4,$2)}' | $CURLRC $MCCTLCONS selector=network | $JQ -r '.data[0].Series[0] | (.columns | map(.)) as $headers| .values | map(. as $row | $headers | with_entries({"key": .value, "value": $row[.key]}))| {measurement: "sendBytes", sendBytes: .[].sendBytes | tostring, app: .[].app, ver: .[].ver, timestamp: .[].time }| to_entries|map(.value)|@csv' | awk -F, '{gsub("\"","",$0);printf("%s.app=%s,ver=%s mem=\"%s\"\n",$1,$3,$4,$2)}' | $CURLRC # Disk $MCCTLCONS selector=disk | $JQ -r '.data[0].Series[0] | (.columns | map(.)) as $headers| .values | map(. as $row | $headers | with_entries({"key": .value, "value": $row[.key]}))| {measurement: "disk", disk: .[].disk | tostring, app: .[].app, ver: .[].ver, timestamp: .[].time }| to_entries|map(.value)|@csv' | awk -F, '{gsub("\"","",$0);printf("%s.app=%s,ver=%s mem=\"%s\"\n",$1,$3,$4,$2)}' | $CURLRC

Other Datastores

The same techniques shown here can be used to write data from the MobiledgeX metrics API to any other datastore, provided can create an ETL pipeline to load data into your datastore of choice.

Contact support

If you have reviewed our documentation set and FAQ page, and unable to find an answer to your question, you can contact our Support Team.

You can also email the Support Team to assist you in resolving product issues. To help expedite your request, make sure you copy and paste the tracid, which can be found on the audit logs page, into your email with a brief description of your issue.

Where to go from here