System Dashboards

System Menu

Precondition: System menu is visible only for users with a role Administrator. To open the System dashboards menu select System from the main menu

Figure 1: System dashboards menu

System Health

To open the System Health dashboard select:

Figure 2: System Health dashboard

The System Health menu has the following sections:

Alerting Status

At the top of the dashboard is shown Alerting Status

Figure 3: System Health dashboard - Alerting status

This is a list of all heartbeat alerts and their current status configured based on the heartbeat charts above. All instruments described above are also configured to show the alerting status.

Alerting status can be one of the following:

On the heartbeat tables the status is binary:

ANALYSER STATUS

Expand the panel ANALYSER STATUS

Figure 4: System Health dashboard - Analyser status

The execution table shows a list of all analyzers currently running in the system and their status calculated according to their last execution time. Each analyzer is responsible for a specific group of tasks and runs asynchronously on a configured time interval.

HEARTBEAT STATUS

Expand the panel HEARTBEAT STATUS

Figure 5: System Health dashboard -Heartbeat status graphics

Linux and Windows Gateways

The list of the graphics include:

Figure 6: System Health - Heartbeat status - Gateway status

The time period is selected - Last 1 hour (UTC).

DB SYNC Timer

Figure 7: System Health - Heartbeat status - DB SYNC timer

This is an asynchronous timer-based function in Azure that accomplishes data synchronization and other similar tasks. It also sends heartbeats to the back-end services for the purpose of checking whether everything in the cloud works fine without being dependent on the gateways in the client premises.

Linux and Windows adapters

Figure 8: System Health - Heartbeat status - LINUX adapters

Figure 9: System Health - Heartbeat status - Windows adapters

For each gateway and gateway module, there is a dedicated chart showing the history of the number of heartbeats sent by it.

The time period is defined by the selected dashboard period.

The number of heartbeats sent at each interval corresponds to the number of machines dependent on this module or in the case of a gateway this number is always 1 (the gateway itself).

Based on these charts are configured system notifications that automatically notify of a heartbeat loss.

TELEMETRY

Expand the panel TELEMETRY

PowerBI Embedded Capacity ON

Figure 10: System Health - Telemetry - PowerBU Embedded Capacity

IoT Hub Data Usage(1 Minute)

Figure 11: System Health - Telemetry - IoT Hub data usage

This chart shows:

IoT Hub Telemetry(1 Minute)

Figure 12: System Health - Telemetry - IoT Hub telemetry

External Sensors Calculated Tags Counts

Figure 13: System Health - Telemetry - External sensors calculated tags counts

Events Ingested per Gateway(stacked)

Figure 14: System Health - Telemetry - Events ingested per gateway

This chart shows:

Linux MTConnect Data

Figure 15: System Health - Telemetry -Linux MTConnect

Edge MT Connect for the selected time period.

Windows MTConnect Data

Figure 16: System Health - Telemetry - Windows MTConnect

LOG STATISTICS

This section shows the information on the system logs for the selected period in the dashboard. It contains the following instruments:

Errors and Exceptions

Figure 17: System Health - Log Statistics - Error and Exceptions

This graphic shows the number of errors across system components for the chosen period. Any number of errors should be mitigated.

Traces

Figure 18: : System Health - Log Statistics - Traces

This graphic shows the number of system log messages (debug, informational and warning) across system components for the chosen period. These numbers are only informative and the main purpose of this instrument is to give an overview of how much data the system logs in the log database. If a component starts raising too much traces it should be changed so that the storage costs are optimized.

Errors and Exceptions List

Figure 19: System Health - Log Statistics - Error and exceptions list

The errors and exceptions list is a table showing the errors and exceptions for the chosen period. The timestamp when the error or exception occurred, a message generated and the resource that generated the message.

Warnings List

The warnings list is a table showing the warnings for the chosen period. The timestamp when the warning occurred, a message generated and the resource that generated the message.

REQUESTS & DEPENDENCIES

This section shows information about requests made from, to and between different system components. A dependency is an external component that is called by a specific component. It's typically a service called using HTTP, or a database, or a file system.

The section contains the following instruments

Requests Duration

Figure 20: System Health - Requests & Dependencies -Request duration

This chart shows the duration of requests to specific URL for the chosen period. These statistics usually represent the duration of responses of parts of our internal/integrational APIs, or other similar HTTP endpoints.

The chart is provided with a summary table for each shown metric where we can find the minimum, maximum and average value of each series shown on the chart.

Requests Failed

Figure 21: System Health - Requests & Dependencies - Requests failed

This chart shows the number of failed requests to specific URL for the chosen period. These statistics usually represent the count of failed responses at specific moment on parts of our internal/integrational APIs, or other similar HTTP endpoints. The chart is provided with a summary table for each shown metric where we can find the minimum, maximum and average value of each series shown on the chart.

Dependencies Duration by Target

Figure 22: System Health - Requests & Dependencies - Requests duration by target

It shows the duration of calls to dependent specific URL for the chosen period. The URLs are usually, but not always, external APIs not managed within our platform. These statistics usually represent the duration of responses of parts of those APIs, or other similar HTTP endpoints.

The chart data is grouped by the endpoint being called.

The chart is provided with a summary table for each shown metric where we can find the minimum, maximum and average value of each series shown on the chart.

Dependencies Duration by Type

Figure 23: System Health - Requests & Dependencies - Requests duration by type

This chart is the same as the Dependencies Duration by Target. The difference is that endpoints are grouped by Type (HTTP, SQL, IoT Hub, and others).

The chart is provided with a summary table for each shown metric where we can find the minimum, maximum and average value of each series shown on the chart.

PERFORMANCE

SQL Server Load (DTU)

Figure 24: System Health - Performance - SQL Server load (DTU)

This graphic shows the load on the SQL server databases in number of DTU items used for the chosen period. DTU is Database Transactional Unit and is an Azure SQL specific metric on how much resources a database is using.

SQL Server Data Space Used

Figure 25: System Health - Performance - SQL Server data space used

This graphic shows the percentage of SQL Server Data Space used for ConfigDB and ReportingDB during the selected time period.

SQL Server Successful Connections

Figure 26: System Health - Performance - SQL Server Successful connections

This graphic shows the SQL Server Connections succeeded for the selected time interval.

SQL Server Failed Connections

Figure 27: System Health - Performance - SQL Server Failed connections

This graphic shows the SQL Server Connections failed for the selected time interval.

Service Bug Queues

Figure 28: System Health - Performance - Service bug queues

This chart shows the number of messages in service bus queues across time for the chosen period in the dashboard.

Active queues are those that constantly consume messages. From them different system components are retrieving and processing those messages. If the number of messages raise too much this means that some component is processing them slowly. If the number raises constantly this means that some component is not running at all.

Dead-lettered queues correspond to the specific active queues and they contain messages for which the processing has failed a certain amount of times. These queues should always contain zero items. If there any messages here, then this is an indication of a problem and should be mitigated.

The chart is provided with a summary table for each shown metric where we can find the minimum, maximum, average and current number of messages within each queue for the chosen period.

Persister Server Metrics

Figure 29: System Health - Performance - Persister Server metrics

Persister server metrics are custom metrics that measure how fast our persister server is processing and persisting telemetry incoming from devices via the IoT Hub.

These metrics are turned off by default and turned on when necessary to measure things up.

The chart is provided with a summary table for each shown metric where we can find the minimum, maximum and average value of each series shown on the chart.

VM ACTIVITY

This section shows the information on the performance and important event logs of the few virtual machines in the system for the selected period in the dashboard.

Figure 30: System Health -VM Activity

Influx Application Logs Pie Chart & List

Figure 31: System Health -VM Activity - Influx Application logs

This pie chart counts the most important event log entries from the Event Log of the Time series DB Server virtual machine for the chosen period in the dashboard. Two Event Log collections are being monitored for warnings and errors Application logs and System Logs. If some counts are too big, they should be reviewed and if a problem is found it should be mitigated.

Figure 32: System Health -VM Activity - Influx application logs messages

This table lists the most important event log entries from the Event Log of the Time series DB Server virtual machine for the chosen period in the dashboard. The same two Event Log collections are being monitored for warnings and errors. Each message should be carefully reviewed and all identified problems should be mitigated.

Grafana Application Logs Pie Chart & List

Figure 33: System Health -VM Activity - Grafana application logs

This pie chart counts the most important event log entries from the Event Log of the Visualization virtual machine for the chosen period in the dashboard. Two Event Log collections are being monitored for warnings and errors – Application logs and System Logs. If some counts are too big, they should be reviewed and if a problem is found it should be mitigated.

Figure 34: System Health -VM Activity - Grafana application logs messages

This table lists the most important event log entries from the Event Log of the Visualization virtual machine for the chosen period in the dashboard. The same two Event Log collections are being monitored for warnings and errors. Each message should be carefully reviewed and all identified problems should be mitigated.

CPU & Memory Usage

Figure 35: System Health -VM Activity -CPU & Memory usage

The charts show the usage of CPU and memory by two virtual machines – Influx DB Server VM and Grafana VM.

All virtual machine metrics are evaluated and persisted once per minute.

SYSTEM INFORMATION

In this section is shown table for all system components: Group, Name, Version, Reported Time.

Figure 36: System Health -System Information

Influx DB Metrics

To open InfluxDB Metrics dashboard select:

Figure 37: InfluxDB Metrics dashboard

HTTP Queries

Figure 38: InfluxDB Metrics - HTTP Queries

This graphic shows the number of queries executed for the selected time interval.

HTTP Errors

Figure 39: InfluxDB Metrics - HTTP Errors

The graphic shows the number of failed queries for the selected time interval. The errors are split in two types: server and client errors.

Points Read

Figure 40: InfluxDB Metrics - Points Read

The chart shows the number of points read from the database during the selected time interval.

Points Written

Figure 41: InfluxDB Metrics - Points Written

The chart shows the number of points written to the database over time. The data is split in two types: failed and successful writes. For a more focuses view, failed chart is negated and will be shown towards the negative end of the Y-axis

HTTP Reads Duration (99th %)

Figure 42: InfluxDB Metrics - HTTP Reads duration (99th %)

The chart shows the duration, in nanoseconds, of the slowest 1% of all read queries for the selected time interval.

HTTP Writes Duration (99th %)

Figure 43: InfluxDB Metrics - HTTP Writen Duration (99th %)

The chart shows the duration, in nanoseconds, of the slowest 1% of all write queries for the selected time interval.

Number of Series

Figure 44: InfluxDB Metrics - Number of series

The chart shows the number of series per database for the selected time interval.

Runtime

Figure 45: InfluxDB Metrics - Runtime

The chart shows the following runtime statistics over time:

Health Check

To open the Healath Check dashboard select:

Purpose: Health Check dashboard shows in a table status of execution of HealthCheckAPI requests which are executed every hour. Requests check some basic functions of the Notification server and Persister server.

Figure 46: System - Health Check dashboard

Statuses from the tests can be:

Heartbeat History

To open the Heartbeat history dashboard select:

Purpose: On this dashboard, user administrator can search for missing heartbeats(2 or 3) for the selected time period. The returned result is grouped by adapters. Each machine heartbeat comes on each minute.

Figure 47: System - Heartbeat history dashboard

The first chart is shown for each adapter the amount of downtime for which adapter has interruptions. For example, FanucRobotAdapter has interruptions of 4 min for the last 3 hours (selected by time interval from upper right)

Figure 48: Heartbeat history - Chart Amount of downtime by adapter

For each adapter are shown in the table times when it has interruptions and affected machines:

Times when Heartbeat Stopped, Time delta of interruption, Heartbeat Restored, and list of affected machines.

Figure 49: Heartbeat history - FanucRobotAdapter with interruptions and list of affected machines

Value Analysis

To open the Value Analysis dashboard select:

Purpose: Dashboard shows tags values statistics for the selected time interval (3). Statistics are shown after selecting the machine from the machine list (1) and tag from the machine tags list (2).

At the top is shown Tag description by ID, Name, Display name, Value type.

Tag Analytics are grouped in two panels: Simple and Advanced.

Figure 50: Value analysis - selected tag for machine

Expand the Simple analytics panel:

Figure 51: Value Analysis - Simple analytics-1

Figure 52: Value Analysis - Simple analytics- 2

  1. Number of values – this panel shows the number of values grouped by 1h for the given period

Figure 53: Simple value analytics - Number of values

  1. Telemetry Stats– in the panel are listed changing from drop-down : Average time for values in sec, Elapsed time, Count of values and Count distinct for the selected time interval

Figure 54: Simple value analytics - Telemetry stats

  1. Most frequent values – this panel shows the most frequent values by days and grouped by 1min for the selected period.

Figure 55: Simple value analytics - Most frequent values

  1. Aggregations – This panel shows the common aggregations (Min, Max, Mean, Standard deviation, Coefficient of variation).
    • Coefficient of variation = Standard deviation/Average

Figure 56: Simple value analytics - Aggregations

  1. Distribution of values – this panel shows all values for the given Tag and period. Standard deviation and coefficient of variation for the tag values.

Figure 57: Simple value analytics - Distributions of values

  1. Variations – the panel shows the common variations for the given series of values (Min, Max, Average, Moving Average, Exponential moving average)

Figure 58: Simple value analytics – Variations

Expand the Advanced analytics panel:

Figure 59: Advanced analytics -1

Figure 60: Advanced analytics -2

  1. Calculated percentage of values that fall below a particular value – the panel calculates the specific value from the series for every percentile/quantile. This shows that the value is X% bigger than the values below it.

Figure 61: Advanced analytics - Calculated percentage of values that fall below a particular value

  1. Outliers – the panel shows values that are in abnormal distance from other values. Either extremely high or extremely low values.

Figure 62: Advanced analytics - Outliers

  1. Outlier trend – outlier trend for the selected period. Graphic shows tag values and lines with high outlier and low outlier values.

Figure 63: Advanced analytics - Outliers trend

  1. Values changing in time – the panel shows the change (how the value changed according to the value before it) between the values for the given Tag and period.

Figure 64: Advanced analytics - values changing in time

  1. Values Histogram - The histogram panel is a graphical representation of the distribution of numerical data - values related to specified tag. It groups values into buckets (sometimes also called bins) and then counts how many values fall into each bucket. Min and Max values for the selected period are shown also.

Figure 65: Advanced analytics - Values histogram

  1. Values Heatmap - The heatmap panel is a graphical representation of the distribution of numerical data - values related to a specified tag. It groups values into buckets (sometimes also called bins) and then counts how many values fall into each bucket. The heatmap is like a histogram, but over time where each time slice represents its own histogram. Instead of using bar height as a representation of frequency, it uses cells and colors the cell proportional to the number of values in the bucket.

Figure 66: Advanced analytics - Values heatmap