-
Home
- Configuration
Alerts
An alert is a notification that a hardware problem has occurred, such as a critical low speed on a fan leading to an increase in CPU temperature.
Hardware Sentry defines a set of conditions that trigger alerts when failures are detected. These alerts are sent as OpenTelemetry logs
from the Hardware Sentry Agent's internal OTLP Exporter
to the OpenTelemetry Collector's internal OTLP Receiver
.
Alert Content
The alerts report:
- the host's Fully Qualified Domain Name
- the resource's attributes
- the faulty component with its identifying information (Serial Number, Model, Manufacturer, Bios Version, Driver Version, Physical Address)
- the parent dependency and its identifying information
- the alert severity (WARN, ALARM)
- the alert rule
- the date at which the alert is triggered
- the metric that triggered the alert
- the status information of the component
- the encountered problem, consequence and recommended action
- a complete hardware health report on the faulty component
Here is an example of an alert triggered by an unplugged cable on a network interface. This alert log has been captured using the OpenTelemetry Logging Exporter:
2022-04-21T14:37:57.034+0200 DEBUG loggingexporter/logging_exporter.go:81 ResourceLog #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource labels:
-> agent.host.name: STRING(hws.internal.sentrysoftware.net)
-> host.id: STRING(netapp9-san)
-> host.name: STRING(netapp9-san.internal.sentrysoftware.net)
-> host.type: STRING(storage)
-> os.type: STRING(storage)
-> site: STRING(data center 1)
ScopeLogs #0
ScopeLogs SchemaURL:
InstrumentationScope netapp9-san
LogRecord #0
Timestamp: 2022-04-21 12:37:47.201 +0000 UTC
Severity: WARN
Body: Hardware problem on netapp9-san.internal.sentrysoftware.net with 0c (FC Port).
Alert Severity : WARN
Alert Rule : hw.network.up == 0
Alert Details
=============
Problem : The network link is down.
Consequence : The network traffic (if any) that was processed by this adapter is no longer being handled, or is overloading another network adapter.
Recommended Action: Check that the network cable (if any) is not unplugged or broken/cut, and that it is properly plugged into the network card. Ensure that the network hub/switch/router is working properly.
Hardware Health Report (2022-04-21T14:37:47.201)
================================================
Monitor : 0c (FC Port)
Type : Network Card
On Host : netapp9-san.internal.sentrysoftware.net
Monitor ID : NetAppREST_networkcard_netapp9-san_netapp9-san-01.0c
Connector Used : NetAppREST
Parent ID : NetAppREST_enclosure_netapp9-san_netapp9-san-01
Physical Address : 50:0a:09:83:80:72:2b:36
This object is attached to: Enclosure: netapp9-san-01 (NetApp FAS2650)
Type : Enclosure
Manufacturer : NetApp
Model : FAS2650
Serial Number : 651652000067
=================================================================
Metric: hw.network.up
-----------------------------------------------------------------
Current Value : 0 (Unplugged)
=================================================================
Metric: hw.status{state="present", hw.type="network"}
-----------------------------------------------------------------
Current Value : 1 (Present)
Attributes:
-> agent.host.name: STRING(hws.internal.sentrysoftware.net)
-> host.id: STRING(netapp9-san)
-> host.name: STRING(netapp9-san.internal.sentrysoftware.net)
-> host.type: STRING(storage)
-> os.type: STRING(storage)
-> site: STRING(data center 1)
Trace ID:
Span ID:
Flags: 0
Alert Rules
Alert rules are sets of conditions used to identify the alert's severity and whether the alert should be triggered or not. These alert rules apply to Hardware Sentry:
Monitor | Metric Name | Severity | Default Alert Conditions | Attributes |
---|---|---|---|---|
Connector | hardware_sentry.connector.status | ALARM | hardware_sentry.connector.status == 1 | state = failed |
Host | hardware_sentry.host.up | ALARM | hardware_sentry.host.up == 0 | protocol = http |
Host | hardware_sentry.host.up | ALARM | hardware_sentry.host.up == 0 | protocol = ipmi |
Host | hardware_sentry.host.up | ALARM | hardware_sentry.host.up == 0 | protocol = snmp |
Host | hardware_sentry.host.up | ALARM | hardware_sentry.host.up == 0 | protocol = ssh |
Host | hardware_sentry.host.up | ALARM | hardware_sentry.host.up == 0 | protocol = wbem |
Host | hardware_sentry.host.up | ALARM | hardware_sentry.host.up == 0 | protocol = wmi |
Battery | hw.battery.charge | WARN | hw.battery.charge <= 0.5 | |
Battery | hw.battery.charge | ALARM | hw.battery.charge <= 0.3 | |
Battery | hw.status | ALARM | hw.status == 0 | hw.type = battery state = present |
Battery | hw.status | WARN | hw.status == 1 | hw.type = battery state = degraded |
Battery | hw.status | ALARM | hw.status == 1 | hw.type = battery state = failed |
Blade | hw.status | ALARM | hw.status == 0 | hw.type = blade state = present |
Blade | hw.status | WARN | hw.status == 1 | hw.type = blade state = degraded |
Blade | hw.status | ALARM | hw.status == 1 | hw.type = blade state = failed |
CPU | hw.errors | ALARM | hw.errors >= 1 | hw.type = cpu |
CPU | hw.status | WARN | hw.status == 1 | hw.type = cpu state = predicted_failure |
CPU | hw.status | ALARM | hw.status == 0 | hw.type = cpu state = present |
CPU | hw.status | WARN | hw.status == 1 | hw.type = cpu state = degraded |
CPU | hw.status | ALARM | hw.status == 1 | hw.type = cpu state = failed |
CPU Core | hw.status | ALARM | hw.status == 0 | hw.type = cpu_core state = present |
CPU Core | hw.status | WARN | hw.status == 1 | hw.type = cpu_core state = degraded |
CPU Core | hw.status | ALARM | hw.status == 1 | hw.type = cpu_core state = failed |
Disk Controller | hw.status | WARN | hw.status == 1 | hw.type = disk_controller battery_state = degraded |
Disk Controller | hw.status | ALARM | hw.status == 1 | hw.type = disk_controller battery_state = failed |
Disk Controller | hw.status | WARN | hw.status == 1 | hw.type = disk_controller state = degraded |
Disk Controller | hw.status | ALARM | hw.status == 1 | hw.type = disk_controller state = failed |
Disk Controller | hw.status | ALARM | hw.status == 0 | hw.type = disk_controller state = present |
Enclosure | hw.status | ALARM | hw.status == 1 | hw.type = enclosure state = open |
Enclosure | hw.status | ALARM | hw.status == 0 | hw.type = enclosure state = present |
Fan | hw.fan.speed | ALARM | hw.fan.speed == 0 | |
Fan | hw.fan.speed | WARN | hw.fan.speed <= 500 | |
Fan | hw.fan.speed_ratio | ALARM | hw.fan.speed_ratio == 0 | |
Fan | hw.fan.speed_ratio | WARN | hw.fan.speed_ratio <= 0.05 | |
Fan | hw.status | ALARM | hw.status == 0 | hw.type = fan state = present |
Fan | hw.status | WARN | hw.status == 1 | hw.type = fan state = degraded |
Fan | hw.status | ALARM | hw.status == 1 | hw.type = fan state = failed |
GPU | hw.errors | ALARM | hw.errors >= 1 | hw.type = gpu type = corrected |
GPU | hw.errors | ALARM | hw.errors >= 1 | hw.type = gpu type = all |
GPU | hw.gpu.memory.utilization | WARN | hw.gpu.memory.utilization >= 0.9 | |
GPU | hw.gpu.memory.utilization | ALARM | hw.gpu.memory.utilization >= 0.95 | |
GPU | hw.status | WARN | hw.status == 1 | hw.type = gpu state = predicted_failure |
GPU | hw.status | ALARM | hw.status == 0 | hw.type = gpu state = present |
GPU | hw.status | WARN | hw.status == 1 | hw.type = gpu state = degraded |
GPU | hw.status | ALARM | hw.status == 1 | hw.type = gpu state = failed |
LED | hw.status | WARN | hw.status == 1 | hw.type = led state = degraded |
LED | hw.status | ALARM | hw.status == 1 | hw.type = led state = failed |
Logical Disk | hw.errors | ALARM | hw.errors >= 1 | hw.type = logical_disk |
Logical Disk | hw.status | ALARM | hw.status == 0 | hw.type = logical_disk state = present |
Logical Disk | hw.status | WARN | hw.status == 1 | hw.type = logical_disk state = degraded |
Logical Disk | hw.status | ALARM | hw.status == 1 | hw.type = logical_disk state = failed |
LUN | hw.lun.paths | ALARM | hw.lun.paths < 1 | type = available |
LUN | hw.status | ALARM | hw.status == 0 | hw.type = lun state = present |
LUN | hw.status | WARN | hw.status == 1 | hw.type = lun state = degraded |
LUN | hw.status | ALARM | hw.status == 1 | hw.type = lun state = failed |
Memory Module | hw.errors | ALARM | hw.errors >= 1 | hw.type = memory |
Memory Module | hw.status | WARN | hw.status == 1 | hw.type = memory state = predicted_failure |
Memory Module | hw.status | ALARM | hw.status == 0 | hw.type = memory state = present |
Memory Module | hw.status | WARN | hw.status == 1 | hw.type = memory state = degraded |
Memory Module | hw.status | ALARM | hw.status == 1 | hw.type = memory state = failed |
Network Card | hw.network.bandwidth.utilization | WARN | hw.network.bandwidth.utilization >= 0.8 | |
Network Card | hw.network.error_ratio | WARN | hw.network.error_ratio >= 0.2 | |
Network Card | hw.network.error_ratio | ALARM | hw.network.error_ratio >= 0.3 | |
Network Card | hw.network.up | WARN | hw.network.up == 0 | |
Network Card | hw.status | ALARM | hw.status == 0 | hw.type = network state = present |
Network Card | hw.status | WARN | hw.status == 1 | hw.type = network state = degraded |
Network Card | hw.status | ALARM | hw.status == 1 | hw.type = network state = failed |
Other | hw.status | ALARM | hw.status == 0 | hw.type = other_device state = present |
Other | hw.status | WARN | hw.status == 1 | hw.type = other_device state = degraded |
Other | hw.status | ALARM | hw.status == 1 | hw.type = other_device state = failed |
Physical Disk | hw.physical_disk.endurance_utilization | WARN | hw.physical_disk.endurance_utilization <= 0.05 | state = remaining |
Physical Disk | hw.physical_disk.endurance_utilization | ALARM | hw.physical_disk.endurance_utilization <= 0.02 | state = remaining |
Physical Disk | hw.errors | ALARM | hw.errors >= 1 | hw.type = physical_disk |
Physical Disk | hw.status | WARN | hw.status == 1 | hw.type = physical_disk state = predicted_failure |
Physical Disk | hw.status | ALARM | hw.status == 0 | hw.type = physical_disk state = present |
Physical Disk | hw.status | WARN | hw.status == 1 | hw.type = physical_disk state = degraded |
Physical Disk | hw.status | ALARM | hw.status == 1 | hw.type = physical_disk state = failed |
Power Supply | hw.status | ALARM | hw.status == 0 | hw.type = power_supply state = present |
Power Supply | hw.status | WARN | hw.status == 1 | hw.type = power_supply state = degraded |
Power Supply | hw.status | ALARM | hw.status == 1 | hw.type = power_supply state = failed |
Power Supply | hw.power_supply.utilization | WARN | hw.power_supply.utilization >= 0.9 | |
Power Supply | hw.power_supply.utilization | ALARM | hw.power_supply.utilization >= 0.99 | |
Robotics | hw.status | ALARM | hw.status == 0 | hw.type = robotics state = present |
Robotics | hw.status | WARN | hw.status == 1 | hw.type = robotics state = degraded |
Robotics | hw.status | ALARM | hw.status == 1 | hw.type = robotics state = failed |
Tape Drive | hw.errors | ALARM | hw.errors >= 1 | hw.type = tape_drive |
Tape Drive | hw.status | WARN | hw.status == 1 | hw.type = tape_drive state = needs_cleaning |
Tape Drive | hw.status | ALARM | hw.status == 1 | hw.type = tape_drive state = needs_cleaning |
Tape Drive | hw.status | ALARM | hw.status == 0 | hw.type = tape_drive state = present |
Tape Drive | hw.status | WARN | hw.status == 1 | hw.type = tape_drive state = degraded |
Tape Drive | hw.status | ALARM | hw.status == 1 | hw.type = tape_drive state = failed |
Temperature | hw.status | ALARM | hw.status == 0 | hw.type = temperature state = present |
Temperature | hw.status | WARN | hw.status == 1 | hw.type = temperature state = degraded |
Temperature | hw.status | ALARM | hw.status == 1 | hw.type = temperature state = failed |
Virtual Machine | hw.status | ALARM | hw.status == 0 | hw.type = vm state = present |
Virtual Machine | hw.status | WARN | hw.status == 1 | hw.type = vm state = degraded |
Virtual Machine | hw.status | ALARM | hw.status == 1 | hw.type = vm state = failed |
Voltage | hw.status | ALARM | hw.status == 0 | hw.type = voltage state = present |
Voltage | hw.status | WARN | hw.status == 1 | hw.type = voltage state = degraded |
Voltage | hw.status | ALARM | hw.status == 1 | hw.type = voltage state = failed |
Customizing Alert Content
You can customize the content of alerts by adding macros in the hardwareProblemTemplate
parameter in the config/hws-config.yaml
file. See the procedure detailed in the Hardware Problem Template section.
The default alert content template is:
Hardware problem on ${FQDN} with ${MONITOR_NAME}.${NEWLINE}${NEWLINE}${ALERT_DETAILS}${NEWLINE}${NEWLINE}${FULLREPORT}
The following macros can be used to obtain more details about the problem. They will be replaced at runtime.
Macro | Description |
---|---|
${MONITOR_NAME} |
Name of the monitor that triggered the alert. Example: Fan: 1.1 (CPU1) |
${MONITOR_ID} |
Unique identifier of the monitor that triggered the alert. |
${MONITOR_TYPE} |
Type of the monitor that triggered the alert. Example: Physical Disk |
${PARENT_ID} |
Identifier of the parent that the faulty instance is attached to. |
${METRIC_NAME} |
Name of the metric that triggered the alert. Example: hw.status{state=“failed”, hw.type = “battery”} |
${METRIC_VALUE} |
Value of the metric that triggered the alert. Example: 1 (Failed) |
${SEVERITY} |
Severity of the alert (ALARM, WARN) |
${ALERT_RULE} |
Alert conditions that triggered the alert. Example: hw.status{state=“failed”, hw.type = “battery”} == 1 |
${ALERT_DATE} |
ISO date time at which the alert triggered. |
${CONSEQUENCE} |
Description of the possible consequence of the detected problem. Example: The temperature of the chip, component or device that was cooled by this fan should grow quickly. This can lead to severe hardware damage and system crashes. |
${RECOMMENDED_ACTION} |
Recommended action to solve the problem. Example: Check if the fan is no longer cooling the system. If so, replace the fan. |
${PROBLEM} |
Description of the problem encountered by the monitor. Example: The speed of this fan is critically low (1503 rpm). |
${ALERT_DETAILS} |
Severity, alert rule, problem, consequence and recommended action. |
${FULLREPORT} |
Full hardware health report about the monitor that triggered the alert. |
${NEWLINE} |
Linefeed. This is useful to produce multi-line information. |
Receiving Alerts
To receive Hardware Sentry's alerts, your Exporter
must support the OpenTelemetry logs
pipeline.
For troubleshooting purposes, you can add logging
in the service:pipelines:logs:exporters
section of the otel/otel-config.yaml
file:
service:
# ...
pipelines:
# ...
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resourcedetection]
exporters: [logging] # List here the platform of your choice
Alerts will then be exported to the console.