Alerting

Alerts and pagers for your cluster

5 minute read

Alertmanager

Alertmanager, a tool from the Prometheus stack, handles alerts sent by the Prometheus server. Alertmanager lets you manage alerts flexibly and route them through receiver integrations such as email, Slack, or PagerDuty.

You can learn how to configure your Alertmanager and its integrations from its own documentation section.

Fury Kubernetes Monitoring module ships with a simple and useful Alermanager pre-configuration.

Alerts dispatch and support

Fury Kubernetes Monitoring module comes pre-configured with a series of tested and validated alerts and rules, which will cover most of your use cases.

You can add your alerts on top of ours. Alerts relating to the Fury Kubernetes Cluster can be dispatched to your on-call/SRE team or to the SIGHUP Support team, depending on your Support contract. Alerts relating to your application can be dispatched to your on-call/dev teams.

Alerts

The followings alerts, listed by the alert group they belong to, come pre-defined with this package.

kube-state-metrics

Alert name Description Severity
KubeStateMetricsListErrors kube-state-metrics is experiencing errors in list operations. critical
KubeStateMetricsWatchErrors kube-state-metrics is experiencing errors in watch operations. critical

node-exporter

Alert name Description Severity
NodeFilesystemSpaceFillingUp Filesystem is predicted to run out of space within the next 24 hours. warning
NodeFilesystemSpaceFillingUp Filesystem is predicted to run out of space within the next 4 hours. critical
NodeFilesystemAlmostOutOfSpace Filesystem has less than 5% space left. warning
NodeFilesystemAlmostOutOfSpace Filesystem has less than 3% space left. critical
NodeFilesystemFilesFillingUp Filesystem is predicted to run out of inodes within the next 24 hours. warning
NodeFilesystemFilesFillingUp Filesystem is predicted to run out of inodes within the next 4 hours. critical
NodeFilesystemAlmostOutOfFiles Filesystem has less than 5% inodes left. warning
NodeFilesystemAlmostOutOfFiles Filesystem has less than 3% inodes left. critical
NodeNetworkReceiveErrs Network interface is reporting many receive errors. warning
NodeNetworkTransmitErrs Network interface is reporting many transmit errors. warning
NodeHighNumberConntrackEntriesUsed Number of conntrack are getting close to the limit. warning
NodeTextFileCollectorScrapeError Node Exporter text file collector failed to scrape. warning
NodeClockSkewDetected Clock skew detected. warning
NodeClockNotSynchronising Clock not synchronising. warning
NodeRAIDDegraded RAID Array is degraded critical
NodeRAIDDiskFailure Failed device in RAID array warning

kubernetes-apps

Alert name Description Severity
KubePodCrashLooping Pod is crash looping. warning
KubePodNotReady Pod has been in a non-ready state for more than 15 minutes. warning
KubeDeploymentGenerationMismatch Deployment generation mismatch due to possible roll-back warning
KubeDeploymentReplicasMismatch Deployment has not matched the expected number of replicas. warning
KubeStatefulSetReplicasMismatch Deployment has not matched the expected number of replicas. warning
KubeStatefulSetGenerationMismatch StatefulSet generation mismatch due to possible roll-back warning
KubeStatefulSetUpdateNotRolledOut StatefulSet update has not been rolled out. warning
KubeDaemonSetRolloutStuck DaemonSet rollout is stuck. warning
KubeContainerWaiting Pod container waiting longer than 1 hour warning
KubeDaemonSetNotScheduled DaemonSet pods are not scheduled. warning
KubeDaemonSetMisScheduled DaemonSet pods are misscheduled. warning
KubeJobCompletion Job did not complete in time warning
KubeJobFailed Job failed to complete. warning
KubeHpaReplicasMismatch HPA has not matched descired number of replicas. warning
KubeHpaMaxedOut HPA is running at max replicas warning

kubernetes-resources

Alert name Description Severity
KubeCPUOvercommit Cluster has overcommitted CPU resource requests. warning
KubeMemoryOvercommit Cluster has overcommitted memory resource requests. warning
KubeCPUQuotaOvercommit Cluster has overcommitted CPU resource requests. warning
KubeMemoryQuotaOvercommit Cluster has overcommitted memory resource requests. warning
KubeQuotaAlmostFull Namespace quota is going to be full. info
KubeQuotaFullyUsed Namespace quota is fully used. info
KubeQuotaExceeded Namespace quota has exceeded the limits. warning

kubernetes-storage

Alert name Description Severity
KubePersistentVolumeFillingUp PersistentVolume is filling up. critical
KubePersistentVolumeFillingUp PersistentVolume is filling up. warning
KubePersistentVolumeErrors PersistentVolume is having issues with provisioning. critical

kubernetes-system

Alert name Description Severity
KubeVersionMismatch Different semantic versions of Kubernetes components running. warning
KubeClientErrors Kubernetes API server client is experiencing errors. warning

kube-apiserver-slos

Alert name Description Severity
KubeAPIErrorBudgetBurn The API server is burning too much error budget. critical
KubeAPIErrorBudgetBurn The API server is burning too much error budget. critical
KubeAPIErrorBudgetBurn The API server is burning too much error budget. warning
KubeAPIErrorBudgetBurn The API server is burning too much error budget. warning

kubernetes-system-apiserver

Alert name Description Severity
KubeClientCertificateExpiration Client certificate is about to expire. warning
KubeClientCertificateExpiration Client certificate is about to expire. critical
AggregatedAPIErrors An aggregated API has reported errors. warning
AggregatedAPIDown An aggregated API is down. warning
KubeAPIDown Target disappeared from Prometheus target discovery. critical

kubernetes-system-kubelet

Alert name Description Severity
KubeNodeNotReady Node is not ready. warning
KubeNodeUnreachable Node is unreachable. warning
KubeletTooManyPods Kubelet is running at capacity. warning
KubeNodeReadinessFlapping Node readiness status is flapping. warning
KubeletPlegDurationHigh Kubelet Pod Lifecycle Event Generator is taking too long to relist. warning
KubeletPodStartUpLatencyHigh Kubelet Pod startup latency is too high. warning
KubeletClientCertificateExpiration Kubelet client certificate is about to expire. warning
KubeletClientCertificateExpiration Kubelet client certificate is about to expire. critical
KubeletServerCertificateExpiration Kubelet server certificate is about to expire. warning
KubeletServerCertificateExpiration Kubelet server certificate is about to expire. critical
KubeletClientCertificateRenewalErrors Kubelet has failed to renew its client certificate. warning
KubeletServerCertificateRenewalErrors Kubelet has failed to renew its server certificate. warning
KubeletDown Target disappeared from Prometheus target discovery. critical

kubernetes-system-scheduler

Alert name Description Severity
KubeSchedulerDown Target disappeared from Prometheus target discovery. critical

kubernetes-system-controller-manager

Alert name Description Severity
KubeControllerManagerDown Target disappeared from Prometheus target discovery. critical

prometheus

Alert name Description Severity
PrometheusBadConfig Failed Prometheus configuration reload. critical
PrometheusNotificationQueueRunningFull Prometheus alert notification queue predicted to run full in less than 30m. warning
PrometheusErrorSendingAlertsToSomeAlertmanagers Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager. warning
PrometheusErrorSendingAlertsToAnyAlertmanager Prometheus encounters more than 3% errors sending alerts to any Alertmanager. critical
PrometheusNotConnectedToAlertmanagers Prometheus is not connected to any Alertmanagers. warning
PrometheusTSDBReloadsFailing Prometheus has issues reloading blocks from disk. warning
PrometheusTSDBCompactionsFailing Prometheus has issues compacting blocks. warning
PrometheusNotIngestingSamples Prometheus is not ingesting samples. warning
PrometheusDuplicateTimestamps Prometheus is dropping samples with duplicate timestamps. warning
PrometheusOutOfOrderTimestamps Prometheus drops samples with out-of-order timestamps. warning
PrometheusRemoteStorageFailures Prometheus fails to send samples to remote storage. critical
PrometheusRemoteWriteBehind Prometheus remote write is behind. critical
PrometheusRemoteWriteDesiredShards Prometheus remote write desired shards calculation wants to run more than configured max shards. warning
PrometheusRuleFailures Prometheus is failing rule evaluations. critical
PrometheusMissingRuleEvaluations Prometheus is missing rule evaluations due to slow rule group evaluation. warning

alertmanager.rules

Alert name Description Severity
AlertmanagerConfigInconsistent critical
AlertmanagerFailedReload warning
AlertmanagerMembersInconsistent critical

general.rules

Alert name Description Severity
TargetDown warning
DeadMansSwitch none

node-network

Alert name Description Severity
NodeNetworkInterfaceFlapping warning

prometheus-operator

Alert name Description Severity
PrometheusOperatorListErrors warning
PrometheusOperatorWatchErrors warning
PrometheusOperatorReconcileErrors warning
PrometheusOperatorNodeLookupErrors warning

Last modified 24.09.2020: Updates core modules docs (d30ff5c)