How to Set Up Alerting in OpenShift 4: Alertmanager, Alert Rules & Notifications

In this blog post, we will run through how to set up alerting in OpenShift 4. After configuring monitoring and persistent storage in our previous articles, we now have metrics flowing and Prometheus actively scraping our OpenShift cluster. On the surface, this might feel like the job is done but monitoring alone is only half the story.

If a critical issue occurs at 3 AM and no one is notified, those beautifully collected metrics are effectively useless. Detection without notification does not prevent downtime, reduce impact, or help teams respond faster. This is where alerting becomes the critical bridge between observability and real-world incident response.

How to Set Up Alerting in OpenShift 4

Monitoring tells you what is happening. Alerting ensures that the right people know what is happening, at the right time, with the right context.

In real-world OpenShift operations, alerting must be intentional and well-designed. Not every alert should wake someone up, and not every alert should go to the same destination. Infrastructure failures, application errors, and informational events all require different handling, urgency, and audiences.

In this post, we will complete our monitoring stack by configuring Alertmanager to intelligently route alerts based on severity, source, and ownership. We’ll cover how to:

Page SRE or platform teams for critical cluster and infrastructure issues
Notify application teams through collaboration tools for workload-level warnings
Send low-priority or informational alerts as summaries for visibility without noise

This is the step that transforms OpenShift monitoring from passive visibility into active reliability.

Understanding the OpenShift Alerting Architecture

Before configuring notifications, it’s important to understand how alerts flow through OpenShift and which components are responsible for each step.

┌─────────────────┐
│   Prometheus    │  • Scrapes metrics from targets
│   (Monitoring)  │  • Evaluates PrometheusRule expressions
│                 │  • Determines WHEN an alert should fire
└────────┬────────┘
         │
         │ Sends firing alerts (with labels)
         ▼
┌─────────────────┐
│  Alertmanager   │  • Groups related alerts
│                 │  • Applies routing rules
│                 │  • Handles silences and inhibition
│                 │  • Decides WHERE alerts should go
└────────┬────────┘
         │
         │ Routes alerts to receivers
         ▼
┌─────────────────┐
│   Receivers     │  • Slack channels
│                 │  • PagerDuty services
│                 │  • Email addresses
│                 │  • Webhook endpoints
│                 │  • Defines HOW notifications are delivered
└─────────────────┘

The Three-Step Alert Flow:

Step 1: Detection (Prometheus)

Prometheus continuously evaluates PrometheusRule resources every 30 seconds (by default). Each rule contains a PromQL expression that checks if a problem exists. For example:

expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1

This expression checks if available memory is less than 10%. If this condition is true for the duration specified, Prometheus creates a firing alert and sends it to Alertmanager.

Step 2: Routing (Alertmanager)

Alertmanager receives the firing alert from Prometheus. The alert includes labels such as:

labels:
  severity: critical
  namespace: production
  alertname: NodeMemoryLow

Alertmanager uses these labels to:

Match routing rules – Which receiver should handle this alert?
Group related alerts – Should we batch multiple similar alerts together?
Apply inhibition – Should we suppress lower-priority alerts?
Check silences – Has someone temporarily muted this alert?

Step 3: Notification (Receivers)

Once Alertmanager determines where the alert should go, it formats the notification and sends it to the configured receiver (Slack, email, etc.).

Key Concepts:

PrometheusRule CRD:
- Defines when an alert should fire by evaluating metric expressions on a fixed interval.
- Prometheus detects problems but never sends notifications directly.
Alertmanager:
- Receives firing alerts from Prometheus and decides where they should be sent.
- All notification logic; routing, grouping, silencing, and inhibition, lives here.
Labels:
- Labels are the contract between Prometheus and Alertmanager.
- Prometheus attaches labels to alerts; Alertmanager uses those labels to match routing rules.
Routes:
- Routes define the logic that maps alerts to receivers based on label matching.
- For example: severity=critical, team=platform.
Receivers:
- Receivers define how alerts are delivered to external systems such as Slack, PagerDuty, email, or webhooks.

Prerequisites

Before starting, ensure you have:

OpenShift (for context, we are on version 4.20.8) cluster with cluster-admin access
Platform monitoring configured (enabled by default)
User workload monitoring enabled (see previous article)
Persistent storage configured for Alertmanager (see previous article)
oc CLI installed and configured
Access to notification systems:
- Slack workspace with webhook creation permissions
- SMTP server details (for email)
- PagerDuty integration key (optional)
- Webhook endpoint (optional)

Verifying Monitoring Stack

Before configuring receivers and routes, verify that both core platform and user workload Alertmanager (if enabled already for your cluster) instances are running and healthy.

Verify Core Platform Alertmanager. Check Alertmanager pods:

oc get pods -n openshift-monitoring | grep alertmanager

Expected output:

alertmanager-main-0   6/6   Running   0   5d
alertmanager-main-1   6/6   Running   0   5d

Verify persistent volume claims. Persistent volumes are required for silences and alert state to survive pod restarts.

oc get pvc -n openshift-monitoring -o \
custom-columns=NAME:.metadata.name,STATUS:.status.phase,\
VOLUME:.spec.volumeName,CAPACITY:.status.capacity.storage,\
ACCESS_MODES:.spec.accessModes

Expected output:

NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS_MODES
alertmanager-main-db-alertmanager-main-0   Bound    pvc-3cb4cf2b-40c4-4872-8fce-2189f0a8d553   10Gi       [ReadWriteOnce]
alertmanager-main-db-alertmanager-main-1   Bound    pvc-d8461170-c5f3-4f4b-9f72-db0fbba16570   10Gi       [ReadWriteOnce]

What to verify:

STATUS: Bound – PVC successfully attached to a persistent volume
CAPACITY: 10Gi – Default size, sufficient for most clusters

If any PVC shows Pending status, you have a storage provisioning issue that needs to be resolved.

Sample core platform metrics (Memory usage) from OpenShift dashboard > Observe > Metrics:

Similarly, for our user workload monitoring, we are using the core platform Alertmanager. Here are pods from the user workload monitoring namespace:

oc get pods -n openshift-user-workload-monitoring

Sample output;

AME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-86bc45dcfb-dhcn5   2/2     Running   0          23h
prometheus-user-workload-0             6/6     Running   0          23h
prometheus-user-workload-1             6/6     Running   0          23h
thanos-ruler-user-workload-0           4/4     Running   0          23h
thanos-ruler-user-workload-1           4/4     Running   0          23h

Here are sample metrics from out monitoring-demo namespace:

metrics being scraped user workload monitoring

Understanding Alertmanager Configuration

Alertmanager configuration has 4 main sections. In OpenShift, the actual Alertmanager configuration is stored in a secret named alertmanager-main, which contains the alertmanager.yaml file. This file defines how alerts are processed, routed, and notified.

To get the current configuration, run the command below;

oc get secret alertmanager-main -n openshift-monitoring \
  --template='{{index .data "alertmanager.yaml"}}' \
  | base64 -d;echo

Here is the sample output;

"global":
  "http_config":
    "proxy_from_environment": true
"inhibit_rules":
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = critical"
  "target_matchers":
  - "severity =~ warning|info"
- "equal":
  - "namespace"
  - "alertname"
  "source_matchers":
  - "severity = warning"
  "target_matchers":
  - "severity = info"
"receivers":
- "name": "Default"
- "name": "Watchdog"
- "name": "Critical"
"route":
  "group_by":
  - "namespace"
  "group_interval": "5m"
  "group_wait": "30s"
  "receiver": "Default"
  "repeat_interval": "12h"
  "routes":
  - "matchers":
    - "alertname = Watchdog"
    "receiver": "Watchdog"
  - "matchers":
    - "severity = critical"
    "receiver": "Critical"

Where:

Section 1: Global Settings

This section defines global settings like the default receiver and common configurations across all alerts. For example;

global:
  http_config:
    proxy_from_environment: true

proxy_from_environment: true allows Alertmanager to use proxy settings from the environment, enabling it to send outgoing requests (e.g., notifications) through the defined proxy.

Section 2: Inhibition Rules

Inhibition rules prevent Alertmanager from sending notifications for lower-severity alerts when higher-severity alerts are already active. This is useful when you don’t want to receive redundant notifications.

inhibit_rules:
  - equal:
      - namespace
      - alertname
    source_matchers:
      - severity = critical
    target_matchers:
      - severity =~ warning|info
  - equal:
      - namespace
      - alertname
    source_matchers:
      - severity = warning
    target_matchers:
      - severity = info

Here, there are two defined inhibition rules:

The first rule prevents sending warning or informational alerts if a critical alert is already active, but only if they have the same namespace and alertname.
The second rule prevents sending informational alerts when a warning alert is already active, only if they have the same namespace and alertname.
source_matchers: Defines which alerts (e.g., with critical severity) will suppress others.
target_matchers: Defines the alerts (e.g., warning or info) that will be suppressed when the source alert is active.

Section 3: Receivers (Where alerts go)

The receivers section is where you define the destinations for your alerts. In this example, there are three receivers:

Default: This is the fallback receiver for general alerts.
Watchdog: Used for health check or “watchdog” alerts.
Critical: This one is used to route critical alerts.

receivers:
  - name: "Default"
  - name: "Watchdog"
  - name: "Critical"

These receivers could be configured to send alerts to various destinations, such as Slack, email, or PagerDuty, depending on your needs.

In the default config above, the settings are just placeholder receivers with no actual notification methods configured. They’re defined but they don’t send anything anywhere yet.

When we configure Slack and Email, it will look like:

receivers:
  - name: "slack-critical"
    slack_configs:
      - channel: '#alerts-critical'
        api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'
  
  - name: "email-management"
    email_configs:
      - to: '[email protected]'

Each receiver can have multiple configurations (e.g., send to both Slack AND email).

Section 4: Routes

The routes section defines how alerts are grouped and routed to receivers. In this example, grouping and delay settings, as well as route-based matching rules are specified:

route:
  group_by:
    - namespace
  group_interval: "5m"
  group_wait: "30s"
  receiver: "Default"
  repeat_interval: "12h"
  routes:
    - matchers:
        - alertname = Watchdog
      receiver: "Watchdog"
    - matchers:
        - severity = critical
      receiver: "Critical"

Explanation:

group_by: namespace: Alerts are grouped by the namespace label to ensure related alerts from the same namespace are bundled together.
Example: If 5 pods crash in the production namespace within 30 seconds and generate alerts with the same alertname and namespace labels, Alertmanager groups them into a single notification instead of sending 5 separate ones. The notification may indicate that multiple alerts are firing in the production namespace.
group_interval: "5m": Specifies how long Alertmanager should wait before sending grouped notifications. After sending the first notification for a group, wait 5 minutes before sending updates about new alerts added to that group.
group_wait: "30s": Defines the delay before sending the first notification in a group of alerts. When a new group of alerts starts, wait 30 seconds before sending the first notification. This gives time for related alerts to arrive so they can be batched together.
repeat_interval: "12h": Configures the time between repeated notifications for the same alert. If an alert is still firing, resend the notification every 12 hours as a reminder.

The routes section then matches specific conditions:

Alerts with alertname = Watchdog go to the Watchdog receiver.
Alerts with severity = critical go to the Critical receiver.

Setting Up Notification Receivers

A receiver in Alertmanager is a destination for alerts. It defines where alerts are sent and how they are delivered.

Some of the most common receiver types include:

Slack
Email
PagerDuty
Microsoft Teams
Webhook

In this example setup, we will demonstrate how to configure Slack and Email notifications.

Configure Slack Webhook

Slack is commonly used for team notifications. To receive alerts in Slack, you need to create Incoming Webhook for your channels.

A webhook is a special URL that allows one service to send data to another automatically. In this case, Alertmanager uses the webhook to push alert messages to a Slack channel whenever an alert occurs.

Step 1: Configure Slack Webhook

You can create Slack channels to organize alerts by severity or type. For example:

#alerts-critical: critical or warning alerts
#alerts-info: informational or resolved alerts
#alerts-watchdog: health check alerts

In this demo, we’ll use a single channel: openshift-alerts-demo.

Create an Incoming Webhook for the channel:

1. Create a Slack App (only if you don’t have one)

Go to Slack Apps > Create an App.
Create an App from Scratch.
Give it a name (e.g., OpenShift Alerts) and select your workspace > Create App.
If you already have an app that can post to the channel, skip this step

2. Enable Incoming Webhooks

Open your Slack App from Slack Apps page.
Head over to Features > Incoming Webhooks
Toggle Activate Incoming Webhooks to ON

3. Create a Webhook URL for your channel

Scroll down to Webhook URLs for Your Workspace section.
Click Add New Webhook.
Select the Slack workspace and your respective channel (e.g., openshift-alerts-demo)
Review the permissions the app needs to post messages to that channel. Click Allow to authorize the app to post in the selected channel.
After authorization, Slack will generate a Webhook URL (format: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX).
Copy this Webhook URL. You will use it in your Alertmanager configuration.

⚠️Note: Treat webhook URLs as secrets. Do not commit them to public repositories as anyone with the URL can post messages to your channel.

step 2: Test the Slack Webhook

After creating the Slack webhook URL, it’s a good idea to test it to ensure everything is working as expected. I will execute the command below from my bastion host to confirm a message can be posted on the defined Slack channel.

curl -X POST -H 'Content-type: application/json' --data '{"text":"OpenShift Alert: Test message!"}' https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX

Use your respective Webhook

Sample output of the curl command is ok, which confirms all good. If all good, you should receive the alert message on your channel.

If the test fails:

Verify the webhook URL is correct (no typos)
Check that the Slack app still has permission to post to the channel
Ensure your network allows outbound HTTPS to hooks.slack.com

Setting Up Email Notifications

Email notifications are important for:

Audit trails and compliance
Management visibility
Backup notification channel if Slack is down
Formal incident records

Step 1: Gather SMTP Information

You need these details from your email provider:

SMTP Server:
- Gmail: smtp.gmail.com
- Office 365: smtp.office365.com
Port: e.g port 587
From Address: e.g [email protected]
Username: Your Gmail address
Password:
- Gmail: App-specific password
- Office 365: Your password. Confirm with your IT team about this.
TLS Required: Yes

Step 2: Test SMTP Connection

Before configuring Alertmanager, verify you can connect to the SMTP server:

For example (we are using Gmail relay in our demo, hence):

nc -vz smtp.gmail.com 587

Expected response:

Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 142.251.127.109:587.
Ncat: 0 bytes sent, 0 bytes received in 0.06 seconds.

If connection fails:

Port 587 may be blocked by firewall
SMTP server hostname may be wrong
Try alternate port (465 for SSL, 25 for legacy)

Applying the Alertmanager Configuration

Now we’ll create a complete Alertmanager configuration file that includes both Slack and Email receivers.

Extract Current Alertmanager Configuration

As we did above under the section, “Understanding Alertmanager Configuration“, you can extract the current Alertmanager configuration from the alertmanager-main secret in the openshift-monitoring and save it as a local alertmanager.yaml file:

oc get secret alertmanager-main -n openshift-monitoring \
  --template='{{index .data "alertmanager.yaml"}}' \
  | base64 -d > alertmanager.yaml  && echo >> alertmanager.yaml

So, you will now have the current Alertmanager configurations in the alertmanager.yaml.

Customize the Alertmanager Configuration Accordingly

You can then edit the alertmanager.yaml and customize it accordingly.

This is how we have modified our configuration:

global:
  http_config:
    proxy_from_environment: true
  slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/HERE'

inhibit_rules:
  - equal:
      - namespace
      - alertname
    source_matchers:
      - severity = critical
    target_matchers:
      - severity =~ warning|info
  - equal:
      - namespace
      - alertname
    source_matchers:
      - severity = warning
    target_matchers:
      - severity = info

receivers:
  - name: unified-alerts
    slack_configs:
      - channel: '#openshift-alerts-demo'
        username: 'OpenShift Alertmanager'
        icon_emoji: ':kubernetes:'
        title: '{{ if eq .CommonLabels.severity "critical" }} CRITICAL: {{ end }}{{ .GroupLabels.alertname }}'
        text: |-
          {{ range .Alerts }}
          {{ if eq .Labels.severity "critical" }} *CRITICAL ALERT*{{ end }}
          *Summary:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Namespace:* {{ .Labels.namespace }}
          *Severity:* {{ .Labels.severity }}
          *Started:* {{ .StartsAt }}
          {{ end }}
        send_resolved: true
        
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.gmail.com:587'
        auth_username: '[email protected]'
        auth_password: 'your-app-password-here'
        require_tls: true
        headers:
          Subject: '[OpenShift Alert] {{ .GroupLabels.alertname }} - {{ .Status }}'
        html: |-
          <!DOCTYPE html>
          <html>
          <body>
            <h2>{{ if eq .CommonLabels.severity "critical" }}CRITICAL {{ end }}OpenShift Alert: {{ .GroupLabels.alertname }}</h2>
            <p><strong>Status:</strong> {{ .Status | toUpper }}</p>
            {{ range .Alerts }}
            <div style="border-left: 4px solid {{ if eq .Labels.severity "critical" }}red{{ else }}blue{{ end }}; padding: 10px; margin: 10px 0;">
              <h3>{{ .Annotations.summary }}</h3>
              <p>{{ .Annotations.description }}</p>
              <p><strong>Namespace:</strong> {{ .Labels.namespace }}</p>
              <p><strong>Severity:</strong> {{ .Labels.severity }}</p>
              <p><strong>Started:</strong> {{ .StartsAt }}</p>
            </div>
            {{ end }}
          </body>
          </html>
        send_resolved: true

route:
  receiver: unified-alerts
  group_by:
    - namespace
    - alertname
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    - matchers:
        - alertname = Watchdog
      receiver: unified-alerts
      repeat_interval: 5m
      group_wait: 0s

Update the configuration accordingly as per your environment settings.

Once you are confident about the updates you have made, you can apply the new configuration by running the command below;

oc create secret generic alertmanager-main \
     -n openshift-monitoring  \
     --from-file=alertmanager.yaml \
     --dry-run=client -o=yaml |  \
     oc replace secret -n openshift-monitoring  --filename=-

Then verify your routing configuration by visualizing the routing tree:

oc exec alertmanager-main-0 -n openshift-monitoring -- \
     amtool config routes show --alertmanager.url http://localhost:9093

Routing tree:
.
└── default-route  receiver: unified-alerts
    └── {alertname="Watchdog"}  receiver: unified-alerts

You can also check Alertmanager logs:

oc logs alertmanager-main-0 -n openshift-monitoring -c alertmanager --tail=20 -f

Look for such line:

...
time=2026-02-10T20:21:42.734Z level=INFO source=coordinator.go:125 msg="Completed loading of configuration file" component=configuration file=/etc/alertmanager/config_out/alertmanager.env.yaml
...

If you see errors instead:

Check YAML syntax (indentation matters!)
Verify webhook URL format
Ensure SMTP credentials are correct

Test with Watchdog Alert

Within 1-2 minutes, you should receive the Watchdog alert in Slack.

Watchdog is a special alert that always fires continuously. It exists specifically to prove your alerting pipeline is working. If you stop receiving Watchdog alerts, you know:

Either Prometheus stopped evaluating rules
Or Alertmanager stopped sending notifications
Or your Slack webhook stopped working

It’s an early warning system for the alerting system itself.

Check Slack channel:

You should see alrt messags:

Sample email alerts:

If you don’t see it after 3-5 minutes:

Check Alertmanager logs for errors:

oc logs alertmanager-main-0 -c alertmanager -n openshift-monitoring | grep -i error

Verify the webhook URL is correct:

oc get secret alertmanager-main -n openshift-monitoring \
  --template='{{index .data "alertmanager.yaml"}}' | base64 -d | grep slack_api_url

Test the webhook directly again with curl

Creating Custom Alert Rules with PrometheusRule

Now that Alertmanager is configured to send notifications, let’s create custom alert rules that detect actual problems in your applications.

PrometheusRule is a Kubernetes Custom Resource (CRD) that tells Prometheus:

What metric expression to evaluate (e.g., “CPU > 90%”)
How often to check it (default: every 30 seconds)
How long it must be true before firing (e.g., “for 5 minutes”)
What labels to attach (severity, team, etc.)
What information to include (summary, description)

PrometheusRule Structure

This is the basic layout of a PrometheusRule resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: example-app-alerts
  namespace: my-namespace
  labels:
    prometheus: user-workload
    role: alert-rules
spec:
  groups:
  - name: app-availability
    interval: 30s
    rules:
    - alert: PodNotRunning
      expr: kube_pod_status_phase{phase!="Running"} == 1
      for: 5m
      labels:
        severity: critical
        team: platform
      annotations:
        summary: "Pod {{ $labels.pod }} is not running"
        description: "Pod has been in {{ $labels.phase }} state for 5 minutes"

Let’s break down each part:

metadata Section:
- name: A unique name that identifies this PrometheusRule resource
- namespace: The user-defined namespace where the rule is created (your application/project namespace).
- Labels: Optional key-value pairs for organization, filtering, or categorization of the resource.
  - prometheus: user-workload: Conventionally used to tag rules for user workload context
  - role: alert-rules: Purely conventional for easy filtering e.g oc get prometheusrules -l role=alert-rules.
spec Section: This defines the desired state of the rules.
- groups – Rules are organized into groups for logical organization
- name – Name of this group (appears in Prometheus UI)
- interval – How often to evaluate these rules (default: 30s)
spec.groups[].rules Section: Inside each group is an array of individual rules (alerting or recording). This example has one alerting rule.
- alert – Name of the alert (shown in notifications). Should be descriptive enough and in most cases, it is written in upper CamelCase (PodNotRunning)
- expr – PromQL expression that detects the problem.
  - Returns a value when true, nothing when false. This checks if any pod phase is NOT “Running”
  - expr: kube_pod_status_phase{phase!=”Running”} == 1 this query checks if any pod’s phase is not running. If it evaluates to true, the condition is met.
- for – defines how long the expression must be true before firing an alert.
  - Prevents alerts on transient issues (pod restart takes 30 seconds = no alert)
  - For critical infrastructure, use shorter durations (1-2 minutes)
  - For less urgent issues, use longer (10-15 minutes)
- labels – They define key-value pairs attached to the alert
  - Alertmanager uses these for routing decisions
  - severity – critical/warning/info (standard convention)
  - You can add custom labels like team, component, environment
- annotations – This is human-readable information shown in notifications
  - summary – gives a one-line description (appears as alert title)
  - description – Detailed explanation (appears in alert body)
  - {{ $labels.pod }} – Template variable, replaced with actual pod name

Creating Prometheus Alert Rules

Now that we have seen the format of a PrometheusRule, let’s see how you can create an alert rule to notify in case of any issues.

In real-world monitoring, alerts typically fall into two major categories:

Resource-level alerts: Is the infrastructure healthy?
Application or business-level alerts: Is the application behaving correctly?

To demonstrate this, we’ll create two alerts in the monitoring-demo namespace:

One that checks for high CPU usage on pods.
Another that monitors transaction failures in our MobilePay application.

Resource Alert: High CPU Usage

A common infrastructure concern is CPU saturation. If a pod consistently consumes more CPU than expected, performance may degrade before the pod even crashes.

In this example, we define two CPU usage thresholds for workloads in the monitoring-demo project:

A warning alert when CPU usage exceeds 85% of the container’s CPU limit for more than 5 minutes
A critical alert when CPU usage exceeds 95% of the container’s CPU limit for more than 2 minutes

This tiered approach allows us to detect sustained pressure early (warning) while escalating quickly if the container approaches CPU saturation (critical).

Now consider the deployment resources for our container:

oc get deploy -n monitoring-demo mobilepay-api -o yaml | grep -C10 resources

Example output:

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 256Mi

The CPU limit is 500m, which equals 0.5 CPU cores.

This means the container cannot use more than 0.5 cores. If it attempts to exceed this limit, Kubernetes will throttle CPU usage.

Based on our alert configuration:

85% of 0.5 cores = 0.425 cores: If usage stays above this threshold for 5 continuous minutes, the HighPodCPUUsageWarning alert fires.
95% of 0.5 cores = 0.475 cores: If usage exceeds this threshold for more than 2 minutes, the HighPodCPUUsageCritical alert fires, indicating an immediate risk of CPU throttling and potential application performance degradation.

This approach ensures we are not alerted for short-lived spikes, but we are notified quickly when sustained or near-saturation CPU conditions occur.

Here is our sample PrometheusRule for the same:

cat cpu-alerts.yaml

Sample output;

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: monitoring-demo-resource-alerts
  namespace: monitoring-demo
spec:
  groups:
  - name: resource-alerts
    rules:
    - alert: HighPodCPUUsageWarning
      expr: |
        sum by (pod, container) (
          rate(container_cpu_usage_seconds_total{
            namespace="monitoring-demo",
            container!="", container!="POD"
          }[5m])
        ) / sum by (pod, container) (
          container_spec_cpu_quota{
            namespace="monitoring-demo",
            container!="", container!="POD"
          } / 100000
        ) * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage warning in pod {{ $labels.pod }} (container {{ $labels.container }})"
        description: |
          CPU usage is {{ printf "%.1f" $value }}% of the container's CPU limit for >5 minutes.
          Namespace: monitoring-demo
          Pod: {{ $labels.pod }}
          Container: {{ $labels.container }}
          Consider checking application load, scaling, or increasing CPU limit to prevent throttling.
          
    - alert: HighPodCPUUsageCritical
      expr: |
        sum by (pod, container) (
          rate(container_cpu_usage_seconds_total{
            namespace="monitoring-demo",
            container!="", container!="POD"
          }[5m])
        ) / sum by (pod, container) (
          container_spec_cpu_quota{
            namespace="monitoring-demo",
            container!="", container!="POD"
          } / 100000
        ) * 100 > 95
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "CRITICAL CPU usage in pod {{ $labels.pod }} (container {{ $labels.container }})"
        description: |
          CPU usage is {{ printf "%.1f" $value }}% of the container's CPU limit for >2 minutes.
          Namespace: monitoring-demo
          Pod: {{ $labels.pod }}
          Container: {{ $labels.container }}
          Immediate action required: high risk of CPU throttling. Check load, scaling, or increase CPU limit.

Note that since we defined inhibition rules in Alertmanager, if both alerts fire at the same time, the lower-severity alert is automatically suppressed. This means that if CPU usage exceeds 95% and the critical alert fires, the corresponding warning alert (same namespace and alertname) will not send a separate notification.

So, let;s apply the rule;

oc apply -f cpu-alerts.yaml

Confirm the rule is created;

oc get promrule -n monitoring-demo

Sample output;

NAME                              AGE
monitoring-demo-resource-alerts   43s

Testing your Alerts

To quickly verify that the rules work as expected, let’s intentionally trigger alerts and observe the pipeline in action.

So since I didn’t have a direct way to simulate the situation in my application, i just logged in to the deployment pods;

oc exec -it mobilepay-api-6b898c6dd5-sln7g -n monitoring-demo -- bash

oc exec -it mobilepay-api-6b898c6dd5-wkpfn -n monitoring-demo -- bash

and executed the following stress script;

for i in {1..8}; do
  yes > /dev/null &
done

I started to see CPU usage on the pods go up;

oc adm top pods -n monitoring-demo

NAME                             CPU(cores)   MEMORY(bytes)   
mobilepay-api-6b898c6dd5-sln7g   95m          69Mi            
mobilepay-api-6b898c6dd5-wkpfn   38m          82Mi

After sometime, the CPU usage was just at and above the thresholds:

First alert on Slack;

Critical threshold alerts:

When things ease out, you should received a resolution alert as well:

In the same way, you can create your own custom application error or business metrics alerts to monitor critical behavior in your applications.

You might have also received a lot of Watchdog alerts that create some nuisance to you.

Sample;

OpenShift Alert: Watchdog
Status: FIRING

An alert that should always be firing to certify that Alertmanager is working properly.
This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the "DeadMansSnitch" integration in PagerDuty.

Namespace: openshift-monitoring

Severity: none

Started: 2026-02-11 17:28:50.411 +0000 UTC

In that case, just edit the alertmanager.yaml above, add a null receiver and route Watchdog alerts to that receiver so you dont receive them on your channel.

vim alertmanager.yaml

add null route:

...
receivers:
  - name: 'null'
  - name: unified-alerts
    slack_configs:
      - channel: '#openshift-alerts-demo'
        username: 'OpenShift Alertmanager'
...

Then route the Watchdog alerts to that receiver:

...
  routes:
    - matchers:
        - alertname = Watchdog
      receiver: 'null'
      repeat_interval: 5m
      group_wait: 0s
...

Save and apply;

oc create secret generic alertmanager-main \
     -n openshift-monitoring  \
     --from-file=alertmanager.yaml \
     --dry-run=client -o=yaml |  \
     oc replace secret -n openshift-monitoring  --filename=-

Troubleshooting Common Alerting Issues

Even with a properly configured PrometheusRule and Alertmanager, alerting can sometimes fail. Here’s a quick high level overview of common problems and what to check:

Alerts Not Firing
- Symptom: Alert never appears in Prometheus.
- Check: Ensure the PrometheusRule exists, has the correct prometheus: user-workload label, metrics exist, and the PromQL expression is valid.
- Common causes: Missing label, typo in metric name, wrong namespace, or PromQL syntax errors.
Alerts Firing but No Notifications
- Symptom: Alert shows as “firing” in Prometheus but no Slack/email is received.
- Check: Verify Alertmanager received the alert, check logs, and validate routing and receiver configuration.
- Common causes: Incorrect webhook/SMTP, misconfigured routes, firewall blocking traffic.
Slack Webhook Not Working
- Symptom: Alertmanager logs indicate failure to send notification.
- Check: Test the webhook manually, verify the URL in secrets, and confirm network connectivity.
- Common causes: Typo in webhook URL, expired webhook, corporate proxy/firewall blocking requests.
Email Not Sending
- Symptom: No errors in logs, but emails never arrive.
- Check: Confirm SMTP settings, test from a pod, and check spam folder.
- Common causes: Wrong SMTP port, TLS issues, missing app password, firewall restrictions.
Too Many Notifications
- Symptom: Flooded with duplicate alerts.
- Solutions:
  - Adjust grouping (group_by) to consolidate related alerts.
  - Increase repeat_interval to reduce frequency of repeated notifications.
  - Use inhibition rules to suppress lower-severity alerts when a critical alert is active.

By systematically checking these areas, you can quickly identify the root cause of alerting issues and keep your Prometheus monitoring reliable.

Conclusion

Alerting transforms OpenShift monitoring from passive observation into active incident response. By configuring Alertmanager with intelligent routing, severity-based receivers, and custom PrometheusRules, you’ve built a notification pipeline that ensures the right people are informed at the right time. Remember that alerting is iterative, start with critical infrastructure alerts, tune thresholds based on real-world behavior, and continuously refine your routing rules to reduce noise while maintaining coverage.

With Prometheus detecting problems, Alertmanager routing notifications, and receivers delivering them to your team, you now have a complete observability stack that doesn’t just collect metrics but actively protects your applications and infrastructure. As your OpenShift environment grows, regularly review your alert rules, adjust notification channels, and leverage inhibition rules to keep your alerting pipeline both comprehensive and manageable.

How to Set Up Alerting in OpenShift 4: Alertmanager, Alert Rules & Notifications

Table of Contents

How to Set Up Alerting in OpenShift 4

Understanding the OpenShift Alerting Architecture

Prerequisites

Verifying Monitoring Stack

Understanding Alertmanager Configuration

Setting Up Notification Receivers

Configure Slack Webhook

Setting Up Email Notifications

Applying the Alertmanager Configuration

Extract Current Alertmanager Configuration

Customize the Alertmanager Configuration Accordingly

Test with Watchdog Alert

Creating Custom Alert Rules with PrometheusRule

PrometheusRule Structure

Creating Prometheus Alert Rules

Testing your Alerts

Troubleshooting Common Alerting Issues

Conclusion

How to Improve Your Business With Web Services

Install WireGuard VPN Client on Rocky Linux/Ubuntu/Debian

Upgrade RHEL 9 to RHEL 10 using LEAPP Tool

Setup Master-Slave DNS Server using BIND on CentOS 7

Install OSSEC Agent on Rocky Linux 8

Install Guacamole on Debian 11

Leave a Comment Cancel reply

Table of Contents

How to Set Up Alerting in OpenShift 4

Understanding the OpenShift Alerting Architecture

Prerequisites

Verifying Monitoring Stack

Understanding Alertmanager Configuration

Setting Up Notification Receivers

Configure Slack Webhook

Setting Up Email Notifications

Applying the Alertmanager Configuration

Extract Current Alertmanager Configuration

Customize the Alertmanager Configuration Accordingly

Test with Watchdog Alert

Creating Custom Alert Rules with PrometheusRule

PrometheusRule Structure

Creating Prometheus Alert Rules

Testing your Alerts

Troubleshooting Common Alerting Issues

Conclusion

SUPPORT US VIA A VIRTUAL CUP OF COFFEE

Related Posts

Leave a Comment Cancel reply