Cluster Monitoring and Alerts

Prerequisites

  • OpenShift 4.6 on VMware 6.7 U3+ or 7.0
  • VMware Cloud Native Storage to support CNS CSI
  • OpenShift installer
    • Node subnet with DHCP pool
    • DNS
    • NTP

OpenShift Mornitering and Alert

OpenShift Container Platform includes a pre-configured, pre-installed, and self-updating monitoring stack that provides monitoring for core platform components. OpenShift Container Platform delivers monitoring best practices out of the box. A set of alerts are included by default that immediately notify cluster administrators about issues with a cluster. Default dashboards in the OpenShift Container Platform web console include visual representations of cluster metrics to help you to quickly understand the state of your cluster.

The OpenShift Container Platform monitoring stack is based on the Prometheus open source project and its wider ecosystem. The monitoring stack includes the following:

  • Default platform monitoring components. A set of platform monitoring components are installed in the openshift-monitoring project by default during an OpenShift Container Platform installation. This provides monitoring for core OpenShift Container Platform components including Kubernetes services. The default monitoring stack also enables remote health monitoring for clusters. These components are illustrated in the Installed by default section in the following diagram.

  • Components for monitoring user-defined projects. After optionally enabling monitoring for user-defined projects, additional monitoring components are installed in the openshift-user-workload-monitoring project. This provides monitoring for user-defined projects.

Default monitoring targets In addition to the components of the stack itself, the default monitoring stack monitors:

  • CoreDNS
  • Elasticsearch (if Logging is installed)
  • etcd
  • Fluentd (if Logging is installed)
  • HAProxy
  • Image registry
  • Kubelets
  • Kubernetes apiserver
  • Kubernetes controller manager
  • Kubernetes scheduler
  • Metering (if Metering is installed)
  • OpenShift apiserver
  • OpenShift controller manager
  • Operator Lifecycle Manager (OLM)

Configuring the monitoring stack

The OpenShift Container Platform 4 installation program provides only a low number of configuration options before installation. Configuring most OpenShift Container Platform framework components, including the cluster monitoring stack, happens post-installation.

You can configure the monitoring stack by creating and updating monitoring config maps.

Procedure

  1. Check whether the cluster-monitoring-config ConfigMap object exists:

    oc -n openshift-monitoring get configmap cluster-monitoring-config
    
  2. If the ConfigMap object does not exist:

    Create and apply the following YAML manifest. In this example the file is called cluster-monitoring-config.yaml:

    cat <<EOF | oc apply -f -
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |
        enableUserWorkload: true
        prometheusK8s: 
          retention: 7d
          externalLabels:
            region: sea
            environment: dev
          volumeClaimTemplate:
            spec:
              storageClassName: thin
              volumeMode: Filesystem
              resources:
                requests:
                  storage: 40Gi
        alertmanagerMain:
          volumeClaimTemplate:
            spec:
              storageClassName: thin
              resources:
                requests:
                  storage: 5Gi
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: user-workload-monitoring-config
      namespace: openshift-user-workload-monitoring
    data:
      config.yaml: |
        prometheus: 
            retention: 12h 
            resources:
              requests:
                cpu: 200m 
                memory: 1Gi
    EOF
    

Managing alerts

In OpenShift Container Platform 4.6, the Alerting UI enables you to manage alerts, silences, and alerting rules.

  • Alerting rules. Alerting rules contain a set of conditions that outline a particular state within a cluster. Alerts are triggered when those conditions are true. An alerting rule can be assigned a severity that defines how the alerts are routed.

  • Alerts. An alert is fired when the conditions defined in an alerting rule are true. Alerts provide a notification that a set of circumstances are apparent within an OpenShift Container Platform cluster.

  • Silences. A silence can be applied to an alert to prevent notifications from being sent when the conditions for an alert are true. You can mute an alert after the initial notification, while you work on resolving the underlying issue.

Understanding alert filters

In the Administrator perspective, the Alerts page in the Alerting UI provides details about alerts relating to default OpenShift Container Platform and user-defined projects. The page includes a summary of severity, state, and source for each alert. The time at which an alert went into its current state is also shown.

You can filter by alert state, severity, and source. By default, only Platform alerts that are Firing are displayed. The following describes each alert filtering option:

  • Alert State filters:

    • Firing. The alert is firing because the alert condition is true and the optional for duration has passed. The alert will continue to fire as long as the condition remains true.
    • Pending. The alert is active but is waiting for the duration that is specified in the alerting rule before it fires.
    • Silenced. The alert is now silenced for a defined time period. Silences temporarily mute alerts based on a set of label selectors that you define. Notifications will not be sent for alerts that match all the listed values or regular expressions.
  • Severity filters:

    • Critical. The condition that triggered the alert could have a critical impact. The alert requires immediate attention when fired and is typically paged to an individual or to a critical response team.
    • Warning. The alert provides a warning notification about something that might require attention in order to prevent a problem from occurring. Warnings are typically routed to a ticketing system for non-immediate review.
    • Info. The alert is provided for informational purposes only.
    • None. The alert has no defined severity.
    • You can also create custom severity definitions for alerts relating to user-defined projects.
  • Source filters:

    • Platform. Platform-level alerts relate only to default OpenShift Container Platform projects. These projects provide core OpenShift Container Platform functionality.
    • User. User alerts relate to user-defined projects. These alerts are user-created and are customizable. User-defined workload monitoring can be enabled post-installation to provide observability into your own workloads.

Sending notifications to external systems

In OpenShift Container Platform 4.6, firing alerts can be viewed in the Alerting UI. Alerts are not configured by default to be sent to any notification systems. You can configure OpenShift Container Platform to send alerts to the following receiver types:

  • PagerDuty
  • Webhook
  • Email
  • Slack

Routing alerts to receivers enables you to send timely notifications to the appropriate teams when failures occur. For example, critical alerts require immediate attention and are typically paged to an individual or a critical response team. Alerts that provide non-critical warning notifications might instead be routed to a ticketing system for non-immediate review.

**Set up Test Mail Server and Slack Channel

Before we configure the AlertManager to send email and slack alert, we will need to run a webmail with smtp and slack channel for testing

  1. Deploy the maildev in default namespace and expose webmail

    oc project default
    oc new-app --docker-image=maildev/maildev --name='maildev'
    oc expose service/maildev --port=80
    

    From this deployment you can reach smtp via maildev.default.svc.cluster.local:25 and webmail from route host using command oc get route

  2. Create Slack Channel webhooks token, or use prepared token you have

  3. Webhook notification with line-notify-gateway You will need to setup Line Notification service to allow Line Notifications in your chat group. After you setup you will get your token to use in AlertManager webhook to line-notify-gateway container, which is provided as an example for receive AlertManager webhook json and send to Line Notification Service.

    oc -n default new-app --docker-image=nontster/line-notify-gateway
    oc -n default get svc
    

Configuring alert receivers

Procedure

  1. In the Administrator perspective, navigate to Administration → Cluster Settings → Global Configuration → Alertmanager.

  2. You can apply Alert Receivers configuration example that includes receivers for smtp to webmail, slack and line-notification-gateway.

     cat <<EOF | oc apply -f -
     kind: Secret
     apiVersion: v1
     metadata:
       name: alertmanager-main
       namespace: openshift-monitoring
     data:
       alertmanager.yaml: >-
         Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KICBzbXRwX2Zyb206IG9jcEBleGFtcGxlLmNvbQogIHNtdHBfc21hcnRob3N0OiAnbWFpbGRldi5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsOjI1JwogIHNtdHBfaGVsbG86IGxvY2FsaG9zdAogIHNtdHBfcmVxdWlyZV90bHM6IGZhbHNlCiAgc2xhY2tfYXBpX3VybDogPi0KICAgIGh0dHBzOi8vaG9va3Muc2xhY2suY29tL3NlcnZpY2VzL1QwMUJBNVoxUUczL0IwMUpEMEZTSzVXL01kWGJpTlpxRjVVTlZ1MWtReWE2dEFlRgppbmhpYml0X3J1bGVzOgogIC0gZXF1YWw6CiAgICAgIC0gbmFtZXNwYWNlCiAgICAgIC0gYWxlcnRuYW1lCiAgICBzb3VyY2VfbWF0Y2g6CiAgICAgIHNldmVyaXR5OiBjcml0aWNhbAogICAgdGFyZ2V0X21hdGNoX3JlOgogICAgICBzZXZlcml0eTogd2FybmluZ3xpbmZvCiAgLSBlcXVhbDoKICAgICAgLSBuYW1lc3BhY2UKICAgICAgLSBhbGVydG5hbWUKICAgIHNvdXJjZV9tYXRjaDoKICAgICAgc2V2ZXJpdHk6IHdhcm5pbmcKICAgIHRhcmdldF9tYXRjaF9yZToKICAgICAgc2V2ZXJpdHk6IGluZm8KcmVjZWl2ZXJzOgogIC0gbmFtZTogQ3JpdGljYWwKICAgIGVtYWlsX2NvbmZpZ3M6CiAgICAgIC0gdG86IGFkbWluQGV4YW1wbGUuY29tCiAgLSBuYW1lOiBEZWZhdWx0CiAgICBlbWFpbF9jb25maWdzOgogICAgICAtIHRvOiBhZG1pbkBleGFtcGxlLmNvbQogIC0gbmFtZTogbGluZS13YXJuaW5nCiAgICB3ZWJob29rX2NvbmZpZ3M6CiAgICAgIC0gdXJsOiA+LQogICAgICAgICAgaHR0cDovL2xpbmUtbm90aWZ5LWdhdGV3YXkuZGVmYXVsdC5zdmMuY2x1c3Rlci5sb2NhbDoxODA4MS92MS9hbGVydG1hbmFnZXIvcGF5bG9hZD9ub3RpZnlfdG9rZW49aG1OODlpUU1qTTNubE1IVkRaNnc3Y0t1S0RVaWZwd21tdzhvNGVwMTNTQwogIC0gbmFtZTogbGluZS13YXRjaGRvZwogICAgd2ViaG9va19jb25maWdzOgogICAgICAtIHVybDogPi0KICAgICAgICAgIGh0dHA6Ly9saW5lLW5vdGlmeS1nYXRld2F5LmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWw6MTgwODEvdjEvYWxlcnRtYW5hZ2VyL3BheWxvYWQ/bm90aWZ5X3Rva2VuPWhtTjg5aVFNak0zbmxNSFZEWjZ3N2NLdUtEVWlmcHdtbXc4bzRlcDEzU0MKICAtIG5hbWU6IHNsYWNrCiAgICBzbGFja19jb25maWdzOgogICAgICAtIGNoYW5uZWw6ICcjYWxlcnQnCiAgLSBuYW1lOiBzbGFjay13YXJuaW5nCiAgICBzbGFja19jb25maWdzOgogICAgICAtIGNoYW5uZWw6ICcjYWxlcnQnCiAgLSBuYW1lOiBzbGFjay13YXRjaGRvZwogICAgc2xhY2tfY29uZmlnczoKICAgICAgLSBjaGFubmVsOiAnI2FsZXJ0JwogIC0gbmFtZTogV2FybmluZwogICAgZW1haWxfY29uZmlnczoKICAgICAgLSB0bzogd2FybmluZ0BleGFtcGxlLmNvbQogIC0gbmFtZTogV2F0Y2hkb2cKICAgIGVtYWlsX2NvbmZpZ3M6CiAgICAgIC0gdG86IHdhdGNoZG9nQGV4YW1wbGUuY29tCnJvdXRlOgogIGdyb3VwX2J5OgogICAgLSBuYW1lc3BhY2UKICBncm91cF9pbnRlcnZhbDogNW0KICBncm91cF93YWl0OiAzMHMKICByZWNlaXZlcjogRGVmYXVsdAogIHJlcGVhdF9pbnRlcnZhbDogMWgKICByb3V0ZXM6CiAgICAtIHJlY2VpdmVyOiBXYXRjaGRvZwogICAgICBtYXRjaDoKICAgICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICAtIHJlY2VpdmVyOiBDcml0aWNhbAogICAgICBtYXRjaDoKICAgICAgICBzZXZlcml0eTogY3JpdGljYWwKICAgIC0gcmVjZWl2ZXI6IFdhcm5pbmcKICAgICAgbWF0Y2g6CiAgICAgICAgc2V2ZXJpdHk6IHdhcm5pbmcKICAgIC0gcmVjZWl2ZXI6IHNsYWNrCiAgICAgIG1hdGNoOgogICAgICAgIHNldmVyaXR5OiBpbmZvCiAgICAtIHJlY2VpdmVyOiBzbGFjay13YXRjaGRvZwogICAgICBtYXRjaDoKICAgICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICAtIHJlY2VpdmVyOiBzbGFjay13YXJuaW5nCiAgICAgIG1hdGNoOgogICAgICAgIHNldmVyaXR5OiB3YXJuaW5nCiAgICAtIHJlY2VpdmVyOiBsaW5lLXdhcm5pbmcKICAgICAgbWF0Y2g6CiAgICAgICAgc2V2ZXJpdHk6IHdhcm5pbmcKICAgIC0gcmVjZWl2ZXI6IGxpbmUtd2F0Y2hkb2cKICAgICAgbWF0Y2g6CiAgICAgICAgYWxlcnRuYW1lOiBXYXRjaGRvZwo=
     type: Opaque
     EOF
    

    Configuration content in plain-text

     global:
       resolve_timeout: 5m
       smtp_from: ocp@example.com
       smtp_smarthost: 'maildev.default.svc.cluster.local:25'
       smtp_hello: localhost
       smtp_require_tls: false
       slack_api_url: >-
         https://hooks.slack.com/services/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
     inhibit_rules:
       - equal:
           - namespace
           - alertname
         source_match:
           severity: critical
         target_match_re:
           severity: warning|info
       - equal:
           - namespace
           - alertname
         source_match:
           severity: warning
         target_match_re:
           severity: info
     receivers:
       - name: Critical
         email_configs:
           - to: admin@example.com
       - name: Default
         email_configs:
           - to: admin@example.com
       - name: slack
         slack_configs:
           - channel: '#alert'
       - name: slack-warning
         slack_configs:
           - channel: '#alert'
       - name: slack-watchdog
         slack_configs:
           - channel: '#alert'
       - name: Warning
         email_configs:
           - to: warning@example.com
       - name: Watchdog
         email_configs:
           - to: watchdog@example.com
     route:
       group_by:
         - namespace
       group_interval: 5m
       group_wait: 30s
       receiver: Default
       repeat_interval: 1h
       routes:
         - receiver: Watchdog
           match:
             alertname: Watchdog
         - receiver: Critical
           match:
             severity: critical
         - receiver: Warning
           match:
             severity: warning
         - receiver: slack
           match:
             severity: info
         - receiver: slack-watchdog
           match:
             alertname: Watchdog
         - receiver: slack-warning
           match:
             severity: warning
         - receiver: line-warning
           match:
             severity: warning
         - receiver: line-watchdog
           match:
             alertname: Watchdog
    

Tesing Alerts

We will demontrate Alerts from Prometheus and AlertManager by using Platform and User Workload application with metrics from cpu, memory, persistent volume resource.

Procedures

  • Deploy a sample httpd pod with persistent volume 1G size mount to /httpd-pvc path

    oc new-project user1
    cat << EOF | oc create -f -
    ---
    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: httpd-24
      namespace: user1
      labels:
        app: httpd-24
        app.kubernetes.io/component: httpd-24
        app.kubernetes.io/instance: httpd-24
        app.kubernetes.io/part-of: httpd-24-app
        app.openshift.io/runtime-namespace: user1
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: httpd-24
      template:
        metadata:
          labels:
            app: httpd-24
            deploymentconfig: httpd-24
        spec:
          volumes:
            - name: httpd-pvc
              persistentVolumeClaim:
                claimName: httpd-pvc
          containers:
            - name: httpd-24
              image: registry.redhat.io/rhel8/httpd-24
              ports:
                - containerPort: 8080
                  protocol: TCP
                - containerPort: 8443
                  protocol: TCP
              volumeMounts:
                - name: httpd-pvc
                  mountPath: /httpd-pvc
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: httpd-pvc
      namespace: user1
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      storageClassName: thin
      volumeMode: Filesystem
    EOF
    
  • rsh to httpd pod to check PV size and use fallocate command to make PV high utilization (>95%)

    oc rsh $(oc -n user1 get pod | grep httpd | cut -d' ' -f1)
    

    check volume utilization

    df -h
    

    result, /httpd-pvc, Use 1%

    Filesystem                            Size  Used Avail Use% Mounted on
    overlay                               120G   19G  102G  16% /
    tmpfs                                  64M     0   64M   0% /dev
    tmpfs                                 3.9G     0  3.9G   0% /sys/fs/cgroup
    shm                                    64M     0   64M   0% /dev/shm
    tmpfs                                 3.9G   47M  3.9G   2% /etc/passwd
    /dev/sdd                              976M  2.6M  958M   1% /httpd-pvc
    /dev/mapper/coreos-luks-root-nocrypt  120G   19G  102G  16% /etc/hosts
    tmpfs                                 3.9G   28K  3.9G   1% /run/secrets/kubernetes.io/serviceaccount
    tmpfs                                 3.9G     0  3.9G   0% /proc/acpi
    tmpfs                                 3.9G     0  3.9G   0% /proc/scsi
    tmpfs                                 3.9G     0  3.9G   0% /sys/firmware
    

    use fallocate to create a file with 950M size

    cd /httpd-pvc/
    fallocate -l 950M file.tmp
    

    re-check /httpd-pvc again, now it is Use 100%

    df -h
    

    result

    Filesystem                            Size  Used Avail Use% Mounted on
    overlay                               120G   19G  102G  16% /
    tmpfs                                  64M     0   64M   0% /dev
    tmpfs                                 3.9G     0  3.9G   0% /sys/fs/cgroup
    shm                                    64M     0   64M   0% /dev/shm
    tmpfs                                 3.9G   47M  3.9G   2% /etc/passwd
    /dev/sdd                              976M  953M  7.4M 100% /httpd-pvc
    /dev/mapper/coreos-luks-root-nocrypt  120G   19G  102G  16% /etc/hosts
    tmpfs                                 3.9G   28K  3.9G   1% /run/secrets/kubernetes.io/serviceaccount
    tmpfs                                 3.9G     0  3.9G   0% /proc/acpi
    tmpfs                                 3.9G     0  3.9G   0% /proc/scsi
    tmpfs                                 3.9G     0  3.9G   0% /sys/firmware
    
  • For the User Workload Monitoring, we will need to add rolebinding to enabled permission to edit/view PrometheusRule, ServiceMonitor and PodMonitor.

    Let's add monitoring-edit role to user ldapuser

    oc policy add-role-to-user monitoring-edit ldapuser -n user1
    
  • Create Prometheus Rule to alert when PV utilization is >80% in namespace user1

    cat << EOF | oc create -f -
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: pvfull-alert
      namespace: user1
    spec:
      groups:
      - name: PersistenVolumeFullAlertRule
        rules:
        - alert: PersistenVolumeFull
          expr: kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics"} < 0.20
        for: 5m
        labels:
          secerity: critical
        annotations:
          message: The PersistentVolume claimed is under 20 percent free.
    EOF
    
  • Now we can go back to OpenShift Console Developer View, go to user1 project > Monitoring > Alerts

  • We will see our user defined alert rule and alert that trigger
  • You can click on Alert and see the details
  • When alert is Firing, firing alerts will also route to external notifications defined by AlertManager

results matching ""

    No results matching ""