Job:
#1978829bug15 months agoClusterMonitoringOperatorReconciliationErrors is firing during upgrades and should not be RELEASE_PENDING
Bug 1978829: ClusterMonitoringOperatorReconciliationErrors is firing during upgrades and should not be
Status: RELEASE_PENDING
alert ClusterMonitoringOperatorReconciliationErrors fired for 60 seconds with labels: {severity="warning"}
can be seen in ci-search that this is failing most of our "old-rhcos" job runs:
https://search.ci.openshift.org/?search=ClusterMonitoringOperatorReconciliationErrors&maxAge=336h&context=1&type=bug%2Bjunit&name=release-openshift-origin-installer-old-rhcos-e2e-aws-4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
      - alert: ClusterMonitoringOperatorReconciliationErrors
        annotations:
#1965545bug18 months agoPod stuck in ContainerCreating: Unit ...slice already exists RELEASE_PENDING
alert ClusterMonitoringOperatorReconciliationErrors fired for 1429 seconds with labels: {severity="warning"}
alert ClusterNotUpgradeable fired for 942 seconds with labels: {condition="Upgradeable", endpoint="metrics", name="version", severity="warning"}
#1996755bug17 months agoPod stuck in ContainerCreating: Unit ...slice already exists NEW
alert ClusterMonitoringOperatorReconciliationErrors fired for 1429 seconds with labels: {severity="warning"}
alert ClusterNotUpgradeable fired for 942 seconds with labels: {condition="Upgradeable", endpoint="metrics", name="version", severity="warning"}
#1932624bug18 months agoClusterMonitoringOperatorReconciliationErrors is pending at the end of an upgrade and probably should not be RELEASE_PENDING
Bug 1932624: ClusterMonitoringOperatorReconciliationErrors is pending at the end of an upgrade and probably should not be
Status: RELEASE_PENDING
demonstrates this for the ClusterMonitoringOperatorReconciliationErrors which is pending 1m after upgrade is complete. I would except reconciliation to not be pending because CMO should handle normal disruption errors silently and other components should not disrupt CMO during upgrade (I.e. control plane). I *suspect* this is because of the known GCP issue where some API requests are disrupted, so feel free to blame this on https://bugzilla.redhat.com/show_bug.cgi?id=1925698 for now.
#1999148bug16 months agoClusterMonitoringOperatorReconciliationErrors is firing during upgrades and should not be RELEASE_PENDING
Bug 1999148: ClusterMonitoringOperatorReconciliationErrors is firing during upgrades and should not be
Status: RELEASE_PENDING
Comment 15446787 by juzhao@redhat.com at 2021-09-01T05:32:08Z
checked with the PR, ClusterMonitoringOperatorReconciliationErrors detail now is
        - alert: ClusterMonitoringOperatorReconciliationErrors
          annotations:
#1980123bug15 months ago4.7 to 4.8 oVirt update CI failing: monitoring UpdatingkubeStateMetricsFailed with a Prometheus pod stuck in ContainerCreating due to unmounted volume NEW
    alert ClusterMonitoringOperatorReconciliationErrors fired for 6945 seconds with labels: {severity="warning"}
    alert ClusterOperatorDegraded fired for 4958 seconds with labels: {condition="Degraded", endpoint="metrics", instance="192.168.213.132:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-66f9c6c64-7dgf8", reason="UpdatingPrometheusK8SFailed", service="cluster-version-operator", severity="warning"}
#1913398bug2 years agoRace condition updating grafana ClusterRoleBinding ASSIGNED
This is why we have the ClusterMonitoringOperatorReconciliationErrors alert [1]. A single reconciliation failure can happen and it shouldn't have to notify anybody IMO.
#1940933bug18 months ago[sig-arch] Check if alerts are firing during or after upgrade success: AggregatedAPIDown on v1beta1.metrics.k8s.io RELEASE_PENDING
    alert ClusterMonitoringOperatorReconciliationErrors pending for 1 seconds with labels: {__name__="ALERTS", container="kube-rbac-proxy", endpoint="https", instance="10.128.0.27:8443", job="cluster-monitoring-operator", namespace="openshift-monitoring", pod="cluster-monitoring-operator-6466d67f66-nz94k", service="cluster-monitoring-operator", severity="warning"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1932624)
From the logs of the other failed job [1], I see that the issue isn't related to the ClusterMonitoringOperatorReconciliationErrors alert. This time, the alert is AggregatedAPIDown for the v1beta1.metrics.k8s.io API (which is prometheus-adapter).
Comment 14916635 by spasquie@redhat.com at 2021-03-29T09:07:54Z
The original description isn't a duplicate of bug 1932624: the former is about the ClusterMonitoringOperatorReconciliationErrors alert while this bug report is about AggregatedAPIDown. Reopening the bug and assigning to Damien.
Comment 14951290 by wking@redhat.com at 2021-04-09T22:39:33Z
    alert AggregatedAPIDown fired for 90 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}
    alert ClusterMonitoringOperatorReconciliationErrors fired for 300 seconds with labels: {severity="warning"}
#1952128bug20 months agoPrometheus Pod stuck ContainerCreating NEW
KubePodNotReady
ClusterMonitoringOperatorReconciliationErrors
KubeStatefulSetReplicasMismatch
#1992493bug15 months ago3 alerts have no annotations summary and description RELEASE_PENDING
$ oc get prometheusrules -n openshift-monitoring -oyaml |grep -E -A5 'ClusterMonitoringOperatorReconciliationErrors|AlertmanagerReceiversNotConfigured|MultipleContainersOOMKilled'
      - alert: ClusterMonitoringOperatorReconciliationErrors
        annotations:
checked with 4.9.0-0.nightly-2021-08-19-184748, the reported alerts have summary and description part
# oc get prometheusrules -n openshift-monitoring -oyaml |grep -E -A10 'ClusterMonitoringOperatorReconciliationErrors|AlertmanagerReceiversNotConfigured|MultipleContainersOOMKilled'
      - alert: ClusterMonitoringOperatorReconciliationErrors
        annotations:

Found in 0.00% of runs (0.00% of failures) across 241001 total runs and 9217 jobs (20.12% failed) in 160ms - clear search | chart view - source code located on github