#1978829 | bug | 15 months ago | ClusterMonitoringOperatorReconciliationErrors is firing during upgrades and should not be RELEASE_PENDING |
Bug 1978829: ClusterMonitoringOperatorReconciliationErrors is firing during upgrades and should not be Status: RELEASE_PENDING alert ClusterMonitoringOperatorReconciliationErrors fired for 60 seconds with labels: {severity="warning"} can be seen in ci-search that this is failing most of our "old-rhcos" job runs: https://search.ci.openshift.org/?search=ClusterMonitoringOperatorReconciliationErrors&maxAge=336h&context=1&type=bug%2Bjunit&name=release-openshift-origin-installer-old-rhcos-e2e-aws-4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job - alert: ClusterMonitoringOperatorReconciliationErrors annotations: | |||
#1965545 | bug | 18 months ago | Pod stuck in ContainerCreating: Unit ...slice already exists RELEASE_PENDING |
alert ClusterMonitoringOperatorReconciliationErrors fired for 1429 seconds with labels: {severity="warning"} alert ClusterNotUpgradeable fired for 942 seconds with labels: {condition="Upgradeable", endpoint="metrics", name="version", severity="warning"} | |||
#1996755 | bug | 17 months ago | Pod stuck in ContainerCreating: Unit ...slice already exists NEW |
alert ClusterMonitoringOperatorReconciliationErrors fired for 1429 seconds with labels: {severity="warning"} alert ClusterNotUpgradeable fired for 942 seconds with labels: {condition="Upgradeable", endpoint="metrics", name="version", severity="warning"} | |||
#1932624 | bug | 18 months ago | ClusterMonitoringOperatorReconciliationErrors is pending at the end of an upgrade and probably should not be RELEASE_PENDING |
Bug 1932624: ClusterMonitoringOperatorReconciliationErrors is pending at the end of an upgrade and probably should not be Status: RELEASE_PENDING demonstrates this for the ClusterMonitoringOperatorReconciliationErrors which is pending 1m after upgrade is complete. I would except reconciliation to not be pending because CMO should handle normal disruption errors silently and other components should not disrupt CMO during upgrade (I.e. control plane). I *suspect* this is because of the known GCP issue where some API requests are disrupted, so feel free to blame this on https://bugzilla.redhat.com/show_bug.cgi?id=1925698 for now. | |||
#1999148 | bug | 16 months ago | ClusterMonitoringOperatorReconciliationErrors is firing during upgrades and should not be RELEASE_PENDING |
Bug 1999148: ClusterMonitoringOperatorReconciliationErrors is firing during upgrades and should not be Status: RELEASE_PENDING Comment 15446787 by juzhao@redhat.com at 2021-09-01T05:32:08Z checked with the PR, ClusterMonitoringOperatorReconciliationErrors detail now is - alert: ClusterMonitoringOperatorReconciliationErrors annotations: | |||
#1980123 | bug | 15 months ago | 4.7 to 4.8 oVirt update CI failing: monitoring UpdatingkubeStateMetricsFailed with a Prometheus pod stuck in ContainerCreating due to unmounted volume NEW |
alert ClusterMonitoringOperatorReconciliationErrors fired for 6945 seconds with labels: {severity="warning"} alert ClusterOperatorDegraded fired for 4958 seconds with labels: {condition="Degraded", endpoint="metrics", instance="192.168.213.132:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-66f9c6c64-7dgf8", reason="UpdatingPrometheusK8SFailed", service="cluster-version-operator", severity="warning"} | |||
#1913398 | bug | 2 years ago | Race condition updating grafana ClusterRoleBinding ASSIGNED |
This is why we have the ClusterMonitoringOperatorReconciliationErrors alert [1]. A single reconciliation failure can happen and it shouldn't have to notify anybody IMO. | |||
#1940933 | bug | 18 months ago | [sig-arch] Check if alerts are firing during or after upgrade success: AggregatedAPIDown on v1beta1.metrics.k8s.io RELEASE_PENDING |
alert ClusterMonitoringOperatorReconciliationErrors pending for 1 seconds with labels: {__name__="ALERTS", container="kube-rbac-proxy", endpoint="https", instance="10.128.0.27:8443", job="cluster-monitoring-operator", namespace="openshift-monitoring", pod="cluster-monitoring-operator-6466d67f66-nz94k", service="cluster-monitoring-operator", severity="warning"} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1932624) From the logs of the other failed job [1], I see that the issue isn't related to the ClusterMonitoringOperatorReconciliationErrors alert. This time, the alert is AggregatedAPIDown for the v1beta1.metrics.k8s.io API (which is prometheus-adapter). Comment 14916635 by spasquie@redhat.com at 2021-03-29T09:07:54Z The original description isn't a duplicate of bug 1932624: the former is about the ClusterMonitoringOperatorReconciliationErrors alert while this bug report is about AggregatedAPIDown. Reopening the bug and assigning to Damien. Comment 14951290 by wking@redhat.com at 2021-04-09T22:39:33Z alert AggregatedAPIDown fired for 90 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} alert ClusterMonitoringOperatorReconciliationErrors fired for 300 seconds with labels: {severity="warning"} | |||
#1952128 | bug | 20 months ago | Prometheus Pod stuck ContainerCreating NEW |
KubePodNotReady ClusterMonitoringOperatorReconciliationErrors KubeStatefulSetReplicasMismatch | |||
#1992493 | bug | 15 months ago | 3 alerts have no annotations summary and description RELEASE_PENDING |
$ oc get prometheusrules -n openshift-monitoring -oyaml |grep -E -A5 'ClusterMonitoringOperatorReconciliationErrors|AlertmanagerReceiversNotConfigured|MultipleContainersOOMKilled' - alert: ClusterMonitoringOperatorReconciliationErrors annotations: checked with 4.9.0-0.nightly-2021-08-19-184748, the reported alerts have summary and description part # oc get prometheusrules -n openshift-monitoring -oyaml |grep -E -A10 'ClusterMonitoringOperatorReconciliationErrors|AlertmanagerReceiversNotConfigured|MultipleContainersOOMKilled' - alert: ClusterMonitoringOperatorReconciliationErrors annotations: |
Found in 0.00% of runs (0.00% of failures) across 241001 total runs and 9217 jobs (20.12% failed) in 160ms - clear search | chart view - source code located on github