Search OpenShift CI

#OCPBUGS-27113	issue	4 weeks ago	Console blips Available=False with RouteHealth_FailedGet and such Verified
Issue 15716399: Console blips Available=False with RouteHealth_FailedGet and such Description: This is a clone of issue OCPBUGS-24041. The following is the description of the original issue: --- h2. Description Seen in 4.15-related update CI: {code:none} $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.upgrade&context=0&search=clusteroperator/console.condition/Available.status/False' \| jq -r 'to_entries[].value \| to_entries[].value[].context[]' \| sed 's\|.clusteroperator/$[^ ]$ condition/Available reason/$[^ ]$ status/False[^:]: $.$\|\1 \2 \3\|' \| sed 's\|[.]apps[.][^ /]\|.apps...\|g' \| sort \| uniq -c \| sort -n 1 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp 52.158.160.194:443: connect: connection refused 1 console RouteHealth_StatusError route not yet available, https://console-openshift-console.apps... returns '503 Service Unavailable' 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp: lookup console-openshift-console.apps... on 172.30.0.10:53: no such host 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... EOF 8 console RouteHealth_RouteNotAdmitted console route is not admitted 16 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... context deadline exceeded (Client.Timeout exceeded while awaiting headers) {code} For example [this 4.14 to 4.15 run\|https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade/1729324067384725504] had: {code:none} : [bz-Management Console] clusteroperator/console should not change condition/Available Run #0: Failed 1h25m23s { 1 unexpected clusteroperator state transitions during e2e test run Nov 28 03:42:41.207 - 1s E clusteroperator/console condition/Available reason/RouteHealth_FailedGet status/False RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)} {code} While a timeout for console Route isn't fantastic, an issue that only persists for 1s is not long enough to warrant [immediate admin intervention\|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153]. Teaching the console operator to stay {{Available=True}} for this kind of brief hiccup, while still going {{Available=False}} for issues where [least part of the component is non-functional, and that the condition requires immediate administrator intervention\|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153] would make it easier for admins and SREs operating clusters to identify when intervention was required. h2. Version-Release number of selected component At least 4.15. Possibly other versions; I haven't checked. .h2 How reproducible {code:none} $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.upgrade&context=0&search=clusteroperator/console.condition/Available.status/False' \| grep 'periodic.*failures match' \| sort periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 17% failed, 50% of failures match = 8% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 4 runs, 100% failed, 25% of failures match = 25% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 12 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 25% failed, 33% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 23% failed, 28% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 28% failed, 23% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 63 runs, 38% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 60 runs, 73% failed, 11% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 7% failed, 20% of failures match = 1% impact {code} Seems like it's primarily minor-version updates that trip this, and in jobs with high run counts, the impact percentage is single-digits. h2. Steps to reproduce There may be a way to reliable trigger these hiccups, but as a reproducer floor, running days of CI and checking to see whether impact percentages decrease would be a good way to test fixes post-merge. h2. Actual results Lots of {{console}} ClusterOperator going {{Available=False}} blips in 4.15 update CI. h2. Expected results Console goes {{Available=False}} if and only if immediate admin intervention is appropriate. Status: Verified
#OCPBUGS-24041	issue	2 months ago	Console blips Available=False with RouteHealth_FailedGet and such Verified
Issue 15644579: Console blips Available=False with RouteHealth_FailedGet and such Description: h2. Description Seen in 4.15-related update CI: {code:none} $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.upgrade&context=0&search=clusteroperator/console.condition/Available.status/False' \| jq -r 'to_entries[].value \| to_entries[].value[].context[]' \| sed 's\|.clusteroperator/$[^ ]$ condition/Available reason/$[^ ]$ status/False[^:]: $.$\|\1 \2 \3\|' \| sed 's\|[.]apps[.][^ /]\|.apps...\|g' \| sort \| uniq -c \| sort -n 1 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp 52.158.160.194:443: connect: connection refused 1 console RouteHealth_StatusError route not yet available, https://console-openshift-console.apps... returns '503 Service Unavailable' 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp: lookup console-openshift-console.apps... on 172.30.0.10:53: no such host 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... EOF 8 console RouteHealth_RouteNotAdmitted console route is not admitted 16 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... context deadline exceeded (Client.Timeout exceeded while awaiting headers) {code} For example [this 4.14 to 4.15 run\|https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade/1729324067384725504] had: {code:none} : [bz-Management Console] clusteroperator/console should not change condition/Available Run #0: Failed 1h25m23s { 1 unexpected clusteroperator state transitions during e2e test run Nov 28 03:42:41.207 - 1s E clusteroperator/console condition/Available reason/RouteHealth_FailedGet status/False RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)} {code} While a timeout for console Route isn't fantastic, an issue that only persists for 1s is not long enough to warrant [immediate admin intervention\|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153]. Teaching the console operator to stay {{Available=True}} for this kind of brief hiccup, while still going {{Available=False}} for issues where [least part of the component is non-functional, and that the condition requires immediate administrator intervention\|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153] would make it easier for admins and SREs operating clusters to identify when intervention was required. h2. Version-Release number of selected component At least 4.15. Possibly other versions; I haven't checked. .h2 How reproducible {code:none} $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.upgrade&context=0&search=clusteroperator/console.condition/Available.status/False' \| grep 'periodic.*failures match' \| sort periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 17% failed, 50% of failures match = 8% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 4 runs, 100% failed, 25% of failures match = 25% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 12 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 25% failed, 33% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 23% failed, 28% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 28% failed, 23% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 63 runs, 38% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 60 runs, 73% failed, 11% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 7% failed, 20% of failures match = 1% impact {code} Seems like it's primarily minor-version updates that trip this, and in jobs with high run counts, the impact percentage is single-digits. h2. Steps to reproduce There may be a way to reliable trigger these hiccups, but as a reproducer floor, running days of CI and checking to see whether impact percentages decrease would be a good way to test fixes post-merge. h2. Actual results Lots of {{console}} ClusterOperator going {{Available=False}} blips in 4.15 update CI. h2. Expected results Console goes {{Available=False}} if and only if immediate admin intervention is appropriate. Status: Verified
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
#1790791978791735296	junit	3 days ago
[bz-Image Registry] clusteroperator/image-registry should not change condition/Available [bz-Management Console] clusteroperator/console should not change condition/Available [bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
#1790791978791735296	junit	3 days ago
[bz-Image Registry] clusteroperator/image-registry should not change condition/Available [bz-Management Console] clusteroperator/console should not change condition/Available [bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
#1790791978791735296	junit	3 days ago
# [bz-Management Console] clusteroperator/console should not change condition/Available 2 unexpected clusteroperator state transitions during e2e test run

Found in 100.00% of runs (100.00% of failures) across 1 total runs and 1 jobs (100.00% failed) in 88ms - clear search | chart view - source code located on github