Job:
#OCPBUGS-27113issue4 weeks agoConsole blips Available=False with RouteHealth_FailedGet and such Verified
Issue 15716399: Console blips Available=False with RouteHealth_FailedGet and such
Description: This is a clone of issue OCPBUGS-24041. The following is the description of the original issue:  --- h2. Description
 
 Seen in 4.15-related update CI:
 {code:none}
 $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's|[.]apps[.][^ /]*|.apps...|g' | sort | uniq -c | sort -n
       1 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp 52.158.160.194:443: connect: connection refused
       1 console RouteHealth_StatusError route not yet available, https://console-openshift-console.apps... returns '503 Service Unavailable'
       2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp: lookup console-openshift-console.apps... on 172.30.0.10:53: no such host
       2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... EOF
       8 console RouteHealth_RouteNotAdmitted console route is not admitted
      16 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 {code}
 For example [this 4.14 to 4.15 run|https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade/1729324067384725504] had:
 {code:none}
 : [bz-Management Console] clusteroperator/console should not change condition/Available 
 Run #0: Failed 	1h25m23s
 {  1 unexpected clusteroperator state transitions during e2e test run 
 
 Nov 28 03:42:41.207 - 1s    E clusteroperator/console condition/Available reason/RouteHealth_FailedGet status/False RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)}
 {code}
 While a timeout for console Route isn't fantastic, an issue that only persists for 1s is not long enough to warrant [immediate admin intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153]. Teaching the console operator to stay {{Available=True}} for this kind of brief hiccup, while still going {{Available=False}} for issues where [least part of the component is non-functional, and that the condition requires immediate administrator intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153] would make it easier for admins and SREs operating clusters to identify when intervention was required.
 
 h2. Version-Release number of selected component
 
 At least 4.15.  Possibly other versions; I haven't checked.
 
 .h2 How reproducible
 
 {code:none}
 $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | grep 'periodic.*failures match' | sort
 periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 17% failed, 50% of failures match = 8% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 12 runs, 17% failed, 100% of failures match = 17% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 25% failed, 33% of failures match = 8% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 23% failed, 28% of failures match = 6% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 28% failed, 23% of failures match = 6% impact
 periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 63 runs, 38% failed, 8% of failures match = 3% impact
 periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 60 runs, 73% failed, 11% of failures match = 8% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 7% failed, 20% of failures match = 1% impact
 {code}
 
 Seems like it's primarily minor-version updates that trip this, and in jobs with high run counts, the impact percentage is single-digits.
 
 h2. Steps to reproduce
 
 There may be a way to reliable trigger these hiccups, but as a reproducer floor, running days of CI and checking to see whether impact percentages decrease would be a good way to test fixes post-merge.
 
 h2. Actual results
 
 Lots of {{console}} ClusterOperator going {{Available=False}} blips in 4.15 update CI.
 
 h2. Expected results
 
 Console goes {{Available=False}} if and only if immediate admin intervention is appropriate.
Status: Verified
#OCPBUGS-24041issue2 months agoConsole blips Available=False with RouteHealth_FailedGet and such Verified
Issue 15644579: Console blips Available=False with RouteHealth_FailedGet and such
Description: h2. Description
 
 Seen in 4.15-related update CI:
 {code:none}
 $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's|[.]apps[.][^ /]*|.apps...|g' | sort | uniq -c | sort -n
       1 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp 52.158.160.194:443: connect: connection refused
       1 console RouteHealth_StatusError route not yet available, https://console-openshift-console.apps... returns '503 Service Unavailable'
       2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp: lookup console-openshift-console.apps... on 172.30.0.10:53: no such host
       2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... EOF
       8 console RouteHealth_RouteNotAdmitted console route is not admitted
      16 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 {code}
 For example [this 4.14 to 4.15 run|https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade/1729324067384725504] had:
 {code:none}
 : [bz-Management Console] clusteroperator/console should not change condition/Available 
 Run #0: Failed 	1h25m23s
 {  1 unexpected clusteroperator state transitions during e2e test run 
 
 Nov 28 03:42:41.207 - 1s    E clusteroperator/console condition/Available reason/RouteHealth_FailedGet status/False RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)}
 {code}
 While a timeout for console Route isn't fantastic, an issue that only persists for 1s is not long enough to warrant [immediate admin intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153]. Teaching the console operator to stay {{Available=True}} for this kind of brief hiccup, while still going {{Available=False}} for issues where [least part of the component is non-functional, and that the condition requires immediate administrator intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153] would make it easier for admins and SREs operating clusters to identify when intervention was required.
 
 h2. Version-Release number of selected component
 
 At least 4.15.  Possibly other versions; I haven't checked.
 
 .h2 How reproducible
 
 {code:none}
 $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | grep 'periodic.*failures match' | sort
 periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 17% failed, 50% of failures match = 8% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 12 runs, 17% failed, 100% of failures match = 17% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 25% failed, 33% of failures match = 8% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 23% failed, 28% of failures match = 6% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 28% failed, 23% of failures match = 6% impact
 periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 63 runs, 38% failed, 8% of failures match = 3% impact
 periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 60 runs, 73% failed, 11% of failures match = 8% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 7% failed, 20% of failures match = 1% impact
 {code}
 
 Seems like it's primarily minor-version updates that trip this, and in jobs with high run counts, the impact percentage is single-digits.
 
 h2. Steps to reproduce
 
 There may be a way to reliable trigger these hiccups, but as a reproducer floor, running days of CI and checking to see whether impact percentages decrease would be a good way to test fixes post-merge.
 
 h2. Actual results
 
 Lots of {{console}} ClusterOperator going {{Available=False}} blips in 4.15 update CI.
 
 h2. Expected results
 
 Console goes {{Available=False}} if and only if immediate admin intervention is appropriate.
Status: Verified
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
#1790791978791735296junit3 days ago
[bz-Image Registry] clusteroperator/image-registry should not change condition/Available
[bz-Management Console] clusteroperator/console should not change condition/Available
[bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
#1790791978791735296junit3 days ago
[bz-Image Registry] clusteroperator/image-registry should not change condition/Available
[bz-Management Console] clusteroperator/console should not change condition/Available
[bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
#1790791978791735296junit3 days ago
# [bz-Management Console] clusteroperator/console should not change condition/Available
2 unexpected clusteroperator state transitions during e2e test run

Found in 100.00% of runs (100.00% of failures) across 1 total runs and 1 jobs (100.00% failed) in 88ms - clear search | chart view - source code located on github