Search OpenShift CI

issue
2 weeks ago
Image registry experiencing disruption during vSphere serial jobs CLOSED
Issue 15580732: Image registry experiencing disruption during vSphere serial jobs
Description: [Image registry disruption|https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&var-platform=vsphere&var-percentile=P75&var-backend=image-registry-new-connections&var-backend=image-registry-reused-connections&var-releases=4.15&var-upgrade_type=none&var-networks=ovn&var-networks=sdn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=10&var-lookback=1&var-min_disruption_regression=5&var-min_disruption_job_list=5&var-master_nodes_updated=N&var-master_nodes_updated=&var-master_nodes_updated=Y&from=now-30d&to=now] surfaced again specifically on vsphere serial. It's not every run, but it shows up somewhere between 50-75% of runs.
 
 We dug in on TRT-1318 and found that there are serial tests which taint nodes, and only one registry replica is being used on vsphere. If the test happens to pick the worker where the registry is running, it will go down for 10-50s, this is likely why we see it between 50 and 75%, the actual value is probably 66% of the time due to 1/3 worker nodes being selected.
 
 The tests we found running when the disruption occurred were things like: 
 
 [sig-node] NoExecuteTaintManager Single Pod [Serial] eventually evict pod with finite tolerations from tainted nodes [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]
 
 [sig-node] NoExecuteTaintManager Multiple Pods [Serial] only evicts pods without tolerations from tainted nodes [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]
 
 Other clouds run these in their serial suites, but I found that they were all running two registry replicas.
 
 Similar to https://issues.redhat.com/browse/OCPBUGS-18596 we need a fix to ensure two replicas are running in vsphere serial suites.
 
 To verify, [check this link|https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&var-platform=vsphere&var-percentile=P75&var-backend=image-registry-new-connections&var-backend=image-registry-reused-connections&var-releases=4.15&var-upgrade_type=none&var-networks=ovn&var-networks=sdn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=10&var-lookback=1&var-min_disruption_regression=5&var-min_disruption_job_list=5&var-master_nodes_updated=N&var-master_nodes_updated=&var-master_nodes_updated=Y&from=now-30d&to=now] a couple days after the fix merges and goes live. We should see the P75 near 0. (today it's 35s for new connections, about 15 for re-used)
Status: CLOSED
Found in 0.00% of runs (0.00% of failures) across 267980 total runs and 11458 jobs (21.33% failed) in 108ms - clear search | chart view - source code located on github