infinite ResolveNow() loop on net transients #8154
Labels
Area: Resolvers/Balancers
Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities.
Status: Requires Reporter Clarification
Type: Bug
What version of gRPC are you using?
v1.70.0
What version of Go are you using (
go version
)?go version go1.22.11 linux/amd64
What operating system (Linux, Windows, …) and version?
several Linux versions, including: kernel 4.15.0-112-generic on Ubuntu
What did you do?
gRPC client with a custom resolver (using
pick_first
LB), connecting to a multi-node server cluster. this codebase appeared to be working fine for about 6 years, up to and including gRPC v1.63.2 - or at least the issue below was never brought to our attention before. since upgrading to v1.70.0, the same code appears to occasionally enter infinite loop calling the custom resolver'sResolveNow()
method. the issue is non-deterministic, but has been observed 4-5 times in the past month, AFAICT - always on network transients (impaired connectivity to the gRPC servers cluster).What did you expect to see?
the problem appears some time after the initial conn was established, lost, re-established, etc. after a net hiccup, i'd expect the lower gRPC layers to keep trying to dial the addresses returned by
ResolveNow()
as per the backoff and connect timeout config - which it normally does, and has been doing for years. an example of this is at the top of the attached log.What did you see instead?
occasionally (but rarely, despite forcibly injecting various network faults), the gRPC code enters an infinite loop of calling
ResolveNow()
at a rate of >15 thousand calls/second, pegging several CPU cores (>210% CPU on an otherwise near-idle machine). the process does not recover. the issue appears to manifest somewhere around here:see attached log, the relevant portion (reproduced above) starts at around
PROBLEM MANIFESTS AROUND HERE
text.sample.2025-03-07.12-54-08.log.gz
the only unusual entry in the log that i see before this
ResolveNow()
infinite loop but don't normally find in the logs on any other net transients (which are handled gracefully) is this one:loopyWriter exiting with error: connection error: desc = "keepalive ping failed to receive ACK within timeout"
.potentially relevant gRPC client options:
The text was updated successfully, but these errors were encountered: