Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infinite ResolveNow() loop on net transients #8154

Open
lev-lb opened this issue Mar 7, 2025 · 1 comment
Open

infinite ResolveNow() loop on net transients #8154

lev-lb opened this issue Mar 7, 2025 · 1 comment
Assignees
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Status: Requires Reporter Clarification Type: Bug

Comments

@lev-lb
Copy link

lev-lb commented Mar 7, 2025

What version of gRPC are you using?

v1.70.0

What version of Go are you using (go version)?

go version go1.22.11 linux/amd64

What operating system (Linux, Windows, …) and version?

several Linux versions, including: kernel 4.15.0-112-generic on Ubuntu

What did you do?

gRPC client with a custom resolver (using pick_first LB), connecting to a multi-node server cluster. this codebase appeared to be working fine for about 6 years, up to and including gRPC v1.63.2 - or at least the issue below was never brought to our attention before. since upgrading to v1.70.0, the same code appears to occasionally enter infinite loop calling the custom resolver's ResolveNow() method. the issue is non-deterministic, but has been observed 4-5 times in the past month, AFAICT - always on network transients (impaired connectivity to the gRPC servers cluster).

What did you expect to see?

the problem appears some time after the initial conn was established, lost, re-established, etc. after a net hiccup, i'd expect the lower gRPC layers to keep trying to dial the addresses returned by ResolveNow() as per the backoff and connect timeout config - which it normally does, and has been doing for years. an example of this is at the top of the attached log.

What did you see instead?

occasionally (but rarely, despite forcibly injecting various network faults), the gRPC code enters an infinite loop of calling ResolveNow() at a rate of >15 thousand calls/second, pegging several CPU cores (>210% CPU on an otherwise near-idle machine). the process does not recover. the issue appears to manifest somewhere around here:

2025-03-07T12:48:28.072	INFO	[core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE
2025-03-07T12:48:28.072	INFO	resolver.ResolveNow()	{"lbgrpc": "1i855hu", "targets": "192.168.19.217:443,192.168.19.218:443,192.168.19.220:443,192.168.25.23:443,192.168.26.25:443,192.168.29.142:443,192.168.29.144:443,192.168.16.233:443,192.168.17.106:443,192.168.17.107:443,192.168.17.30:443,192.168.19.213:443,192.168.19.214:443,192.168.19.215:443", "calls-last-sec": 0}
2025-03-07T12:48:28.072	INFO	[transport] [client-transport 0xc000bc46c8] loopyWriter exiting with error: received GOAWAY with no active streams
2025-03-07T12:48:28.072	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received SubConn state update: 0xc0002bb5e0, {ConnectivityState:IDLE ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025-03-07T12:48:28.073	INFO	[core] blockingPicker: the picked transport is not ready, loop back to repick
2025-03-07T12:48:28.073	INFO	[core] [Channel #1]Resolver state updated: { "Addresses": [ { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null} ()
2025-03-07T12:48:28.073	INFO	[core] [Channel #1]Channel Connectivity change to IDLE
2025-03-07T12:48:28.073	INFO	[transport] [client-transport 0xc000bc46c8] Closing: connection error: desc = "error reading from server: read tcp 10.212.1.6:32844->192.168.19.215:443: use of closed network connection"
2025-03-07T12:48:28.073	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received new config { "shuffleAddressList": false}, resolver state { "Addresses": [ { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null}
2025-03-07T12:48:28.073	INFO	[core] [Channel #1 SubChannel #2]addrConn: updateAddrs addrs (5 of 14): [{Addr: "192.168.19.217:443", ServerName: "", } {Addr: "192.168.19.218:443", ServerName: "", } {Addr: "192.168.19.220:443", ServerName: "", } {Addr: "192.168.25.23:443", ServerName: "", } {Addr: "192.168.26.25:443", ServerName: "", }]
2025-03-07T12:48:28.073	INFO	[core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2025-03-07T12:48:28.073	INFO	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "192.168.19.215:443" to connect
2025-03-07T12:48:28.073	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received SubConn state update: 0xc0002bb5e0, {ConnectivityState:CONNECTING ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025-03-07T12:48:28.073	INFO	[core] [Channel #1]Channel Connectivity change to CONNECTING
2025-03-07T12:48:28.073	INFO	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "192.168.19.217:443" to connect
2025-03-07T12:48:28.073	INFO	[core] Creating new client transport to "{Addr: \"192.168.19.215:443\", ServerName: \"lb-resolver\", }": connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.215:443: operation was canceled"
2025-03-07T12:48:28.073	WARN	[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "192.168.19.215:443", ServerName: "lb-resolver", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.215:443: operation was canceled"
2025-03-07T12:48:28.073	INFO	resolver.ResolveNow()	{"lbgrpc": "1i855hu", "targets": "192.168.19.218:443,192.168.19.220:443,192.168.25.23:443,192.168.26.25:443,192.168.29.142:443,192.168.29.144:443,192.168.16.233:443,192.168.17.106:443,192.168.17.107:443,192.168.17.30:443,192.168.19.213:443,192.168.19.214:443,192.168.19.215:443,192.168.19.217:443", "calls-last-sec": 1}
2025-03-07T12:48:28.074	INFO	[core] [Channel #1]Resolver state updated: { "Addresses": [ { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null} ()
2025-03-07T12:48:28.074	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received new config { "shuffleAddressList": false}, resolver state { "Addresses": [ { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null}
2025-03-07T12:48:28.074	INFO	[core] [Channel #1 SubChannel #2]addrConn: updateAddrs addrs (5 of 14): [{Addr: "192.168.19.218:443", ServerName: "", } {Addr: "192.168.19.220:443", ServerName: "", } {Addr: "192.168.25.23:443", ServerName: "", } {Addr: "192.168.26.25:443", ServerName: "", } {Addr: "192.168.29.142:443", ServerName: "", }]
2025-03-07T12:48:28.074	INFO	[core] Creating new client transport to "{Addr: \"192.168.19.217:443\", ServerName: \"lb-resolver\", }": connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.217:443: operation was canceled"
2025-03-07T12:48:28.074	WARN	[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "192.168.19.217:443", ServerName: "lb-resolver", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.217:443: operation was canceled"
2025-03-07T12:48:28.074	INFO	resolver.ResolveNow()	{"lbgrpc": "1i855hu", "targets": "192.168.19.220:443,192.168.25.23:443,192.168.26.25:443,192.168.29.142:443,192.168.29.144:443,192.168.16.233:443,192.168.17.106:443,192.168.17.107:443,192.168.17.30:443,192.168.19.213:443,192.168.19.214:443,192.168.19.215:443,192.168.19.217:443,192.168.19.218:443", "calls-last-sec": 2}
2025-03-07T12:48:28.074	INFO	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "192.168.19.218:443" to connect
2025-03-07T12:48:28.074	INFO	[core] [Channel #1]Resolver state updated: { "Addresses": [ { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null} ()
2025-03-07T12:48:28.074	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received new config { "shuffleAddressList": false}, resolver state { "Addresses": [ { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null}
2025-03-07T12:48:28.074	INFO	[core] [Channel #1 SubChannel #2]addrConn: updateAddrs addrs (5 of 14): [{Addr: "192.168.19.220:443", ServerName: "", } {Addr: "192.168.25.23:443", ServerName: "", } {Addr: "192.168.26.25:443", ServerName: "", } {Addr: "192.168.29.142:443", ServerName: "", } {Addr: "192.168.29.144:443", ServerName: "", }]
2025-03-07T12:48:28.075	INFO	[core] Creating new client transport to "{Addr: \"192.168.19.218:443\", ServerName: \"lb-resolver\", }": connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.218:443: operation was canceled"
2025-03-07T12:48:28.075	WARN	[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "192.168.19.218:443", ServerName: "lb-resolver", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 192.168.19.218:443: operation was canceled"
2025-03-07T12:48:28.075	INFO	resolver.ResolveNow()	{"lbgrpc": "1i855hu", "targets": "192.168.25.23:443,192.168.26.25:443,192.168.29.142:443,192.168.29.144:443,192.168.16.233:443,192.168.17.106:443,192.168.17.107:443,192.168.17.30:443,192.168.19.213:443,192.168.19.214:443,192.168.19.215:443,192.168.19.217:443,192.168.19.218:443,192.168.19.220:443", "calls-last-sec": 3}
2025-03-07T12:48:28.075	INFO	[core] [Channel #1]Resolver state updated: { "Addresses": [ { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null} ()
2025-03-07T12:48:28.075	INFO	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "192.168.19.220:443" to connect
2025-03-07T12:48:28.075	INFO	[pick-first-lb] [pick-first-lb 0xc000a37d70] Received new config { "shuffleAddressList": false}, resolver state { "Addresses": [ { "Addr": "192.168.25.23:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.26.25:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.142:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.29.144:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.16.233:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.106:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.107:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.17.30:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.213:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.214:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.215:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.217:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.218:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null }, { "Addr": "192.168.19.220:443", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": null, "ServiceConfig": null, "Attributes": null}
2025-03-07T12:48:28.075	INFO	[core] [Channel #1 SubChannel #2]addrConn: updateAddrs addrs (5 of 14): [{Addr: "192.168.25.23:443", ServerName: "", } {Addr: "192.168.26.25:443", ServerName: "", } {Addr: "192.168.29.142:443", ServerName: "", } {Addr: "192.168.29.144:443", ServerName: "", } {Addr: "192.168.16.233:443", ServerName: "", }]

see attached log, the relevant portion (reproduced above) starts at around PROBLEM MANIFESTS AROUND HERE text.

sample.2025-03-07.12-54-08.log.gz

the only unusual entry in the log that i see before this ResolveNow() infinite loop but don't normally find in the logs on any other net transients (which are handled gracefully) is this one: loopyWriter exiting with error: connection error: desc = "keepalive ping failed to receive ACK within timeout".

potentially relevant gRPC client options:

kal := keepalive.ClientParameters{
    Time:                10 * time.Second,
    Timeout:             10 * time.Second,
    PermitWithoutStream: true, 
}
dialBackoffConfig := backoff.Config{
    BaseDelay:  10 * time.Millisecond,
    Multiplier: 5,
    Jitter:     0.1,  
    MaxDelay:   3 * time.Second,
}
cp := grpc.ConnectParams{
    Backoff:           dialBackoffConfig,
    MinConnectTimeout: 1 * time.Second,
}
opts := []grpc.DialOption{
    grpc.WithBlock(),
    grpc.WithDisableRetry(),
    grpc.WithDefaultCallOptions(grpc.WaitForReady(true)),
    grpc.WithKeepaliveParams(kal),
    grpc.WithConnectParams(cp),
    grpc.WithResolvers(customResolver),
}
@purnesh42H purnesh42H self-assigned this Mar 10, 2025
@purnesh42H purnesh42H added the Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. label Mar 10, 2025
@purnesh42H
Copy link
Contributor

purnesh42H commented Mar 10, 2025

@lev-lb are you using grpc.Dial or grpc.NewClient? If you have switched to grpc.NewClient which was released in 1.64, WithBlock parameter will be ignored. See release notes https://github.com/grpc/grpc-go/releases/tag/v1.64.0

From the logs I am seeing "operation was canceled" which means context cancelation by the caller either due to short timeout or manual cancelation. This is likely the cause of backoff not getting chance to execute and connection attempt being immediately aborted. If the cancelation happens during dialing the connection, all addresses are exhausted and clientconn is repored with TRANSIENT_FAILURE which triggers the reconnection process.

Its hard to debug further with just the logs. Could you provide your code showing how you are creating the client and how the cancelation is happening?

Also, do you have any back off implemented in your custom name resolver?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Status: Requires Reporter Clarification Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants