Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: assignment to entry in nil map in channelz.(*SubChannel).addChild #8110

Open
Cluas opened this issue Feb 20, 2025 · 34 comments
Open

panic: assignment to entry in nil map in channelz.(*SubChannel).addChild #8110

Cluas opened this issue Feb 20, 2025 · 34 comments
Assignees
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Status: Requires Reporter Clarification Type: Bug

Comments

@Cluas
Copy link

Cluas commented Feb 20, 2025

NOTE: if you are reporting is a potential security vulnerability or a crash,
please follow our CVE process at
https://github.com/grpc/proposal/blob/master/P4-grpc-cve-process.md instead of
filing an issue here.

Please see the FAQ in our main README.md, then answer the questions below
before submitting your issue.

What version of gRPC are you using?

v1.67.1

What version of Go are you using (go version)?

go 1.22

What operating system (Linux, Windows, …) and version?

linux amd64

What did you do?

If possible, provide a recipe for reproducing the error.

This is an occasional bug that occurs when an upstream service connected using stream restarts.

What did you expect to see?

no panic. Because this is a separate goroutine, it directly led to a large number of restarts of prod services(So what is the cause of the connection abnormality?).

func (acbw *acBalancerWrapper) Connect() {
	go acbw.ac.connect()
}

What did you see instead?

painc

@Cluas
Copy link
Author

Cluas commented Feb 20, 2025

log info.

2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/balancer_wrapper.go:300](http://google.golang.org/[email protected]/balancer_wrapper.go:300) +0x56
2025-02-18 23:33:10.287	
created by [google.golang.org/grpc.(*acBalancerWrapper).Connect](http://google.golang.org/grpc.(*acBalancerWrapper).Connect) in goroutine 480
2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/clientconn.go:912](http://google.golang.org/[email protected]/clientconn.go:912) +0x145
2025-02-18 23:33:10.287	
[google.golang.org/grpc.(*addrConn).connect(0xc000f24008)](http://google.golang.org/grpc.(*addrConn).connect(0xc000f24008))
2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/clientconn.go:1253](http://google.golang.org/[email protected]/clientconn.go:1253) +0x1cc
2025-02-18 23:33:10.287	
[google.golang.org/grpc.(*addrConn).resetTransportAndUnlock(0xc000f24008)](http://google.golang.org/grpc.(*addrConn).resetTransportAndUnlock(0xc000f24008))
2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/clientconn.go:1319](http://google.golang.org/[email protected]/clientconn.go:1319) +0x408
2025-02-18 23:33:10.287	
[google.golang.org/grpc.(*addrConn).tryAllAddrs](http://google.golang.org/grpc.(*addrConn).tryAllAddrs)(0xc000f24008, {0x3d61500, 0xc000d2ab40}, {0xc0012bf440, 0x1, 0xc0452c5e00?}, {0xc0db1e7310?, 0xc0f5eeb381?, 0x5aabe20?})
2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/clientconn.go:1372](http://google.golang.org/[email protected]/clientconn.go:1372) +0x2e5
2025-02-18 23:33:10.287	
[google.golang.org/grpc.(*addrConn).createTransport](http://google.golang.org/grpc.(*addrConn).createTransport)(_, {_, _}, {{0xc0007a7c0f, 0x22}, {0xc0007a7c0f, 0x22}, 0x0, 0x0, {0x0, ...}}, ...)
2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/internal/transport/transport.go:722](http://google.golang.org/[email protected]/internal/transport/transport.go:722)
2025-02-18 23:33:10.287	
[google.golang.org/grpc/internal/transport.NewClientTransport(...)](http://google.golang.org/grpc/internal/transport.NewClientTransport(...))
2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/internal/transport/http2_client.go:359](http://google.golang.org/[email protected]/internal/transport/http2_client.go:359) +0x1771
2025-02-18 23:33:10.287	
[google.golang.org/grpc/internal/transport.newHTTP2Client](http://google.golang.org/grpc/internal/transport.newHTTP2Client)({_, _}, {_, _}, {{0xc0007a7c0f, 0x22}, {0xc0007a7c0f, 0x22}, 0x0, 0x0, ...}, ...)
2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/internal/channelz/funcs.go:193](http://google.golang.org/[email protected]/internal/channelz/funcs.go:193) +0x48
2025-02-18 23:33:10.287	
[google.golang.org/grpc/internal/channelz.RegisterSocket(0xc102e1fd10)](http://google.golang.org/grpc/internal/channelz.RegisterSocket(0xc102e1fd10))
2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/internal/channelz/channelmap.go:116](http://google.golang.org/[email protected]/internal/channelz/channelmap.go:116) +0x1d9
2025-02-18 23:33:10.287	
[google.golang.org/grpc/internal/channelz.(*channelMap).addSocket](http://google.golang.org/grpc/internal/channelz.(*channelMap).addSocket)(0x5aab080, 0xc102e1fd10)
2025-02-18 23:33:10.287	
	/go/pkg/mod/[google.golang.org/[email protected]/internal/channelz/subchannel.go:64](http://google.golang.org/[email protected]/internal/channelz/subchannel.go:64) +0x9f
2025-02-18 23:33:10.287	
[google.golang.org/grpc/internal/channelz.(*SubChannel).addChild](http://google.golang.org/grpc/internal/channelz.(*SubChannel).addChild)(0x2ed4280?, 0x8fc, {0x3d743e0, 0xc102e1fd10?})
2025-02-18 23:33:10.287	
goroutine 2220205 [running]:
2025-02-18 23:33:10.287	

2025-02-18 23:33:10.287	
panic: assignment to entry in nil map

@arjan-bal arjan-bal self-assigned this Feb 20, 2025
@arjan-bal
Copy link
Contributor

arjan-bal commented Feb 20, 2025

The channelz.Subchannel.sockets map that is causing the panic is initialized only when channelz is turned on:

if !IsOn() {
return sc
}
sc.sockets = make(map[int64]string)

This means that channelz was turned off when the subchannel was created.

The panic happens when registering a socket for the SubChannel. This code path should get triggered only when channelz is turned on.

if IsOn() {
db.addSocket(skt)
}

Theoretically, this bug should happen when channelz was toggled b/w subchannel creation and transport creation. Trying to repro this by inserting delays in test code.

@arjan-bal
Copy link
Contributor

arjan-bal commented Feb 20, 2025

I have a repro for this here: https://github.com/arjan-bal/grpc-go/pull/13/files

Steps to repro:

  1. Create a ClientConn and make it connect.
  2. Turn after channelz when the transport (subchannel/addrConn) is created, but before the http2Client is created.
  3. When the http2Client is created, it will panic

Instead of step 2, it should also be possible to have the same Subchannel re-connect after channelz is turned on causing the panic. This would happen more commonly than step 2.

@arjan-bal arjan-bal added the Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability label Feb 20, 2025
@arjan-bal
Copy link
Contributor

From what I can tell, channelz can only be enabled at init time by importing the channelz service

func init() {
channelz.TurnOn()
}

So this situation should not be possible in non-test code.

@arjan-bal
Copy link
Contributor

@Cluas I wanted to check if you've found a way (e.g., forking the code, using linkname, etc.) to toggle Channelz after initialization?

@Cluas
Copy link
Author

Cluas commented Feb 21, 2025

yes.

import (
	channelz "google.golang.org/grpc/channelz/grpc_channelz_v1"
	_ "google.golang.org/grpc/channelz/service"
)

//go:embed templates
var templates embed.FS

var (
	headerTemplate     = parseTemplate("header.tmpl")
	channelsTemplate   = parseTemplate("channels.tmpl")
	subChannelTemplate = parseTemplate("sub_channels.tmpl")
	channelTemplate    = parseTemplate("channel.tmpl")
	serversTemplate    = parseTemplate("servers.tmpl")
	serverTemplate     = parseTemplate("server.tmpl")
	socketTemplate     = parseTemplate("socket.tmpl")
	footerTemplate     = parseTemplate("footer.tmpl")
)

//go:linkname newCZServer google.golang.org/grpc/channelz/service.newCZServer
func newCZServer() channelz.ChannelzServer

func init() {
	handler := channelzHandler{
		server: newCZServer(),
	}
	http.HandleFunc("/runtime/channelz/", handler.Index)
	http.HandleFunc("/runtime/channelz/channels/", handler.Channels)
	http.HandleFunc("/runtime/channelz/channel/", handler.Channel)
	http.HandleFunc("/runtime/channelz/subchannel/", handler.Subchannel)
	http.HandleFunc("/runtime/channelz/server/", handler.Server)
	http.HandleFunc("/runtime/channelz/socket/", handler.Socket)
}

@arjan-bal
Copy link
Contributor

arjan-bal commented Feb 21, 2025

The snippet has imported the service _ "google.golang.org/grpc/channelz/service". So channelz should be turned on during init time. Are you disabling it later?

If not, are you creating gRPC clients in init functions before channelz is enabled? The fix here would be to import google.golang.org/grpc/channelz/service before any client gets created.

@Cluas
Copy link
Author

Cluas commented Feb 21, 2025

The snippet has imported the service _ "google.golang.org/grpc/channelz/service". So channelz should be turned on during init time. Are you disabling it later?

If not, are you creating gRPC clients in init functions before channelz is enabled? The fix here would be to import google.golang.org/grpc/channelz/service before any client gets created.

We didn't get an error at startup. Instead, an error occurred suddenly during the service operation.

@Cluas
Copy link
Author

Cluas commented Feb 21, 2025

Moreover, the triggering condition is that the upstream is a gRPC stream. When the upstream service restarts and the connection is disconnected, when the service here attempts to reconnect, a panic is reported.

I wonder if we can provide protection on the program side. At least, don't panic.

@arjan-bal
Copy link
Contributor

This issue can happen when a subchannel created with channelz off fails and is asked to re-connect when channelz is on. This repro demonstrates this: https://github.com/arjan-bal/grpc-go/pull/13/files

// Channelz off.
cc.Connect()
testutils.AwaitState(ctx, t, cc, connectivity.Ready)

// SubChannel fails
backends.backends[0].lis.Stop()
testutils.AwaitState(ctx, t, cc, connectivity.Idle)
channelz.TurnOn()

// Attempt to reconnect
backends.backends[0].lis.Restart()
cc.Connect() // This will trigger the panic
time.Sleep(time.Second)

@arjan-bal
Copy link
Contributor

I wonder if we can provide protection on the program side. At least, don't panic.

This is not a condition that gRPC clients expect to be in. The reason for enabling ChannelZ in an init function is to allow gRPC to assume it will not change. If the code ends up in such a situation where the enabling status changed, the program state may become inconsistent, making it unclear how to safely recover. If the code tries to suppress the issue, it may lead to undefined behaviour later on.

@arjan-bal
Copy link
Contributor

arjan-bal commented Feb 21, 2025

The simplest fix would be to import google.golang.org/grpc/channelz/service in your outermost package.

@Cluas
Copy link
Author

Cluas commented Feb 21, 2025

Image

@Cluas
Copy link
Author

Cluas commented Feb 21, 2025

There is a sudden panic during runtime. According to what you said above, is it only possible that this occurs during startup? As long as the service starts normally, it must have been imported.

@arjan-bal
Copy link
Contributor

arjan-bal commented Feb 21, 2025

The situation is as follows:

package outer

import (
  "inner1"
  "inner2"
)

func init(){}
package inner1

import (
  "google.golang.org/grpc" // grpc doesn't import channelz automatically
)

func init() {
  cc1 := grpc.NewClient()
  cc1.Connect()
}
package inner2

import (
  _ "google.golang.org/grpc/channelz/service"  // enables channelz
)

func init() {}

Dependency graph

       ┌──────────────┐     
       │              │     
       │     outer    │     
       │              │     
       ├──────────────┤     
       │              │     
       │              │     
┌──────▼────┐   ┌─────▼────┐
│           │   │          │
│   inner2  │   │   inner1 │
│           │   │          │
└─────┬─────┘   └──────────┘
      │                     
      │                     
┌─────▼─────┐               
│           │               
│ channelz  │               
│           │               
└───────────┘               

The order of init function execution is as follows:

  1. inner1
  2. channelz/service
  3. inner2
  4. outer

When innner1 creates a client, channelz/service is not initialized and hence channelz is not enabled. cc1 thinks that channelz is disabled, so it doesn't have the required data structures for channelz tracking.

cc1 runs for a couple of hours. Now the connection for cc1 breaks, maybe due to server restart. cc1 will attempt to connect again. This time channelz/service is already imported, so cc1 sees channelz as enabled. It will try to access channelz data structures that should have been created at subchannel creation time. It will get a nil pointer deref because these data structures were not created initially.

@arjan-bal
Copy link
Contributor

arjan-bal commented Feb 21, 2025

There are a couple of ways to prevent this issue:

  1. Don't create grpc clients in init functions, or
  2. Move the google.golang.org/grpc/channelz/service import to package outer

Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Feb 27, 2025
@arjan-bal
Copy link
Contributor

Closing assuming the query has been answered. Feel free to re-open if this isn't the case.

@Cluas
Copy link
Author

Cluas commented Mar 2, 2025

We have misidentified the direction of the problem. Closing channelz can indeed solve the problem, but the current problem does not lie in the initialization order of channlez. Even if we initialize channelz in main, we will still encounter this problem. The root of the problem is that we directly call func (ccb *ccBalancerWrapper) NewSubConn without having func (ccb *ccBalancerWrapper) NewSubConn. The former initializes the map that causes panic. I think this may be an implementation problem of the loadbalancer.

@Cluas
Copy link
Author

Cluas commented Mar 2, 2025

#7663 This PR may solve this problem. Let's upgrade the version and test it.

@arjan-bal arjan-bal reopened this Mar 3, 2025
@github-actions github-actions bot removed the stale label Mar 3, 2025
@arjan-bal
Copy link
Contributor

Hi @Cluas, #7663 deals with a race condition that happens when the message about SubChannel disconnection is delivered to the LB policy before the Producer. The sockets map that is causing the panic here is set at the time of channel creation and is NEVER reset to nil. So even if there was a race here, it should not cause a panic. Based on this I'd be surprised if #7663 fixes the issue.

@arjan-bal
Copy link
Contributor

The root of the problem is that we directly call func (ccb *ccBalancerWrapper) NewSubConn without having func (ccb *ccBalancerWrapper) NewSubConn.

This is unclear to me. What do you mean by calling func (ccb *ccBalancerWrapper) NewSubConn without having func (ccb *ccBalancerWrapper) NewSubConn?

I think this may be an implementation problem of the loadbalancer.

After reading through the code, I can't see how this panic can happen without channelz being initially off and then being turned on later. Can you provide a way for me to reproduce the issue locally? Without a repro, I don't know where to look next.

@Cluas
Copy link
Author

Cluas commented Mar 3, 2025

Our case is that channelz was initially turned on and running normally. This panic suddenly occurred during the service operation. Moreover, this case cannot be reproduced locally, occasional but it is highly related to the restart of upstream services. The panic log also shows that channelz is in an open state. What I can ensure is that channelz was enabled from the very beginning.

func (g *grpcServer) initGrpcServer() {
	maxMsgSize := g.getMaxMsgSize()

	gopts := []grpc.ServerOption{
		grpc.MaxRecvMsgSize(maxMsgSize),
		grpc.MaxSendMsgSize(maxMsgSize),
		grpc.UnknownServiceHandler(g.handler),
	}

	if creds := g.getCredentials(); creds != nil {
		gopts = append(gopts, grpc.Creds(creds))
	}

	if opts := g.getGrpcOptions(); opts != nil {
		gopts = append(gopts, opts...)
	}

	g.srv = grpc.NewServer(gopts...)
	

	healthgrpc.RegisterHealthServer(g.srv, health.NewServer())

	// register reflection service
	g.reflectionServer = RegisterReflection(g.srv, g.handlers)

	// register channelz service
	channelzservice.RegisterChannelzServiceToServer(g.srv)
}

Our grpc server can handle requests normally. Then the incident occurred when we had a grpc client that connected to a configuration distribution service through a stream. When the panic occurred, it was precisely when the configuration distribution service was restarted, resulting in an EOF for the connection, and when the client initiated a reconnection.

	// Create connection manager
	connectFunc := func(ctx context.Context) (grpc.ClientStream, error) {
		connCtx, connSpan := tracer.Start(ctx, "Invoke Watch Stream")
		defer connSpan.End()

		stream, err := wc.cli.invokeWatch(connCtx, wc.req, grpc.WaitForReady(true))
		if err != nil {
			connSpan.RecordError(err)
			return nil, fmt.Errorf("failed to invoke watch: %w", err)
		}

		// Also update old fields to ensure compatibility
		wc.rwMu.Lock()
		wc.stream = stream
		cancel := func() {
			select {
			case <-ctx.Done():
			default:
				// If context is not closed, we do nothing
				// Actual cancellation will be handled by connectionManager
			}
		}
		wc.cancel = cancel
		wc.rwMu.Unlock()

		return stream, nil
	}

@arjan-bal
Copy link
Contributor

I read the golang docs about the order of execution of package init function: https://go.dev/ref/spec#Program_initialization

Given the list of all packages, sorted by import path, in each step the first uninitialized package in the list for which all imported packages (if any) are already initialized is initialized. This step is repeated until all packages are initialized.

This means that if a package with a lexicographically smaller name than "google.golang.org/grpc/channelz/service" is present in the import list, it will be initialized first. I have two questions to check if this could be happenning:

  1. Is the grpc client that connects to the configuration distribution service created in an init function or before init time by running a closure?
cc := func() grpc.ClientConn {
    return grpc.NewClient("some-target")
}()
  1. What is the name of the imported package in the main module that eventually results in the creation of the config service client? Is is lexicographically smaller than "google.golang.org/grpc/channelz/service"? If yes, the way to force execution of "google.golang.org/grpc/channelz/service" before the config service clients is to add a "google.golang.org/grpc/channelz/service" import in the package that creates the config service clients.

@arjan-bal
Copy link
Contributor

arjan-bal commented Mar 4, 2025

I discussed this issue with other maintainers who mentioned that creating gRPC channels during or before package initialization is incorrect. gRPC and libraries that use gRPC perform actions such as registering LB polices and resolvers at init time. These libraries or gRPC may fail in unexpected ways if gRPC channels are created before they initialize.

@Cluas
Copy link
Author

Cluas commented Mar 5, 2025

First, our service is running normally, indicating that initGrpcServer has been executed. At this time, channelzservice.RegisterChannelzServiceToServer(g.srv) has imported channelz. Then, after the service runs for a period of time, the upstream service restarts and encounters a panic. Here is a description of the situation of the upstream service. We Read first and then Watch. It is not that an error is reported as soon as we try to connect. Instead, after the connection is established, when the upstream service restarts and the long connection is disconnected with EOF, and then we try to reconnect, a panic occurs. This is completely two different situations from what you wanted to discuss above.

@arjan-bal
Copy link
Contributor

Can you check the answer to the following question also?

Is the grpc client that connects to the configuration distribution service created in an init function or before init time by running a closure?

@Cluas
Copy link
Author

Cluas commented Mar 6, 2025

Can you check the answer to the following question also?

Is the grpc client that connects to the configuration distribution service created in an init function or before init time by running a closure?

we don't have this situation.

@arjan-bal
Copy link
Contributor

It is not that an error is reported as soon as we try to connect. Instead, after the connection is established, when the upstream service restarts and the long connection is disconnected with EOF, and then we try to reconnect, a panic occurs. This is completely two different situations from what you wanted to discuss above.

I see where the confusion might be. As I mentioned in this comment, the panic occurs specifically when a subchannel reconnects, not at startup. This aligns exactly with the situation you’re describing.

From my analysis, the code should never panic if Channelz is enabled at the time of ClientConn creation. If this were a common issue, we would have likely seen similar reports over the past couple of years, given that the code has been stable for a long time.

The key question we need to address is how a ClientConn could be created before Channelz's init function runs. Understanding this will help us identify a fix.

@arjan-bal
Copy link
Contributor

In #8110 (comment), there is a function named connectFunc. It gets a stream by calling wc.cli.invokeWatch(connCtx, wc.req, grpc.WaitForReady(true)). When is connectFunc called, is it during or before package initialization?

@Cluas
Copy link
Author

Cluas commented Mar 6, 2025

In #8110 (comment), there is a function named connectFunc. It gets a stream by calling wc.cli.invokeWatch(connCtx, wc.req, grpc.WaitForReady(true)). When is connectFunc called, is it during or before package initialization?

When we actually encounter an error, it is not this code. I just illustrated a similar situation. We must figure out one point. The error occurs during reconnection, not during connection. If it is a connection, I can understand that it is related to the initialization order. But this is a reconnection.

@arjan-bal
Copy link
Contributor

arjan-bal commented Mar 6, 2025

The panic occurs at re-connection time because the ClientConn creation was done assuming channelz is off. Quoting from the previous comment.

When innner1 creates a client, channelz/service is not initialized and hence channelz is not enabled. cc1 thinks that channelz is disabled, so it doesn't have the required data structures for channelz tracking.

cc1 runs for a couple of hours. Now the connection for cc1 breaks, maybe due to server restart. cc1 will attempt to connect again. By this time channelz/service is already imported, so cc1 sees channelz as enabled. It will try to access channelz data structures that should have been created at subchannel creation time. It will get a nil pointer deref because these data structures were not created initially.

@Cluas
Copy link
Author

Cluas commented Mar 6, 2025

The panic occurs at re-connection time because the ClientConn creation was done assuming channelz is off. Quoting from the previous comment.

When innner1 creates a client, channelz/service is not initialized and hence channelz is not enabled. cc1 thinks that channelz is disabled, so it doesn't have the required data structures for channelz tracking.
cc1 runs for a couple of hours. Now the connection for cc1 breaks, maybe due to server restart. cc1 will attempt to connect again. By this time channelz/service is already imported, so cc1 sees channelz as enabled. It will try to access channelz data structures that should have been created at subchannel creation time. It will get a nil pointer deref because these data structures were not created initially.

I understand what you mean. I will ask relevant colleagues for confirmation.

@Cluas
Copy link
Author

Cluas commented Mar 6, 2025

I have confirmed with them that the import of channelz is before the initialization of the client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Status: Requires Reporter Clarification Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants