-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats/opentelemetry: add trace event for name resolution delay #8074
base: master
Are you sure you want to change the base?
stats/opentelemetry: add trace event for name resolution delay #8074
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #8074 +/- ##
==========================================
- Coverage 82.30% 82.24% -0.06%
==========================================
Files 392 392
Lines 39132 39150 +18
==========================================
- Hits 32207 32199 -8
- Misses 5603 5621 +18
- Partials 1322 1330 +8
🚀 New features to boost your workflow:
|
stats/handlers.go
Outdated
@@ -38,6 +38,8 @@ type RPCTagInfo struct { | |||
// FailFast indicates if this RPC is failfast. | |||
// This field is only valid on client side, it's always false on server side. | |||
FailFast bool | |||
// NameResolutionDelay indicates if the RPC was delayed due to address resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// NameResolutionDelay indicates if there was a delay in the name resolution.
// This field is only valid on client side, it's always false on server side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
@@ -212,9 +216,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth | |||
} | |||
// Provide an opportunity for the first RPC to see the first service config | |||
// provided by the resolver. | |||
if err := cc.waitForResolvedAddrs(ctx); err != nil { | |||
isDelayed, err := cc.waitForResolvedAddrs(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isDelayed -> nameResDelayed/nameResolutionDelayed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
@@ -416,8 +424,9 @@ func (cs *clientStream) newAttemptLocked(isTransparent bool) (*csAttempt, error) | |||
method := cs.callHdr.Method | |||
var beginTime time.Time | |||
shs := cs.cc.dopts.copts.StatsHandlers | |||
isDelayed, _ := ctx.Value(nameResolutionDelayKey).(bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about naming the variable as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
@@ -212,9 +216,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth | |||
} | |||
// Provide an opportunity for the first RPC to see the first service config | |||
// provided by the resolver. | |||
if err := cc.waitForResolvedAddrs(ctx); err != nil { | |||
isDelayed, err := cc.waitForResolvedAddrs(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning a bool and error is not good practice for go. It breaks the established pattern of error handling in Go because returned bool indicates success/failure in general. Can we do something better? It might be fine if we can't but we can try to look for better ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an approach as follows:
Add a nameResolutionDelay field: Add a new nameResolutionDelay field to the ClientConn struct to store the delay state.
Modify waitForResolvedAddrs: Set the nameResolutionDelay field directly in ClientConn instead of returning a boolean.
Access in newAttemptLocked: Use the nameResolutionDelay field from ClientConn within newAttemptLocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we want to add the field to clientconn because clientconn is not restricted to only single rpc. Returning a struct sounds better but we don't have any other field apart from boolean field. Let's keep bool, err for now. But make sure docstring is updated to indicate the bool correctly.
test/clientconn_test.go
Outdated
}() | ||
// Simulate delayed resolution and unblock it via resolutionReady | ||
go func() { | ||
<-resolutionReady |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before <-resolutionReady, you can add the t.LogF("RPC waiting for resolved addresses")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stats/opentelemetry/trace.go
Outdated
@@ -52,6 +52,9 @@ func populateSpan(rs stats.RPCStats, ai *attemptInfo) { | |||
) | |||
// increment previous rpc attempts applicable for next attempt | |||
atomic.AddUint32(&ai.previousRPCAttempts, 1) | |||
if ai.nameResolutionDelayed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wrong. As per https://github.com/grpc/proposal/blob/master/A72-open-telemetry-tracing.md#tracing-information, "Delayed name resolution" should be an event in the call span not attempt span.
This should be right place https://github.com/grpc/grpc-go/blob/master/stats/opentelemetry/client_tracing.go#L34. Before creating the attempt span, you need to retrieve the current call span using trace.SpanFromContext
and add an event to that span. Before that, also check if that event already exist or not. And only add, if it exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@purnesh42H Could you clarify this part: "Only add if it exists"? Should we check if the event has already been added and skip adding it if it has? Or did you mean we should check ai.nameResolutionDelayed before adding the event? Is there any other condition we should consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
if _, err := client.EmptyCall(ctx, &testpb.Empty{}); err != nil { | ||
t.Fatalf("First RPC failed unexpectedly: %v", err) | ||
} | ||
// Verify that name resolution did not happen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't mean "name resolution did not happen". Either it should be "name resolution did not happen again" or "Verify that RPC was not blocked on waiting for resolver to return addresses indicating no name resolution delay". I prefer the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -30,7 +30,17 @@ import ( | |||
// It creates a new outgoing carrier which serializes information about this | |||
// span into gRPC Metadata, if TextMapPropagator is provided in the trace | |||
// options. if TextMapPropagator is not provided, it returns the context as is. | |||
// Additionally, if name resolution was delayed (ai.nameResolutionDelayed is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment can be within the method instead of docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
func (h *clientStatsHandler) traceTagRPC(ctx context.Context, ai *attemptInfo) (context.Context, *attemptInfo) { | ||
callSpan := trace.SpanFromContext(ctx) | ||
if ai.nameResolutionDelayed && !ai.nameResolutionEventAdded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think its the right way to maintain another boolean field for checking if event was already added or not. Can't we just check in the list if the event by that name already exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can optionally declare a const with name "Delayed name resolution complete" for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothkumarr227 i discussed this problem with @dfawley and he suggested to use atomic bool here and swap the value whenever we find true in any of the retry attempt and then add the event to call span only if we find it true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stats/opentelemetry/opentelemetry.go
Outdated
// complete" event has already been added to prevent duplicate event | ||
// logging. This field is only applicable on the client side and is always | ||
// false on the server side. | ||
nameResolutionEventAdded bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's not a good idea to maintain another field in this struct just for checking existence in the same package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have now modified it by adding atomic.Bool for nameResolutionEventAdded.
test/clientconn_test.go
Outdated
if _, err := client.EmptyCall(ctx, &testpb.Empty{}); err != nil { | ||
t.Fatalf("First RPC failed unexpectedly: %v", err) | ||
} | ||
// verifying that RPC was blocked on resolver to return addresses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment seems wrong. it should be "verifying that RPC was not
blocked on resolver indicating there was no delay in name resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
|
||
// TestStatsHandlerDetectsResolutionDelay verifies that if this is the | ||
// first time resolution is happening at the time of making RPC, | ||
// nameResolutionDelayed flag was set indicating there was a delay in name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nameResolutionDelayed is
set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// event is logged only once using an atomic flag, even across retries. | ||
callSpan := trace.SpanFromContext(ctx) | ||
if ai.nameResolutionDelayed && !ai.nameResolutionEventAdded.Swap(true) { | ||
if callSpan.SpanContext().IsValid() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can be part of the same top level if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stats/opentelemetry/e2e_test.go
Outdated
@@ -1563,3 +1569,144 @@ func (s) TestRPCSpanErrorStatus(t *testing.T) { | |||
t.Fatalf("got rpc error %s, want %s", spans[0].Status.Description, rpcErrorMsg) | |||
} | |||
} | |||
|
|||
type testStatsHandler struct { | |||
nameResolutionDelayed atomic.Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for test, nameResolutionDelayed
shouldn't be part of stats handler because the scope of it is per rpc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stats/opentelemetry/e2e_test.go
Outdated
// HandleRPC tracks if an RPC started with a delayed name resolution. | ||
func (h *testStatsHandler) HandleRPC(_ context.Context, rs stats.RPCStats) { | ||
if _, ok := rs.(*stats.Begin); ok { | ||
h.nameResolutionDelayed.CompareAndSwap(false, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm.... HandleRPC is called for every event in the attempt. nameResolutionDelayed should be always handled in TagRPC which is called for every retry attempt of a call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stats/opentelemetry/e2e_test.go
Outdated
// indicating that the resolver took time to return addresses. | ||
// The test also ensures that retries happen as per the configured policy | ||
// and that the resolution delay event is added to the call span only once. | ||
func TestNameResolutionDelay_WithRetry(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothkumarr227 can you write the test similar to how other tests are written? See TestSpan
. You can write exact similar test with only difference being retries and manual resolver for delay. However, you shouldn't verify the name resolution flag in the test because that part is already being tested in the clientconn test. Only thing you need to modify is your expected call span
grpc.testing.TestService.UnaryCalland
grpc.testing.TestService.FullDuplexCall to have "Delayed name resolution complete
as one event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you do above, you wouldn't need a separate test stats handler for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// delay was detected and the event hasn't been recorded yet. Ensures the | ||
// event is logged only once using an atomic flag, even across retries. | ||
callSpan := trace.SpanFromContext(ctx) | ||
if ai.nameResolutionDelayed && !ai.nameResolutionEventAdded.Swap(true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we do something like !atomic.SwapBool(ai.nameResolutionEventAdded, true)
and nameResolutionEventAdded
can be a regular bool type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that there isn't a method named SwapBool, so I used ai.nameResolutionEventAdded.Swap(true) to achieve the desired functionality
stats/opentelemetry: add trace event for name resolution delay.
RELEASE NOTES: None