Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats/opentelemetry: add trace event for name resolution delay #8074

Open
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

vinothkumarr227
Copy link
Contributor

@vinothkumarr227 vinothkumarr227 commented Feb 10, 2025

stats/opentelemetry: add trace event for name resolution delay.

RELEASE NOTES: None

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.24%. Comparing base (af07815) to head (f092d0f).
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8074      +/-   ##
==========================================
- Coverage   82.30%   82.24%   -0.06%     
==========================================
  Files         392      392              
  Lines       39132    39150      +18     
==========================================
- Hits        32207    32199       -8     
- Misses       5603     5621      +18     
- Partials     1322     1330       +8     
Files with missing lines Coverage Δ
clientconn.go 90.82% <100.00%> (+0.26%) ⬆️
stats/opentelemetry/client_metrics.go 89.44% <100.00%> (+0.05%) ⬆️
stats/opentelemetry/client_tracing.go 83.33% <100.00%> (+6.86%) ⬆️
stats/opentelemetry/opentelemetry.go 75.00% <ø> (ø)
stream.go 81.55% <100.00%> (+0.03%) ⬆️

... and 27 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal requested a review from purnesh42H February 17, 2025 07:16
@arjan-bal arjan-bal added this to the 1.71 Release milestone Feb 17, 2025
@arjan-bal arjan-bal added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Feb 17, 2025
@@ -38,6 +38,8 @@ type RPCTagInfo struct {
// FailFast indicates if this RPC is failfast.
// This field is only valid on client side, it's always false on server side.
FailFast bool
// NameResolutionDelay indicates if the RPC was delayed due to address resolution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// NameResolutionDelay indicates if there was a delay in the name resolution.
// This field is only valid on client side, it's always false on server side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

stream.go Outdated
@@ -212,9 +216,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
}
// Provide an opportunity for the first RPC to see the first service config
// provided by the resolver.
if err := cc.waitForResolvedAddrs(ctx); err != nil {
isDelayed, err := cc.waitForResolvedAddrs(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isDelayed -> nameResDelayed/nameResolutionDelayed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

stream.go Outdated
@@ -416,8 +424,9 @@ func (cs *clientStream) newAttemptLocked(isTransparent bool) (*csAttempt, error)
method := cs.callHdr.Method
var beginTime time.Time
shs := cs.cc.dopts.copts.StatsHandlers
isDelayed, _ := ctx.Value(nameResolutionDelayKey).(bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about naming the variable as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

stream.go Outdated
@@ -212,9 +216,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
}
// Provide an opportunity for the first RPC to see the first service config
// provided by the resolver.
if err := cc.waitForResolvedAddrs(ctx); err != nil {
isDelayed, err := cc.waitForResolvedAddrs(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning a bool and error is not good practice for go. It breaks the established pattern of error handling in Go because returned bool indicates success/failure in general. Can we do something better? It might be fine if we can't but we can try to look for better ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an approach as follows:

Add a nameResolutionDelay field: Add a new nameResolutionDelay field to the ClientConn struct to store the delay state.
Modify waitForResolvedAddrs: Set the nameResolutionDelay field directly in ClientConn instead of returning a boolean.
Access in newAttemptLocked: Use the nameResolutionDelay field from ClientConn within newAttemptLocked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we want to add the field to clientconn because clientconn is not restricted to only single rpc. Returning a struct sounds better but we don't have any other field apart from boolean field. Let's keep bool, err for now. But make sure docstring is updated to indicate the bool correctly.

}()
// Simulate delayed resolution and unblock it via resolutionReady
go func() {
<-resolutionReady
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before <-resolutionReady, you can add the t.LogF("RPC waiting for resolved addresses")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -52,6 +52,9 @@ func populateSpan(rs stats.RPCStats, ai *attemptInfo) {
)
// increment previous rpc attempts applicable for next attempt
atomic.AddUint32(&ai.previousRPCAttempts, 1)
if ai.nameResolutionDelayed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong. As per https://github.com/grpc/proposal/blob/master/A72-open-telemetry-tracing.md#tracing-information, "Delayed name resolution" should be an event in the call span not attempt span.

This should be right place https://github.com/grpc/grpc-go/blob/master/stats/opentelemetry/client_tracing.go#L34. Before creating the attempt span, you need to retrieve the current call span using trace.SpanFromContext and add an event to that span. Before that, also check if that event already exist or not. And only add, if it exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@purnesh42H Could you clarify this part: "Only add if it exists"? Should we check if the event has already been added and skip adding it if it has? Or did you mean we should check ai.nameResolutionDelayed before adding the event? Is there any other condition we should consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if _, err := client.EmptyCall(ctx, &testpb.Empty{}); err != nil {
t.Fatalf("First RPC failed unexpectedly: %v", err)
}
// Verify that name resolution did not happen.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't mean "name resolution did not happen". Either it should be "name resolution did not happen again" or "Verify that RPC was not blocked on waiting for resolver to return addresses indicating no name resolution delay". I prefer the latter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -30,7 +30,17 @@ import (
// It creates a new outgoing carrier which serializes information about this
// span into gRPC Metadata, if TextMapPropagator is provided in the trace
// options. if TextMapPropagator is not provided, it returns the context as is.
// Additionally, if name resolution was delayed (ai.nameResolutionDelayed is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment can be within the method instead of docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

func (h *clientStatsHandler) traceTagRPC(ctx context.Context, ai *attemptInfo) (context.Context, *attemptInfo) {
callSpan := trace.SpanFromContext(ctx)
if ai.nameResolutionDelayed && !ai.nameResolutionEventAdded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think its the right way to maintain another boolean field for checking if event was already added or not. Can't we just check in the list if the event by that name already exist?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can optionally declare a const with name "Delayed name resolution complete" for readability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothkumarr227 i discussed this problem with @dfawley and he suggested to use atomic bool here and swap the value whenever we find true in any of the retry attempt and then add the event to call span only if we find it true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// complete" event has already been added to prevent duplicate event
// logging. This field is only applicable on the client side and is always
// false on the server side.
nameResolutionEventAdded bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's not a good idea to maintain another field in this struct just for checking existence in the same package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have now modified it by adding atomic.Bool for nameResolutionEventAdded.

if _, err := client.EmptyCall(ctx, &testpb.Empty{}); err != nil {
t.Fatalf("First RPC failed unexpectedly: %v", err)
}
// verifying that RPC was blocked on resolver to return addresses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment seems wrong. it should be "verifying that RPC was not blocked on resolver indicating there was no delay in name resolution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// TestStatsHandlerDetectsResolutionDelay verifies that if this is the
// first time resolution is happening at the time of making RPC,
// nameResolutionDelayed flag was set indicating there was a delay in name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameResolutionDelayed is set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// event is logged only once using an atomic flag, even across retries.
callSpan := trace.SpanFromContext(ctx)
if ai.nameResolutionDelayed && !ai.nameResolutionEventAdded.Swap(true) {
if callSpan.SpanContext().IsValid() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be part of the same top level if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1563,3 +1569,144 @@ func (s) TestRPCSpanErrorStatus(t *testing.T) {
t.Fatalf("got rpc error %s, want %s", spans[0].Status.Description, rpcErrorMsg)
}
}

type testStatsHandler struct {
nameResolutionDelayed atomic.Bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for test, nameResolutionDelayed shouldn't be part of stats handler because the scope of it is per rpc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// HandleRPC tracks if an RPC started with a delayed name resolution.
func (h *testStatsHandler) HandleRPC(_ context.Context, rs stats.RPCStats) {
if _, ok := rs.(*stats.Begin); ok {
h.nameResolutionDelayed.CompareAndSwap(false, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm.... HandleRPC is called for every event in the attempt. nameResolutionDelayed should be always handled in TagRPC which is called for every retry attempt of a call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// indicating that the resolver took time to return addresses.
// The test also ensures that retries happen as per the configured policy
// and that the resolution delay event is added to the call span only once.
func TestNameResolutionDelay_WithRetry(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothkumarr227 can you write the test similar to how other tests are written? See TestSpan. You can write exact similar test with only difference being retries and manual resolver for delay. However, you shouldn't verify the name resolution flag in the test because that part is already being tested in the clientconn test. Only thing you need to modify is your expected call span grpc.testing.TestService.UnaryCallandgrpc.testing.TestService.FullDuplexCall to have "Delayed name resolution complete as one event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you do above, you wouldn't need a separate test stats handler for this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// delay was detected and the event hasn't been recorded yet. Ensures the
// event is logged only once using an atomic flag, even across retries.
callSpan := trace.SpanFromContext(ctx)
if ai.nameResolutionDelayed && !ai.nameResolutionEventAdded.Swap(true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we do something like !atomic.SwapBool(ai.nameResolutionEventAdded, true) and nameResolutionEventAdded can be a regular bool type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that there isn't a method named SwapBool, so I used ai.nameResolutionEventAdded.Swap(true) to achieve the desired functionality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants