-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[consumer] Add new otlp-centric error type #11085
base: main
Are you sure you want to change the base?
[consumer] Add new otlp-centric error type #11085
Conversation
41e8af2
to
b5c90ce
Compare
@@ -13,6 +13,8 @@ type permanent struct { | |||
|
|||
// NewPermanent wraps an error to indicate that it is a permanent error, i.e. an | |||
// error that will be always returned if its source receives the same inputs. | |||
// | |||
// Deprecated: [v0.110.0] use Error instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe NewPermanent
is no longer needed if we use Error
. I marked this deprecated in this PR, but we could do it in a followup PR.
If we agree to do it in this PR I'll update all the usages to fix the linter.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #11085 +/- ##
==========================================
- Coverage 91.85% 91.78% -0.08%
==========================================
Files 416 418 +2
Lines 19925 19996 +71
==========================================
+ Hits 18302 18353 +51
- Misses 1245 1263 +18
- Partials 378 380 +2 ☔ View full report in Codecov by Sentry. |
// Experimental: This API is at the early stage of development and may change | ||
// without backward compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to mark consumer as 1.0 if we have this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the plan from the other PR was to quickly remove it. We could also choose to not start with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#9041 includes a Combine
function to aggregate multiple Error
together. I can add that feature to this PR if it is still necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this approach is becoming too specific to be useful. HTTP and gRPC status codes (and conversion between them) should be protocol agnostic. Is there any way we can design this so that we have a general consumer error that contains HTTP and gRPC status codes that can be extended to add protocol specific functionality as needed?
if e.retryable { | ||
return status.New(codes.Unavailable, e.Error()) | ||
} | ||
return status.New(codes.Internal, e.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered codes.Unknown
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am an http fanboy, if there is any improperly used grpc codes please let me know. Is codes.Unknown
the best thing to use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think codes.Unknown
is probably the best default. It's the default in our HTTP->gRPC code conversion logic, and is the fallback code according to the list that we refer to for conversions.
@mwear the discussion @evan-bradley @bogdandrutu @dmitryax and I had led us to the conclusion that a isolated http status or grpc status, with no knowledge of its source, led to ambiguity. Essentially a protocol-agnostic approach meant we could not be sure what the code meant. For example, a non-OTLP exporter, call it exporter A, may use http protocol to make connections with some backend, but, similar to OTLP, it may be speaking a protocol on top of HTTP. That protocol could return a 500 http status code when it is trying to communicate to its clients that it was their fault. This would be in violation of http's definition of 500, but as long as the sever and client have agreed upon their protocol they'll be able to communicate effectively. In the situation where exporter A wants to propagate this error back up the pipeline, recording a generic HTTP 500 status code will cause problems. In this scenario the 500 status code is representing a client error, but if that code made it back to a receiver than did not speak the same protocol as exporter A, such as the OTLP receiver, it will interpret the code incorrectly. In this scenario the OTLP receiver should return an error that indicates the client had a problem, likely a 400, but it would see the 500 and interpret it incorrectly as a server error. Our solution to this was to force a protocol into the error and the one that made the most sense was OTLP. If we instead included an additional piece of metadata, such as If, in the initial scenario, there is a receiver A that speaks the same protocol, then exporter A could return its own special error type and receiver A could look for that error type. We aren't restricting what types of errors an exporter can send back up the pipeline, only providing a consistent, reliable container that we'll recommend using. |
@TylerHelmuth I would like to add to the list of considerations that there are a couple of existing places where I think we need a protocol-agnostic error status. Discussed in #11183, there are situations where the exporterhelper's Timeout sender and Retry sender may (IMO) want to abort a request because the context deadline is shorter than the configured timeout or backoff interval. The root cause is that Golang's Context is protocol agnostic, and we have failures induced by the context which are not Canceled and not DeadlineExceeded. I'm looking for an Aborted status code. How would you propose to return protocol-agnostic errors from exporterhelper? See #11198. Have we considered using the gRPC error code space, since it was designed as a universal solution, while preserving protocol-specific |
@jmacd I may not be understanding the issue well enough, but the intention is that the exporter in question translate its grpc status to the OTLP equivalent grpc status and use
This is what we do today. It works for the most part, but fails as a transport mechanism for http since translating to and from http status codes and grpc status codes is lossy. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good to me. It currently covers only one type of internal collector error Retryable
and this is the only one we use at the moment. If we want to add something else in the future, this API doesn't restrict us from doing that. Otherwise, we can just use NewOTLPHTTPStatus
for internal collector errors if more specifics is needed.
@bogdandrutu PTAL
return true | ||
} | ||
switch e.httpStatus { | ||
case http.StatusTooManyRequests, http.StatusBadGateway, http.StatusServiceUnavailable, http.StatusGatewayTimeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't 500 Internal Server Error
retryable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we follow the OTLP spec for determining whether an error is retryable or non-retryable, and according to the OTLP spec 500 errors aren't retryable.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Description
Adds a new otlp-centric error type. This type improves the ability for otlp exporters to report errors to otlp consumers, while also giving any component the ability to properly interpret http/grpc/retriable errors.
With this type, the http/grpc/retriable datasets are always considered according to the OTLP specification. If a non-otlp component receives this error type, that can properly map the http/grpc/retryable error to their form based on OTLP's interpretation of the underlying data.
For example, if the error type stored HTTP status 404 a component checking for the http status code within the error would know that the 404 means Not Found according to OTLP. If the component needed to translate 404 Not Found to their own response code, they would be able to trust the meaning.
Link to tracking issue
Continuation of #9041
Closes #7047
Testing
Added tests
Documentation
Added godoc comments