Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive retries #257

Merged
merged 15 commits into from
Dec 31, 2024
Merged

Adaptive retries #257

merged 15 commits into from
Dec 31, 2024

Conversation

Kamil-Lontkowski
Copy link
Contributor

Closes #252

Implementation of adaptive retries based on implementation from aws-sdk-java-v2. It is backed by simple token bucket, where for every retry we take tokens, and replenish them for every successful operation. This way if we are making for example request for very resource-constrained set of resources, after system failure we don't generate more load by retrying every operation that failed.

@Kamil-Lontkowski
Copy link
Contributor Author

I have a few questions regarding this draft.

  1. AdaptiveRetryConfig added few parameters, I am wondering if we want to go in similar direction as before, providing factory methods for simple cases, or make them with all parameters with sensible defaults.
  2. Currently when there are no tokens for retry we just fail as if we used all retries. I wonder if we can introduce "blocking mode" so client waits for enough tokens in bucket.
  3. There is already implementation of leakyBucket in RateLimiter. I wonder if maybe it would be worth sharing some logic between them, but at the same time it is very simple logic.

@lukaszlenart
Copy link
Member

What about adding token on failure/giving up?

@adamw
Copy link
Member

adamw commented Dec 19, 2024

Isn't the "adaptiveness" or "token bucket" an orthogonal concern to the other pieces of configuration that we have currently: schedule, resultPolicy and onRetry? That is - for any values of the schedule etc., we might add (or not) a token bucket, which needs to be consulted before retries. In that scenario, we wouldn't need separate Config implementations, just extending the current one?

As for your questions:

  1. if we have a single config class, then we should also have factory methods with sensible defaults
  2. I think that would be making the retries kind of a rate limiter - and it's a different mechanism. So let's keep those separate.
  3. if you could extract a Bucket abstraction, great, but not unless straighforward

@Kamil-Lontkowski
Copy link
Contributor Author

Kamil-Lontkowski commented Dec 20, 2024

What about adding token on failure/giving up?

I think that it would contradict what TokenBucketRetryCondition and AdaptiveRetryStrategy do. If we add token on failure we would effectively take token for retry and add it back after it fails again. I don't know if I understood that question correctly.

@adamw
Copy link
Member

adamw commented Dec 20, 2024

Thanks for the changes, but I've got yet another variant to consider. Up until now, retries where stateless - it was just a config saying how many time to retry, regardless of other invocations. With the adaptive variant, retries become stateful - the token bucket must be shared. State is usually hard and we have to be cautious not to introduce "surprising" APIs.

That's why maybe it would be better to make the state possible explicit. That is - keep retry stateless as it currently is. But, introduce an AdaptiveRetry(TokenBucket) class, which would have an apply method, mirroring the current retry. That way, the stateful instance would be made explicit, and that's what users would have to share. Usage would then be:

// this will be defined at the top-level, shared among request handlers etc.
val adaptiveRetry = AdaptiveRetry(TokenBucket(...))

// usage:
adaptiveRetry(RetryConfig.backoff(maxRetries, initialDelay, maxDelay))(...)

As for the implementation, we might use the callbacks in the retry policy (respecting the callbacks that already are there). Also, shouldn't we specify both failureCost and successReward for the token bucket?

What do you think? :)

@Kamil-Lontkowski
Copy link
Contributor Author

This WIP is mostly based on mechanism implemented in aws-sdk. There is only exceptionCost. But we can add also successReward. Maybe with new retry AdaptiveRetry those can be functions on errors and result allowing even more flexibility(in aws-sdk throttling exception were treated differently).

@adamw
Copy link
Member

adamw commented Dec 20, 2024

Hm, I think that's a good feature of the AWS SDK - if the call didn't even make it over the network, then we shouldn't count this as an "adaptive failure". So just as we have "isSuccess" etc in the policy, here we could also have one piece of config for adaptive retries - whether the exception should incur the retry penalty

@Kamil-Lontkowski
Copy link
Contributor Author

Kamil-Lontkowski commented Dec 20, 2024

I am trying two approaches for writing AdaptiveRetry class. One defines methods like failureCost as class params. This approach makes the apply methods take less parameters but it forces us to define E and T params when defining AdaptiveRetry. So this seems like a bad idea if we want to reuse policy for different actions. It also does't allow for retry alternative

def apply(config: RetryConfig[E, T])(operation: => T): T

because we can't narrow T to Throwable.

Second approach pushes definitions of this methods to apply. This also have some drawbacks. We can't have multiple apply methods with the same defaults, so we need to force user to define these, or hardcode defaults. Two argument lists also makes it hard for compiler to pick proper overloaded method. In the version I pushed I needed to specify with config E and T explicitly because without it compiler would pick method that returns T even though operation returns Either. Also the infered type of result was Any(but I don't know if it is from IntelliJ or compiler).

That means that we would probably need to get rid of overloaded apply methods, because if compiler picks wrong one we only find out if it blows up in runtime. These are my thoughts after trying this variant, or maybe there is more ways to define this class.

@adamw
Copy link
Member

adamw commented Dec 23, 2024

But isn't failure / success cost constant? Having them variable seems like a very advanced scenario

@adamw
Copy link
Member

adamw commented Dec 23, 2024

I'd imagine them simply as failureCost: Int, successReward: Int

@Kamil-Lontkowski
Copy link
Contributor Author

These can be easily changed. I was wondering more if we are okey with signature like this:

def apply[E, T, F[_]](
      config: RetryConfig[E, T],
      failureCost: Int,
      successReward: Int,
      isFailure: E => Boolean,
      errorMode: ErrorMode[E, F]
  )(operation: => F[T]): F[T]

Because in this scenario overloading apply methods is not good idea because things like I mentioned in the comment can sneak by and blow out in runtime. But writing few methods, like retryEither should be enough. If so I will write tests and complete docs.

@adamw
Copy link
Member

adamw commented Dec 23, 2024

I think I'd push all the "adaptive" configuration to the AdaptiveRetry constructor, so the: token bucket, success & failure token counts. This is all independent of specific T/E parameters. Not sure if that the problem, though :)

@Kamil-Lontkowski Kamil-Lontkowski changed the title WIP: Adaptive retries Adaptive retries Dec 23, 2024
@Kamil-Lontkowski Kamil-Lontkowski marked this pull request as ready for review December 23, 2024 13:58
@Kamil-Lontkowski
Copy link
Contributor Author

Kamil-Lontkowski commented Dec 23, 2024

I think that would be nice middle ground for how much configuration should be provided at every invocation. Maybe running operations on AdaptiveRetry instance should have the same api that retry.scala provides to avoid confusion with this apply method. But that is just matter of method name.

I didn't add more tests in BackoffRetryTest as they felt redundant. I think all "adaptive" cases where tested with different configs.

*/
def apply[E, T, F[_]](
config: RetryConfig[E, T],
isFailure: E => Boolean = (_: E) => true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we get an error value (E), that's always a failure. The additional flexibility that we provide in retries, is that success values T might also cause a retry - so there's no exception thrown, for example (if E == Throwable), but we get back a value which signals a retry anyway. But if we get an E, that's a failure for sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed signature to isFailure: T => Boolean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed changes according to this comment. shouldPayPenaltyCost: T => Boolean = (_: T) => true decides if penalty will be paid for result T that was decided to be retried.

But I am not sure about the use case. What if we use library that has client side throttling and operation ends with error. It was throttling so we didn't make the call to the service so we should not pay for this retry but now we can't handle this case. Maybe it should be Either[E, T] => Boolean. Just a thought, I don't know which case might be more useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, Either[E, T] => Boolean would be much more useful. Also let's rename this to shouldPayFailureCost to keep the vocabulary consistent

@adamw
Copy link
Member

adamw commented Dec 31, 2024

The code examples in the docs didn't have mdoc:compile-only and contained multiple errors (I fixed those)

@adamw adamw merged commit daefc1e into master Dec 31, 2024
5 checks passed
@adamw adamw deleted the adaptive-retries branch December 31, 2024 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adaptive retries
3 participants