gRPC deadlines and retries

To have transparent retries on RPCs from a gRPC client, it is desirable to configure a deadline for the RPC, eg, 10 seconds. Either programmatically or via a service_config.json file (timeout parameter). Enabling automated retries during that period, so that if the RPC is failed due to, eg, UNAVAILABLE, the gRPC code internally will retry as long as the deadline is not yet met. If that is not possible (eg, because of some library wrapping gRPC code not setting that up), then the next best thing is to manually code client-side retries for the period while the deadline is not yet met.
Enabling waitForReady on the RPC’s stub call options as to avoid the RPC failing if there is a Channel transient failure (eg, TCP socket drop, which will trigger an automated reconnection attempt by the gRPC channel logic). With the default mode of waitOnReady=false, RPCs will immediately fail in the client if the underlying channel for the stub is reconnecting, preventing future retries once reconnected. Note that both withDeadline and waitForReady have to be called each time the stub is used to make a call; it is not enough to call them once on the stub.

Setting deadlines and retries

gRPC encourages clients to set explicit deadlines for RPCs. The intention is to allow a client to define when an RPC result may no longer be useful and thus a server can fail an expensive computation back to the client reducing cost in the server. This is specially important in two scenarios:

A server receiving delayed messages from clients after network congestion. If the RPC deadline is 5 seconds and the messages finally arrive to the server 6 seconds later, the server can fail the RPC back to the client without doing any work, helping reduce the congestion generated by the network hiccup for the cases where doing the work would have been useless anyway.
An RPC from client C1 to server S1, where S1 needs to in turn make an RPC to server S2 to get some relevant data to respond back to C1. S1 should consider the remaining time available before the RPC deadline when doing the RPC to S2, and pass that remaining time as a deadline for the S2 call. Considering this together with (1), is likely that one RPC may need several steps and have a kind of “write amplification” effect, making the reduction in load for meaningless late work all the more relevant.

Deadlines can be defined in two ways:

Programmatically: https://grpc.io/blog/deadlines/#java
Via a service configuration file service_config.json using the timeout parameter: https://www.retinadata.com/blog/configuring-grpc-retries/

Of note:

When using a service_config.json file, automated RPC retries are typically also configured there, which would trigger automated retries for failures that are retriable (typically those matching Status=UNAVAILABLE with any reason), for as long as the timeout period allows.
Using a service_config.json file may not be ideal in all cases: it provides a single value that always applies. It is common to require different values for interactive tools (fail fast) and services (be resilient and retry for a period both RPCs and connections to survive network glitches and quick-ish server restarts).

A note on RPC deadlines and Channel state / disconnections

A channel in gRPC represents a transport to get messages from clients to servers. The association from channel to particular transport, eg a TCP connection, is explicitly not made. Channels can be in different states, and a channel implemented as a TCP socket connection will manage its own logic to try to reconnect independently of individual RPC and application level requests using the Channel. In this sense, channel parameters for reconnection logic are completely separate and different from RPC retry logic. The actual implementation of channel reconnection logic is explained in this short document. Note at the time of this writing many parameters are hardcoded in the java case and it is not possible to change them.
By default, RPCs that are initiated while a channel is in TRANSIENT_FAILURE (eg, TCP connection dropped, pending reconnection attempt) will fail immediately, independent of deadline or retry configuration. There is an option to make RPCs wait for channels to be in READY state before executing: waitForReady. The wait for ready semantics is explained in a bit more detail here.

A note on streaming RPCs

Everything that this page has discussed so far is valid for unary (ie, simple request-response) RPCs. It does not apply exactly as described for any kind of streaming (client, server, or bidi) RPCs. It ends up being a lot more complicated and manual to implement deadlines for streaming PRCs, and requires coding of interceptor logic.

The jetcd library and its API in the context of gRPC retries.

The jetcd library allows clients to set deadlines when using the ClientBuilder API, methods retryDelay, retryMaxDelay, and retryMaxDuration. Using these three parameters it implements a retry policy equivalent to an exponential backoff, initially retrying after retryDelay period, doubling it up and not over retryMaxDelay, after which point retries continue linearly at 2*retryMaxDelay, 3*retryMaxDelay, etc, and up to a total time of retryMaxDuration. This logic is implemented programmatically in jetcd internals when executing stubs towards the etcd server, using calls to a dependent library (net.jodah.failsafe) https://github.com/etcd-io/jetcd/blob/a6f54e1c4f57e7e00ebd0223f9eab52bd86daf77/jetcd-core/src/main/java/io/etcd/jetcd/impl/Impl.java#L105
Concrete Client implementation objects created by users of the jetcd library know the retry parameters through the ClientBuilder stored inside the ClientConnectionManager in the Client implementation. The specific clients, eg, KVClient, use the retry parameters when executing stubs.

This works mostly for everything except jetcd library calls implemented as streaming RPCs. Calls to renew leases use a bidi stream to the server in the gRPC service definition for etcd, therefore renewing a lease does not use the retry parameters defined in the ClientBuilder object. Moreover, none of the calls in the Lease client API use them; despite grant being a unary (not streaming) call, the implementation in LeaseImpl calls into the stub directly. There is an overload that allows the user to specify a deadline for the call, but there is no way to pass a retry policy.
Therefore, retries for lease-related calls have to be implemented by jetcd client code manually. The general approach is:
any StatusException or RuntimeStatusException with a status code of UNAVAILABLE can be retried.
an EtcdException with Error code of UNAVAILABLE can be retried.

References

gRPC retries design