All Things Cloud: Failures in Microservices

Sunday, August 4, 2019

Failures in Microservices

As microservices evolve into a tangled mess of synchronous and asynchronous flows with multi-level fanouts it becomes important to think about failure and resiliency since that is pretty much a guaranteed outcome when the availability of the whole system is a multiplicative of all its downstream microservices and dependencies.

How does one systematically think about handling load, graceful degradation and load shedding in the face of impaired operation and sustained high load ? Google's SRE books contain excellent high level advice as it pertains to handling load and addressing cascading failures. I have prepared a actionable summary of a couple of chapters dealing with resiliency to win in the face of failure. Follow the notes here to create

Rigor and governance around Microservices frameworks and templates to enable systematic resiliency through circuit breakers and autoscaling for sustainable scale out of your System of Systems.

https://landing.google.com/sre/sre-book/chapters/addressing-cascading-failures/

https://landing.google.com/sre/workbook/chapters/managing-load/

Different types of resources can be exhausted

Insufficient CPU > all requests become slower > various secondary effects

1. Increased number of inflight requests

2. Excessively long queue lengths

- steady state rate of incoming requests > rate at which the server can process requests

3. Thread starvation

4. CPU or request starvation

5. Missed RPC deadlines

6. Reduced CPU caching benefits

Memory Exhaustion - as more in-flight requests consume more RAM, response, and RPC objects

1. Dying containers due to OOM Killers

2. A vicious cycle - (Increased rate of GC in Java, resulting in increased CPU usage)

3. Reduction in app level cache hit rates

Threads (Tomcat HTTP )

1. Thread starvation can directly cause errors or lead to health check failures.

2. If the server adds threads as needed, thread overhead can use too much RAM.

3. In extreme cases, thread starvation can also cause you to run out of process IDs.

File descriptors

Running out of file descriptors can lead to the inability to initialize network connections, which in turn can cause health checks to fail.

Dependencies among resources

Resource exhaustion scenarios feed from one another
DB Connections (Negative Indicator)

All this can ultimately lead to Service Unavailability > Resource exhaustion can lead to servers crashing leading to snowball effect.

How To Prevent Server Overload

Load test the server’s capacity limits,
Serve degraded results
Instrument servers to reject requests when overloaded - fail early and cheaply
Instrument higher-level systems to reject requests at reverse proxies, by limiting the volume of requests by criteria such as IP address, At the load balancers, by dropping requests when the service enters global overload and at At individual tasks
Perform capacity planning

Load Shedding

Detect When Load Shedding/ Graceful Degradation Should Kick In

Look @ CPU usage, latency, queue length and number of threads used
Decide whether your service enters degraded mode automatically or if manual intervention is necessary)?
Graceful degradation shouldn’t trigger very often
Monitor and alert when too many servers enter these modes
Design a way to quickly turn off complex graceful degradation when you run into emergent behavior

Implement

Per-task throttling based on CPU, memory, or queue length > - Limit queue length. For a system with fairly steady traffic over time, it is better to have *small queue lengths* relative to the thread pool size, which results in the server rejecting requests early when it can’t sustain the rate of incoming requests.
Dynamically adjusting the number of in-flight task updates based on the volume of requests and available capacity
Return 503 service unavailable to any incoming request when there are more than a given number of client requests in flight
Change the queuing method from FIFO to LIFO or using the CoDel algorithm can reduce load by removing requests that are unlikely to be worth processing

Graceful degradation

Decrease the amount of work or time needed by decreasing the quality of responses
What actions should be taken when the server is in degraded mode?
Implement these strategies at every layer in the stack, or sufficient to have a high-level choke-point

Circuit Breaker Retry Advice

Retries can amplify the effects seen in Server Overload

Limit to 3 retries per request. Don’t retry a given request indefinitely.
Impose a server-wide retry budget when retry budget is exceeded, don’t retry; just fail the request
Examine if you need to perform retries at a given level. Prevent retry fanout
Separate retriable and nonretriable error conditions. Don’t retry permanent errors or malformed requests
Retry exponential backoff with jitter.
All Retry behavior should be configurable. We can turn this off.

Implement Deadline Propagation

Pick a deadline
Server/appinstance should check the deadline left at each stage before attempting to perform any more work on the request
Each server in the request tree implements deadline propagation
Reduce the outgoing deadline by a few hundred milliseconds to account for network transit times
Set an upper bound for outgoing deadlines
Deadlines several orders of magnitude longer than the mean request latency is usually bad

Multi-modal latency requests

For multivariate workloads allow only 25% of your threads to be occupied by any one client in order to provide fairness in the face of heavy load by any single client misbehaving.

Address Cascading failures

Process health checking is relevant to the cluster scheduler, whereas service health checking is relevant to the load balancer
Increase Resources, Restart Servers, Drop Traffic, Enter Degraded Modes, Eliminate Batch Load, Eliminate Bad Traffic and Autoscale

All Things Cloud

About Me

Sunday, August 4, 2019

Failures in Microservices

Different types of resources can be exhausted

Insufficient CPU > all requests become slower > various secondary effects

Memory Exhaustion - as more in-flight requests consume more RAM, response, and RPC objects

Threads (Tomcat HTTP )

File descriptors

Dependencies among resources

How To Prevent Server Overload

Load Shedding

Detect When Load Shedding/ Graceful Degradation Should Kick In

Implement

Graceful degradation

Circuit Breaker Retry Advice

Implement Deadline Propagation

Multi-modal latency requests

Address Cascading failures

No comments:

Post a Comment

Sunday, August 4, 2019

Failures in Microservices

Different types of resources can be exhausted

Insufficient CPU > all requests become slower > various secondary effects

Memory Exhaustion - as more in-flight requests consume more RAM, response, and RPC objects

Threads (Tomcat HTTP )

File descriptors

Dependencies among resources

How To Prevent Server Overload

Load Shedding

Detect When Load Shedding/ Graceful Degradation Should Kick In

Implement

Graceful degradation

Circuit Breaker Retry Advice

Implement Deadline Propagation

Multi-modal latency requests

Address Cascading failures

No comments:

Post a Comment