All Things Cloud

Saturday, August 10, 2019

Event Storming - A Pivotal Practice for decomposing applications

FIELDS + GUIDANCE	Guidance
Name of method	Event Storming https://speakerdeck.com/rkelapure/event-storming
What is this method ?	Event Storming is a cross functional facilitation technique for revealing the bounded contexts, microservices, vertical Slices, trouble spots and starting points for a system or business process.
Phases	Discovery, Kick Off
Suggested Time	1 - 2 hours.
Who participates?	SMEs, Core Team (see facilitator notes)
Why do it?	Event Storming enables decomposing monoliths into microservices. It allows for modeling new flows and ideas, synthesizing knowledge and facilitating active group participation without conflict to time travel and ideate the next generation of a software system.
When to do it?	When you need to make sense of a Huge mess to enable Cross Perspective Communication as a force function for clarity.
What supplies are needed?	People, tools and supplies needed to conduct an ES session A large wall (for stickies) At least 4 different colored stickies Sharpies Blue painters tape Refreshments like water and soda and juices for hydration Paper flip boards for readouts and breakouts
How to Use this Method	Event Storming is a group exercise to scientifically explore the domains and problem areas of a monolithic application. The most concise description of the process of event storming comes from Vaughn Vernon's DDD-Distilled book and the color around the process comes from Alberto Brandolini's book Event Storming. Storm the business process by creating a series of Domain events on sticky notes. The most popular color to use for domain events is orange. The DomainEvent is a verb stated in past tense and represent a state transition in the domain. Write the name of the DomainEvent on an orange sticky note. Place the sticky notes on your modeling surface in time order from left to right. As you go through the storming session, you will find trouble spots in your existing business process. Clearly mark these with a purple/red stick notes. Use vertical space to represent parallel processing. After all the events are posted experts will post locally ordered sequence of events and enforce a timeline. Enforcing a timeline triggers long awaited conversations and eventually STRUCTURE will emerge. These event clumps or common groupings give us our notional service candidates (actors or aggregates depending on how rigid the team is with DDD definitions). These will be used during the Boris Exercise.
Success/Expected Outcomes “You know you are done when…”	- Event Storming generates an immense backlog of user stories. - Perform User Story Mapping to Map and organize stories into MVPs - Define scope of the problem - Confirm that you are solving the right problem ?
Facilitator Notes & Tips	Event Storming is a technique used to visualize complex systems and processes. This could range from monoliths to value streams. Event Storming is gamestorming technique for harnessing and capturing the information captured in a group’s minds. It surfaces conflicts and different perspectives of a complex system and bubbles up the top constraints and problem spots. As an event storming facilitator you have one job - create a safe environment for the exchange and output of ideas and data. The job is 50% technical facilitation and 50% soft people facilitation where you are reading body language. A single facilitator can typically orchestrate groups of 15-20. For a group of 30 or more you need two facilitators. ES is usually conducted in two phases. A high level event storm to identify the domains and then a subsequent ES into a top constraint - the core domain. The language of ES is stickies. In its simplest form ES is basically a facilitated group story telling. The stickies represent domain events - or things that happened in the past. The trouble spots are identified by orange/red stickies. The color of the stickies does not matter. What does matter is that you start simple and then add the notation incrementally. Start simple and then add the information in layers. ES can serve many goals - break down a monolith into constituent bounded contexts, create a value stream - as a way to onboard employees, etc., There is no ONE correct style of ES. Every session is different from another based on the desired goals and outcomes. So don’t worry about getting it right- just do it and roll your own style. An ES is only successful if the right people are involved. This is a mix of business domain experts, customer executives, stakeholders, business analysts, software developers, architects, testers, and folks who support the product in production. Subject matter experts, product owners and developers that knows and understand the application domain. This process enables cross perspective conversation throughout the team as well as a standard definition of the terms used by both technical and non-technical team members.
Related practices	BORIS SNAP
Real world example	Deconstructing Monoliths With Domain Driven Design
Recommended reading	Motivation behind ES can be found here - Gamestorming: A Playbook for Innovators, Rulebreakers, and Changemakers. Note this book is available on safari books online: https://www.safaribooksonline.com/library/view/gamestorming/9781449391195/ This is the book we read on the flight to Boston Eventstorming https://leanpub.com/introducing_eventstorming Book is written by Alberto Brandolini the father and inventor of Eventstorming Domain Driven Design (DDD) - provides the theoretical underpinnings of decomposing ‘/monoliths. DDD Distilled is the perfect book to understand the science of DDD and how ES fits into the grander scheme of things - how do the ES artifacts translate into software design, architecture and an actual backlog.

Thursday, August 8, 2019

Learnings from Implementing enterprise event driven architecture

There are four different types of event driven architecture[3]. Docket Based Choreography pattern is one of our inventions that allows us to design and operationalize an event driven architecture for a legacy system. It involves event notification but is a specific implementation. Typical event driven reengineering of a monolith involves both sync and async flows. CQRS and Event Sourcing - one of the en vogue forms of event driven architecture is hard resulting in to 50% failure rate in projects. Practicing domain driven design and carving out bounded contexts and vertical slices is hard. You have to stick to first principles after the system is decomposed to stay true to the event driven architecture. We leverage Kafka a lot primarily as a messaging broker. Developers struggle with ACID guarantees in Event driven systems. Online event processing provides a way to make a system eventually consistent [6].

Typical pitfalls encountered in engineering to microservices from monoliths are Incidental Coupling of microservices and shared data model across microservices. in some cases the microservices reverted to shared canonical domain models. CDC driven decomposition of monoliths is rare due to data silos. Successful architecture and app transformation requires change in culture. Event Shunting Pattern [4] allows for gradual transformation for legacy to a modern stream event driven architecture. Among many things, the pace of change dictates the boundary of the microservice. Leverage past experience to see which features/components change over time. Avoid glorious central model. Central authority make changes to the canonical model. Avoid the trap of making a giant canonical model by staying true to guiding principles. Large messages as events vs small messages build on this. Large canonical messages can force many changes to the model happen. Managing regulatory, security, PII, encrypted data at rest are painful. Small messages that trigger data fetches through an API have been much more successful towards a sustainable architecture.

References

Wednesday, August 7, 2019

Death To the Kubernetes YAML! Long Live the Manifest Generators

You are wallowing in a wall of YAML, wondering where your inner loop developer productivity disappeared. You are wandering the annals of the internet to find the kubectl equivalent of cf push. You are a refugee from the land of the Platform As A Service wandering aimlessly in the Container As a Service world. The land of Kubernetes is intimidating for those of us who are used to the higher order abstractions afforded by platforms like Cloud Foundry or Heroku

In this series of blog posts a refugee from PaaS who has crossed the chasm, will be your coach as you navigate this perilous journey. We will cover equivalence of concepts from Cloud Foundry to Kubernetes. We will draw out the distinctions of architecture & developer workflows across K8s and Cloud Foundry. You will get a deep understanding of the tools & the confidence necessary to make YAML your new best friend and conquer K8s to 10x your productivity developing in the Kubernetes native way.

So here it goes - episode 1 - Death to YAML long live the manifest generators!

What are the options when it comes to a cf push like experience on kubernetes. I am going to keep a running list of K8s YAML template generators here that generate K8s manifests for app deployments. Developers start with these tools below to escape death by YAML on first contact with kubernetes.

Fabric8.io has a java DSL for kubernetes. The fabric8.io kubernetes java client. Istio and other projects use it. fabric8 library is a single jar - It works in airgapped environments. Provides a Typesafe Kubernetes-manifest DSL for JVM-based apps
The primary interface to kubectl is YAML. Pulumi exposes a rich, multi-language SDK to create API resources, and additionally supports execution of Kubernetes YAML manifests and Helm charts.
kf provides cloud foundry users a familiar workflow experience on top of Knative.
Google Cloud Code - Google Cloud Code provides IDE support for the full development cycle of Kubernetes applications, from creating a cluster to deploying your finished application. https://cloud.google.com/code/docs/intellij/
Lift - Spring 2 Cloud Internal hygen inspired YAML generator. Templating engine is pluggable - and based on mustache/handlebars ATM. Pivotal internal only at the moment.
Docker Enterprise 3.0 Simplifies Kubernetes Management - It can identify and build, from Docker Hub, the containers needed and creates the Docker Compose and Kubernetes YAML files, Helm charts, and other required configuration settings.
Develop with Java on Kubernetes using Azure Dev Spaces - Generate the Docker and Helm chart assets for running the application in Kubernetes using the azds prep command.

In the next blogpost I will cover the automation toolchain for building docker images from source. Tools like jib, pivotal build service, cloud-native-buildpacks, s2i etc fall in that category.

Honorable Mentions

The Future Of Observability and Developer Business Intersect Dashboards

Not sure if y'all came across https://thenewstack.io/observability-a-3-year-retrospective/

I really liked this part and for me resonated with the value of PCF Metrics. We have a single unified firehose of information available to us which helps us achieve the things Susan mentions i.e. figure out the unknown unknowns ...

The Future of Observability

Three short years into this ride, I ponder the question; What’s next and where will this movement take us? I believe that in the next ~3 years, all three of those categories — APM, monitoring/metrics, logs, and possibly others — are likely to cease to exist. There will only be one category: observability. And it will contain all the insights you need to understand any state your system can get itself into.
After all, metrics, logs, and traces can trivially be derived from arbitrarily wide structured events; the reverse is not true.
Users are going to start to figure out that they are paying multiple times to store single data sets they should only have to store once. There is no reason to invest budget with separate monitoring vendors, logs vendors, tracing vendors, or APM vendors. If you collect data in arbitrarily wide structured events, you can infer metrics from those, and if you automatically append some simple span identifiers, you can use those same events for tracing views. Not only can you cut spending by 3-4X, but it’s phenomenally more powerful if you can use a single tool and fluidly flip back and forth between the big picture (“there’s a spike”) and drilling down to the exact raw events with the errors. Next, compute what outlier values they have in common, trace one of them, locate wherein the trace a problem lives, and figure out who else is impacted by that specific outlier behavior. All conducted in one single solution with all teams getting the same level of visibility.
Right now this is either a) impossible, or b) a human being has to copy-paste an ID from one system to another to the next. This is wasteful, slow, and cumbersome, and extremely frustrating for the teams that have to do this when trying to solve a problem. Tools create silos and siloed teams spend too much time arguing about the nature of reality instead of the problem at hand.

In the same vein - wonder what a perfect App Metrics dashboard looks like for the organization ?
Here is a sample soup 2 nuts source to business OKRs dashboard that you should emulate

Sunday, August 4, 2019

Failures in Microservices

As microservices evolve into a tangled mess of synchronous and asynchronous flows with multi-level fanouts it becomes important to think about failure and resiliency since that is pretty much a guaranteed outcome when the availability of the whole system is a multiplicative of all its downstream microservices and dependencies.

How does one systematically think about handling load, graceful degradation and load shedding in the face of impaired operation and sustained high load ? Google's SRE books contain excellent high level advice as it pertains to handling load and addressing cascading failures. I have prepared a actionable summary of a couple of chapters dealing with resiliency to win in the face of failure. Follow the notes here to create

Rigor and governance around Microservices frameworks and templates to enable systematic resiliency through circuit breakers and autoscaling for sustainable scale out of your System of Systems.

https://landing.google.com/sre/sre-book/chapters/addressing-cascading-failures/

https://landing.google.com/sre/workbook/chapters/managing-load/

Different types of resources can be exhausted

Insufficient CPU > all requests become slower > various secondary effects

1. Increased number of inflight requests

2. Excessively long queue lengths

- steady state rate of incoming requests > rate at which the server can process requests

3. Thread starvation

4. CPU or request starvation

5. Missed RPC deadlines

6. Reduced CPU caching benefits

Memory Exhaustion - as more in-flight requests consume more RAM, response, and RPC objects

1. Dying containers due to OOM Killers

2. A vicious cycle - (Increased rate of GC in Java, resulting in increased CPU usage)

3. Reduction in app level cache hit rates

Threads (Tomcat HTTP )

1. Thread starvation can directly cause errors or lead to health check failures.

2. If the server adds threads as needed, thread overhead can use too much RAM.

3. In extreme cases, thread starvation can also cause you to run out of process IDs.

File descriptors

Running out of file descriptors can lead to the inability to initialize network connections, which in turn can cause health checks to fail.

Dependencies among resources

Resource exhaustion scenarios feed from one another
DB Connections (Negative Indicator)

All this can ultimately lead to Service Unavailability > Resource exhaustion can lead to servers crashing leading to snowball effect.

How To Prevent Server Overload

Load test the server’s capacity limits,
Serve degraded results
Instrument servers to reject requests when overloaded - fail early and cheaply
Instrument higher-level systems to reject requests at reverse proxies, by limiting the volume of requests by criteria such as IP address, At the load balancers, by dropping requests when the service enters global overload and at At individual tasks
Perform capacity planning

Making Eureka Service Discovery Responsive on PCF

Making Service Discovery More Responsive on PCF

Eureka service registries and Eureka clients are tuned for cloud scale deployment of applications. This results in tuning their respective server and client side caches to account for network brownouts and self-preservation in case of network partitions, failed compute or storage.

All this has the effect of the service registry getting stale sometimes especially in autoscaling or auto descaling scenarios. If autoscaling and scaling down is happening very fast then the REST or HTTP clients sometimes experience timeouts since the service IPs are outdated and the service registry has not been updated to the latest set of microservices app instances.

We ran into one such issue on PCF and configured Service Discovery in three ways to eliminate timeouts. Much of this detailed experimentation was done by my colleague Rohit Bajaj.

Ribbon Ping Configuration
BOSH DNS Polyglot discovery
Eureka server configuration to eliminate server timeouts

Summary

We Determined that Native Polyglot service discovery provided by Cloud Foundry as the optimal Configuration for Service Discovery. Error rate drops to 0-1% in the auto descaling scenarios as opposed to > 2% with other settings.

For your average spring, spring-boot, Java app that requires service discovery Eureka Service Registry fits the bill nicely; however if your workload is highly dynamic and you need close to 0 error rate when load balancing across transient service instances then BOSH DNS is better. When using Bosh DNS you don’t use Ribbon. So Custom Circuit Breaker + BOSH DNS replaces Ribbon + Eureka.

1. Ribbon Ping Configuration

See https://thepracticaldeveloper.com/2017/06/28/how-to-fix-eureka-taking-too-long-to-deregister-instances/

2. BOSH DNS Discovery

Polyglot service discovery introduces new capabilities with a familiar workflow. An app developer can configure an internal route to create a DNS entry for their app. This makes the app discoverable on the container network. A DNS lookup on an internal route returns a list of container IPs for applications corresponding to that particular internal route. docs

We recommend pairing BOSH service discovery with a robust circuit breaker like Resilience4J Sample code on Reslience4J + BOSH Polyglot discovery

3 Eureka Settings

Ideal setting for tuning Eureka clients and servers to be super responsive

# Ribbon Settings

ribbon.ServerListRefreshInterval = 50

# Eureka Client

eureka.instance.lease-renewal-interval-in-seconds = 10
eureka.client.initialInstanceInfoReplicationIntervalSeconds = 10
eureka.client.instanceInfoReplicationIntervalSeconds = 10
eureka.client.registryFetchIntervalSeconds = 10

# Eureka Server

eureka.instance.lease-expiration-duration-in-seconds = 30
eureka.server.eviction-interval-timer-in-ms = 20 * 1000
eureka.server.responseCacheUpdateIntervalMs = 10 * 1000
eureka.server.getWaitTimeInMsWhenSyncEmpty = 10 * 1000

Typically changing Eureka server side settings is not possible when the Eureka server is provisioned by the Spring Cloud Services tile.

# Eureka Server Self-Preservation

eureka.server.enableSelfPreservation = false

Self Preservation window is every 15 minutes
DEFAULT = 30s interval for `lease-renewal-interval`
N = 24
Number of heartbeats per minute to trigger self-preservation < 2 * 24 * 0.85 = 41 heartbeats

NEW = 10s interval , N = 24
Number of heartbeats per minute to trigger self-preservation < 6 * 24 * 0.85 = 123 heartbeats

Sunday, July 28, 2019

On Scaling Microservices

THOUGHTS ON SCALING MICROSERVICES

Much of this is a rehash of Susan Fowler's excellent book
Production-Ready Microservices Published by O'Reilly Media, Inc., 2016 http://shop.oreilly.com/product/0636920053675.do

# Qualitative and Quantitative growth scales of a Microservice

## Quantitative

Qualitative growth scales allow the scalability of a service to tie in with higher-level business metrics: a microservice may, for example, scale with the number of users, with the number of people who open a phone application (“eyeballs”), or with the number of orders (for a food delivery service). These metrics, these qualitative growth scales, aren’t tied to an individual microservice but to the overall system or product(s).
Quantitative

Business Metrics
Number of Health Care Claims Adjudicated
Number of Insurance claims processed

## Qualitative

If the qualitative growth scale of our microservice is measured in “eyeballs”, and each “eyeball” results in two requests to our microservice and one database transaction, then our quantitative growth scale is measured in terms of requests and transactions, resulting in requests per second and transactions per second as the two key quantities determining our scalability.

RequestsPerSecond/QueriesPerSecond/TransactionsPerSecond
HTTP Throughput
CPU Utilization
Memory
Latency
(negative - scaling) Threadpool saturation
(negative - scaling) Number of open database connections _is it near the conn limit_

When dealing with complex dependency chains, making sure that all microservice teams tie the scalability of their services to high-level business metrics can make sure that all services are properly prepared for expected growth, even when cross-team communication becomes difficult.

## What To Monitor For Each Microservice Availability

### Infrastructure Metrics

CPU utilized by the microservice across all containers
RAM utilized by the microservice across all containers,
The available threads,
The microservice’s open file descriptors (FD)
The number of database connections that the microservice has to any databases it uses.

### Monitor the availability of the service

Service-level agreement (SLA) of the service,
Latency (of both the service as a whole and its API endpoints), -
Success of API endpoints, Responses
Average response times of API endpoints, the services (clients) from which API requests originate (along with which endpoints they send requests to),
Errors and exceptions (both handled and unhandled), and the health and status of dependencies.

# Monitoring ADVICE

CUSTOM DASHBOARD FOR EACH MICROSERVICE that ALERT FOR EACH MICROSERVICE FOR KEY METRICS
Normal, Warning and Critical Alerts
On call Runbook procedure for remediating all alerts
Low level Remediations should be automated

A microservice should never experience the same exact problem twice.

Saturday, July 27, 2019

Spring Boot Microservices Observability > Pivotal PCF Metrics

We are often asked Why the PCF Platform team should install PCF Metrics when the customer already has NewRelic, AppDynamics, Dynatrace, <insert APM tool of choice> and a log aggregator like ELK, Splunk etc., After all PCF Metrics is a resource hog at least in the older versions. Note this has improved and is tunable in later versions.

Now we all know that developers like their shiny toys and given latitude will install the internet on the platform. However there are genuine reasons to install PCFMetrics for app developers. First a picture is worth 1000 words. So please see graphic and check if the reasons make sense. I have tried to present a biased bulls and bears assessment of PCFMetrics for your developer enterprise needs below.

Wait but what is PCF Metrics ?

Pivotal Cloud Foundry Metrics stores logs, metrics data, and event data from apps running on PCF for the past 14 days. It graphically presents this data to help operators and developers better understand the health and performance of their apps. PCF Metrics will enable development teams to advance microservice resliency and scalability goals by providing us a single pane of glass for logging, tracing and metrics giving insight into outages, interruptions and scaling events.

Wait but we already have all the data in Kibana and Splunk and NewRelic/AppDynamics ?

Here is what PCFMetrics provides that the current monitoring and logging solutions do not provide >

1. The java buildpack has inbuilt integration with the metrics forwarder tile and gives us the ability to look at Spring Boot Actuator metrics and custom metrics in PCFMetrics without any additional work. We need the ability to see custom metrics and spring boot actuator metrics in a dashboard, see https://content.pivotal.io/blog/how-pcf-metrics-helps-you-reduce-mttr-for-spring-boot-apps-and-save-money-too for a behind the scenes look on how all this works.

Configuring the New Relic MeterRegistry or any of the other specific Micrometer registries has the benefit of skipping the potentially-lossy loggregator flow in favor of direct communication. Skipping the loggregator in favor of direct communication with a registry highly encouraged if your registry supports dimensional data even if you could sink loggregator metrics to that registry transparently.

2. The metrics are correlated with logs and we can zoom in immediately into the section of logs that is pertinent to the metric spike. We can also visualize PCF autoscaling events in the dashboard giving us an accurate picture of when and why scaling occurred and the faults in the system. Both these capabilities are not provided by ELK/NewRelic without significant customization. see https://docs.pivotal.io/pcf-metrics/1-6/using.html You can also configure alerts in PCF metrics via webhooks when values of metrics are crossed or when significant app events take place. This allows developers to build and own what they deliver to the platform and enables developers to engage in SRE practices. Now you can autoscale without PCF Metrics but you just won't have as much historical data on which to base these scaling decisions. One could start creating autoscaling rules without referencing that history. PCF Metrics gives you the tools to make better scaling decisions.

3. From a cost perspective this tile is free and you are entitled to long term support. Example resource configuration to store approximately 14 days of data for a small deployment, about 100 application instances are as follows. The tile be configured in S, M, L and XL resource configurations. From a cost perspective most Dev teams should be able to absorb this cost. We can start small and then resize the deployment when the development team validates the value. If this is a cost issue the sizing can be further reduced by storing only 7 days.

Now "it's basically free" should never be used an argument for choosing the best tool for monitoring your applications in production in part because tools like ELK and Prometheus also come with no-price-but-the-infrastructure options. Windows like the default two weeks may work in development environments but may not be sufficient for real life app monitoring especially in domains like e-commerce that have distinct rhythm across the day/month/year.

4. PCF metrics comes with an inbuilt distributed tracing dashboard for all the microservices providing an end to end view of microservice latencies in fanout configurations. This Zipkin based dashboard that showcases spans and traces is baked in into PCFMetrics and is better than the corresponding e2e support in NewRelic. see https://content.pivotal.io/blog/distributed-tracing-in-pcf-metrics-breakthrough-insight-for-microservices The distributed tracing support in PCFmetrics provides a deep understanding in the causality of outages and other interruptions across microservice fanout and hierarchy calls.

PCF Metrics can be installed in the non-production PCF environment to do a proper cost/benefit pro/cons analysis. Install the PCFMetrics and the Metrics Forwarder tile to achieve autoscaling to meet the observability, resiliency and meet the non-functional requirements of Microservices. PCF Metrics also supports the day-2 Ops goals of your organization http://cloud.rohitkelapure.com/2019/02/power-of-pcf-metrics.html

PCFMetrics does not displace your enterprise APM and logging tool. It does not provide support today for multi-dimensional metrics and only persists logs for 14 days by default. PCMetrics does not provide any profiling data. There needs to a proper telemetry and APM focused solution alongside PCF Metrics. And if you have something along with it ship metrics directly to that system using the right Micrometer registry. The fact that an application cannot publish metrics directly to PCF Metrics in any way and requires some additional platform support either metrics-forwarder or the new metrics-registrar also means product teams have extra coupling to the teams that maintain the platform or release pipelines. In absence of PCFMetrics, Teams should have an external log aggregator and metrics platform in place for app teams to own their own destiny regarding how and what they monitor about their apps.

If the platform team offers Prometheus for metrics and visualization, developers should use it. Prometheus has fantastic support for polling of metrics. In most places developers need to open tickets to post custom metrics. For instance in NewRelic you need an Insights API key to post custom metrics and create dashboards. NR and ELK are better than PCFMetrics when it comes to specialized features; however developers don’t visit these Portals. PCFMetrics also enables the configuration of autoscaling rules on custom metrics and alerts. Developers are lazy and will do the most convenient thing i.e. pull up the portal from the console which leads me to the bottomline.

PCFMetrics is the perfect on-boarding portal for developers in non-production environments that enables them to start the journey of observability and then get deeper with specialized tools.

Credit to my colleague Matt Campbell for contributing to this blog and keeping it real!

Sunday, July 21, 2019

Tools To Create Chaos

These are interesting tools that I have come across in the last couple of days to create chaos one of the key SRE practices to determine if your production site can handle excess load ...

ChaosBlade: An easy to use and powerful chaos engineering experiment toolkit from AliBaba. https://github.com/chaosblade-io/chaosblade

Chaos Toolkit: Create Chaos in your spring apps . https://github.com/chaostoolkit-incubator/chaostoolkit-spring and the chaos monkey for Spring Boot apps https://codecentric.github.io/chaos-monkey-spring-boot/

Gremlin: Chaos As a Service. https://www.gremlin.com/docs/application-layer/attacks/ Resiliency through orchestrated chaos. Worth paying for this service if you have low confidence on the production readiness of your code or if you don't have SRE practices to shock the organization into operational readiness.

General Load Testing tools: http://cloud.rohitkelapure.com/2019/05/load-testing-tools.html

Istio Fault Injection: https://istio.io/docs/tasks/traffic-management/fault-injection/

Now that you have succeeded in creating chaos how should you instrument and fix the system to deal with the chaos. To understand how to deal with chaos start with Health Checks and Graceful Degradation in Distributed Systems and Testing in Production- The Safe Way

Other Book Chapters to understand the theory and implementation of SRE practices when dealing with Chaos read the chapters on Handling Overload and Addressing Cascading Failures from the SRE Books. As a bonus read the chapter on Non Abstract Large System Design to understand the design process for designing large scale fault tolerant systems.

Lastly if you are in the Bay Area this looks like an awesome conference https://chaosconf.io/

Happy SRE Practices!

Wednesday, July 17, 2019

Distributed Cloud Native Transactions?

Does the Java world have a good cloud solution for distributed transactions?

@Wael from the Pivotal AppTx team asked this question which triggered the blog post ....

The traditional view of distributed transactions in the new cloud native microservices world is that they are a strict no-no. There are ways to workaround distributed transactions via compensation and eventual consistency and reverting to One-phase commit transactions. These alternatives although perfectly acceptable leave us wanting. Developers want everything and taking away distributed transactions is like pulling away a favorite teddy bear from a toddler.

But wait the tech industry evolves so fast there are other options available now ... Let me explain ...

Java world NOW does have support for distributed transactions indirectly. For Heterogeneous distributed transactions > indirectly via Kafka and MongoDB and for Homogeneous distributed transactions via Hyper Scale Secret Sauce Cloud Provider Expensive databases like Spanner DB and CosmosDB and others ...

Unless you like to pay top dollar to Microsoft or Google I would stay away from proprietary hyper-scale cloud provider databases that provide auto-magic 2 phase commit ACID transactions via atomic clocks synced across datacenters. God forbid you need to migrate away from these databases.

The other options are more palatable :

1. MongoDB has recently added ACID Transaction support see https://www.mongodb.com/transactions and Spring-mongo-data supports that via https://spring.io/blog/2018/06/28/hands-on-mongodb-4-0-transactions-with-spring-data and https://www.baeldung.com/spring-data-mongodb-transactions

2. Kafka - Online Event Processing have enabled achieving consistency where distributed transactions have failed see https://queue.acm.org/detail.cfm?id=3321612 Support for distributed transactions across heterogeneous storage technologies is either nonexistent or suffers from poor operational and performance characteristics. By building on top of immutable persistent ordered event logs OLEP systems are used to provide good performance and strong consistency guarantees in such settings. Use Online Event Processing to implement distributed transactions. For more concrete guidance around distributed transactions with Kafka checkout https://kafka-summit.org/sessions/simplifying-distributed-transactions-sagas-kafka/ which introduces the new Simple Sagas library. Built using Kafka streams, it provides a scalable fault tolerance event-based transaction processing engine. We walk through a use case of coordinating a sequence of complex financial transactions.

Bottomline is that distributed transactions in the cloud are hard due to the heterogenity of storage technologies and the transaction managers for ACID transactions were all built for the pre-cloud era. You will need to approach transactions in the cloud native era with new patterns like sagas and compensation and eventual consistency and rely on the product characteristics like immutable persistent ordered logs or proprietary features to realize your dream of distributed transactions in the cloud.

About Me

Saturday, August 10, 2019

Thursday, August 8, 2019