FIELDS + GUIDANCE
|
Guidance
|
Name of method
| |
What is this method ?
|
Event Storming is a cross functional facilitation technique for
revealing the bounded contexts, microservices, vertical Slices, trouble spots and starting points for a system or business process. |
Phases
|
Discovery, Kick Off
|
Suggested Time
|
1 - 2 hours.
|
Who participates?
|
SMEs, Core Team (see facilitator notes)
|
Why do it?
|
Event Storming enables decomposing monoliths into microservices.
It allows for modeling new flows and ideas, synthesizing knowledge and facilitating active group participation without conflict to time travel and ideate the next generation of a software system. |
When to do it?
|
When you need to make sense of a Huge mess to enable Cross
Perspective Communication as a force function for clarity. |
What supplies
are needed? |
People, tools and supplies needed to conduct an ES session
|
How to Use this Method
|
Event Storming is a group exercise to scientifically explore the
domains and problem areas of a monolithic application. The most concise description of the process of event storming comes from Vaughn Vernon's DDD-Distilled book and the color around the process comes from Alberto Brandolini's book Event Storming.
Storm the business process by creating a series of Domain events
on sticky notes. The most popular color to use for domain events is orange. The DomainEvent is a verb stated in past tense and represent a state transition in the domain. Write the name of the DomainEvent on an orange sticky note. Place the sticky notes on your modeling surface in time order from left to right. As you go through the storming session, you will find trouble spots in your existing business process. Clearly mark these with a purple/red stick notes. Use vertical space to represent parallel processing.
After all the events are posted experts will post locally ordered
sequence of events and enforce a timeline. Enforcing a timeline triggers long awaited conversations and eventually STRUCTURE will emerge.
These event clumps or common groupings give us our notional
service candidates (actors or aggregates depending on how rigid the team is with DDD definitions). These will be used during the Boris Exercise. |
Success/Expected
Outcomes
“You know you are done when…”
| - Event Storming generates an immense backlog of user stories. - Perform User Story Mapping to Map and organize stories into MVPs
- Define scope of the problem
- Confirm that you are solving the right problem ?
|
Facilitator Notes & Tips
|
Event Storming is a technique used to visualize complex systems
and processes. This could range from monoliths to value streams. Event Storming is gamestorming technique for harnessing and capturing the information captured in a group’s minds. It surfaces conflicts and different perspectives of a complex system and bubbles up the top constraints and problem spots. As an event storming facilitator you have one job - create a safe environment for the exchange and output of ideas and data. The job is 50% technical facilitation and 50% soft people facilitation where you are reading body language. A single facilitator can typically orchestrate groups of 15-20. For a group of 30 or more you need two facilitators.
ES is usually conducted in two phases. A high level event storm to
identify the domains and then a subsequent ES into a top constraint - the core domain. The language of ES is stickies. In its simplest form ES is basically a facilitated group story telling. The stickies represent domain events - or things that happened in the past. The trouble spots are identified by orange/red stickies. The color of the stickies does not matter. What does matter is that you start simple and then add the notation incrementally. Start simple and then add the information in layers. ES can serve many goals - break down a monolith into constituent bounded contexts, create a value stream - as a way to onboard employees, etc., There is no ONE correct style of ES. Every session is different from another based on the desired goals and outcomes. So don’t worry about getting it right- just do it and roll your own style.
An ES is only successful if the right people are involved. This is a
mix of business domain experts, customer executives, stakeholders, business analysts, software developers, architects, testers, and folks who support the product in production. Subject matter experts, product owners and developers that knows and understand the application domain. This process enables cross perspective conversation throughout the team as well as a standard definition of the terms used by both technical and non-technical team members. |
Related practices
|
|
Real world example
| |
Recommended reading
| Motivation behind ES can be found here - Gamestorming: A Playbook for Innovators, Rulebreakers, and Changemakers. Note this book is available on safari books online: https://www.safaribooksonline.com/library/view/gamestorming/9781449391195/ This is the book we read on the flight to Boston Eventstorming https://leanpub.com/introducing_eventstorming Book is written by Alberto Brandolini the father and inventor of Eventstorming Domain Driven Design (DDD) - provides the theoretical underpinnings of decomposing ‘/monoliths. DDD Distilled is the perfect book to understand the science of DDD and how ES fits into the grander scheme of things - how do the ES artifacts translate into software design, architecture and an actual backlog. |
Musings on PaaS, IaaS, Microservices, Containers and everything else under the Cloud
About Me
- Rohit Kelapure
- Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.
Saturday, August 10, 2019
Event Storming - A Pivotal Practice for decomposing applications
Thursday, August 8, 2019
Learnings from Implementing enterprise event driven architecture
There are four different types of event driven architecture[3]. Docket Based Choreography pattern is one of our inventions that allows us to design and operationalize an event driven architecture for a legacy system. It involves event notification but is a specific implementation. Typical event driven reengineering of a monolith involves both sync and async flows. CQRS and Event Sourcing - one of the en vogue forms of event driven architecture is hard resulting in to 50% failure rate in projects. Practicing domain driven design and carving out bounded contexts and vertical slices is hard. You have to stick to first principles after the system is decomposed to stay true to the event driven architecture. We leverage Kafka a lot primarily as a messaging broker. Developers struggle with ACID guarantees in Event driven systems. Online event processing provides a way to make a system eventually consistent [6].
Typical pitfalls encountered in engineering to microservices from monoliths are Incidental Coupling of microservices and shared data model across microservices. in some cases the microservices reverted to shared canonical domain models. CDC driven decomposition of monoliths is rare due to data silos. Successful architecture and app transformation requires change in culture. Event Shunting Pattern [4] allows for gradual transformation for legacy to a modern stream event driven architecture. Among many things, the pace of change dictates the boundary of the microservice. Leverage past experience to see which features/components change over time. Avoid glorious central model. Central authority make changes to the canonical model. Avoid the trap of making a giant canonical model by staying true to guiding principles. Large messages as events vs small messages build on this. Large canonical messages can force many changes to the model happen. Managing regulatory, security, PII, encrypted data at rest are painful. Small messages that trigger data fetches through an API have been much more successful towards a sustainable architecture.
Typical pitfalls encountered in engineering to microservices from monoliths are Incidental Coupling of microservices and shared data model across microservices. in some cases the microservices reverted to shared canonical domain models. CDC driven decomposition of monoliths is rare due to data silos. Successful architecture and app transformation requires change in culture. Event Shunting Pattern [4] allows for gradual transformation for legacy to a modern stream event driven architecture. Among many things, the pace of change dictates the boundary of the microservice. Leverage past experience to see which features/components change over time. Avoid glorious central model. Central authority make changes to the canonical model. Avoid the trap of making a giant canonical model by staying true to guiding principles. Large messages as events vs small messages build on this. Large canonical messages can force many changes to the model happen. Managing regulatory, security, PII, encrypted data at rest are painful. Small messages that trigger data fetches through an API have been much more successful towards a sustainable architecture.
References
- Evolutionary_design_still_requires_up_front_thinking.html
- http://nealford.com/memeagora/2015/03/30/architecture_is_abstract_until_operationalized.html
- https://martinfowler.com/articles/201701-event-driven.html
- https://medium.com/@KevinHoffman/migrating-apps-to-the-cloud-shunting-the-event-stream-8c2f6f309242
- https://fs.blog/2018/04/reversible-irreversible-decisions/
- https://queue.acm.org/detail.cfm?id=3321612
- https://content.pivotal.io/blog/agile-architecture
Wednesday, August 7, 2019
Death To the Kubernetes YAML! Long Live the Manifest Generators
You are wallowing in a wall of YAML, wondering where your inner loop developer productivity disappeared. You are wandering the annals of the internet to find the kubectl equivalent of cf push. You are a refugee from the land of the Platform As A Service wandering aimlessly in the Container As a Service world. The land of Kubernetes is intimidating for those of us who are used to the higher order abstractions afforded by platforms like Cloud Foundry or Heroku
In this series of blog posts a refugee from PaaS who has crossed the chasm, will be your coach as you navigate this perilous journey. We will cover equivalence of concepts from Cloud Foundry to Kubernetes. We will draw out the distinctions of architecture & developer workflows across K8s and Cloud Foundry. You will get a deep understanding of the tools & the confidence necessary to make YAML your new best friend and conquer K8s to 10x your productivity developing in the Kubernetes native way.
So here it goes - episode 1 - Death to YAML long live the manifest generators!
What are the options when it comes to a cf push like experience on kubernetes. I am going to keep a running list of K8s YAML template generators here that generate K8s manifests for app deployments. Developers start with these tools below to escape death by YAML on first contact with kubernetes.
In this series of blog posts a refugee from PaaS who has crossed the chasm, will be your coach as you navigate this perilous journey. We will cover equivalence of concepts from Cloud Foundry to Kubernetes. We will draw out the distinctions of architecture & developer workflows across K8s and Cloud Foundry. You will get a deep understanding of the tools & the confidence necessary to make YAML your new best friend and conquer K8s to 10x your productivity developing in the Kubernetes native way.
So here it goes - episode 1 - Death to YAML long live the manifest generators!
What are the options when it comes to a cf push like experience on kubernetes. I am going to keep a running list of K8s YAML template generators here that generate K8s manifests for app deployments. Developers start with these tools below to escape death by YAML on first contact with kubernetes.
- Fabric8.io has a java DSL for kubernetes. The fabric8.io kubernetes java client. Istio and other projects use it. fabric8 library is a single jar - It works in airgapped environments. Provides a Typesafe Kubernetes-manifest DSL for JVM-based apps
- The primary interface to kubectl is YAML. Pulumi exposes a rich, multi-language SDK to create API resources, and additionally supports execution of Kubernetes YAML manifests and Helm charts.
- kf provides cloud foundry users a familiar workflow experience on top of Knative.
- Google Cloud Code - Google Cloud Code provides IDE support for the full development cycle of Kubernetes applications, from creating a cluster to deploying your finished application. https://cloud.google.com/code/docs/intellij/
- Lift - Spring 2 Cloud Internal hygen inspired YAML generator. Templating engine is pluggable - and based on mustache/handlebars ATM. Pivotal internal only at the moment.
- Docker Enterprise 3.0 Simplifies Kubernetes Management - It can identify and build, from Docker Hub, the containers needed and creates the Docker Compose and Kubernetes YAML files, Helm charts, and other required configuration settings.
- Develop with Java on Kubernetes using Azure Dev Spaces - Generate the Docker and Helm chart assets for running the application in Kubernetes using the azds prep command.
In the next blogpost I will cover the automation toolchain for building docker images from source. Tools like jib, pivotal build service, cloud-native-buildpacks, s2i etc fall in that category.
Honorable Mentions
The Future Of Observability and Developer Business Intersect Dashboards
Not sure if y'all came across https://thenewstack.io/ observability-a-3-year- retrospective/
In the same vein - wonder what a perfect App Metrics dashboard looks like for the organization ?
Here is a sample soup 2 nuts source to business OKRs dashboard that you should emulate
I really liked this part and for me resonated with the value of PCF Metrics. We have a single unified firehose of information available to us which helps us achieve the things Susan mentions i.e. figure out the unknown unknowns ...
The Future of Observability
Three short years into this ride, I ponder the question; What’s next and where will this movement take us? I believe that in the next ~3 years, all three of those categories — APM, monitoring/metrics, logs, and possibly others — are likely to cease to exist. There will only be one category: observability. And it will contain all the insights you need to understand any state your system can get itself into.
After all, metrics, logs, and traces can trivially be derived from arbitrarily wide structured events; the reverse is not true.
Users are going to start to figure out that they are paying multiple times to store single data sets they should only have to store once. There is no reason to invest budget with separate monitoring vendors, logs vendors, tracing vendors, or APM vendors. If you collect data in arbitrarily wide structured events, you can infer metrics from those, and if you automatically append some simple span identifiers, you can use those same events for tracing views. Not only can you cut spending by 3-4X, but it’s phenomenally more powerful if you can use a single tool and fluidly flip back and forth between the big picture (“there’s a spike”) and drilling down to the exact raw events with the errors. Next, compute what outlier values they have in common, trace one of them, locate wherein the trace a problem lives, and figure out who else is impacted by that specific outlier behavior. All conducted in one single solution with all teams getting the same level of visibility.
Right now this is either a) impossible, or b) a human being has to copy-paste an ID from one system to another to the next. This is wasteful, slow, and cumbersome, and extremely frustrating for the teams that have to do this when trying to solve a problem. Tools create silos and siloed teams spend too much time arguing about the nature of reality instead of the problem at hand.
In the same vein - wonder what a perfect App Metrics dashboard looks like for the organization ?
Here is a sample soup 2 nuts source to business OKRs dashboard that you should emulate
Sunday, August 4, 2019
Failures in Microservices
As microservices evolve into a tangled mess of synchronous and asynchronous flows with multi-level fanouts it becomes important to think about failure and resiliency since that is pretty much a guaranteed outcome when the availability of the whole system is a multiplicative of all its downstream microservices and dependencies.
How does one systematically think about handling load, graceful degradation and load shedding in the face of impaired operation and sustained high load ? Google's SRE books contain excellent high level advice as it pertains to handling load and addressing cascading failures. I have prepared a actionable summary of a couple of chapters dealing with resiliency to win in the face of failure. Follow the notes here to create
Rigor and governance around Microservices frameworks and templates to enable systematic resiliency through circuit breakers and autoscaling for sustainable scale out of your System of Systems.
How does one systematically think about handling load, graceful degradation and load shedding in the face of impaired operation and sustained high load ? Google's SRE books contain excellent high level advice as it pertains to handling load and addressing cascading failures. I have prepared a actionable summary of a couple of chapters dealing with resiliency to win in the face of failure. Follow the notes here to create
Rigor and governance around Microservices frameworks and templates to enable systematic resiliency through circuit breakers and autoscaling for sustainable scale out of your System of Systems.
Different types of resources can be exhausted
Insufficient CPU > all requests become slower > various secondary effects
1. Increased number of inflight requests
2. Excessively long queue lengths
- steady state rate of incoming requests > rate at which the server can process requests
3. Thread starvation
4. CPU or request starvation
5. Missed RPC deadlines
6. Reduced CPU caching benefits
Memory Exhaustion - as more in-flight requests consume more RAM, response, and RPC objects
1. Dying containers due to OOM Killers
2. A vicious cycle - (Increased rate of GC in Java, resulting in increased CPU usage)
3. Reduction in app level cache hit rates
Threads (Tomcat HTTP )
1. Thread starvation can directly cause errors or lead to health check failures.
2. If the server adds threads as needed, thread overhead can use too much RAM.
3. In extreme cases, thread starvation can also cause you to run out of process IDs.
File descriptors
- Running out of file descriptors can lead to the inability to initialize network connections, which in turn can cause health checks to fail.
Dependencies among resources
- Resource exhaustion scenarios feed from one another
- DB Connections (Negative Indicator)
All this can ultimately lead to Service Unavailability > Resource exhaustion can lead to servers crashing leading to snowball effect.
How To Prevent Server Overload
- Load test the server’s capacity limits,
- Serve degraded results
- Instrument servers to reject requests when overloaded - fail early and cheaply
- Instrument higher-level systems to reject requests at reverse proxies, by limiting the volume of requests by criteria such as IP address, At the load balancers, by dropping requests when the service enters global overload and at At individual tasks
- Perform capacity planning
Saturday, August 3, 2019
Making Eureka Service Discovery Responsive on PCF
Making Service Discovery More Responsive on PCF
Eureka service registries and Eureka clients are tuned for cloud scale deployment of applications. This results in tuning their respective server and client side caches to account for network brownouts and self-preservation in case of network partitions, failed compute or storage.All this has the effect of the service registry getting stale sometimes especially in autoscaling or auto descaling scenarios. If autoscaling and scaling down is happening very fast then the REST or HTTP clients sometimes experience timeouts since the service IPs are outdated and the service registry has not been updated to the latest set of microservices app instances.
We ran into one such issue on PCF and configured Service Discovery in three ways to eliminate timeouts. Much of this detailed experimentation was done by my colleague Rohit Bajaj.
- Ribbon Ping Configuration
- BOSH DNS Polyglot discovery
- Eureka server configuration to eliminate server timeouts
We Determined that Native Polyglot service discovery provided by Cloud Foundry as the optimal Configuration for Service Discovery. Error rate drops to 0-1% in the auto descaling scenarios as opposed to > 2% with other settings.
For your average spring, spring-boot, Java app that requires service discovery Eureka Service Registry fits the bill nicely; however if your workload is highly dynamic and you need close to 0 error rate when load balancing across transient service instances then BOSH DNS is better. When using Bosh DNS you don’t use Ribbon. So Custom Circuit Breaker + BOSH DNS replaces Ribbon + Eureka.
1. Ribbon Ping Configuration
2. BOSH DNS Discovery
Polyglot service discovery introduces new capabilities with a familiar workflow. An app developer can configure an internal route to create a DNS entry for their app. This makes the app discoverable on the container network. A DNS lookup on an internal route returns a list of container IPs for applications corresponding to that particular internal route. docs
We recommend pairing BOSH service discovery with a robust circuit breaker like Resilience4J Sample code on Reslience4J + BOSH Polyglot discovery
3 Eureka Settings
Ideal setting for tuning Eureka clients and servers to be super responsive
# Ribbon Settings
ribbon.ServerListRefreshInterval = 50
# Eureka Client
eureka.instance.lease-renewal-interval-in-seconds = 10eureka.client.initialInstanceInfoReplicationIntervalSeconds = 10
eureka.client.instanceInfoReplicationIntervalSeconds = 10
eureka.client.registryFetchIntervalSeconds = 10
# Eureka Server
eureka.instance.lease-expiration-duration-in-seconds = 30eureka.server.eviction-interval-timer-in-ms = 20 * 1000
eureka.server.responseCacheUpdateIntervalMs = 10 * 1000
eureka.server.getWaitTimeInMsWhenSyncEmpty = 10 * 1000
Typically changing Eureka server side settings is not possible when the Eureka server is provisioned by the Spring Cloud Services tile.
# Eureka Server Self-Preservation
eureka.server.enableSelfPreservation = falseSelf Preservation window is every 15 minutes
DEFAULT = 30s interval for `lease-renewal-interval`
N = 24
Number of heartbeats per minute to trigger self-preservation < 2 * 24 * 0.85 = 41 heartbeats
NEW = 10s interval , N = 24
Number of heartbeats per minute to trigger self-preservation < 6 * 24 * 0.85 = 123 heartbeats
# References
- Spring Cloud Netflix Eureka - The Hidden Manual https://blog.asarkar.org/technical/netflix-eureka/- The Mystery of Eureka self-preservation https://medium.com/@fahimfarookme/the-mystery-of-eureka-self-preservation-c7aa0ed1b799
- Eureka http://netflix.github.io/ribbon/ribbon-eureka-javadoc/index.html
- Spring Cloud Ribbon https://cloud.spring.io/spring-cloud-netflix/multi/multi_spring-cloud-ribbon.html
- B/g for Eureka https://app-transformation-cookbook-internal.cfapps.io/duplicates/replatforming/blue-green-with-eureka/3e7f94f6c49795ef347b70141a36c134/
# CODE
https://github.com/Netflix/eureka/blob/master/eureka-client/src/main/java/com/netflix/discovery/EurekaClientConfig.javahttps://github.com/Netflix/eureka/wiki/Eureka-REST-operations
https://github.com/Netflix/eureka/tree/master/eureka-core/src/main/java/com/netflix/eureka/registry
# Monitoring
https://docs.newrelic.com/docs/apis/get-started/intro-apis/understand-new-relic-api-keyshttps://github.com/micrometer-metrics/micrometer/blob/master/implementations/micrometer-registry-new-relic/src/main/java/io/micrometer/newrelic/NewRelicMeterRegistry.java
https://docs.newrelic.com/docs/insights/insights-data-sources/custom-data/send-custom-events-event-api
https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-metrics.html#production-ready-metrics-export-newrelic
https://micrometer.io/docs/registry/new-relic
https://github.com/TechPrimers/spring-boot-1.5-micrometer-prometheus-example
https://github.com/cloudfoundry/java-buildpack-metric-writer/tree/master/java-buildpack-metric-writer-common/src/main/java/org/cloudfoundry/metrics
https://docs.pivotal.io/pivotalcf/2-6/metric-registrar/using.html
# Autoscaling
https://docs.pivotal.io/pivotalcf/2-6/appsman-services/autoscaler/using-autoscaler.htmlhttps://www.toptal.com/devops/scaling-microservices-applications
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing?slide=3
https://github.com/TechPrimers/spring-boot-1.5-micrometer-prometheus-example/blob/master/pom.xml
https://medium.com/finc-engineering/autoscaling-microservices-on-aws-part-1-c8488c64f6d1
https://docs.pivotal.io/pivotalcf/2-4/appsman-services/autoscaler/using-autoscaler-cli.html
https://nephely-io.github.io/app-autoscaling-calculator/?source=post_page
https://docs.pivotal.io/pivotalcf/2-6/appsman-services/autoscaler/using-autoscaler.html
# References
How to improve the eviction policy in the Eureka Service Registryhttps://thepracticaldeveloper.com/2017/06/28/how-to-fix-eureka-taking-too-long-to-deregister-instances/
Working with load balancers
https://github.com/Netflix/ribbon/wiki/Working-with-load-balancers
Autoscaling using HTTP Throughput & Latency metrics
https://community.pivotal.io/s/article/autoscaling-using-http-throughput-latency-metrics
Client Side Load Balancer: Ribbon
https://cloud.spring.io/spring-cloud-netflix/multi/multi_spring-cloud-ribbon.
Spring Boot Actuator: Production-ready features
https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-metrics.html
Observability
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing?slide=21
Sunday, July 28, 2019
On Scaling Microservices
THOUGHTS ON SCALING MICROSERVICES
Much of this is a rehash of Susan Fowler's excellent bookProduction-Ready Microservices Published by O'Reilly Media, Inc., 2016 http://shop.oreilly.com/product/0636920053675.do
# Qualitative and Quantitative growth scales of a Microservice
## Quantitative
Qualitative growth scales allow the scalability of a service to tie in with higher-level business metrics: a microservice may, for example, scale with the number of users, with the number of people who open a phone application (“eyeballs”), or with the number of orders (for a food delivery service). These metrics, these qualitative growth scales, aren’t tied to an individual microservice but to the overall system or product(s).Quantitative
- Business Metrics
- Number of Health Care Claims Adjudicated
- Number of Insurance claims processed
## Qualitative
If the qualitative growth scale of our microservice is measured in “eyeballs”, and each “eyeball” results in two requests to our microservice and one database transaction, then our quantitative growth scale is measured in terms of requests and transactions, resulting in requests per second and transactions per second as the two key quantities determining our scalability.- RequestsPerSecond/QueriesPerSecond/TransactionsPerSecond
- HTTP Throughput
- CPU Utilization
- Memory
- Latency
- (negative - scaling) Threadpool saturation
- (negative - scaling) Number of open database connections _is it near the conn limit_
## What To Monitor For Each Microservice Availability
### Infrastructure Metrics
- CPU utilized by the microservice across all containers
- RAM utilized by the microservice across all containers,
- The available threads,
- The microservice’s open file descriptors (FD)
- The number of database connections that the microservice has to any databases it uses.
### Monitor the availability of the service
- Service-level agreement (SLA) of the service,
- Latency (of both the service as a whole and its API endpoints), -
- Success of API endpoints, Responses
- Average response times of API endpoints, the services (clients) from which API requests originate (along with which endpoints they send requests to),
- Errors and exceptions (both handled and unhandled), and the health and status of dependencies.
# Monitoring ADVICE
- CUSTOM DASHBOARD FOR EACH MICROSERVICE that ALERT FOR EACH MICROSERVICE FOR KEY METRICS
- Normal, Warning and Critical Alerts
- On call Runbook procedure for remediating all alerts
- Low level Remediations should be automated
A microservice should never experience the same exact problem twice.
Saturday, July 27, 2019
Spring Boot Microservices Observability > Pivotal PCF Metrics
We are often asked Why the PCF Platform team should install PCF Metrics when the customer already has NewRelic, AppDynamics, Dynatrace, <insert APM tool of choice> and a log aggregator like ELK, Splunk etc., After all PCF Metrics is a resource hog at least in the older versions. Note this has improved and is tunable in later versions.
Now we all know that developers like their shiny toys and given latitude will install the internet on the platform. However there are genuine reasons to install PCFMetrics for app developers. First a picture is worth 1000 words. So please see graphic and check if the reasons make sense. I have tried to present a biased bulls and bears assessment of PCFMetrics for your developer enterprise needs below.
Wait but what is PCF Metrics ?
Wait but we already have all the data in Kibana and Splunk and NewRelic/AppDynamics ?
Here is what PCFMetrics provides that the current monitoring and logging solutions do not provide >
Now "it's basically free" should never be used an argument for choosing the best tool for monitoring your applications in production in part because tools like ELK and Prometheus also come with no-price-but-the-infrastructure options. Windows like the default two weeks may work in development environments but may not be sufficient for real life app monitoring especially in domains like e-commerce that have distinct rhythm across the day/month/year.
PCF Metrics can be installed in the non-production PCF environment to do a proper cost/benefit pro/cons analysis. Install the PCFMetrics and the Metrics Forwarder tile to achieve autoscaling to meet the observability, resiliency and meet the non-functional requirements of Microservices. PCF Metrics also supports the day-2 Ops goals of your organization http://cloud.rohitkelapure.com/2019/02/power-of-pcf-metrics.html
PCFMetrics does not displace your enterprise APM and logging tool. It does not provide support today for multi-dimensional metrics and only persists logs for 14 days by default. PCMetrics does not provide any profiling data. There needs to a proper telemetry and APM focused solution alongside PCF Metrics. And if you have something along with it ship metrics directly to that system using the right Micrometer registry. The fact that an application cannot publish metrics directly to PCF Metrics in any way and requires some additional platform support either metrics-forwarder or the new metrics-registrar also means product teams have extra coupling to the teams that maintain the platform or release pipelines. In absence of PCFMetrics, Teams should have an external log aggregator and metrics platform in place for app teams to own their own destiny regarding how and what they monitor about their apps.
If the platform team offers Prometheus for metrics and visualization, developers should use it. Prometheus has fantastic support for polling of metrics. In most places developers need to open tickets to post custom metrics. For instance in NewRelic you need an Insights API key to post custom metrics and create dashboards. NR and ELK are better than PCFMetrics when it comes to specialized features; however developers don’t visit these Portals. PCFMetrics also enables the configuration of autoscaling rules on custom metrics and alerts. Developers are lazy and will do the most convenient thing i.e. pull up the portal from the console which leads me to the bottomline.
Now we all know that developers like their shiny toys and given latitude will install the internet on the platform. However there are genuine reasons to install PCFMetrics for app developers. First a picture is worth 1000 words. So please see graphic and check if the reasons make sense. I have tried to present a biased bulls and bears assessment of PCFMetrics for your developer enterprise needs below.
Pivotal Cloud Foundry Metrics stores logs, metrics data, and event data from apps running on PCF for the past 14 days. It graphically presents this data to help operators and developers better understand the health and performance of their apps. PCF Metrics will enable development teams to advance microservice resliency and scalability goals by providing us a single pane of glass for logging, tracing and metrics giving insight into outages, interruptions and scaling events.
Wait but we already have all the data in Kibana and Splunk and NewRelic/AppDynamics ?
Here is what PCFMetrics provides that the current monitoring and logging solutions do not provide >
1. The java buildpack has inbuilt integration with the metrics forwarder tile and gives us the ability to look at Spring Boot Actuator metrics and custom metrics in PCFMetrics without any additional work. We need the ability to see custom metrics and spring boot actuator metrics in a dashboard, see https://content.pivotal. io/blog/how-pcf-metrics-helps- you-reduce-mttr-for-spring- boot-apps-and-save-money-too for a behind the scenes look on how all this works.
Configuring the New Relic MeterRegistry or any of the other specific Micrometer registries has the benefit of skipping the potentially-lossy loggregator flow in favor of direct communication. Skipping the loggregator in favor of direct communication with a registry highly encouraged if your registry supports dimensional data even if you could sink loggregator metrics to that registry transparently.
Configuring the New Relic MeterRegistry or any of the other specific Micrometer registries has the benefit of skipping the potentially-lossy loggregator flow in favor of direct communication. Skipping the loggregator in favor of direct communication with a registry highly encouraged if your registry supports dimensional data even if you could sink loggregator metrics to that registry transparently.
2. The metrics are correlated with logs and we can zoom in immediately into the section of logs that is pertinent to the metric spike. We can also visualize PCF autoscaling events in the dashboard giving us an accurate picture of when and why scaling occurred and the faults in the system. Both these capabilities are not provided by ELK/NewRelic without significant customization. see https://docs.pivotal.io/ pcf-metrics/1-6/using.html You can also configure alerts in PCF metrics via webhooks when values of metrics are crossed or when significant app events take place. This allows developers to build and own what they deliver to the platform and enables developers to engage in SRE practices. Now you can autoscale without PCF Metrics but you just won't have as much historical data on which to base these scaling decisions. One could start creating autoscaling rules without referencing that history. PCF Metrics gives you the tools to make better scaling decisions.
3. From a cost perspective this tile is free and you are entitled to long term support. Example resource configuration to store approximately 14 days of data for a small deployment, about 100 application instances are as follows. The tile be configured in S, M, L and XL resource configurations. From a cost perspective most Dev teams should be able to absorb this cost. We can start small and then resize the deployment when the development team validates the value. If this is a cost issue the sizing can be further reduced by storing only 7 days.
3. From a cost perspective this tile is free and you are entitled to long term support. Example resource configuration to store approximately 14 days of data for a small deployment, about 100 application instances are as follows. The tile be configured in S, M, L and XL resource configurations. From a cost perspective most Dev teams should be able to absorb this cost. We can start small and then resize the deployment when the development team validates the value. If this is a cost issue the sizing can be further reduced by storing only 7 days.
4. PCF metrics comes with an inbuilt distributed tracing dashboard for all the microservices providing an end to end view of microservice latencies in fanout configurations. This Zipkin based dashboard that showcases spans and traces is baked in into PCFMetrics and is better than the corresponding e2e support in NewRelic. see https://content.pivotal. io/blog/distributed-tracing- in-pcf-metrics-breakthrough- insight-for-microservices The distributed tracing support in PCFmetrics provides a deep understanding in the causality of outages and other interruptions across microservice fanout and hierarchy calls.
PCFMetrics does not displace your enterprise APM and logging tool. It does not provide support today for multi-dimensional metrics and only persists logs for 14 days by default. PCMetrics does not provide any profiling data. There needs to a proper telemetry and APM focused solution alongside PCF Metrics. And if you have something along with it ship metrics directly to that system using the right Micrometer registry. The fact that an application cannot publish metrics directly to PCF Metrics in any way and requires some additional platform support either metrics-forwarder or the new metrics-registrar also means product teams have extra coupling to the teams that maintain the platform or release pipelines. In absence of PCFMetrics, Teams should have an external log aggregator and metrics platform in place for app teams to own their own destiny regarding how and what they monitor about their apps.
If the platform team offers Prometheus for metrics and visualization, developers should use it. Prometheus has fantastic support for polling of metrics. In most places developers need to open tickets to post custom metrics. For instance in NewRelic you need an Insights API key to post custom metrics and create dashboards. NR and ELK are better than PCFMetrics when it comes to specialized features; however developers don’t visit these Portals. PCFMetrics also enables the configuration of autoscaling rules on custom metrics and alerts. Developers are lazy and will do the most convenient thing i.e. pull up the portal from the console which leads me to the bottomline.
PCFMetrics is the perfect on-boarding portal for developers in non-production environments that enables them to start the journey of observability and then get deeper with specialized tools.Credit to my colleague Matt Campbell for contributing to this blog and keeping it real!
Sunday, July 21, 2019
Tools To Create Chaos
These are interesting tools that I have come across in the last couple of days to create chaos one of the key SRE practices to determine if your production site can handle excess load ...
- ChaosBlade: An easy to use and powerful chaos engineering experiment toolkit from AliBaba. https://github.com/chaosblade-io/chaosblade
- Chaos Toolkit: Create Chaos in your spring apps . https://github.com/chaostoolkit-incubator/chaostoolkit-spring and the chaos monkey for Spring Boot apps https://codecentric.github.io/chaos-monkey-spring-boot/
- Gremlin: Chaos As a Service. https://www.gremlin.com/docs/application-layer/attacks/ Resiliency through orchestrated chaos. Worth paying for this service if you have low confidence on the production readiness of your code or if you don't have SRE practices to shock the organization into operational readiness.
- General Load Testing tools: http://cloud.rohitkelapure.com/2019/05/load-testing-tools.html
- Istio Fault Injection: https://istio.io/docs/tasks/traffic-management/fault-injection/
Now that you have succeeded in creating chaos how should you instrument and fix the system to deal with the chaos. To understand how to deal with chaos start with Health Checks and Graceful Degradation in Distributed Systems and Testing in Production- The Safe Way
Other Book Chapters to understand the theory and implementation of SRE practices when dealing with Chaos read the chapters on Handling Overload and Addressing Cascading Failures from the SRE Books. As a bonus read the chapter on Non Abstract Large System Design to understand the design process for designing large scale fault tolerant systems.
Lastly if you are in the Bay Area this looks like an awesome conference https://chaosconf.io/
Happy SRE Practices!
Lastly if you are in the Bay Area this looks like an awesome conference https://chaosconf.io/
Happy SRE Practices!
Wednesday, July 17, 2019
Distributed Cloud Native Transactions?
Does the Java world have a good cloud solution for distributed transactions?
@Wael from the Pivotal AppTx team asked this question which triggered the blog post ....
The traditional view of distributed transactions in the new cloud native microservices world is that they are a strict no-no. There are ways to workaround distributed transactions via compensation and eventual consistency and reverting to One-phase commit transactions. These alternatives although perfectly acceptable leave us wanting. Developers want everything and taking away distributed transactions is like pulling away a favorite teddy bear from a toddler.
But wait the tech industry evolves so fast there are other options available now ... Let me explain ...
Java world NOW does have support for distributed transactions indirectly. For Heterogeneous distributed transactions > indirectly via Kafka and MongoDB and for Homogeneous distributed transactions via Hyper Scale Secret Sauce Cloud Provider Expensive databases like Spanner DB and CosmosDB and others ...
Unless you like to pay top dollar to Microsoft or Google I would stay away from proprietary hyper-scale cloud provider databases that provide auto-magic 2 phase commit ACID transactions via atomic clocks synced across datacenters. God forbid you need to migrate away from these databases.
The other options are more palatable :
1. MongoDB has recently added ACID Transaction support see https://www.mongodb.com/transactions and Spring-mongo-data supports that via https://spring.io/blog/2018/06/28/hands-on-mongodb-4-0-transactions-with-spring-data and https://www.baeldung.com/spring-data-mongodb-transactions
2. Kafka - Online Event Processing have enabled achieving consistency where distributed transactions have failed see https://queue.acm.org/detail.cfm?id=3321612 Support for distributed transactions across heterogeneous storage technologies is either nonexistent or suffers from poor operational and performance characteristics. By building on top of immutable persistent ordered event logs OLEP systems are used to provide good performance and strong consistency guarantees in such settings. Use Online Event Processing to implement distributed transactions. For more concrete guidance around distributed transactions with Kafka checkout https://kafka-summit.org/sessions/simplifying-distributed-transactions-sagas-kafka/ which introduces the new Simple Sagas library. Built using Kafka streams, it provides a scalable fault tolerance event-based transaction processing engine. We walk through a use case of coordinating a sequence of complex financial transactions.
Bottomline is that distributed transactions in the cloud are hard due to the heterogenity of storage technologies and the transaction managers for ACID transactions were all built for the pre-cloud era. You will need to approach transactions in the cloud native era with new patterns like sagas and compensation and eventual consistency and rely on the product characteristics like immutable persistent ordered logs or proprietary features to realize your dream of distributed transactions in the cloud.
@Wael from the Pivotal AppTx team asked this question which triggered the blog post ....
The traditional view of distributed transactions in the new cloud native microservices world is that they are a strict no-no. There are ways to workaround distributed transactions via compensation and eventual consistency and reverting to One-phase commit transactions. These alternatives although perfectly acceptable leave us wanting. Developers want everything and taking away distributed transactions is like pulling away a favorite teddy bear from a toddler.
But wait the tech industry evolves so fast there are other options available now ... Let me explain ...
Java world NOW does have support for distributed transactions indirectly. For Heterogeneous distributed transactions > indirectly via Kafka and MongoDB and for Homogeneous distributed transactions via Hyper Scale Secret Sauce Cloud Provider Expensive databases like Spanner DB and CosmosDB and others ...
Unless you like to pay top dollar to Microsoft or Google I would stay away from proprietary hyper-scale cloud provider databases that provide auto-magic 2 phase commit ACID transactions via atomic clocks synced across datacenters. God forbid you need to migrate away from these databases.
The other options are more palatable :
1. MongoDB has recently added ACID Transaction support see https://www.mongodb.com/transactions and Spring-mongo-data supports that via https://spring.io/blog/2018/06/28/hands-on-mongodb-4-0-transactions-with-spring-data and https://www.baeldung.com/spring-data-mongodb-transactions
2. Kafka - Online Event Processing have enabled achieving consistency where distributed transactions have failed see https://queue.acm.org/detail.cfm?id=3321612 Support for distributed transactions across heterogeneous storage technologies is either nonexistent or suffers from poor operational and performance characteristics. By building on top of immutable persistent ordered event logs OLEP systems are used to provide good performance and strong consistency guarantees in such settings. Use Online Event Processing to implement distributed transactions. For more concrete guidance around distributed transactions with Kafka checkout https://kafka-summit.org/sessions/simplifying-distributed-transactions-sagas-kafka/ which introduces the new Simple Sagas library. Built using Kafka streams, it provides a scalable fault tolerance event-based transaction processing engine. We walk through a use case of coordinating a sequence of complex financial transactions.
Bottomline is that distributed transactions in the cloud are hard due to the heterogenity of storage technologies and the transaction managers for ACID transactions were all built for the pre-cloud era. You will need to approach transactions in the cloud native era with new patterns like sagas and compensation and eventual consistency and rely on the product characteristics like immutable persistent ordered logs or proprietary features to realize your dream of distributed transactions in the cloud.
Subscribe to:
Posts (Atom)




