About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Saturday, July 27, 2019

Spring Boot Microservices Observability > Pivotal PCF Metrics

We are often asked Why the PCF Platform team should install PCF Metrics when the customer already has NewRelic, AppDynamics, Dynatrace, <insert APM tool of choice> and a log aggregator like ELK, Splunk etc., After all PCF Metrics is a resource hog at least in the older versions. Note this has improved and is tunable in later versions.

Now we all know that developers like their shiny toys and given latitude will install the internet on the platform. However there are genuine reasons to install PCFMetrics for app developers.  First a picture is worth 1000 words. So please see graphic and check if the reasons make sense. I have tried to present a biased bulls and bears assessment of PCFMetrics for your developer enterprise needs below.




Wait but what is PCF Metrics ?
Pivotal Cloud Foundry Metrics stores logs, metrics data, and event data from apps running on PCF for the past 14 days. It graphically presents this data to help operators and developers better understand the health and performance of their apps. PCF Metrics will enable development teams to advance microservice resliency and scalability goals  by providing us a single pane of glass for logging, tracing and metrics giving insight into outages, interruptions and scaling events.

Wait but we already have all the data in Kibana and Splunk and NewRelic/AppDynamics ?

Here is what PCFMetrics provides that the current monitoring and logging solutions do not provide > 

1. The java buildpack has inbuilt integration with the metrics forwarder tile and gives us the ability to look at Spring Boot Actuator metrics and custom metrics in PCFMetrics without any additional work. We need the ability to see custom metrics and spring boot actuator metrics in a dashboard, see https://content.pivotal.io/blog/how-pcf-metrics-helps-you-reduce-mttr-for-spring-boot-apps-and-save-money-too for a behind the scenes look on how all this works.

Configuring the New Relic MeterRegistry or any of the other specific Micrometer registries has the benefit of skipping the potentially-lossy loggregator flow in favor of direct communication. Skipping the loggregator in favor of direct communication with a registry highly encouraged if your registry supports dimensional data even if you could sink loggregator metrics to that registry transparently. 

2. The metrics are correlated with logs and we can zoom in immediately into the section of logs that is pertinent to the metric spike. We can also visualize PCF autoscaling events in the dashboard giving us an accurate picture of when and why scaling occurred and the faults in the system. Both these capabilities are not provided by ELK/NewRelic without significant customization. see https://docs.pivotal.io/pcf-metrics/1-6/using.html  You can also configure alerts in PCF metrics via webhooks when values of metrics are crossed or when significant app events take place. This allows developers to build and own what they deliver to the platform and enables developers to engage in SRE practices. Now you can autoscale without PCF Metrics but you just won't have as much historical data on which to base these scaling decisions. One could start creating autoscaling rules without referencing that history. PCF Metrics gives you the tools to make better scaling decisions.


3. From a cost perspective this tile is free and you are entitled to long term support. Example resource configuration to store approximately 14 days of data for a small deployment, about 100 application instances are as follows. The tile be configured in S, M, L and XL resource configurations. From a cost perspective most Dev teams should be able to absorb this cost. We can start small and then resize the deployment when the development team validates the value. If this is a cost issue the sizing can be further reduced by storing only 7 days. 



Now "it's basically free" should never be used an argument for choosing the best tool for monitoring your applications in production in part because tools like ELK and Prometheus also come with no-price-but-the-infrastructure options. Windows like the default two weeks may work in development environments but may not be sufficient for real life app monitoring especially in domains like e-commerce that have distinct rhythm across the day/month/year. 

4. PCF metrics comes with an inbuilt distributed tracing dashboard for all the microservices providing an end to end view of microservice latencies in fanout configurations. This Zipkin  based dashboard that showcases spans and traces is baked in into PCFMetrics and is better than the corresponding e2e support in NewRelic. see https://content.pivotal.io/blog/distributed-tracing-in-pcf-metrics-breakthrough-insight-for-microservices The distributed tracing support in PCFmetrics provides a deep understanding in the causality of outages and other interruptions across microservice fanout and hierarchy calls. 

PCF Metrics can be installed in the non-production PCF environment to do a proper cost/benefit pro/cons analysis. Install the PCFMetrics and the Metrics Forwarder tile to achieve autoscaling to meet the observability, resiliency and meet the non-functional requirements of Microservices. PCF Metrics also supports the day-2 Ops goals of your organization http://cloud.rohitkelapure.com/2019/02/power-of-pcf-metrics.html

PCFMetrics does not displace your enterprise APM and logging tool. It does not provide support today for multi-dimensional metrics and only persists logs for 14 days by default. PCMetrics does not provide any profiling data. There needs to a proper telemetry and APM focused solution alongside PCF Metrics. And if you have something along with it ship metrics directly to that system using the right Micrometer registry. The fact that an application cannot publish metrics directly to PCF Metrics in any way and requires some additional platform support either metrics-forwarder or the new metrics-registrar also means product teams have extra coupling to the teams that maintain the platform or release pipelines.  In absence of PCFMetrics, Teams should have an external log aggregator and metrics platform in place for app teams to own their own destiny regarding how and what they monitor about their apps.

If the platform team offers Prometheus for metrics and visualization, developers should use it. Prometheus has fantastic support for polling of metrics. In most places developers need to open tickets to post custom metrics. For instance in NewRelic you need an Insights API key to post custom metrics and create dashboards. NR and ELK are better than PCFMetrics when it comes to specialized features;  however developers don’t visit these Portals. PCFMetrics also enables the configuration of autoscaling rules on custom metrics and alerts.  Developers are lazy and will do the most convenient thing i.e. pull up the portal from the console which leads me to the bottomline. 
PCFMetrics is the perfect on-boarding portal for developers in non-production environments that enables them to start the journey of observability and then get deeper with specialized tools.
Credit to my colleague Matt Campbell for contributing to this blog and keeping it real!

Sunday, July 21, 2019

Tools To Create Chaos

These are interesting tools that I have come across in the last couple of days to create chaos one of the key SRE practices to determine if your production site can handle excess load ...

  • Gremlin: Chaos As a Service.  https://www.gremlin.com/docs/application-layer/attacks/ Resiliency through orchestrated chaos. Worth paying for this service if you have low confidence on the production readiness of your code or if you don't have SRE practices to shock the organization into operational readiness.
Now that you have succeeded in creating chaos how should you instrument and fix the system to deal with the chaos. To understand how to deal with chaos start with Health Checks and Graceful Degradation in Distributed Systems and  Testing in Production- The Safe Way

Other Book Chapters to understand the theory and implementation of SRE practices when dealing with Chaos read the chapters on Handling Overload and Addressing Cascading Failures from the SRE Books. As a bonus read the chapter on Non Abstract Large System Design to understand the design process for designing large scale fault tolerant systems.

Lastly if you are in the Bay Area this looks like  an awesome conference https://chaosconf.io/

Happy SRE Practices!