All Things Cloud: 2015

Sunday, December 13, 2015

Migrating apps to Cloud Foundry Deigo Runtime

Cloud Foundry has been rebased on a new runtime Deigo

You can follow the steps below to migrate your app with least disruption from the existing DEA VMs to the Deigo cells

(1) Deploy DEAs & Diego Side by Side when deploying the 1.6.x ERT.

(2) Set "Use Diego by default instead of DEAs” = true

(3) Set "cf curl /v2/apps/APP_GUID -d '{"diego": true}' -X PUT “ for pre-exsiting DEA deployed apps

(4) Test/Trigger app restages onto Diego as time permits

If any of your apps use the VCAP_APP_PORT variable. Please refactor to simply use the PORT env var.

Please read this Pivotal Technote that provides details on the migration steps

https://support.run.pivotal.io/entries/105844873-Migrating-Applications-from-DEAs-to-Diego

Evolution of the ESB to the Cloud In 5 Steps

Background

The rising popularity of microservices as an architecture has led to several questions around the role of the ESB. Traditionally the role of the ESB was to fulfill the promise of SOA by providing invocation, routing, mediation, messaging, process choreography, service orchestration, complex event processing, management, agnosticism, support for various message exchange patterns, adapters, transformation, validation, governance, enrichment, support for WS-* standards and abstraction. One can easily see how such a software can easily become the central bottleneck for all things enterprise aka the God object.

The term 'central' can also be used to reference the architectural style, namely everything had to be routed through the ESB. Not having to do that but still support message exchange patterns via Spring Integration's approach was a significant' distinction and is now even more important. The microservices approach favors dumb pipes and smart endpoints whereas an ESB works on an canonical data representation leading to dumb endpoints and smart pipes. There is an inherent conflict between microservices and ESBs. An ESB centric app architecture results in anemic services that are integrated via a smart centralized ESB.

The Microservice architecture pattern is SOA without the commercialization and perceived baggage of WS death star and an ESB. Furthermore creation of a SOA Center Of Excellence for Application Integration promoted a culture of centralized integration in the enterprise. Microservice-based applications favor simpler, lightweight protocols such as REST, rather than WS-*. They also very much avoid using ESBs and instead implement ESB-like functionality in the microservices themselves. The Microservice architecture pattern also rejects other parts of SOA, such as the concept of a canonical schema and an unified domain model across bounded contexts.. [chris-richardson].

Why Microservices

The fundamental tenets of Microservices like modularization, cohesiveness, DRYness, loose coupling, replaceability etc., have been widely recognized in the software industry since the 70s. The recent popularity of microservices stems from lack of independent deployability, independent scalability and independent failure of systems. A new generation of distributed system platforms and programming models has emerged to address these issues i.e. Microservices [martin-fowler]. Most enterprise systems over time evolve to Big Balls of Mud [balls-of-mud]. Beyond a certain scale making modifications to this spaghetti code jungle becomes untenable. Microservices gives us a structure approach to shear these monoliths and develop and deploy them in an agile fashion. Decomposing a monolith requires patterns like the Strangler, Inverse-Conway, Facade, Adapter, smart routing, feature-flags, zero-downtime-deployment and Proxy. Cloud Foundry natively bakes these concerns into the platform.

Integration tools have historically been the domain of the specialists working in the Integration Competency Center. They have required distinct skill sets and domain knowledge – from enterprise application integration and enterprise service buses and services oriented architecture to extract, transformation and load and enterprise data warehousing and business intelligence. The citizen model of integration refers to the process of democratization of integration workflows upending traditional approaches to data and application integration put together by citizen developers with no specialized domain knowledge driven by the demand for self-service from the business. [rise-citizen-integrator]

The emergence of PaaS and iPaaS platforms is a consequence of the need for reduced business software time to value. Features need to be delivered in a time span of weeks instead of years. The Third Platform makes this possible by baking into its foundation microservice management capabilities like service discovery, routing, load balancing, application lifecycle management and operational capabilities like canary releases, zero downtime deployment,automation, monitoring and security. Many of the concerns taken care of by the ESB are now usurped by the PaaS and provided to all the applications transparently at web scale. Moreover these platforms are truly open to extension and can be deployed using next generation tooling with BOSH, Ansible etc.

People & Process

A successful deployment of microservices is contingent on people, process and technology. The development of microservices is done with the agile methodology. The roots of agile can be found in Mel Conway's seminal work "How Do Committees Invent" mel-conway in 1967.

Conway's First Law: A system’s design is a copy of the organization’s communication structure. Conway's first law tells us TEAM SIZE is important. Communication dictates design. Make the teams as small as necessary.

Conway's Second Law: There is never enough time to do something right, but there is always enough time to do it over. Conway's second law tells us PROBLEM SIZE is important. Make the solution as small as necessary.

Conway's Third Law: There is a homomorphism from the linear graph of a system to the linear graph of its design organization. Conway’s Third law tells us CROSS-TEAM INDEPENDENCE is important. A more modern variant of the same covenant is the ‘two pizza rule’ invented by Jeff Bezos i.e. teams shouldn’t be larger than what two pizzas can feed.

Conway's Fourth Law: The structures of large systems tend to disintegrate during development, qualitatively more so than with small systems. Conway’s fourth law tells us that TIME is against LARGE teams therefore it is critical to make release cycles short and small.

The ESB development model of design by committee is at odds with the fundamental tenets of agile and lean development i.e. smaller decentralized teams and faster release cycles. ESB development requires specialized knowledge and does not support rapid continuous integration and development.

The O-Ring Theory when applied to Devops states that when production depends on completing a series of tasks, failure or quality reduction of any task reduces the value of the entire product. [O-ring-theory]. Another interesting finding is that, the better you are the more value you get from improving your weakness and conversely if you are fairly poor across the board you won’t get as high a ROI on an improvement in one specific area. In order to boost productivity it is critical to pursue excellence across all IT processes.

Challenges With Microservices

When a business system is designed as a suite of microservices that prefer choreography over orchestration it behooves the advocates of the architecture to provide best practices and patterns to implement the abstraction. This is where microservices falls short. The programming frameworks for microservices choreography such as event sourcing and CQRS are woefully behind in terms of maturity and production capability with some notable exceptions like axon or akka-persistence.

An ESB provides a visual mode of development that allows domain modelers and business developers an opportunity to design and validate a business process without getting their hands dirty with code. A cross-bar architecture allows architects to enforce a central point of governance and control over design and implementation. DIY Integration frameworks lack the extensive user and process modeling tools that traditional ESB vendors provide. This is a shortcoming in the DIY Integration ecosystem with Spring Integration.

Features of the ESB like business activity monitoring and event correlation now have to be implemented in applications or supplied as cross-cutting libraries to applications. ESB has a whole suite of connector services and adaptors that will need to plugged in for an individual service instead of harvesting from a common core.

A mature enterprise has a fully developed and very mature suite of ESB services that makes development of features with ESBs easy and fast. The pain of development, training and deployment has been amortized over the years to the point where developing features with the ESB may be equivalent to writing a suite of microservices.

Migration from ESBs to Cloud Native Platforms

There are 5 phases to evolving an ESB infrastructure and accompanying business services to the cloud.

Phase 0 : Co-exist

Deploy the ESB alongside the PaaS. ESB services are exposed as user provided external services to the PaaS. This allows the enterprise to deploy apps to a platform like Cloud Foundry and keep its existing backend services in an ESB like TIBCO. The advantages here is the ability to move the web tier to the next generation platform without disrupting existing integration flows.

Phase 1 : Lift And Shift

Deploy the ESB in the PaaS. Increasingly ESB vendors have moved their offerings to the cloud. See TIBCO Business Works Cloud Foundry Edition or mulesoft iPaaS. In fact Gartner defines this as an entirely new category called iPaaS i.e. a suite of cloud services enabling development, execution and governance of integration flows connecting any combination of on premises and cloud-based processes, services, applications and data within individual or across multiple organizations.

The Cloud Foundry edition of TIBCO allows individual TIBCO business processes to be natively scaled and managed by Cloud Foundry. From a Cloud Foundry perspective the TIBCO business process is treated as a new language with its own buildpack. This is a reasonable middle ground between ESBs and Microservices. There are two issues with a cloud enabled ESB like TIBCO, the ESB needs to leverage native messaging capabilities (like FTL) that are not available natively in the PaaS. The ESB needs to leverage the managed data backing services natively available on the platform.
Phase-1.1.png

Phase 2 : Refactor

A refactoring of the ESB services offered by the ESB such that the ESB layer is kept as thin as possible. At this point there are many opportunities to migrate from the ESB to Spring Integration to achieve basic transformation, mediation, and routing. Consumer Services Expose APIs with formats/protocols desired by consumers. Consumer services transform formats provide sync/async, protocols and composite workflows. Provider Services integrate with legacy IT systems, implement required workflow and consume whatever protocol/format is exposed by legacy systems. Refactor services incrementally per bounded context, prioritizing the ones that are new or under active development.

Standardize consumer APIs over REST/JSON and reduce the cross-bar architecture complexity by not entertaining a variety of consumer preferences. Extend bounded contexts to include legacy app dependencies. For high performance data intensive workloads the cost of serialization/deserialization of messages from the enterprise bus may be too high. Consider a binary serialization mechanism like google protocol buffers or Kyro for internal messaging, APIs, caching, etc across services.

Leverage Spring Flo a graphical UI for design and visualization of streaming and batch data pipelines. Flo allows you to create real-time streaming and batch pipelines either with the textual shell or the drag and drop interface Composed Batch Job Orchestration using Flo for Spring XD.

Phase 3 : Replace

Replacement of ESB services using integration pattern frameworks like Spring Cloud Stream. Spring Integration and its cloud successor Spring Cloud Stream provide successively higher layers of abstraction making it possible to write ESB flows and function using cloud native architectural constructs leveraging annotations and interfaces to specify business function. Spring Cloud Stream is a combination of Spring Cloud, Spring Boot and Spring Integration where integration modules are Spring Boot apps. The plumbing of putting together channel adapters and flows over message brokers like Kafka and RabbitMQ is done intelligently by the platform. Spring-cloud-dataflow can also orchestrate jobs so batch-processing and stream-processing are all manageable in a consistent way on a unified platform. Spring cloud "tasks" can handle spring-batch jobs as well as simple one-off runnable tasks. This approach does not rely on IDE tools to design the services and an additional step to deploy. Development of integration services is the same as development of any other business logic which eliminates the traditional impedance mismatch associated with ESBs and allows for a faster iteration of feature code. The citizen integration framework like Spring Integration provides developers a familiar Java developer's way to compose and operate on services.

One language (Java), one runtime (Tomcat) and one CI/CD process results in the holy grail of reduced time to value and increased efficiency. Both Apache Camel and Spring Integration have a vast number of components, modules and adapters to legacy systems that can aid in this transformation. REST APIs from zConnect should be leveraged natively on the mainframe where possible ditching the man-in-middle legacy adapter layer if possible. z/OS Connect enables z/OS systems such as CICS and IMS to provide RESTful APIs and accepts transactional JSON payloads via the WebSphere Liberty app server.

Apache Camel is in the same category as Spring Integration in terms of being a lightweight framework alternative to ESBs. Apache Camel like Spring Integration is a low level toolbox of components that can be used to implement EAI patterns. Unlike Spring Cloud Stream Apache Camel does not provide higher level microservice abstractions for designing flows other than a DSL.

Phase 4 : Transform

After phase 2 and 3 you are leveraging a lot of Integration patterns aka EAI constructs. Any ESB albeit even light weight cloud native ones inevitably result in context bleeding and a proliferation of interchange contexts resulting from sharing a canonical data model. Entities and Aggregates are outside their Bounded Context or even worse replicated outside, endorsed in a more generic representation, where the integration team has put an extra layer of complexity on top of it.enterprise-arch-in-practice.

Phase 4 is doing away with EAI patterns and leveraging a pure event driven messaging architecture. Decomposition of ESB integration services based on purpose alignment model to higher order vertically sliced domain based service. The intent is to make each service independent. The service provides a smart endpoint. The service fulfills a contract for a bounded context and is complete in all respects. Process orchestration is replaced by choreography. The Spring Cloud framework provides an umbrella of Netflix OSS tools that should be used when writing distributed systems. The service implemented with the help of Spring cloud and the platform are responsible for providing all the qualities of service provided earlier by the ESB. A fabric of independently operating coarsely grained microservices that are choreographed to achieve business function is the ultimate manifestation of the Domain Driven microservice philosophy.

In a choreography approach the central controller service is replaced by the asynchronous flow of events over a message bus. These events are picked up by interested services and appropriate actions are taken which may be message enrichment, transformation, etc,. Since business process flows are now transformed to event and activity flows, business activity monitoring becomes critical. Compensation of failed transactions and exception actions should be incorporated into business logic. A user journey now becomes a series of async interactions. Additional work is needed to ensure that you can monitor and track that the right things have happened.

The decomposition of a monolith to microservices is done by breaking the system into a shared kernel and identifying candidate bounded contexts using techniques like event storming and model-storming. Use an Anti-Corruption Layer to ring-fence clearly distinguishable bounded contexts from their neighbors to allow for extraction. Start with the core domain, move according to value. Transform the monolith using approaches such as Strangler, asset-capture, event-interception and reference-data. A transformation is not a refactoring it’s a debt restructuring program. Usual moves include Shared Kernel -> Customer-Supplier -> Open Host Services -> Published Language. References (ian-cooper, implementing-ddd)

Are my apps worthy of the Cloud ?

From time to time we get questions on what approaches to use to determine application suitability to the cloud.

In my view the best way to do this is to form a set of technical or business heuristics that are important to your enterprise and then grade the app portfolio on those criteria. Form a pool of 10 apps and begin app replatforming and modernization.

The most important thing about this scoping exercise is that it should 1. be time scoped to a day and 2. involve all the stakeholders (business, developers, architects, testers).

What you should NOT do is create a spreadsheet and rate apps on various criteria. Such an effort is destined to fail and get mired in analysis paralysis.

App migration is like training for a marathon or weight training. You start with smaller weights and challenges and then ramp up to the longer distances and higher class weights.

In the following screenshots I am going to display a set of heuristics that companies have used successfully to perform app migration to the cloud.

1. Technical Feasibility

http://redmonk.com/jgovernor/2015/11/10/cloud-native-is-nice-and-all-but-how-do-get-there

2. Business Feasibility

Differentiating – Clean Code, Now. Move to Microservices.

Parity – Good-enough software. Run on VMs.

Partner – With a little help from my friends. Outsource Refactoring.

Who Cares – Shut it down, shut it down. Phase out.

Niel Nickolaisen Purpose Alignment Model.

3. Technical and Business Feasibility

Impact here refers to the disruption to the business. In risk averse organization risk to business becomes an important metric. The order of app re-platforming here is Q1, Q2, Q4, Q3.

Wednesday, October 21, 2015

Tuning the default memory setting of the Java Buildpack to avoid OOMs

Credit for this blog post largely goes to Daniel Mikusa from Pivotal Support and others who have debugged and diagnosed a number of OOM issues recently with JDK8 and Cloud Foundry Java Buildpack.

Situation

We have seen significant improvement in the performance of long runs for certain enterprise apps after decreasing the amount of memory allocated to thread stacks and increasing the native i.e. metaspace of the container. If you are seeing your app running OOM then make the following changes to the memory of the Java Buildpack:

1.) Lower the thread stack size. This defaults to right around 1M per thread, which is more than most threads every need. I usually start by lowering this to 228k per thread, which is the JVM minimum. This will work fine unless you've got apps that do lots of recursion. At any rate, if you see any StackOverFlowExceptions, then just bump up the value until they go away.

2.) The default allocation of memory weights is 75 heap and 10 native. I suggest starting with 60 heap and 25 native. That's probably a bit conservative, but it should leave more free room in the container which should help to prevent crashes. Lowering the heap to 60 will lower the total amount of heap space that the app can use. If this causes problems with the app, like OOME's then you might need to raise the memory limit on the container instead. Once adjusted, we should see between 100 and 150M of free memory in the container.

In other words for JDK8,

MEMORY_LIMIT - ( HEAP + METASPACE + 100M ) should be between 100 & 150M.

The problem that most customers hit when moving their Java apps to CF is that they've never had a hard limit on total system memory before. Deploying to WAS/WebLogic/JBOSS (or any other application container) is going to be done on a system with swap space. That means if they don't have their memory settings quite right, the worst thing that will happen is that they use a bit of swap space. If it's just a little swap, they probably won't even notice it happening. With CF, there's much less forgiveness in the system. Exceeding the memory limit by even a byte will result in your app being killed.

How to Fix this ?

Please set the JBP_CONFIG_OPEN_JDK_JRE or the JBP_CONFIG_OPEN_JDK_JRE env var. and restage the app

cf set-env my-app JBP_CONFIG_ORACLE_JRE_MEMORY_HUERISTICS '{heap: 60, native: 25, stack:"228k”}'

cf set-env my-app JBP_CONFIG_OPEN_JDK_JRE: '[memory_calculator: {memory_sizes: {stack: 228k, heap: 225M}}]

cf se my-app JBP_CONFIG_ORACLE_JRE '[memory_calculator: {memory_sizes: {stack: 228k}, memory_heuristics: {heap: 60, native: 25}}]’

For versions of JBP pre 3.3

cf set-env my-app JBP_CONFIG_OPEN_JDK_JRE: '[memory_calculator: {memory_heuristics: {stack: .01, heap: 10, metaspace: 2}}]’

Script For Parsing NMT

If you do happen to enable -XX:NativeMemoryTracking then use the script below to graph and generate charts of memory growth:

https://github.com/dmikusa-pivotal/cf-debug-tools/blob/master/build-graph.py

and here are some quick docs for it.

https://github.com/dmikusa-pivotal/cf-debug-tools#logs

There's one issue at the moment, the process id that the script looks for in the top output is hard coded to 35. I think that it's usually 35, because of the order that the processes start up, but that might not always be the case. You can manually find the pid in the log and change the script for now. I will try to add to the script to automatically put that, but haven't had the time yet.

Monday, October 12, 2015

Heapdumps and CoreDumps on Cloud Foundry containers

SSH into the warden container following instructions here and follow the instructions below ...

Heapdumps:

su vcap

PID=` ps -ef | grep java | grep -v "bash\|grep" | awk '{print $2}'`

jmap -dump:format=b,file=/home/vcap/app/test.hprof $PID

Coredumps:

I played with this some more and the process below worked for generating core dumps:

ulimit -c unlimited

gdb --pid=33

Attaching to process 33
Reading symbols from /home/vcap/app/.java-buildpack/open_jdk_jre/bin/java...(no debugging symbols found)...done.
Reading symbols from /home/vcap/app/.java-buildpack/open_jdk_jre/bin/../lib/amd64/jli/libjli.so...(no debugging symbols found)...done.
Loaded symbols for /home/vcap/app/.java-buildpack/open_jdk_jre/bin/../lib/amd64/jli/libjli.so
...

Reading symbols from /home/vcap/app/.java-buildpack/open_jdk_jre/bin/java...(no debugging symbols found)...done.

Reading symbols from /home/vcap/app/.java-buildpack/open_jdk_jre/bin/../lib/amd64/jli/libjli.so...(no debugging symbols found)...done.

Loaded symbols for /home/vcap/app/.java-buildpack/open_jdk_jre/bin/../lib/amd64/jli/libjli.so

...

(gdb) gcore

(gdb) quit
A debugging session is active.
Inferior 1 [process 33] will be detached.
Quit anyway? (y or n) y
Detaching from program: /home/vcap/app/.java-buildpack/open_jdk_jre/bin/java, process 33
root@18tlhc59f7e:~# ls -al
total 1236312
drwx------ 2 root root 4096 Oct 12 14:19 .
drwxr-xr-x 33 root root 4096 Oct 12 14:19 ..
-rw------- 1 root root 985 Oct 6 04:47 .bash_history
-rw-r--r-- 1 root root 3106 Feb 20 2014 .bashrc
-rw-r--r-- 1 root root 1265958640 Oct 12 14:20 core.33
-rwxr-xr-x 1 root root 213 Jul 24 21:08 firstboot.sh
-rw-r--r-- 1 root root 140 Feb 20 2014 .profile

Followed instructions from http://jagadesh4java.blogspot.com/2013/05/linux-core-and-java-jmap.html

Note kill -SIGABRT 33 did not work.

In miscellaneous notes everyone should read the following article on memory management in CF containers http://fabiokung.com/2014/03/13/memory-inside-linux-containers/

Most of the Linux tools providing system resource metrics were created before cgroups even existed (e.g.: free and top, both from procps). They usually read memory metrics from the proc filesystem: /proc/meminfo,/proc/vmstat, /proc/PID/smaps and others. Unfortunately /proc/meminfo, /proc/vmstat and friends are not containerized.

Most container specific metrics are available at thecgroup filesystem via /path/to/cgroup/memory.stat,/path/to/cgroup/memory.usage_in_bytes,/path/to/cgroup/memory.limit_in_bytes and others.

Saturday, October 10, 2015

Boiler plate apps for all buildpacks in Cloud Foundry

If you are tired of searching for sample apps of different types that exercise all the buildpacks in Cloud Foundry, the list of boilerplate apps below comes to your rescue. Most of these apps need a backing service like MySQL which exercises the service binding code in CF and your app.

List of all the Buildpacks supported by CF: http://docs.cloudfoundry.org/buildpacks/

java_buildpack: Spring-Music : Allows binding based on profiles to mysql, postgres, in-memory, etc., https://github.com/pivotalservices/spring-music

go_buildpack: pong_matcher_go: This is an app to match ping-pong players with each other. It's currently an API only, so you have to use curl to interact with it. Requires mysql
https://github.com/cloudfoundry-samples/pong_matcher_go

nodejs_buildpack: node-tutorial-for-frontend-devs: Node.js sample app with mongodb backend:
https://github.com/cwbuecheler/node-tutorial-for-frontend-devs

php_buildpack: PHPMyAdmin : out-of-the-box implementation of PHPMyAdmin 4.2.2. Requires mysql
https://github.com/dmikusa-pivotal/cf-ex-phpmyadmin

binary_buildpack: pezdispenser: Admin portal for Cloud Foundry
https://github.com/pivotal-pez/pezdispenser

ruby_buildpack: Rails app to match ping-pong players with each other. Requires mysql.

https://github.com/cloudfoundry-samples/pong_matcher_rails

python_buildpack: Buildpack uses pip to install dependencies. Needs a requirement.txt

Flask app: https://github.com/michaljemala/hello-python

PyData app: https://gist.github.com/ihuston/d6aab5e4a811fe582fa7 Does not use pip. Uses conda.

staticfile_buildpack: Put a Staticfile in any directory and do a cf push. If directory browsing is needed Add a line to your Staticfile that begins with directory: visible

Spring Cloud services Tile:
- https://github.com/spring-cloud-samples/fortune-teller
- https://github.com/dpinto-pivotal/cf-SpringBootTrader

.NET sample app: Contoso University

- https://github.com/pivotalservices/contoso-university

Method to get to the root cause of a memory leak in the Cloud

From an application developer perspective our apps are increasingly run in russian doll deployments i.e. app code wrapped in a app server container which is wrapped by a LXC container running within a VM , running on one or multiple hypervisors running on bare metal. In such a scenario determining the root cause of memory leaks becomes difficult. Please find below a process that could be used to get to the eureka moment.

The basic principle to get to root cause is that of eliminating all the variables one by one. We start at the top of the stack and work our way down. Remember the JVM is like an iceberg. There is java heap that is above the water and an unbound native memory portion underneath the surface. Java heap OutOfMemory errors are easier to fix than native memory leaks. Native leaks are generally caused by errant libraries/frameworks, JDKs, app-servers or some unexplained OS-container-JDK interaction.

ok lets get to it ...

First establish a process to measure the JVM heap and native process size perhaps using a dump script like https://github.com/dmikusa-pivotal/cf-debug-tools#use-profiled-to-dump-the-jvm-native-memory. Remember to take heapdumps before during and after the load run. Once the test is close to completion take native process core dumps using kill -6 or kill -11. This procedure of running the test is then repeated as you eliminate each variable below.

1. [app] First look at the application source for the usual memory leak anti-patterns like DirectByteBuffers, threadlocal, Statics, Classloader retention, resource cleanup etc. This is where you will get the maximum bang for the buck. Take a heapdump of the JVM process and analyze in EclipseMemoryAnalyzer of HeapAnalyzer.

2. [jdk] Eliminate the JDK as a factor of the leak by switching JVM implementations i.e. moving from OpenJDK to Hotspot or from OpenJDK to IBM JDK etc .. see the entire list of JVM impls https://en.wikipedia.org/wiki/List_of_Java_virtual_machines.

3. [app-server] If simple eyeballing does not help then switch the app-server i.e. move from tomcat to jetty, undertow to tomcat. If your app runs on WebSphere or WebLogic and cannot be ported then my apologies. Call 1-800-IBM-Support.

4. [container] If your droplet (app + libraries/frameworks + jvm) is running within a container in Cloud Foundry or Docker then try switching out the containers. i.e. If running within the warden container then run the same app within Docker container. Try changing Docker base images and see if the leak goes away.

5. [hypervisor] If running on AWS switch to OpenStack or vSphere and vice versa. You get the idea. Cloud Foundry makes this easy since you can standup the same CF deployment on all three providers.

6. [bare-metal] Run the app on the bare metal server to check if the leak persists.

7. [sweep-under-the-rug] Once you are ready to pull your hair out, resort to tuning the JDK. Start playing with JVM options like -Xms234M -Xmx234M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=128M -Xss228K. In cloud foundry this is set by the memory_calculator that is influenced by setting the memory_heuristics and memory_sizes env vars.

JBP_CONFIG_OPEN_JDK_JRE: '[memory_heuristics: {heap: 55, metaspace: 30}, memory_sizes: {metaspace: 4096m..}]'
JBP_CONFIG_OPEN_JDK_JRE: '[memory_calculator: {memory_sizes: {stack: 228k, heap: 225M}}]
JBP_CONFIG_OPEN_JDK_JRE: '[memory_calculator: {memory_heuristics: {stack: .01, heap: 10, metaspace: 2}}]'

As you can see, the options become increasingly cumbersome as you keep doing down this list. Your best hope is to catch it at 1, 2 or 3. Good Luck Hunting!

Thursday, October 1, 2015

Chasing Cloud Foundry OutOfMemory Errors - OOM

If you are ever unfortunate enough to troubleshoot application, java heap or native process OOM issues in Cloud Foundry follow the playbook below to get to the root cause:

1. Include the attached script dump.sh at the root of your JAR/WAR file. You can edit the LOOP_WAIT variable in the script to configure how often it will dump the Java NMT info. I'd suggest somewhere between 5 and 30 seconds, depending on how long it takes for the problem to occur. If the problem happens pretty quick, go with a lower number. If it takes hours, then go with something higher.

2. Make a .profile.d directory, also in the root of the JAR / WAR file. For a detailed explanation on using .profile.d to profile native memory checkout this note from CF support engineer Daniel Mikusa.

3. In that directory, add this script.
#!/bin/bash
$HOME/dump.sh &

This script will be run before the application starts. It starts the dump.sh script and backgrounds it. The dump.sh script will loop and poll the Java NMT stats, dumping them to STDOUT. As an example see the simple-java-web-for-test application. There is also an accompanying load plan here.

4. Add the following parameters to JAVA_OPTS:
JAVA_OPTS: "-XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:./jvm-gc.log -XX:NativeMemoryTracking=detail"
The -XX:NativeMemoryTracking will enable native OOM analysis [1]. More on this later. The -Xloggc will allow you to pipe all GC output to a log that you can later analyze with tool like PMAT or GCMV

5. Push or restage the application.

6. In a terminal open cf logs app-name > app-name.log. That will dump the app logs & the Java NMT info to the file. Please try to turn off as much application logging as possible as this will make it easier to pick out the Java NMT dumps.

7. Kick off the load tests.

The nice thing about Java NMT is that it will take a snapshot of the memory usage when it first runs and every other time we poll the stats we'll see a diff of the initial memory usage. This is helpful as we really only need the last Java NMT dump prior to the crash to know what parts of the memory have increased and by how much. It's also nice because it gives us insight into non-heap usage. Given the Java NMT info, it should be easier to make some suggestions for tuning the JVM so that it doesn't exceed the memory limit of the application and cause a crash.

8. If you have the ability to ssh into the container use the following commands to trigger heapdumps and coredumps
JVM Process Id:
PID=` ps -ef | grep java | grep -v "bash\|grep" | awk '{print $2}'`
Heapdump:
./jmap -dump:format=b,file=/home/vcap/app/test.hprof $PID

Coredump:

kill -6 $PID should produce a core file and leave the server running

Analyze these dumps using tools like Eclipse Memory Analyzer and IBM HeapDump Analyzer

9. If you have the ability to modify the application then enable the Spring Boot Actuator feature for your app if it is Boot app else integrate a servlet or script like DumpServlet and HeapDumpServlet into the app

Salient Notes:

- The memory statistics reported by CF in cf app is in fact used_memory_in_bytes which is a summation of rss and the active and inactive caches. [2] and [3]. This is the number watched by the cgroup linux OOM killer.
- The Cloud Foundry Java Buildpack by default enables the following parameter -XX:OnOutOfMemoryError=$PWD/.java-buildpack/open_jdk_jre/bin/killjava.sh
Please do NOT be lulled into a false sense of complacency by this parameter. I have never seen this work in production. Your best bet is to be proactive when triggering and pulling dumps using servlets, JMX, kill commands, whatever ...

References:

[1] https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html

[2] https://github.com/cloudfoundry/dea_ng/commit/e5fd523aaf2cddeda01d4134ecd2a929829c4147

[3] https://groups.google.com/a/cloudfoundry.org/d/msg/vcap-dev/6M8BDV_tq7w/VCoJmWtJJncJ
[4] https://blogs.oracle.com/poonam/entry/about_g1_garbage_collector_permanent
[5] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html
[6] Native Memory Best Practices
[7] http://www-01.ibm.com/support/docview.wss?uid=swg21255223
[8] https://publib.boulder.ibm.com/httpserv/cookbook/cookbook.html

Tuesday, September 29, 2015

Spring Boot Activator metrics collection in a spreadsheet

If your app is a spring boot app that has the actuator enabled then use this nifty script from Greg Turnquist's Learning Spring Boot book with some changes from me to collect all the metrics in a csv

package learningspringboot

@Grab("groovy-all")
import groovy.json.*

package learningspringboot

@Grab("groovy-all")
import groovy.json.*

@EnableScheduling
class MetricsCollector {

def url = "http://fizzbuzz.cfapps.io/metrics"
def slurper = new JsonSlurper()
def keys = slurper.parse(new URL(url)).keySet()
def header = false;
@Scheduled(fixedRate = 1000L)
void run() {
if (!header) {
println(keys.join(','))
header = true
}

def metrics = slurper.parse(new URL(url))

println(keys.collect{metrics[it]}.join(','))
}
}

About Me