If you are ever unfortunate enough to troubleshoot application, java heap or native process OOM issues in Cloud Foundry follow the playbook below to get to the root cause:
1. Include the attached script dump.sh at the root of your JAR/WAR file. You can edit the LOOP_WAIT variable in the script to configure how often it will dump the Java NMT info. I'd suggest somewhere between 5 and 30 seconds, depending on how long it takes for the problem to occur. If the problem happens pretty quick, go with a lower number. If it takes hours, then go with something higher.
2. Make a .profile.d directory, also in the root of the JAR / WAR file. For a detailed explanation on using .profile.d to profile native memory checkout this note from CF support engineer Daniel Mikusa.
3. In that directory, add this script.
#!/bin/bash
$HOME/dump.sh &
This script will be run before the application starts. It starts the dump.sh script and backgrounds it. The dump.sh script will loop and poll the Java NMT stats, dumping them to STDOUT. As an example see the simple-java-web-for-test application. There is also an accompanying load plan here.
4. Add the following parameters to JAVA_OPTS:
JAVA_OPTS: "-XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:./jvm-gc.log -XX:NativeMemoryTracking=detail"
The -XX:NativeMemoryTracking will enable native OOM analysis [1]. More on this later. The -Xloggc will allow you to pipe all GC output to a log that you can later analyze with tool like PMAT or GCMV
5. Push or restage the application.
6. In a terminal open cf logs app-name > app-name.log. That will dump the app logs & the Java NMT info to the file. Please try to turn off as much application logging as possible as this will make it easier to pick out the Java NMT dumps.
7. Kick off the load tests.
8. If you have the ability to ssh into the container use the following commands to trigger heapdumps and coredumps
JVM Process Id:
PID=` ps -ef | grep java | grep -v "bash\|grep" | awk '{print $2}'`
Heapdump:
./jmap -dump:format=b,file=/home/vcap/app/test.hprof $PID
Salient Notes:
1. Include the attached script dump.sh at the root of your JAR/WAR file. You can edit the LOOP_WAIT variable in the script to configure how often it will dump the Java NMT info. I'd suggest somewhere between 5 and 30 seconds, depending on how long it takes for the problem to occur. If the problem happens pretty quick, go with a lower number. If it takes hours, then go with something higher.
2. Make a .profile.d directory, also in the root of the JAR / WAR file. For a detailed explanation on using .profile.d to profile native memory checkout this note from CF support engineer Daniel Mikusa.
3. In that directory, add this script.
#!/bin/bash
$HOME/dump.sh &
This script will be run before the application starts. It starts the dump.sh script and backgrounds it. The dump.sh script will loop and poll the Java NMT stats, dumping them to STDOUT. As an example see the simple-java-web-for-test application. There is also an accompanying load plan here.
4. Add the following parameters to JAVA_OPTS:
JAVA_OPTS: "-XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:./jvm-gc.log -XX:NativeMemoryTracking=detail"
The -XX:NativeMemoryTracking will enable native OOM analysis [1]. More on this later. The -Xloggc will allow you to pipe all GC output to a log that you can later analyze with tool like PMAT or GCMV
5. Push or restage the application.
6. In a terminal open cf logs app-name > app-name.log. That will dump the app logs & the Java NMT info to the file. Please try to turn off as much application logging as possible as this will make it easier to pick out the Java NMT dumps.
7. Kick off the load tests.
The nice thing about Java NMT is that it will take a snapshot of the memory usage when it first runs and every other time we poll the stats we'll see a diff of the initial memory usage. This is helpful as we really only need the last Java NMT dump prior to the crash to know what parts of the memory have increased and by how much. It's also nice because it gives us insight into non-heap usage. Given the Java NMT info, it should be easier to make some suggestions for tuning the JVM so that it doesn't exceed the memory limit of the application and cause a crash.
JVM Process Id:
PID=` ps -ef | grep java | grep -v "bash\|grep" | awk '{print $2}'`
Heapdump:
./jmap -dump:format=b,file=/home/vcap/app/test.hprof $PID
Coredump:
kill -6 $PID should produce a core file and leave the server running
Analyze these dumps using tools like Eclipse Memory Analyzer and IBM HeapDump Analyzer
9. If you have the ability to modify the application then enable the Spring Boot Actuator feature for your app if it is Boot app else integrate a servlet or script like DumpServlet and HeapDumpServlet into the app
Salient Notes:
- The memory statistics reported by CF in cf app is in fact used_memory_in_bytes which is a summation of rss and the active and inactive caches. [2] and [3]. This is the number watched by the cgroup linux OOM killer.
- The Cloud Foundry Java Buildpack by default enables the following parameter -XX:OnOutOfMemoryError=$PWD/.java-buildpack/open_jdk_jre/bin/killjava.sh
Please do NOT be lulled into a false sense of complacency by this parameter. I have never seen this work in production. Your best bet is to be proactive when triggering and pulling dumps using servlets, JMX, kill commands, whatever ...
- The Cloud Foundry Java Buildpack by default enables the following parameter -XX:OnOutOfMemoryError=$PWD/.java-buildpack/open_jdk_jre/bin/killjava.sh
Please do NOT be lulled into a false sense of complacency by this parameter. I have never seen this work in production. Your best bet is to be proactive when triggering and pulling dumps using servlets, JMX, kill commands, whatever ...
References:
[3] https://groups.google.com/a/cloudfoundry.org/d/msg/vcap-dev/6M8BDV_tq7w/VCoJmWtJJncJ
[4] https://blogs.oracle.com/poonam/entry/about_g1_garbage_collector_permanent
[5] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html
[6] Native Memory Best Practices
[7] http://www-01.ibm.com/support/docview.wss?uid=swg21255223
[8] https://publib.boulder.ibm.com/httpserv/cookbook/cookbook.html
[4] https://blogs.oracle.com/poonam/entry/about_g1_garbage_collector_permanent
[5] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html
[6] Native Memory Best Practices
[7] http://www-01.ibm.com/support/docview.wss?uid=swg21255223
[8] https://publib.boulder.ibm.com/httpserv/cookbook/cookbook.html
Hi Rohit,
ReplyDeleteIn AWS instance , How could we Know the Instance Host name ? So that we can carry the 3rd step .
#!/bin/bash
$HOME/dump.sh &