This blog post goes into step-by step run through of how to take a core dump of a specific application instance in cloud foundry. Please note you WILL need access to the PCF OpsManager. Typically the ability to target the BOSH director to your CF deployment is only possessed by sysadmins. So if you are reading this as a developer of Node.js apps in PCF, your best shot is to build the diagnostics into the app like so
- Use the restify observable REST framework
- Modify the node start command for your app like so node --abort_on_uncaught_exception --expose-gc
- Install heapdump with npm install heapdump and require it in server.js. Use chrome dev tools to analyze the heapdumps.
- Develop Debugging Techniques as explained here.
Step 0: App Listing
cf app word-finder
state since cpu memory disk details
#0 running 2016-04-09 02:13:26 AM 0.0% 107.4M of 1G 164.8M of 1G
#1 running 2016-04-09 10:31:47 PM 0.0% 96.2M of 1G 44.6M of 1G
#2 running 2016-04-09 10:31:47 PM 0.0% 95.4M of 1G 44.6M of 1G
Step 1: Determine app GUID
cf app word-finder --guid
5e46745d-9849-42c3-9549-be9ba5fe1e57
Step 2: Determine Host and Port of the app instance in question
Lets assume that it is instance #2 that is suffering the HIGH CPU.
export CF_TRACE=true
cf curl /v2/apps/5e46745d-9849-42c3-9549-be9ba5fe1e57/stats
"2": {
"state": "RUNNING",
"stats": {
"name": "word-finder",
"uris": [
"word-finder.cfapps.paas.pivotal.io"
],
"host": "10.65.187.46",
"port": 60540,
"uptime": 560,
"mem_quota": 1073741824,
"disk_quota": 1073741824,
"fds_quota": 16384,
"usage": {
"time": "2016-04-10T02:41:08.873469312Z",
"cpu": 0.00019609362320671737,
"mem": 103034880,
"disk": 46796800
}
}
Step 3: bosh ssh into the Diego Cell VM
ssh into the OpsManager VM and then target the BOSH director to the correct deployment
Target the BOSH director after looking up the Director IP address from the OpsManager console
Target the BOSH director after looking up the Director IP address from the OpsManager console
bosh deployment /var/tempest/workspaces/default/deployments/cf-cdcc8064df9477b983f3.yml
Issue a bosh vms command to determine the name of the Diego cell that houses the app instance container. Note the host-name should match from step-1.
ubuntu@pivotal-ops-manager:~$ bosh vms cf-cdcc8064df9477b983f3 --details
Acting as user 'director' on deployment 'cf-cdcc8064df9477b983f3' on 'p-bosh-1667ec8b8a5e1edbe71b'
| Job/index| State | Resource Pool | IPs | CID | Agent ID | Resurrection |
|diego_cell-partition-e8ca738002f1f150c602/0 | running | diego_cell-partition-e8ca738002f1f150c602 | 10.65.187.46 | vm-b2f2b447-043b-41ec-bad8-9fd950e9f9d7 | f1ec0fc6-9942-4c18-a55e-b507014684e3 | active |
ubuntu@pivotal-ops-manager:~$ bosh ssh diego_cell-partition-e8ca738002f1f150c602/0
Step 4: Locate the Garden container that contains the app instance # 2.
First start in the right directory
cd var/vcap/data
Install Veritas
wget https://github.com/pivotal-cf-experimental/veritas/releases/download/latest/veritas
chmod a+x veritas
Configure Veritas
chmod a+x veritas
export VITALS_ADDRS=garden:0.0.0.0:17013,rep:0.0.0.0:17008
root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export GARDEN_ADDR=/var/vcap/data/garden/garden.sock
root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export GARDEN_NETWORK=unix
root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export VITALS_ADDRS=garden:0.0.0.0:17013,rep:0.0.0.0:17008
root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# ./veritas garden-containers
Derive a garden container mapping of the app instances. Now find the garden container based on the port exposed for that instance from step #2.
b60ba43b-d927-4d92-b3775e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-bf91aa00-fabb-45a9-4311-1e326ad5998e - active @ /var/vcap/data/garden/depot/f8gqc8vnqfl
Memory: 98.797 MB
Disk: Total:802.640 MB 0 Inodes, Exclusive:44.629 MB 0 Inodes
10.254.0.25=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-bf91aa00-fabb-45a9-4311-1e326ad5998e: 60542:8080,60543:2222
Running: 2 processes
Properties:
executor:owner: executor
5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-fc296ea9-1e02-4f97-6c95-553388a3370c - active @ /var/vcap/data/garden/depot/f8gqc8vnqfm
Memory: 98.297 MB
Disk: Total:802.640 MB 0 Inodes, Exclusive:44.629 MB 0 Inodes
10.254.0.29=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-fc296ea9-1e02-4f97-6c95-553388a3370c: 60540:8080,60541:2222
Running: 2 processes
Properties:
executor:owner: executor
5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-8ce28a9f-a9ef-4aff-73a6-4e095822e0db - active @ /var/vcap/data/garden/depot/f8gqc8vnqbr
Memory: 107.430 MB
Disk: Total:922.765 MB 0 Inodes, Exclusive:164.754 MB 0 Inodes
10.254.0.21=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-8ce28a9f-a9ef-4aff-73a6-4e095822e0db: 60538:8080,60539:2222
Running: 2 processes
Properties:
executor:owner: executor
At this point you have nailed down the path to your specific garden container i.e. /var/vcap/data/garden/depot/f8gqc8vnqfm
Step 5: wsh into the container and issue the gcore command by attaching to the node.js process
Determine the PID of the node.js app process
root@f1ec0fc6-9942-4c18-a55e-b507014684e3:~# cd /var/vcap/data/garden/depot/f8gqc8vnqfm
garden/depot/f8gqc8vnqfm$ sudo su -
root@f1ec0fc6-9942-4c18-a55e-b507014684e3:/var/vcap/data/garden/depot/f8gqc8vnqfm# ./bin/wsh
# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 02:31 ? 00:00:00 initd -dropCapabilities=false -title="wshd: f8gqc8vnqfm"
vcap 14 1 0 02:31 ? 00:00:00 node server.js
vcap 17 1 0 02:31 ? 00:00:00 /tmp/lifecycle/diego-sshd -address=0.0.0.0:2222 -hostKey=-----BEGIN RSA PRIVATE KEY----- MIICXAIBAAKBgQCxtuDXcInp3mD2x9s8oI6c6Ca8wgEYvPoFwichBYOklyxJHMSy
root 330 1 0 03:07 pts/0 00:00:00 /bin/sh
root 332 330 0 03:08 pts/0 00:00:00 ps -ef
Attach a debugger to the process
# gdb --pid=14
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Attaching to process 14
Reading symbols from /home/vcap/app/.heroku/node/bin/node...done.
Reading symbols from /lib/x86_64-linux-gnu/libssl.so.1.0.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libssl.so.1.0.0
Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/librt.so.1
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /usr/lib/x86_64-linux-gnu/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/x86_64-linux-gnu/libstdc++.so.6
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 25]
[New LWP 24]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007f8baa3c4859 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
Generate a core dump
(gdb) gcore
warning: target file /proc/14/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 8192 bytes at 0x7fff6e7d9000.
Saved corefile core.14
(gdb) exit
Undefined command: "exit". Try "help".
(gdb) quit
A debugging session is active.
Inferior 1 [process 14] will be detached.
Quit anyway? (y or n) y
Detaching from program: /home/vcap/app/.heroku/node/bin/node, process 14
# ls
core.14 firstboot.sh
# ls -al
total 114044
drwx------ 2 root root 4096 Apr 10 03:08 .
drwxr-xr-x 31 root root 4096 Apr 10 03:08 ..
-rw-r--r-- 1 root root 3106 Feb 20 2014 .bashrc
-rw-r--r-- 1 root root 140 Feb 20 2014 .profile
-rw-r--r-- 1 root root 116764456 Apr 10 03:08 core.14
-rwxr-xr-x 1 root root 213 Feb 22 19:32 firstboot.sh
#
Step 6 - Transfer core-dump to a VM that has the mdb debugger installed.
After generating the coredump change ownership of the core to vcap:vcap and move it to the /tmp dir. so that the core can scp'ed from the container to your Joyent or SunOS box
cf ssh-code | pbcopy scp -P 2222 -o User=cf:0b9216bc-5606-4c40-80cc-ae177a072435/0 ssh.run.haas-35.pez.pivotal.io:/tmp/core.16 .
After generating the coredump change ownership of the core to vcap:vcap and move it to the /tmp dir. so that the core can scp'ed from the container to your Joyent or SunOS box
cf ssh-code | pbcopy scp -P 2222 -o User=cf:0b9216bc-5606-4c40-80cc-ae177a072435/0 ssh.run.haas-35.pez.pivotal.io:/tmp/core.16 .
Step 7 Load the Core dump on the Solaris VM
Fetch a Solaris VM from the Joyent cloud. My specific machine is a SunOS 6de24ae5-f178-4d11-aa0f-c027c81e9f84 5.11 joyent_20151029T052920Z i86pc i386 i86pc Solaris
Fetch a Solaris VM from the Joyent cloud. My specific machine is a SunOS 6de24ae5-f178-4d11-aa0f-c027c81e9f84 5.11 joyent_20151029T052920Z i86pc i386 i86pc Solaris
# mdb ./node-v4.2.2-linux/node-v4.2.2-linux-x64/bin/node ./core.14
> ::load ./mdb_v8_amd64.so
mdb_v8 version: 1.1.1 (release, from 28cedf2)
V8 version: 143.156.132.195
Autoconfigured V8 support from target
C++ symbol demangling enabled
Step 8 Issue mdb commands for introspecting the dump for stacktraces and data
❖ ::help <cmd>
❖ ::jsstack
❖ ::jsprint
❖ ::jssource
❖ ::jsconstructor
❖ ::findjsobjects
❖ ::jsfunctions
Step 9: Figure out the exact stack frame executing on the heap when the crash occurred or High CPU situation occurred.
Happy hunting ...