About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Saturday, April 9, 2016

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 2

This blog post goes into step-by step run through of how to take a core dump of a specific application instance in cloud foundry. Please note you WILL need access to the PCF OpsManager. Typically the ability to target the BOSH director to your CF deployment is only possessed by sysadmins. So if you are reading this as a developer of Node.js apps in PCF, your best shot is to build the diagnostics into the app like so

  1. Use the restify observable REST framework
  2. Modify the node start command for your app like so node --abort_on_uncaught_exception --expose-gc
  3. Install heapdump with npm install heapdump and require it in server.js. Use chrome dev tools to analyze the heapdumps.
  4. Develop Debugging Techniques as explained here. 
Step 0: App Listing
cf app word-finder
     state     since                    cpu    memory         disk           details
#0   running   2016-04-09 02:13:26 AM   0.0%   107.4M of 1G   164.8M of 1G
#1   running   2016-04-09 10:31:47 PM   0.0%   96.2M of 1G    44.6M of 1G
#2   running   2016-04-09 10:31:47 PM   0.0%   95.4M of 1G    44.6M of 1G

Step 1: Determine app GUID
cf app word-finder --guid
5e46745d-9849-42c3-9549-be9ba5fe1e57

Step 2: Determine Host and Port of the app instance in question
Lets assume that it is instance #2 that is suffering the HIGH CPU.

export CF_TRACE=true
cf curl /v2/apps/5e46745d-9849-42c3-9549-be9ba5fe1e57/stats

  "2": {
      "state": "RUNNING",
      "stats": {
         "name": "word-finder",
         "uris": [
            "word-finder.cfapps.paas.pivotal.io"
         ],
         "host": "10.65.187.46",
         "port": 60540,
         "uptime": 560,
         "mem_quota": 1073741824,
         "disk_quota": 1073741824,
         "fds_quota": 16384,
         "usage": {
            "time": "2016-04-10T02:41:08.873469312Z",
            "cpu": 0.00019609362320671737,
            "mem": 103034880,
            "disk": 46796800
         }
      }
Step 3: bosh ssh into the Diego Cell VM 

ssh into the OpsManager VM and then target the BOSH director to the correct deployment

Target the BOSH director after looking up the Director IP address from the OpsManager console


bosh deployment /var/tempest/workspaces/default/deployments/cf-cdcc8064df9477b983f3.yml

Issue a bosh vms command to determine the name of the Diego cell that houses the app instance container. Note the host-name should match from step-1.

ubuntu@pivotal-ops-manager:~$ bosh vms cf-cdcc8064df9477b983f3 --details
Acting as user 'director' on deployment 'cf-cdcc8064df9477b983f3' on 'p-bosh-1667ec8b8a5e1edbe71b'


| Job/index| State   | Resource Pool | IPs | CID | Agent ID | Resurrection |
|diego_cell-partition-e8ca738002f1f150c602/0 | running | diego_cell-partition-e8ca738002f1f150c602 | 10.65.187.46 | vm-b2f2b447-043b-41ec-bad8-9fd950e9f9d7 | f1ec0fc6-9942-4c18-a55e-b507014684e3 | active       |

ubuntu@pivotal-ops-manager:~$ bosh ssh diego_cell-partition-e8ca738002f1f150c602/0

Step 4: Locate the Garden container that contains the app instance # 2.

First start in the right directory
cd var/vcap/data

Install Veritas
wget https://github.com/pivotal-cf-experimental/veritas/releases/download/latest/veritas
chmod a+x veritas

Configure Veritas
chmod a+x veritas

bosh_4auu9hoye@f1ec0fc6-9942-4c18-a55e-b507014684e3:/var/vcap/data$ ./veritas autodetect
export VITALS_ADDRS=garden:0.0.0.0:17013,rep:0.0.0.0:17008

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export GARDEN_ADDR=/var/vcap/data/garden/garden.sock

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export GARDEN_NETWORK=unix

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export VITALS_ADDRS=garden:0.0.0.0:17013,rep:0.0.0.0:17008

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# ./veritas garden-containers

Derive a garden container mapping of the app instances. Now find the garden container based on the port exposed for that instance from step #2.

b60ba43b-d927-4d92-b3775e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-bf91aa00-fabb-45a9-4311-1e326ad5998e - active @ /var/vcap/data/garden/depot/f8gqc8vnqfl
  Memory: 98.797 MB
  Disk: Total:802.640 MB 0 Inodes, Exclusive:44.629 MB 0 Inodes
  10.254.0.25=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-bf91aa00-fabb-45a9-4311-1e326ad5998e: 60542:8080,60543:2222
  Running: 2 processes
  Properties:
    executor:owner: executor

5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-fc296ea9-1e02-4f97-6c95-553388a3370c - active @ /var/vcap/data/garden/depot/f8gqc8vnqfm
  Memory: 98.297 MB
  Disk: Total:802.640 MB 0 Inodes, Exclusive:44.629 MB 0 Inodes
  10.254.0.29=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-fc296ea9-1e02-4f97-6c95-553388a3370c: 60540:8080,60541:2222
  Running: 2 processes
  Properties:
    executor:owner: executor

5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-8ce28a9f-a9ef-4aff-73a6-4e095822e0db - active @ /var/vcap/data/garden/depot/f8gqc8vnqbr
  Memory: 107.430 MB
  Disk: Total:922.765 MB 0 Inodes, Exclusive:164.754 MB 0 Inodes
  10.254.0.21=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-8ce28a9f-a9ef-4aff-73a6-4e095822e0db: 60538:8080,60539:2222
  Running: 2 processes
  Properties:
    executor:owner: executor

At this point you have nailed down the path to your specific garden container i.e. /var/vcap/data/garden/depot/f8gqc8vnqfm

Step 5: wsh into the container and issue the gcore command by attaching to the node.js process

Determine the PID of the node.js app process
root@f1ec0fc6-9942-4c18-a55e-b507014684e3:~# cd /var/vcap/data/garden/depot/f8gqc8vnqfm

garden/depot/f8gqc8vnqfm$ sudo su -

root@f1ec0fc6-9942-4c18-a55e-b507014684e3:/var/vcap/data/garden/depot/f8gqc8vnqfm# ./bin/wsh

# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 02:31 ?        00:00:00 initd -dropCapabilities=false -title="wshd: f8gqc8vnqfm"
vcap        14     1  0 02:31 ?        00:00:00 node server.js
vcap        17     1  0 02:31 ?        00:00:00 /tmp/lifecycle/diego-sshd -address=0.0.0.0:2222 -hostKey=-----BEGIN RSA PRIVATE KEY----- MIICXAIBAAKBgQCxtuDXcInp3mD2x9s8oI6c6Ca8wgEYvPoFwichBYOklyxJHMSy
root       330     1  0 03:07 pts/0    00:00:00 /bin/sh
root       332   330  0 03:08 pts/0    00:00:00 ps -ef

Attach a debugger to the process
# gdb --pid=14
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Attaching to process 14
Reading symbols from /home/vcap/app/.heroku/node/bin/node...done.
Reading symbols from /lib/x86_64-linux-gnu/libssl.so.1.0.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libssl.so.1.0.0
Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/librt.so.1
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /usr/lib/x86_64-linux-gnu/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/x86_64-linux-gnu/libstdc++.so.6
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 25]
[New LWP 24]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007f8baa3c4859 in syscall () from /lib/x86_64-linux-gnu/libc.so.6


Generate a core dump
(gdb) gcore
warning: target file /proc/14/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 8192 bytes at 0x7fff6e7d9000.
Saved corefile core.14
(gdb) exit
Undefined command: "exit".  Try "help".
(gdb) quit
A debugging session is active.

Inferior 1 [process 14] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/vcap/app/.heroku/node/bin/node, process 14


# ls
core.14  firstboot.sh
# ls -al
total 114044
drwx------  2 root root      4096 Apr 10 03:08 .
drwxr-xr-x 31 root root      4096 Apr 10 03:08 ..
-rw-r--r--  1 root root      3106 Feb 20  2014 .bashrc
-rw-r--r--  1 root root       140 Feb 20  2014 .profile
-rw-r--r--  1 root root 116764456 Apr 10 03:08 core.14
-rwxr-xr-x  1 root root       213 Feb 22 19:32 firstboot.sh
#

Step 6 - Transfer core-dump to a VM that has the mdb debugger installed. 
After generating the coredump change ownership of the core to vcap:vcap  and move it to the /tmp dir. so that the core can scp'ed from the container to your Joyent or SunOS box

cf ssh-code | pbcopy scp -P 2222 -o User=cf:0b9216bc-5606-4c40-80cc-ae177a072435/0 ssh.run.haas-35.pez.pivotal.io:/tmp/core.16 . 

Step 7 Load the Core dump on the Solaris VM
Fetch a Solaris VM from the Joyent cloud. My specific machine is a SunOS 6de24ae5-f178-4d11-aa0f-c027c81e9f84 5.11 joyent_20151029T052920Z i86pc i386 i86pc Solaris

# mdb ./node-v4.2.2-linux/node-v4.2.2-linux-x64/bin/node ./core.14

> ::load ./mdb_v8_amd64.so
mdb_v8 version: 1.1.1 (release, from 28cedf2)
V8 version: 143.156.132.195
Autoconfigured V8 support from target
C++ symbol demangling enabled

Step 8 Issue mdb commands for introspecting the dump for stacktraces and data
::help <cmd>
::jsstack
::jsprint
::jssource
::jsconstructor
::findjsobjects
::jsfunctions

Step 9: Figure out the exact stack frame executing on the heap when the crash occurred or High CPU situation occurred.

Happy hunting ... 

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 1

These are the scenarios that keep a Devops engineer awake in the night - an instance of the node.js app is running at 100% CPU in Cloud Foundry with NO logging or resiliency protection like circuit breakers. This is the third occurrence of the problem in the last month. Management is dragging you into these long drawn day long meetings where each side blames each other - the app is restarted - the problem goes away and the cycle repeats. 

With apps increasingly being wrapped in russian doll layers of abstraction  [baremetal => VM => Container => Node process] getting to the root cause has become increasingly difficult.

So if you are a Java guy that has no experience in debugging a production node.js app. After drinking an appropriate amount of coffee and fruity alcoholic drinks you have landed on http://techblog.netflix.com/2015/12/debugging-nodejs-in-production.html

The recommendations from Yunong Xiao in the presentation essentially boils down to:
1. Using node-restify as a REST framework in the node app to log and observe app performance
2. Use Linux Perf Events to statistically sample stack traces  to construct CPU flame-graphs. This technique can reliably find the proverbial needle in the haystack.
3. Take Coredumps to dump the app state for post-mortem debugging with the mdb debugger. Unfortunately the mdb debugger can only run on Solaris- therefore you have to transfer the coredump to a debug solaris VM instance [1] [2]. 

Unfortunately both options 1 and 2 are not possible. 1 because the app-dev team will need to get involved and they don't look upon kindly on a devops guy advising a REST framework. 2 because CF does not bake perf events into the OS stemcell. Yikes!!! Looks like good old gcore is the only way to go  ... 

ok so with the introduction of Diego, finding your way into the container should be as easy as doing a cf-ssh and then issuing the gcore command rite ? - NOPE Wrong - Once ssh'ed into the container there is no way to elevate the privilege and attach gdb to your node.js process. You will need to find another way to elevate privilege and take a core dump.

This other way is using BOSH and Veritas. The gist of the method is that you bosh ssh into the correct Diego cell by first examining the app metadata. Thereafter on the Diego cell install veritas the Garden explorer tool. Use Veritas to map from the app instance to the container instance. Thereafter ./wsh into the garden container. You can now elevate your privilege with a sudo su - and then issue a  gdb, gcore of the node.js server process.

Once the dump has been generated sftp it to a debug solaris instance and run diagnostics with mdb commands like jsstack etc ... see this video from Joyent that explains how to walk the dump.

Part 2 of the blog goes into the gory detail of discovery and  generation of the core dump.