About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Saturday, April 9, 2016

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 2

This blog post goes into step-by step run through of how to take a core dump of a specific application instance in cloud foundry. Please note you WILL need access to the PCF OpsManager. Typically the ability to target the BOSH director to your CF deployment is only possessed by sysadmins. So if you are reading this as a developer of Node.js apps in PCF, your best shot is to build the diagnostics into the app like so

  1. Use the restify observable REST framework
  2. Modify the node start command for your app like so node --abort_on_uncaught_exception --expose-gc
  3. Install heapdump with npm install heapdump and require it in server.js. Use chrome dev tools to analyze the heapdumps.
  4. Develop Debugging Techniques as explained here. 
Step 0: App Listing
cf app word-finder
     state     since                    cpu    memory         disk           details
#0   running   2016-04-09 02:13:26 AM   0.0%   107.4M of 1G   164.8M of 1G
#1   running   2016-04-09 10:31:47 PM   0.0%   96.2M of 1G    44.6M of 1G
#2   running   2016-04-09 10:31:47 PM   0.0%   95.4M of 1G    44.6M of 1G

Step 1: Determine app GUID
cf app word-finder --guid
5e46745d-9849-42c3-9549-be9ba5fe1e57

Step 2: Determine Host and Port of the app instance in question
Lets assume that it is instance #2 that is suffering the HIGH CPU.

export CF_TRACE=true
cf curl /v2/apps/5e46745d-9849-42c3-9549-be9ba5fe1e57/stats

  "2": {
      "state": "RUNNING",
      "stats": {
         "name": "word-finder",
         "uris": [
            "word-finder.cfapps.paas.pivotal.io"
         ],
         "host": "10.65.187.46",
         "port": 60540,
         "uptime": 560,
         "mem_quota": 1073741824,
         "disk_quota": 1073741824,
         "fds_quota": 16384,
         "usage": {
            "time": "2016-04-10T02:41:08.873469312Z",
            "cpu": 0.00019609362320671737,
            "mem": 103034880,
            "disk": 46796800
         }
      }
Step 3: bosh ssh into the Diego Cell VM 

ssh into the OpsManager VM and then target the BOSH director to the correct deployment

Target the BOSH director after looking up the Director IP address from the OpsManager console


bosh deployment /var/tempest/workspaces/default/deployments/cf-cdcc8064df9477b983f3.yml

Issue a bosh vms command to determine the name of the Diego cell that houses the app instance container. Note the host-name should match from step-1.

ubuntu@pivotal-ops-manager:~$ bosh vms cf-cdcc8064df9477b983f3 --details
Acting as user 'director' on deployment 'cf-cdcc8064df9477b983f3' on 'p-bosh-1667ec8b8a5e1edbe71b'


| Job/index| State   | Resource Pool | IPs | CID | Agent ID | Resurrection |
|diego_cell-partition-e8ca738002f1f150c602/0 | running | diego_cell-partition-e8ca738002f1f150c602 | 10.65.187.46 | vm-b2f2b447-043b-41ec-bad8-9fd950e9f9d7 | f1ec0fc6-9942-4c18-a55e-b507014684e3 | active       |

ubuntu@pivotal-ops-manager:~$ bosh ssh diego_cell-partition-e8ca738002f1f150c602/0

Step 4: Locate the Garden container that contains the app instance # 2.

First start in the right directory
cd var/vcap/data

Install Veritas
wget https://github.com/pivotal-cf-experimental/veritas/releases/download/latest/veritas
chmod a+x veritas

Configure Veritas
chmod a+x veritas

bosh_4auu9hoye@f1ec0fc6-9942-4c18-a55e-b507014684e3:/var/vcap/data$ ./veritas autodetect
export VITALS_ADDRS=garden:0.0.0.0:17013,rep:0.0.0.0:17008

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export GARDEN_ADDR=/var/vcap/data/garden/garden.sock

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export GARDEN_NETWORK=unix

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export VITALS_ADDRS=garden:0.0.0.0:17013,rep:0.0.0.0:17008

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# ./veritas garden-containers

Derive a garden container mapping of the app instances. Now find the garden container based on the port exposed for that instance from step #2.

b60ba43b-d927-4d92-b3775e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-bf91aa00-fabb-45a9-4311-1e326ad5998e - active @ /var/vcap/data/garden/depot/f8gqc8vnqfl
  Memory: 98.797 MB
  Disk: Total:802.640 MB 0 Inodes, Exclusive:44.629 MB 0 Inodes
  10.254.0.25=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-bf91aa00-fabb-45a9-4311-1e326ad5998e: 60542:8080,60543:2222
  Running: 2 processes
  Properties:
    executor:owner: executor

5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-fc296ea9-1e02-4f97-6c95-553388a3370c - active @ /var/vcap/data/garden/depot/f8gqc8vnqfm
  Memory: 98.297 MB
  Disk: Total:802.640 MB 0 Inodes, Exclusive:44.629 MB 0 Inodes
  10.254.0.29=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-fc296ea9-1e02-4f97-6c95-553388a3370c: 60540:8080,60541:2222
  Running: 2 processes
  Properties:
    executor:owner: executor

5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-8ce28a9f-a9ef-4aff-73a6-4e095822e0db - active @ /var/vcap/data/garden/depot/f8gqc8vnqbr
  Memory: 107.430 MB
  Disk: Total:922.765 MB 0 Inodes, Exclusive:164.754 MB 0 Inodes
  10.254.0.21=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-8ce28a9f-a9ef-4aff-73a6-4e095822e0db: 60538:8080,60539:2222
  Running: 2 processes
  Properties:
    executor:owner: executor

At this point you have nailed down the path to your specific garden container i.e. /var/vcap/data/garden/depot/f8gqc8vnqfm

Step 5: wsh into the container and issue the gcore command by attaching to the node.js process

Determine the PID of the node.js app process
root@f1ec0fc6-9942-4c18-a55e-b507014684e3:~# cd /var/vcap/data/garden/depot/f8gqc8vnqfm

garden/depot/f8gqc8vnqfm$ sudo su -

root@f1ec0fc6-9942-4c18-a55e-b507014684e3:/var/vcap/data/garden/depot/f8gqc8vnqfm# ./bin/wsh

# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 02:31 ?        00:00:00 initd -dropCapabilities=false -title="wshd: f8gqc8vnqfm"
vcap        14     1  0 02:31 ?        00:00:00 node server.js
vcap        17     1  0 02:31 ?        00:00:00 /tmp/lifecycle/diego-sshd -address=0.0.0.0:2222 -hostKey=-----BEGIN RSA PRIVATE KEY----- MIICXAIBAAKBgQCxtuDXcInp3mD2x9s8oI6c6Ca8wgEYvPoFwichBYOklyxJHMSy
root       330     1  0 03:07 pts/0    00:00:00 /bin/sh
root       332   330  0 03:08 pts/0    00:00:00 ps -ef

Attach a debugger to the process
# gdb --pid=14
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Attaching to process 14
Reading symbols from /home/vcap/app/.heroku/node/bin/node...done.
Reading symbols from /lib/x86_64-linux-gnu/libssl.so.1.0.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libssl.so.1.0.0
Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/librt.so.1
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /usr/lib/x86_64-linux-gnu/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/x86_64-linux-gnu/libstdc++.so.6
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 25]
[New LWP 24]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007f8baa3c4859 in syscall () from /lib/x86_64-linux-gnu/libc.so.6


Generate a core dump
(gdb) gcore
warning: target file /proc/14/cmdline contained unexpected null characters
warning: Memory read failed for corefile section, 8192 bytes at 0x7fff6e7d9000.
Saved corefile core.14
(gdb) exit
Undefined command: "exit".  Try "help".
(gdb) quit
A debugging session is active.

Inferior 1 [process 14] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/vcap/app/.heroku/node/bin/node, process 14


# ls
core.14  firstboot.sh
# ls -al
total 114044
drwx------  2 root root      4096 Apr 10 03:08 .
drwxr-xr-x 31 root root      4096 Apr 10 03:08 ..
-rw-r--r--  1 root root      3106 Feb 20  2014 .bashrc
-rw-r--r--  1 root root       140 Feb 20  2014 .profile
-rw-r--r--  1 root root 116764456 Apr 10 03:08 core.14
-rwxr-xr-x  1 root root       213 Feb 22 19:32 firstboot.sh
#

Step 6 - Transfer core-dump to a VM that has the mdb debugger installed. 
After generating the coredump change ownership of the core to vcap:vcap  and move it to the /tmp dir. so that the core can scp'ed from the container to your Joyent or SunOS box

cf ssh-code | pbcopy scp -P 2222 -o User=cf:0b9216bc-5606-4c40-80cc-ae177a072435/0 ssh.run.haas-35.pez.pivotal.io:/tmp/core.16 . 

Step 7 Load the Core dump on the Solaris VM
Fetch a Solaris VM from the Joyent cloud. My specific machine is a SunOS 6de24ae5-f178-4d11-aa0f-c027c81e9f84 5.11 joyent_20151029T052920Z i86pc i386 i86pc Solaris

# mdb ./node-v4.2.2-linux/node-v4.2.2-linux-x64/bin/node ./core.14

> ::load ./mdb_v8_amd64.so
mdb_v8 version: 1.1.1 (release, from 28cedf2)
V8 version: 143.156.132.195
Autoconfigured V8 support from target
C++ symbol demangling enabled

Step 8 Issue mdb commands for introspecting the dump for stacktraces and data
::help <cmd>
::jsstack
::jsprint
::jssource
::jsconstructor
::findjsobjects
::jsfunctions

Step 9: Figure out the exact stack frame executing on the heap when the crash occurred or High CPU situation occurred.

Happy hunting ... 

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.