All Things Cloud: 2016-04-03

Saturday, April 9, 2016

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 2

This blog post goes into step-by step run through of how to take a core dump of a specific application instance in cloud foundry. Please note you WILL need access to the PCF OpsManager. Typically the ability to target the BOSH director to your CF deployment is only possessed by sysadmins. So if you are reading this as a developer of Node.js apps in PCF, your best shot is to build the diagnostics into the app like so

Use the restify observable REST framework
Modify the node start command for your app like so node --abort_on_uncaught_exception --expose-gc
Install heapdump with npm install heapdump and require it in server.js. Use chrome dev tools to analyze the heapdumps.
Develop Debugging Techniques as explained here.

Step 0: App Listing

cf app word-finder

state since cpu memory disk details

#0 running 2016-04-09 02:13:26 AM 0.0% 107.4M of 1G 164.8M of 1G

#1 running 2016-04-09 10:31:47 PM 0.0% 96.2M of 1G 44.6M of 1G

#2 running 2016-04-09 10:31:47 PM 0.0% 95.4M of 1G 44.6M of 1G

Step 1: Determine app GUID

cf app word-finder --guid

5e46745d-9849-42c3-9549-be9ba5fe1e57

Step 2: Determine Host and Port of the app instance in question

Lets assume that it is instance #2 that is suffering the HIGH CPU.

export CF_TRACE=true

cf curl /v2/apps/5e46745d-9849-42c3-9549-be9ba5fe1e57/stats

"2": {

"state": "RUNNING",

"stats": {

"name": "word-finder",

"uris": [

"word-finder.cfapps.paas.pivotal.io"

"host": "10.65.187.46",

"port": 60540,

"uptime": 560,

"mem_quota": 1073741824,

"disk_quota": 1073741824,

"fds_quota": 16384,

"usage": {

"time": "2016-04-10T02:41:08.873469312Z",

"cpu": 0.00019609362320671737,

"mem": 103034880,

"disk": 46796800

}

Step 3: bosh ssh into the Diego Cell VM

ssh into the OpsManager VM and then target the BOSH director to the correct deployment

Target the BOSH director after looking up the Director IP address from the OpsManager console

bosh deployment /var/tempest/workspaces/default/deployments/cf-cdcc8064df9477b983f3.yml

Issue a bosh vms command to determine the name of the Diego cell that houses the app instance container. Note the host-name should match from step-1.

ubuntu@pivotal-ops-manager:~$ bosh vms cf-cdcc8064df9477b983f3 --details

Acting as user 'director' on deployment 'cf-cdcc8064df9477b983f3' on 'p-bosh-1667ec8b8a5e1edbe71b'

ubuntu@pivotal-ops-manager:~$ bosh ssh diego_cell-partition-e8ca738002f1f150c602/0

Step 4: Locate the Garden container that contains the app instance # 2.

First start in the right directory

cd var/vcap/data

Install Veritas

wget https://github.com/pivotal-cf-experimental/veritas/releases/download/latest/veritas

chmod a+x veritas

Configure Veritas

chmod a+x veritas

bosh_4auu9hoye@f1ec0fc6-9942-4c18-a55e-b507014684e3:/var/vcap/data$ ./veritas autodetect
export VITALS_ADDRS=garden:0.0.0.0:17013,rep:0.0.0.0:17008

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export GARDEN_ADDR=/var/vcap/data/garden/garden.sock

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export GARDEN_NETWORK=unix

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# export VITALS_ADDRS=garden:0.0.0.0:17013,rep:0.0.0.0:17008

root@9148d07d-9e64-4308-88ec-6ec90a04be37:/var/vcap/data# ./veritas garden-containers

Derive a garden container mapping of the app instances. Now find the garden container based on the port exposed for that instance from step #2.

b60ba43b-d927-4d92-b3775e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-bf91aa00-fabb-45a9-4311-1e326ad5998e - active @ /var/vcap/data/garden/depot/f8gqc8vnqfl

Memory: 98.797 MB

Disk: Total:802.640 MB 0 Inodes, Exclusive:44.629 MB 0 Inodes

10.254.0.25=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-bf91aa00-fabb-45a9-4311-1e326ad5998e: 60542:8080,60543:2222

Running: 2 processes

Properties:

executor:owner: executor

5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-fc296ea9-1e02-4f97-6c95-553388a3370c - active @ /var/vcap/data/garden/depot/f8gqc8vnqfm

Memory: 98.297 MB

Disk: Total:802.640 MB 0 Inodes, Exclusive:44.629 MB 0 Inodes

10.254.0.29=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-fc296ea9-1e02-4f97-6c95-553388a3370c: 60540:8080,60541:2222

Running: 2 processes

Properties:

executor:owner: executor

5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-8ce28a9f-a9ef-4aff-73a6-4e095822e0db - active @ /var/vcap/data/garden/depot/f8gqc8vnqbr

Memory: 107.430 MB

Disk: Total:922.765 MB 0 Inodes, Exclusive:164.754 MB 0 Inodes

10.254.0.21=>5e46745d-9849-42c3-9549-be9ba5fe1e57-773eaa66-667c-40b2-863e-51c3cd9570f8-8ce28a9f-a9ef-4aff-73a6-4e095822e0db: 60538:8080,60539:2222

Running: 2 processes

Properties:

executor:owner: executor

At this point you have nailed down the path to your specific garden container i.e. /var/vcap/data/garden/depot/f8gqc8vnqfm

Step 5: wsh into the container and issue the gcore command by attaching to the node.js process

Determine the PID of the node.js app process

root@f1ec0fc6-9942-4c18-a55e-b507014684e3:~# cd /var/vcap/data/garden/depot/f8gqc8vnqfm

garden/depot/f8gqc8vnqfm$ sudo su -

root@f1ec0fc6-9942-4c18-a55e-b507014684e3:/var/vcap/data/garden/depot/f8gqc8vnqfm# ./bin/wsh

# ps -ef

UID PID PPID C STIME TTY TIME CMD

root 1 0 0 02:31 ? 00:00:00 initd -dropCapabilities=false -title="wshd: f8gqc8vnqfm"

vcap 14 1 0 02:31 ? 00:00:00 node server.js

vcap 17 1 0 02:31 ? 00:00:00 /tmp/lifecycle/diego-sshd -address=0.0.0.0:2222 -hostKey=-----BEGIN RSA PRIVATE KEY----- MIICXAIBAAKBgQCxtuDXcInp3mD2x9s8oI6c6Ca8wgEYvPoFwichBYOklyxJHMSy

root 330 1 0 03:07 pts/0 00:00:00 /bin/sh

root 332 330 0 03:08 pts/0 00:00:00 ps -ef

Attach a debugger to the process

# gdb --pid=14

GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1

Attaching to process 14

Reading symbols from /home/vcap/app/.heroku/node/bin/node...done.

Reading symbols from /lib/x86_64-linux-gnu/libssl.so.1.0.0...(no debugging symbols found)...done.

Loaded symbols for /lib/x86_64-linux-gnu/libssl.so.1.0.0

Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0...(no debugging symbols found)...done.

Loaded symbols for /lib/x86_64-linux-gnu/libcrypto.so.1.0.0

Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...(no debugging symbols found)...done.

Loaded symbols for /lib/x86_64-linux-gnu/librt.so.1

Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.

Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2

Reading symbols from /usr/lib/x86_64-linux-gnu/libstdc++.so.6...(no debugging symbols found)...done.

Loaded symbols for /usr/lib/x86_64-linux-gnu/libstdc++.so.6

Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.

Loaded symbols for /lib/x86_64-linux-gnu/libm.so.6

Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...(no debugging symbols found)...done.

Loaded symbols for /lib/x86_64-linux-gnu/libgcc_s.so.1

Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.

[New LWP 25]

[New LWP 24]

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0

Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.

Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6

Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.

Loaded symbols for /lib64/ld-linux-x86-64.so.2

0x00007f8baa3c4859 in syscall () from /lib/x86_64-linux-gnu/libc.so.6

Generate a core dump

(gdb) gcore

warning: target file /proc/14/cmdline contained unexpected null characters

warning: Memory read failed for corefile section, 8192 bytes at 0x7fff6e7d9000.

Saved corefile core.14

(gdb) exit

Undefined command: "exit". Try "help".

(gdb) quit

A debugging session is active.

Inferior 1 [process 14] will be detached.

Quit anyway? (y or n) y

Detaching from program: /home/vcap/app/.heroku/node/bin/node, process 14

# ls

core.14 firstboot.sh

# ls -al

total 114044

drwx------ 2 root root 4096 Apr 10 03:08 .

drwxr-xr-x 31 root root 4096 Apr 10 03:08 ..

-rw-r--r-- 1 root root 3106 Feb 20 2014 .bashrc

-rw-r--r-- 1 root root 140 Feb 20 2014 .profile

-rw-r--r-- 1 root root 116764456 Apr 10 03:08 core.14

-rwxr-xr-x 1 root root 213 Feb 22 19:32 firstboot.sh

Step 6 - Transfer core-dump to a VM that has the mdb debugger installed.
After generating the coredump change ownership of the core to vcap:vcap and move it to the /tmp dir. so that the core can scp'ed from the container to your Joyent or SunOS box

cf ssh-code | pbcopy scp -P 2222 -o User=cf:0b9216bc-5606-4c40-80cc-ae177a072435/0 ssh.run.haas-35.pez.pivotal.io:/tmp/core.16 .

Step 7 Load the Core dump on the Solaris VM
Fetch a Solaris VM from the Joyent cloud. My specific machine is a SunOS 6de24ae5-f178-4d11-aa0f-c027c81e9f84 5.11 joyent_20151029T052920Z i86pc i386 i86pc Solaris

# mdb ./node-v4.2.2-linux/node-v4.2.2-linux-x64/bin/node ./core.14

> ::load ./mdb_v8_amd64.so

mdb_v8 version: 1.1.1 (release, from 28cedf2)

V8 version: 143.156.132.195

Autoconfigured V8 support from target

C++ symbol demangling enabled

Step 8 Issue mdb commands for introspecting the dump for stacktraces and data

❖ ::help <cmd>

❖ ::jsstack

❖ ::jsprint

❖ ::jssource

❖ ::jsconstructor

❖ ::findjsobjects

❖ ::jsfunctions

Step 9: Figure out the exact stack frame executing on the heap when the crash occurred or High CPU situation occurred.

Happy hunting ...

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 1

These are the scenarios that keep a Devops engineer awake in the night - an instance of the node.js app is running at 100% CPU in Cloud Foundry with NO logging or resiliency protection like circuit breakers. This is the third occurrence of the problem in the last month. Management is dragging you into these long drawn day long meetings where each side blames each other - the app is restarted - the problem goes away and the cycle repeats.

With apps increasingly being wrapped in russian doll layers of abstraction [baremetal => VM => Container => Node process] getting to the root cause has become increasingly difficult.

So if you are a Java guy that has no experience in debugging a production node.js app. After drinking an appropriate amount of coffee and fruity alcoholic drinks you have landed on http://techblog.netflix.com/2015/12/debugging-nodejs-in-production.html

The recommendations from Yunong Xiao in the presentation essentially boils down to:

1. Using node-restify as a REST framework in the node app to log and observe app performance

2. Use Linux Perf Events to statistically sample stack traces to construct CPU flame-graphs. This technique can reliably find the proverbial needle in the haystack.

3. Take Coredumps to dump the app state for post-mortem debugging with the mdb debugger. Unfortunately the mdb debugger can only run on Solaris- therefore you have to transfer the coredump to a debug solaris VM instance [1] [2].

Unfortunately both options 1 and 2 are not possible. 1 because the app-dev team will need to get involved and they don't look upon kindly on a devops guy advising a REST framework. 2 because CF does not bake perf events into the OS stemcell. Yikes!!! Looks like good old gcore is the only way to go ...

ok so with the introduction of Diego, finding your way into the container should be as easy as doing a cf-ssh and then issuing the gcore command rite ? - NOPE Wrong - Once ssh'ed into the container there is no way to elevate the privilege and attach gdb to your node.js process. You will need to find another way to elevate privilege and take a core dump.

This other way is using BOSH and Veritas. The gist of the method is that you bosh ssh into the correct Diego cell by first examining the app metadata. Thereafter on the Diego cell install veritas the Garden explorer tool. Use Veritas to map from the app instance to the container instance. Thereafter ./wsh into the garden container. You can now elevate your privilege with a sudo su - and then issue a gdb, gcore of the node.js server process.

Once the dump has been generated sftp it to a debug solaris instance and run diagnostics with mdb commands like jsstack etc ... see this video from Joyent that explains how to walk the dump.

Part 2 of the blog goes into the gory detail of discovery and generation of the core dump.

About Me

Saturday, April 9, 2016

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 2

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 1

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 2

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 1