About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Thursday, June 9, 2016

Top 10 KPIs Cloud Foundry

The cloud foundry firehose and syslog streams generate tons of metrics and logs. What should a busy devops engineer look at ? My colleague Ford Donald put this awesome list of the top 10 KPIs of Cloud Foundry.

+----------------------------------+---------+--------------------+
| KPI                              | Description
+----------------------------------+---------+--------------------+
| rep.capacityRemainingMemory      | Available cell memory (sum for all cells)
| rep.capacityRemainingDisk        | Available cell disk (sum for all cells)
| bbs.LRPsRunning                  | Equivalent to # apps running
| bbs.RequestLatency               | CF of API is slow if this rises
| bbs.ConvergingLRPduration        | Apps or staging crashing if this rises
| stager.StagingRequestFailed      | Indicates bad app or buildpack pushes
| auctioneer.LRPauctionsFailed     | CF can’t place an app (out of container space)
| router.totalRoutes               | Size in routes of a PCF install (indicates uptake)
| router.requests                  | Tracks traffic flow thru a PCF
| bosh.healthmonitor.system.healthy| Indicates BOSH/VM health
+----------------------------------+---------+--------------------+

New PCF v1.10 "Monitoring PCF" docs section went live!  - https://docs.pivotal.io/pivotalcf/1-10/monitoring/index.html




Replatforming .NET applications to Cloud Foundry - Migrating .NET apps to Cloud Foundry


A lot of developers do not realize that Cloud Foundry can run a process from any runtime including .NET. Windows applications are managed side-by-side with Linux applications. .NET apps get support for key PCF features like scaling including autoscaling, monitoring, and high availability, logs and metrics, notifications, services including user-provided services and PCF set environment variables. BOSH and OpsManager now supports deployments on Microsoft Azure via pre-built ARM templates.This blog post provides details of the Windows support and provides guidance on migrating .NET apps to Cloud Foundry . This post build extensively on the work done by the Pivotal product engineering and the solutions team particularly Kevin Hoffman, Chris Umbel Zach Brown, Sam Cooper and Aaron Fortener.

There are two ways to run ASP.NET  apps on Cloud Foundry

1. Windows 2012 R2 Stack
 • Support Introduced in Nov. 2015
 • Supports.NET 3.5-4.5+
 • Requires Diego-Windows, Garden-Windows
 • BOSH-Windows support currently underway
 • App is pre-compiled, no detect. Leverage binary_buildpack  to push app
 • cf push mydotNetApp -s windows2012R2 -b binary_buildpack
 • Runs on Garden-Windows container
 • Win 2k12 provides process isolation primitives similar to those found in the Linux kernel
 • IronFrame abstracts these primitives into an API utilized byGarden-Windows
 • Win 2k12’s low-level API Linux kernel’s API
 • Therefore Garden-Windows process isolation functionality ≈ Garden-Linux
 • Developer OS is Windows

2. Linux cflinuxfs2 Stack
 • ASP.NET Core CLR only
 • Runs on Diego or DEA
 • Standard CF Ubuntu Stack
 • Community ASP.NET 5 Buildpack
 • cf push mydotNetApp -s cflinuxfs2 -b <asp.net5 bp>
 • Garden-Linux is the backing container
 • Garden-Linux containers are based on LXC (Linux cgroups)
 • Developer OS is Win, Mac OSX and Linux

ASP.NET Core vs ASP.NET 4.6


 ASP.NET Core
  • Web API
  • ASP.NET MVC
  • No Web Forms (yet)
  • No SignalR
ASP.NET 4.6
  • Battle-tested, hardened
  • Many years of testing and improvement
  • MVC, Web Forms, SignalR, etc.

Migrating Existing Applications

• For Greenfield Applications Try “Core First” approach
• For Replatforming and migrating Existing Apps Probably ASP.NET 4.x
Dependencies and framework requirements will likely dictate choice
• All versions of the .NET SDK supported by Windows Server 2012 R2 are supported on PCF including 3.5, 4.5, 4.5.1 and 4.6

Here are some of the details on how Garden-Windows secures applications


Non Cloud-Native stuff you will want to change

 • Reading/Writing to the Registry.
•  Super old versions of ASP.NET. Upgrade to .NET core or ASP.NET 4.x
 • Reading/Writing to the local disk. File systems are ephemeral. Replace usage of the file system e.g. disk logging even temp files with a S3 compatible blob store or external file store.
 • Does the app use integrated Windows Authentication? Replace  Integrated Windows Auth. with ADFS/OAuth2 to leverage UAA and Pivotal SSO Tiles
 • In-Process Session State / Sticky Sessions. Replace InProc, StateServer with out-of-process data store e.g. Redis, SQL Server
 • Externalize  environment-specific config in web.config into VCAP environment variables. If value changes between environments, don’t get it from web.config. Don’t use externalization methods that only work in Azure. If app relies on machine.config or applicationHost.config, could have issues in PCF.
 • For MSI-Installed services or drivers, Bin deploy dependencies with app.
 • Nearly every app will need refactoring to use Console logging and tracing. Customers who mess with applicationHost.config can interfere with tracing, logging, etc.
 • 32 bit ASP-Net builds/References/Builds/apps depending on 32-bit libraries will not work. Use of 32-bit dependencies or when the app is itself a 32-bit app this is a red flag.
 • Does the app run as a service or a WCF Self-hosted services as opposed to IIS-hosted ? These services will need to be self-hosted in IIS as part of the replatforming to PCF.
 • Apps requiring assemblies in Global Assembly Cache won't work in CF.
 • Backing Service Drivers requiring native installers / DLLs may not work in CF.
 • On Demand Apps and ones triggered by job controller will need to be refactored to run in CF to use CF one-off task support coming in the CC v3 API.
• MSMQ and tight integration with other MSFT server products via on-server dependencies will need to be refactored to use external backing services. .NET Core is now supported in the RabbitMQ .NET client.
• Routing to non-HTTP WCF endpoints and listeners for third party middleware will not work since only HTTP based inbound traffic is routed to the windows containers.
• Reliance on Microsoft-specific infrastructure? (e.g. MSMQ, BizTalk, SharePoint, etc). These apps do NOT make good candidates to port to PCF.

.NET Application Selection Criteria


References

Wednesday, June 8, 2016

Creating Chaos in Cloud Foundry

One of the key tenets of operational readiness is to be prepared for every emergency. The best way to  institutionalize this discipline is by repeatedly creating chaos in your own production deployment and monitor the system recovery. The list below is a listing a tools from the PCF Solutions team @ pivotal and others to cause chaos at all levels in the stack in Cloud Foundry.

Tools, Presentations & Repos:
https://github.com/xchapter7x/chaospeddler
https://github.com/xchapter7x/cf-app-attack
https://github.com/strepsirrhini-army/chaos-lemur
https://github.com/FidelityInternational/chaos-galago
https://github.com/skibum55/chaos-as-a-service
Monkeys & Lemurs and Locusts Oh My - Anti-Fragile Platforms

Type of test/event/task



1. BOSH
* bosh target (director ip)
* bosh login (director username/password obtained from Ops Man)
* bosh download manifest cf-(hash) ~/cf.yml
* bosh deployment ~/cf.yml
* bosh vms/cck
* bosh ssh
* bosh logs
* bosh debug (gives you the job/task logs)
2. VM Recovery
* Terminate a VM by deleting it in vSphere, watch it come back up
3. App Recovery
* Terminate an app by using cf plugin, watch it come back up.
4. Correlate logs?
* Watch logs for steps above
5. Chaos Monkeys
* Execute Chaos Lemur and watch bosh/cf respond
6. Director
* Shut VM down/delete in vCenter
* When its down, what app still runs?
* Once VM is gone, how do you get it back/rebuild?
7. Network switch
8. Hypervisor
9. Credentials that expire:
* Certs that have expiration date
* System Accounts (internal CF system accounts)
* vCenter API Account that CF uses
10. Log Insight goes down
11. Kill container
12. Kill VM
13. Kill DEA
14. Kill Router
15. Kill Health Manager
16. Kill Binary Repository
* Then scale
17. Over-allocate Hardware (how do we do it?)
18. Execute and backout a change to CF
19. Bulid Pack Upgrade and Roll Back
20. Right Apps have right build pack
21. Licensing server scenario (for example, can't connect)
22. Double single components (for example, 2 BOSH's)
23. Kill internal message bus
24. DNS
25. Clock drift



Chaos Testing Procedure: 
Kil vms from vsphere; used bosh tasks —no-filter in a loop to watch resurrector bring them up
bosh ssh and sudo kill -9 -1 are also fun
bosh ssh’d into a dea and killed a container