All Things Cloud: Creating Chaos in Cloud Foundry

Wednesday, June 8, 2016

Creating Chaos in Cloud Foundry

One of the key tenets of operational readiness is to be prepared for every emergency. The best way to institutionalize this discipline is by repeatedly creating chaos in your own production deployment and monitor the system recovery. The list below is a listing a tools from the PCF Solutions team @ pivotal and others to cause chaos at all levels in the stack in Cloud Foundry.

Tools, Presentations & Repos:
https://github.com/xchapter7x/chaospeddler
https://github.com/xchapter7x/cf-app-attack
https://github.com/strepsirrhini-army/chaos-lemur

https://github.com/FidelityInternational/chaos-galago
https://github.com/skibum55/chaos-as-a-service
Monkeys & Lemurs and Locusts Oh My - Anti-Fragile Platforms

Type of test/event/task

1. BOSH

* bosh target (director ip)

* bosh login (director username/password obtained from Ops Man)

* bosh download manifest cf-(hash) ~/cf.yml

* bosh deployment ~/cf.yml

* bosh vms/cck

* bosh ssh

* bosh logs

* bosh debug (gives you the job/task logs)

2. VM Recovery

* Terminate a VM by deleting it in vSphere, watch it come back up

3. App Recovery

* Terminate an app by using cf plugin, watch it come back up.

4. Correlate logs?

* Watch logs for steps above

5. Chaos Monkeys

* Execute Chaos Lemur and watch bosh/cf respond

6. Director

* Shut VM down/delete in vCenter

* When its down, what app still runs?

* Once VM is gone, how do you get it back/rebuild?

7. Network switch

8. Hypervisor

9. Credentials that expire:

* Certs that have expiration date

* System Accounts (internal CF system accounts)

* vCenter API Account that CF uses

10. Log Insight goes down

11. Kill container

12. Kill VM

13. Kill DEA

14. Kill Router

15. Kill Health Manager

16. Kill Binary Repository

* Then scale

17. Over-allocate Hardware (how do we do it?)

18. Execute and backout a change to CF

19. Bulid Pack Upgrade and Roll Back

20. Right Apps have right build pack

21. Licensing server scenario (for example, can't connect)

22. Double single components (for example, 2 BOSH's)

23. Kill internal message bus

24. DNS

25. Clock drift

Chaos Testing Procedure:

Kil vms from vsphere; used bosh tasks —no-filter in a loop to watch resurrector bring them up
bosh ssh and sudo kill -9 -1 are also fun
bosh ssh’d into a dea and killed a container

All Things Cloud

About Me

Wednesday, June 8, 2016

Creating Chaos in Cloud Foundry

No comments:

Post a Comment

Wednesday, June 8, 2016

Creating Chaos in Cloud Foundry

No comments:

Post a Comment