One of the key tenets of operational readiness is to be prepared for every emergency. The best way to institutionalize this discipline is by repeatedly creating chaos in your own production deployment and monitor the system recovery. The list below is a listing a tools from the PCF Solutions team @ pivotal and others to cause chaos at all levels in the stack in Cloud Foundry.
Tools, Presentations & Repos:
https://github.com/xchapter7x/chaospeddler
https://github.com/xchapter7x/cf-app-attack
https://github.com/strepsirrhini-army/chaos-lemur
Tools, Presentations & Repos:
https://github.com/xchapter7x/chaospeddler
https://github.com/xchapter7x/cf-app-attack
https://github.com/strepsirrhini-army/chaos-lemur
https://github.com/FidelityInternational/chaos-galago
https://github.com/skibum55/chaos-as-a-service
Monkeys & Lemurs and Locusts Oh My - Anti-Fragile Platforms
https://github.com/skibum55/chaos-as-a-service
Monkeys & Lemurs and Locusts Oh My - Anti-Fragile Platforms
| ||||
1. BOSH | ||||
* bosh target (director ip) | ||||
* bosh login (director username/password obtained from Ops Man) | ||||
* bosh download manifest cf-(hash) ~/cf.yml | ||||
* bosh deployment ~/cf.yml | ||||
* bosh vms/cck | ||||
* bosh ssh | ||||
* bosh logs | ||||
* bosh debug (gives you the job/task logs) | ||||
2. VM Recovery | ||||
* Terminate a VM by deleting it in vSphere, watch it come back up | ||||
3. App Recovery | ||||
* Terminate an app by using cf plugin, watch it come back up. | ||||
4. Correlate logs? | ||||
* Watch logs for steps above | ||||
5. Chaos Monkeys | ||||
* Execute Chaos Lemur and watch bosh/cf respond | ||||
6. Director | ||||
* Shut VM down/delete in vCenter | ||||
* When its down, what app still runs? | ||||
* Once VM is gone, how do you get it back/rebuild? | ||||
7. Network switch | ||||
8. Hypervisor | ||||
9. Credentials that expire: | ||||
* Certs that have expiration date | ||||
* System Accounts (internal CF system accounts) | ||||
* vCenter API Account that CF uses | ||||
10. Log Insight goes down | ||||
11. Kill container | ||||
12. Kill VM | ||||
13. Kill DEA | ||||
14. Kill Router | ||||
15. Kill Health Manager | ||||
16. Kill Binary Repository | ||||
* Then scale | ||||
17. Over-allocate Hardware (how do we do it?) | ||||
18. Execute and backout a change to CF | ||||
19. Bulid Pack Upgrade and Roll Back | ||||
20. Right Apps have right build pack | ||||
21. Licensing server scenario (for example, can't connect) | ||||
22. Double single components (for example, 2 BOSH's) | ||||
23. Kill internal message bus | ||||
24. DNS | ||||
25. Clock drift
bosh tasks —no-filter in a loop to watch resurrector bring them upbosh ssh and sudo kill -9 -1 are also funbosh ssh ’d into a dea and killed a container |
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.