Site Reliability Engineering book notes
The SRE Book
What follow are my notes on the SRE book put out by Google.
Troubleshooting
Problem Reports
- automated where possible, including helpful data for troubleshooting in the body of the alert
- submit tickets for everything; you want a paper trail
Triage
- "make the system work as well as it can under the circumstances"
Examine
- monitoring
- logging
- distributed tracing
Logging
- be able to change log levels on the fly
- be able to do statistical sampling
- be able to turn on logging quickly, easily, and selectively
Diagnose
- inject test data into each component in a misbehaving system to confirm normal operation
- inject test data meant to expose particular types of suspected errors
- "what, where, and why": figure out what the system is actually doing, not just what it's supposed to be doing
- what it's doing
- where its resources are being used
- why it's doing what it's doing
What Touched It Last
- a system in motion stays in motion until something acts on it
- things to log
- deployments
- configuration changes
- packages installed
Testing
- an ideal test should have mutually exclusive alternatives, in order to rule in/out competing hypotheses
- perform tests in decreasing order of likelihood
- keep in mind the side effects of active testing (e.g., increasing available CPU may compound race conditions)
- take clear notes of ideas, tests, and results
Cure
- once you've reduced the possible causes to one, try to prove it's the cause
- produce a post mortem
- what went wrong
- how you tracked down the problem
- how you fixed the problem
- how to prevent it from happening again