Chapter 1Introduction
The Sysadmin Approach to Service Management
Google’s Approach to Service Management: Site Reliability Engineering
Tenets of SRE
The End of the Beginning
Chapter 2The Production Environment at Google, from the Viewpoint of an SRE
Hardware
System Software That “Organizes” the Hardware
Other System Software
Our Software Infrastructure
Our Development Environment
Shakespeare: A Sample Service
Principles
Chapter 3Embracing Risk
Managing Risk
Measuring Service Risk
Risk Tolerance of Services
Motivation for Error Budgets
Chapter 4Service Level Objectives
Service Level Terminology
Indicators in Practice
Objectives in Practice
Agreements in Practice
Chapter 5Eliminating Toil
Toil Defined
Why Less Toil Is Better
What Qualifies as Engineering?
Is Toil Always Bad?
Conclusion
Chapter 6Monitoring Distributed Systems
Definitions
Why Monitor?
Setting Reasonable Expectations for Monitoring
Symptoms Versus Causes
Black-Box Versus White-Box
The Four Golden Signals
Worrying About Your Tail (or, Instrumentation and Performance)
Choosing an Appropriate Resolution for Measurements
As Simple as Possible, No Simpler
Tying These Principles Together
Monitoring for the Long Term
Conclusion
Chapter 7The Evolution of Automation at Google
The Value of Automation
The Value for Google SRE
The Use Cases for Automation
Automate Yourself Out of a Job: Automate ALL the Things!
Soothing the Pain: Applying Automation to Cluster Turnups
Borg: Birth of the Warehouse-Scale Computer
Reliability Is the Fundamental Feature
Recommendations
Chapter 8Release Engineering
The Role of a Release Engineer
Philosophy
Continuous Build and Deployment
Configuration Management
Conclusions
Chapter 9Simplicity
System Stability Versus Agility
The Virtue of Boring
I Won’t Give Up My Code!
The “Negative Lines of Code” Metric
Minimal APIs
Modularity
Release Simplicity
A Simple Conclusion
Practices
Chapter 10Practical Alerting from Time-Series Data
The Rise of Borgmon
Instrumentation of Applications
Collection of Exported Data
Storage in the Time-Series Arena
Rule Evaluation
Alerting
Sharding the Monitoring Topology
Black-Box Monitoring
Maintaining the Configuration
Ten Years On…
Chapter 11Being On-Call
Introduction
Life of an On-Call Engineer
Balanced On-Call
Feeling Safe
Avoiding Inappropriate Operational Load
Conclusions
Chapter 12Effective Troubleshooting
Theory
In Practice
Negative Results Are Magic
Case Study
Making Troubleshooting Easier
Conclusion
Chapter 13Emergency Response
What to Do When Systems Break
Test-Induced Emergency
Change-Induced Emergency
Process-Induced Emergency
All Problems Have Solutions
Learn from the Past. Don’t Repeat It.
Conclusion
Chapter 14Managing Incidents
Unmanaged Incidents
The Anatomy of an Unmanaged Incident
Elements of Incident Management Process
A Managed Incident
When to Declare an Incident
In Summary
Chapter 15Postmortem Culture: Learning from Failure
Google’s Postmortem Philosophy
Collaborate and Share Knowledge
Introducing a Postmortem Culture
Conclusion and Ongoing Improvements
Chapter 16Tracking Outages
Escalator
Outalator
Chapter 17Testing for Reliability
Types of Software Testing
Creating a Test and Build Environment
Testing at Scale
Conclusion
Chapter 18Software Engineering in SRE
Why Is Software Engineering Within SRE Important?
Auxon Case Study: Project Background and Problem Space
Intent-Based Capacity Planning
Fostering Software Engineering in SRE
Conclusions
Chapter 19Load Balancing at the Frontend
Power Isn’t the Answer
Load Balancing Using DNS
Load Balancing at the Virtual IP Address
Chapter 20Load Balancing in the Datacenter
The Ideal Case
Identifying Bad Tasks: Flow Control and Lame Ducks
Limiting the Connections Pool with Subsetting
Load Balancing Policies
Chapter 21Handling Overload
The Pitfalls of “Queries per Second”
Per-Customer Limits
Client-Side Throttling
Criticality
Utilization Signals
Handling Overload Errors
Load from Connections
Conclusions
Chapter 22Addressing Cascading Failures
Causes of Cascading Failures and Designing to Avoid Them
Preventing Server Overload
Slow Startup and Cold Caching
Triggering Conditions for Cascading Failures
Testing for Cascading Failures
Immediate Steps to Address Cascading Failures
Closing Remarks
Chapter 23Managing Critical State: Distributed Consensus for Reliability
Motivating the Use of Consensus: Distributed Systems Coordination Failure
How Distributed Consensus Works
System Architecture Patterns for Distributed Consensus
Distributed Consensus Performance
Deploying Distributed Consensus-Based Systems
Monitoring Distributed Consensus Systems
Conclusion
Chapter 24Distributed Periodic Scheduling with Cron
Cron
Cron Jobs and Idempotency
Cron at Large Scale
Building Cron at Google
Summary
Chapter 25Data Processing Pipelines
Origin of the Pipeline Design Pattern
Initial Effect of Big Data on the Simple Pipeline Pattern
Challenges with the Periodic Pipeline Pattern
Trouble Caused By Uneven Work Distribution
Drawbacks of Periodic Pipelines in Distributed Environments
Introduction to Google Workflow
Stages of Execution in Workflow
Ensuring Business Continuity
Summary and Concluding Remarks
Chapter 26Data Integrity: What You Read Is What You Wrote
Data Integrity’s Strict Requirements
Google SRE Objectives in Maintaining Data Integrity and Availability
How Google SRE Faces the Challenges of Data Integrity
Case Studies
General Principles of SRE as Applied to Data Integrity
Conclusion
Chapter 27Reliable Product Launches at Scale
Launch Coordination Engineering
Setting Up a Launch Process
Developing a Launch Checklist
Selected Techniques for Reliable Launches
Development of LCE
Conclusion
Management
Chapter 28Accelerating SREs to On-Call and Beyond
You’ve Hired Your Next SRE(s), Now What?
Initial Learning Experiences: The Case for Structure Over Chaos
Creating Stellar Reverse Engineers and Improvisational Thinkers
Five Practices for Aspiring On-Callers
On-Call and Beyond: Rites of Passage, and Practicing Continuing Education
Closing Thoughts
Chapter 29Dealing with Interrupts
Managing Operational Load
Factors in Determining How Interrupts Are Handled
Imperfect Machines
Chapter 30Embedding an SRE to Recover from Operational Overload
Phase 1: Learn the Service and Get Context
Phase 2: Sharing Context
Phase 3: Driving Change
Conclusion
Chapter 31Communication and Collaboration in SRE
Communications: Production Meetings
Collaboration within SRE
Case Study of Collaboration in SRE: Viceroy
Collaboration Outside SRE
Case Study: Migrating DFP to F1
Conclusion
Chapter 32The Evolving SRE Engagement Model
SRE Engagement: What, How, and Why
The PRR Model
The SRE Engagement Model
Production Readiness Reviews: Simple PRR Model
Evolving the Simple PRR Model: Early Engagement
Evolving Services Development: Frameworks and SRE Platform
Conclusion
Conclusions
Chapter 33Lessons Learned from Other Industries
Meet Our Industry Veterans
Preparedness and Disaster Testing
Postmortem Culture
Automating Away Repetitive Work and Operational Overhead
Structured and Rational Decision Making
Conclusions
Chapter 34Conclusion
Appendix Availability Table
Appendix A Collection of Best Practices for Production Services
Fail Sanely
Progressive Rollouts
Define SLOs Like a User
Error Budgets
Monitoring
Postmortems
Capacity Planning
Overloads and Failure
SRE Teams
Appendix Example Incident State Document
Appendix Example Postmortem
Lessons Learned
Timeline
Supporting information:
Appendix Launch Coordination Checklist
Appendix Example Production Meeting Minutes
· · · · · · (
收起)