SREs run servicesโa set of related systems, operated for users, who may be internal or externalโand are ultimately responsible for the health of services.
We can characterize the health of a service in the same way that Abraham Maslow categorized human needs from the most basic requirements needed for a system to function as a service at all to the higher levels of function
This section addresses the theory and practice of an SREโs day-to-day activity: building and operating large distributed computing systems.
This pyramid diagram represents a structured approach to ensuring system reliability.
It builds from foundational practices at the bottom to higher-level goals at the top, emphasizing the importance of strong fundamentals before addressing complex issues.
Here's a detailed breakdown:

1. Monitoring (Base Level)
- Practices: Practical alerting from time-series data.
- Purpose: Ensures systems are monitored effectively for anomalies, providing visibility to detect and respond to issues.
๐ง 2. Incident Response
- Practices:
- Being on-call.
- Effective troubleshooting.
- Emergency response.
- Managing incidents.
- Purpose: Focuses on real-time resolution of issues to maintain uptime and system stability.
๐ 3. Postmortem / Root Cause Analysis
- Practices:
- Postmortem culture: Learning from failure.
- Tracking outages.
- Purpose: Identifies and learns from root causes of incidents, ensuring continuous improvement.
๐งช 4. Testing
- Practices: Testing for reliability.
- Purpose: Ensures systems are rigorously tested under real-world conditions to identify potential failure points.
โ๏ธ 5. Capacity Planning
- Practices:
- Load balancing (frontend and datacenter).
- Handling overload.
- Addressing cascading failures.
- Purpose: Prepares systems to handle workloads effectively, mitigating overload and cascading failures.
๐ ๏ธ 6. Development
- Practices:
- Managing critical state (distributed consensus).
- Data integrity.
- Distributed periodic scheduling.
- Data processing pipelines.
- Purpose: Focuses on designing robust systems during development, ensuring reliability and integrity.
๐ 7. Product (Top Level)
- Practices: Reliable product launches at scale.
- Purpose: Delivers reliable products to users, leveraging all foundational practices for smooth launches and sustained reliability.