Site Reliability Engineering (SRE)

Production Execution Way

Hierarchy of Reliability

SREs run services—a set of related systems, operated for users, who may be internal or external—and are ultimately responsible for the health of services.

We can characterize the health of a service in the same way that Abraham Maslow categorized human needs from the most basic requirements needed for a system to function as a service at all to the higher levels of function

This section addresses the theory and practice of an SRE’s day-to-day activity: building and operating large distributed computing systems.

This pyramid diagram represents a structured approach to ensuring system reliability.

It builds from foundational practices at the bottom to higher-level goals at the top, emphasizing the importance of strong fundamentals before addressing complex issues.

Here's a detailed breakdown:

Screenshot 2025-01-13 at 2.36.44

1. Monitoring (Base Level)

  1. Practices: Practical alerting from time-series data.
  2. Purpose: Ensures systems are monitored effectively for anomalies, providing visibility to detect and respond to issues.

🔧 2. Incident Response

  1. Practices:
    1. Being on-call.
    2. Effective troubleshooting.
    3. Emergency response.
    4. Managing incidents.
  2. Purpose: Focuses on real-time resolution of issues to maintain uptime and system stability.

📋 3. Postmortem / Root Cause Analysis

  1. Practices:
    1. Postmortem culture: Learning from failure.
    2. Tracking outages.
  2. Purpose: Identifies and learns from root causes of incidents, ensuring continuous improvement.

🧪 4. Testing

  1. Practices: Testing for reliability.
  2. Purpose: Ensures systems are rigorously tested under real-world conditions to identify potential failure points.

⚖️ 5. Capacity Planning

  1. Practices:
    1. Load balancing (frontend and datacenter).
    2. Handling overload.
    3. Addressing cascading failures.
  2. Purpose: Prepares systems to handle workloads effectively, mitigating overload and cascading failures.

🛠️ 6. Development

  1. Practices:
    1. Managing critical state (distributed consensus).
    2. Data integrity.
    3. Distributed periodic scheduling.
    4. Data processing pipelines.
  2. Purpose: Focuses on designing robust systems during development, ensuring reliability and integrity.

🚀 7. Product (Top Level)

  1. Practices: Reliable product launches at scale.
  2. Purpose: Delivers reliable products to users, leveraging all foundational practices for smooth launches and sustained reliability.