Site Reliability Engineering (SRE)

Production Execution Way

Hierarchy of Reliability

SREs run servicesโ€”a set of related systems, operated for users, who may be internal or externalโ€”and are ultimately responsible for the health of services.

We can characterize the health of a service in the same way that Abraham Maslow categorized human needs from the most basic requirements needed for a system to function as a service at all to the higher levels of function

This section addresses the theory and practice of an SREโ€™s day-to-day activity: building and operating large distributed computing systems.

This pyramid diagram represents a structured approach to ensuring system reliability.

It builds from foundational practices at the bottom to higher-level goals at the top, emphasizing the importance of strong fundamentals before addressing complex issues.

Here's a detailed breakdown:

Screenshot 2025-01-13 at 2.36.44

1. Monitoring (Base Level)

  1. Practices: Practical alerting from time-series data.
  2. Purpose: Ensures systems are monitored effectively for anomalies, providing visibility to detect and respond to issues.

๐Ÿ”ง 2. Incident Response

  1. Practices:
    1. Being on-call.
    2. Effective troubleshooting.
    3. Emergency response.
    4. Managing incidents.
  2. Purpose: Focuses on real-time resolution of issues to maintain uptime and system stability.

๐Ÿ“‹ 3. Postmortem / Root Cause Analysis

  1. Practices:
    1. Postmortem culture: Learning from failure.
    2. Tracking outages.
  2. Purpose: Identifies and learns from root causes of incidents, ensuring continuous improvement.

๐Ÿงช 4. Testing

  1. Practices: Testing for reliability.
  2. Purpose: Ensures systems are rigorously tested under real-world conditions to identify potential failure points.

โš–๏ธ 5. Capacity Planning

  1. Practices:
    1. Load balancing (frontend and datacenter).
    2. Handling overload.
    3. Addressing cascading failures.
  2. Purpose: Prepares systems to handle workloads effectively, mitigating overload and cascading failures.

๐Ÿ› ๏ธ 6. Development

  1. Practices:
    1. Managing critical state (distributed consensus).
    2. Data integrity.
    3. Distributed periodic scheduling.
    4. Data processing pipelines.
  2. Purpose: Focuses on designing robust systems during development, ensuring reliability and integrity.

๐Ÿš€ 7. Product (Top Level)

  1. Practices: Reliable product launches at scale.
  2. Purpose: Delivers reliable products to users, leveraging all foundational practices for smooth launches and sustained reliability.