
SRE Principles
SLIs, SLOs, SLAs, error budgets, toil reduction, incident management, on-call, blameless postmortems
1What is an SLI (Service Level Indicator) in SRE?
What is an SLI (Service Level Indicator) in SRE?
답변
An SLI (Service Level Indicator) is a quantitative metric that measures a specific aspect of the service level provided to users. Typical SLIs include availability (uptime), latency (response time), error rate, or throughput. These indicators are objectively measured by monitoring systems and serve as the foundation for defining SLOs. For example, an availability SLI could be the percentage of successful HTTP requests (2xx codes) out of total requests.
2What is the main difference between an SLO and an SLA?
What is the main difference between an SLO and an SLA?
답변
An SLO (Service Level Objective) is an internal service level target defined by the team to guide SRE efforts, with no legal consequences. An SLA (Service Level Agreement) is a formal contract with the client that includes consequences (refunds, penalties) if targets are not met. The SLO is typically stricter than the SLA to create a safety buffer and avoid SLA violations. For example, an SLO of 99.9% with an SLA of 99.5% provides a margin of safety.
3What is an error budget in SRE?
What is an error budget in SRE?
답변
An error budget is the acceptable amount of failure or unavailability for a service over a given period. It is calculated as the difference between 100% and the SLO. For example, with an SLO of 99.9%, the error budget is 0.1% (approximately 43 minutes of downtime per month). This error budget allows balancing innovation and reliability: as long as budget remains, the team can deploy new features quickly. If exhausted, focus must shift to stability and releases should be postponed.
How to calculate the remaining error budget for a service?
What to do when a service's error budget is exhausted?
+21 면접 질문