DevOps

Alerting & Incident Response

Runbooks, paging, escalation, incident management, on-call practices, postmortems, troubleshooting alerts

20 perguntas de entrevista·
Mid-Level
1

What is a runbook in the DevOps context?

Resposta

A runbook is an operational document containing standardized procedures for handling incidents or recurring maintenance tasks. It enables on-call teams to follow predefined and validated steps to quickly resolve known issues, thereby reducing resolution time and human errors. Runbooks can be manual or automated and represent a fundamental element of incident management.

2

What is paging in the context of incident management?

Resposta

Paging is the alerting mechanism that notifies on-call engineers when a critical incident occurs. This system uses various communication channels such as SMS, phone calls, or dedicated applications to ensure the responsible person is alerted quickly, even outside working hours. A good paging system includes automatic escalation policies if the first person does not respond.

3

What is a postmortem in the context of incident management?

Resposta

A postmortem is a retrospective analysis conducted after a significant incident. Its objective is to understand the root causes of the incident, document the timeline of events, identify corrective actions, and share learnings with the team. An effective postmortem adopts a blameless approach, focused on improving systems and processes rather than individual accountability.

4

What is the difference between an event and an incident?

5

What does MTTA (Mean Time To Acknowledge) mean?

+17 perguntas de entrevista

Domine DevOps para sua proxima entrevista

Acesse todas as perguntas, flashcards, testes tecnicos, exercicios de code review e simuladores de entrevista.

Comece gratis