DevOps面接必須問題集:完全ガイド2026
CI/CD、Kubernetes、Docker、Terraform、SREプラクティスに関する必須問題でDevOps面接を準備しましょう。詳細な回答付き。

DevOpsの面接では、開発・運用・自動化文化という独自のスキルの組み合わせが評価されます。このガイドでは、ドメインごとに整理された最頻出質問を、概念への深い習熟を示す構造化された回答とともにまとめています。
技術的な知識に加え、面接官は複雑な概念をシンプルに説明する能力と、具体的な問題解決経験を共有できるかどうかを評価します。
DevOpsの基礎と文化
最初の質問では、DevOps哲学の全体的な理解が評価されることが多いです。
Q1: DevOpsとは何か、このアプローチはどのような問題を解決するか?
DevOpsは、ソフトウェア開発(Dev)とIT運用(Ops)を統合する文化と実践のセットです。このアプローチは、高品質を維持しながら開発サイクルを短縮することを目的としています。
# devops-principles.yaml
# The pillars of DevOps culture
principles:
collaboration:
description: "Breaking silos between teams"
practices:
- "Shared responsibility for production code"
- "Continuous communication via ChatOps"
- "Blameless post-mortems"
automation:
description: "Automate repetitive tasks"
practices:
- "Infrastructure as Code (IaC)"
- "CI/CD pipelines"
- "Automated testing at all levels"
measurement:
description: "Measure to improve"
metrics:
- "Deployment frequency"
- "Lead time for changes"
- "Mean time to recovery (MTTR)"
- "Change failure rate"
sharing:
description: "Share knowledge"
practices:
- "Documentation as Code"
- "Automated runbooks"
- "Regular knowledge sharing sessions"解決される問題には、遅くてリスクの高いデプロイ、チーム間の可視性の欠如、環境間の不整合などがあります。
Q2: CI、CD(Continuous Delivery)、CD(Continuous Deployment)の違いは何か?
この3つの概念は、デリバリーサイクルの自動化における段階的な進歩を形成しています。
# ci-cd-pipeline-stages.sh
# Illustration of CI/CD stages
# ============================================
# CI (Continuous Integration)
# ============================================
# Goal: Frequently integrate code into a shared repository
# Automation: Build + Tests
echo "CI: Code commit → Build → Unit Tests → Integration Tests"
# ============================================
# CD (Continuous Delivery)
# ============================================
# Goal: Code always deployable to production
# Automation: CI + Staging deployment + Manual approval
echo "CD Delivery: CI → Deploy Staging → Manual Approval → Deploy Prod"
# ============================================
# CD (Continuous Deployment)
# ============================================
# Goal: Automatic deployment to production
# Automation: Entire pipeline without human intervention
echo "CD Deployment: CI → Deploy Staging → Auto Tests → Auto Deploy Prod"主な違いは自動化のレベルにあります。Continuous Deliveryは本番環境へのデプロイ前に手動承認が必要ですが、Continuous Deploymentはプロセス全体を完全に自動化します。
CI/CDとパイプライン
CI/CDの質問は、デリバリーパイプラインの設計・最適化能力をテストします。
Q3: 堅牢なCI/CDパイプラインをどのように構築するか?
適切に設計されたパイプラインは、各レベルにチェックポイントを持つ段階的なステージに従います。
# .gitlab-ci.yml
# Complete CI/CD pipeline with parallel and sequential stages
stages:
- validate
- build
- test
- security
- deploy-staging
- integration-tests
- deploy-production
variables:
DOCKER_IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
# ============================================
# Stage 1: Fast validation (< 2 min)
# ============================================
lint:
stage: validate
script:
- npm run lint
- npm run type-check
# Run on every commit
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH
# ============================================
# Stage 2: Application build
# ============================================
build:
stage: build
script:
- docker build -t $DOCKER_IMAGE .
- docker push $DOCKER_IMAGE
# Cache Docker layers to speed up builds
cache:
key: docker-$CI_COMMIT_REF_SLUG
paths:
- .docker-cache/
# ============================================
# Stage 3: Parallel tests
# ============================================
unit-tests:
stage: test
script:
- npm run test:unit -- --coverage
coverage: '/Lines\s*:\s*(\d+\.?\d*)%/'
artifacts:
reports:
coverage_report:
coverage_format: cobertura
path: coverage/cobertura-coverage.xml
integration-tests:
stage: test
services:
- postgres:16-alpine
- redis:7-alpine
script:
- npm run test:integration
# Parallelization with unit tests
parallel: 3
# ============================================
# Stage 4: Security analysis
# ============================================
sast:
stage: security
script:
- trivy image --exit-code 1 --severity HIGH,CRITICAL $DOCKER_IMAGE
allow_failure: false
dependency-scan:
stage: security
script:
- npm audit --audit-level=high
allow_failure: true # Alert without blocking
# ============================================
# Stage 5: Staging deployment
# ============================================
deploy-staging:
stage: deploy-staging
script:
- kubectl set image deployment/app app=$DOCKER_IMAGE -n staging
- kubectl rollout status deployment/app -n staging --timeout=300s
environment:
name: staging
url: https://staging.example.com
only:
- develop
# ============================================
# Stage 6: E2E tests on staging
# ============================================
e2e-tests:
stage: integration-tests
script:
- npm run test:e2e -- --base-url=https://staging.example.com
artifacts:
when: on_failure
paths:
- cypress/screenshots/
- cypress/videos/
only:
- develop
# ============================================
# Stage 7: Production deployment
# ============================================
deploy-production:
stage: deploy-production
script:
- kubectl set image deployment/app app=$DOCKER_IMAGE -n production
- kubectl rollout status deployment/app -n production --timeout=300s
environment:
name: production
url: https://app.example.com
# Manual deployment with protection
when: manual
only:
- mainこのパイプラインはベストプラクティスを示しています:スピードのための並列ステージ、トレーサビリティのためのアーティファクト、本番環境の保護された環境。
Q4: CI/CDパイプラインでシークレットをどのように管理するか?
シークレット管理には、暗号化、ローテーション、最小権限の原則を組み合わせた多層アプローチが必要です。
# kubernetes-secrets-management.yaml
# Approach 1: External Secrets Operator with HashiCorp Vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
namespace: production
spec:
refreshInterval: 1h # Automatic rotation
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: app-secrets
creationPolicy: Owner
data:
# Reference to secret in Vault
- secretKey: DATABASE_PASSWORD
remoteRef:
key: secret/data/production/database
property: password
- secretKey: API_KEY
remoteRef:
key: secret/data/production/api
property: key
---
# SecretStore configuration
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: vault-backend
spec:
provider:
vault:
server: "https://vault.example.com"
path: "secret"
version: "v2"
auth:
kubernetes:
mountPath: "kubernetes"
role: "external-secrets"
# Dedicated ServiceAccount with minimal permissions
serviceAccountRef:
name: external-secrets-sa推奨プラクティス:コードにシークレットをプレーンテキストで保存しない、専用のシークレットマネージャー(Vault、AWS Secrets Manager)を使用する、自動ローテーションを有効にする。
ログに表示されるCI/CD環境変数を避けてください。CIプラットフォームのネイティブ機能(マスクされた変数)を使用してシークレットを常にマスクしてください。
Kubernetesとオーケストレーション
Kubernetesの質問では、オーケストレーションの概念の理解と具体的な問題解決能力が評価されます。
Q5: Kubernetesのアーキテクチャと各コンポーネントの役割を説明してください。
Kubernetesは、異なる責任を持つコンポーネントを持つマスター・ノードアーキテクチャに従います。
# kubernetes-architecture.yaml
# Control Plane components (Master)
control_plane:
api_server:
role: "Entry point for all API requests"
responsibilities:
- "Validation and configuration of API objects"
- "Authentication and authorization"
- "REST interface for kubectl and other clients"
etcd:
role: "Distributed key-value database"
responsibilities:
- "Cluster state storage"
- "Source of truth for configuration"
- "Consensus via Raft algorithm"
scheduler:
role: "Assigning Pods to nodes"
responsibilities:
- "Evaluating constraints (resources, affinity)"
- "Selecting the optimal node"
- "Respecting PodDisruptionBudgets"
controller_manager:
role: "Control loops for desired state"
controllers:
- "ReplicaSet Controller"
- "Deployment Controller"
- "Service Controller"
- "Node Controller"
# Worker Node components
worker_nodes:
kubelet:
role: "Agent on each node"
responsibilities:
- "Communication with Control Plane"
- "Pod lifecycle management"
- "Node status reporting"
kube_proxy:
role: "Network proxy on each node"
responsibilities:
- "iptables/IPVS rules for Services"
- "Intra-cluster load balancing"
container_runtime:
role: "Container execution"
options:
- "containerd (recommended)"
- "CRI-O"このアーキテクチャは高可用性を実現します:Control Planeはレプリケートでき、ワークロードはWorker Nodeに分散されます。
Q6: 起動しないPodをどのようにデバッグするか?
Kubernetesのデバッグは、異なるレイヤーを分析する系統的なアプローチに従います。
# kubernetes-debugging.sh
# Workflow for debugging a failing Pod
# Step 1: Check Pod status
kubectl get pod my-app-pod -o wide
# STATUS: CrashLoopBackOff, ImagePullBackOff, Pending, etc.
# Step 2: Pod details and events
kubectl describe pod my-app-pod
# Important sections:
# - Conditions (PodScheduled, Initialized, Ready)
# - Events (scheduling, pull errors, etc.)
# Step 3: Container logs
kubectl logs my-app-pod --previous # Previous crash logs
kubectl logs my-app-pod -c init-container # Init container logs
# Step 4: Interactive execution for debugging
kubectl exec -it my-app-pod -- sh
# Check: env vars, mounted files, network
# Step 5: Check available resources
kubectl describe node <node-name>
# Sections: Allocatable, Allocated resources
# Step 6: Debug with ephemeral Pod (K8s 1.25+)
kubectl debug my-app-pod -it --image=busybox --share-processes一般的な原因には、リソース不足、イメージが見つからない、シークレットの欠如、または誤って設定されたプローブが含まれます。
# pod-debugging-checklist.yaml
# Debugging checklist by status
debugging_by_status:
Pending:
causes:
- "Insufficient resources on nodes"
- "PersistentVolumeClaim not bound"
- "Affinity/Taints not satisfied"
commands:
- "kubectl describe pod <name> | grep -A 20 Events"
- "kubectl get pvc"
- "kubectl describe nodes | grep -A 5 Allocated"
ImagePullBackOff:
causes:
- "Non-existent image or incorrect tag"
- "Private registry without imagePullSecrets"
- "Docker Hub rate limiting"
commands:
- "kubectl get events --field-selector reason=Failed"
- "kubectl get secret <pull-secret> -o yaml"
CrashLoopBackOff:
causes:
- "Application error at startup"
- "Missing configuration (env vars, configmaps)"
- "Liveness probe too aggressive"
commands:
- "kubectl logs <pod> --previous"
- "kubectl describe pod <pod> | grep -A 10 Liveness"
OOMKilled:
causes:
- "Memory limit too low"
- "Memory leak in application"
commands:
- "kubectl describe pod <pod> | grep -A 5 Last State"
- "kubectl top pod <pod>"DevOpsの面接対策はできていますか?
インタラクティブなシミュレーター、flashcards、技術テストで練習しましょう。
Infrastructure as Code
IaCの質問では、プロビジョニングツールとベストプラクティスの習熟度が評価されます。
Q7: TerraformとAnsible:それぞれいつ使うべきか?
この2つのツールは、異なる哲学とユースケースを持っています。
# terraform-example.tf
# Terraform: Infrastructure provisioning (declarative)
# Ideal for: cloud resources, networking, infrastructure state
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
# Remote state for collaboration
backend "s3" {
bucket = "terraform-state-prod"
key = "infrastructure/terraform.tfstate"
region = "eu-west-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
# Declarative resource: Terraform manages the lifecycle
resource "aws_eks_cluster" "main" {
name = "production-cluster"
role_arn = aws_iam_role.eks_cluster.arn
version = "1.29"
vpc_config {
subnet_ids = module.vpc.private_subnets
endpoint_private_access = true
endpoint_public_access = false
}
# Implicit dependencies managed by Terraform
depends_on = [
aws_iam_role_policy_attachment.eks_cluster_policy
]
}
# Reusable modules for standardization
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.0.0"
name = "production-vpc"
cidr = "10.0.0.0/16"
azs = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # HA: one NAT per AZ
}# ansible-example.yml
# Ansible: Configuration management (procedural/declarative)
# Ideal for: OS configuration, app deployment, orchestration
---
- name: Configure application servers
hosts: app_servers
become: yes
vars:
app_version: "2.5.0"
tasks:
# System package management
- name: Install required packages
ansible.builtin.apt:
name:
- nginx
- python3-pip
- supervisor
state: present
update_cache: yes
# Configuration via Jinja2 templates
- name: Deploy nginx configuration
ansible.builtin.template:
src: templates/nginx.conf.j2
dest: /etc/nginx/sites-available/app
owner: root
group: root
mode: '0644'
notify: Reload nginx
# Application deployment
- name: Deploy application
ansible.builtin.git:
repo: "https://github.com/org/app.git"
dest: /opt/app
version: "v{{ app_version }}"
notify: Restart application
handlers:
- name: Reload nginx
ansible.builtin.service:
name: nginx
state: reloaded
- name: Restart application
ansible.builtin.supervisorctl:
name: app
state: restartedまとめると:Terraformはインフラ(何が存在するか)、Ansibleは設定(どのように設定されているか)のために使います。両方のツールは完全なワークフローで組み合わせて使われることが多いです。
Q8: 大規模な組織向けにTerraformプロジェクトをどのように構成するか?
環境を分離したモジュール構造により、メンテナンスとコラボレーションが容易になります。
# terraform-project-structure
# Recommended structure for enterprise projects
terraform-infrastructure/
├── modules/ # Reusable modules
│ ├── networking/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── kubernetes/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── database/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
│
├── environments/ # Per-environment configuration
│ ├── dev/
│ │ ├── main.tf # Calls modules
│ │ ├── variables.tf
│ │ ├── terraform.tfvars # Dev values
│ │ └── backend.tf # Dev state
│ ├── staging/
│ │ ├── main.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ └── production/
│ ├── main.tf
│ ├── terraform.tfvars
│ └── backend.tf
│
├── shared/ # Shared resources
│ ├── iam/
│ └── dns/
│
└── .github/
└── workflows/
└── terraform.yml # CI/CD pipeline# environments/production/main.tf
# Example of module usage
module "networking" {
source = "../../modules/networking"
environment = "production"
vpc_cidr = var.vpc_cidr
azs = var.availability_zones
enable_flow_logs = true
}
module "kubernetes" {
source = "../../modules/kubernetes"
environment = "production"
cluster_name = "prod-cluster"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
node_groups = var.node_groups
# Production: HA configuration
cluster_version = "1.29"
enable_cluster_autoscaler = true
}
module "database" {
source = "../../modules/database"
environment = "production"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.database_subnet_ids
instance_class = "db.r6g.xlarge"
multi_az = true # HA in production
backup_retention = 30
}この構造により、モジュールのバージョン管理、環境ごとの変更レビュー、コードの再利用が可能になります。
モニタリングとObservability
モニタリングの質問では、観測可能なシステムを設計する能力が評価されます。
Q9: Observabilityの3つの柱とは何か?
Observabilityは、システムの内部状態を理解するのに役立つ3つの補完的なデータタイプに依存しています。
# observability-pillars.yaml
# The three pillars of observability
pillars:
metrics:
description: "Numeric data aggregated over time"
characteristics:
- "Low cardinality"
- "Efficient storage"
- "Ideal for alerting"
examples:
- "request_count (counter)"
- "response_time_seconds (histogram)"
- "active_connections (gauge)"
tools:
- "Prometheus"
- "Datadog"
- "CloudWatch"
use_cases:
- "Real-time dashboards"
- "Threshold alerts"
- "Capacity planning"
logs:
description: "Timestamped text events"
characteristics:
- "High cardinality"
- "Detailed context"
- "Large storage"
examples:
- "Application errors"
- "Audit events"
- "Debug information"
tools:
- "Loki"
- "Elasticsearch"
- "CloudWatch Logs"
use_cases:
- "Debugging"
- "Audit compliance"
- "Root cause analysis"
traces:
description: "Request tracking across services"
characteristics:
- "End-to-end view"
- "Context propagation"
- "Bottleneck identification"
examples:
- "Distributed transaction"
- "Service dependencies"
- "Latency breakdown"
tools:
- "Jaeger"
- "Tempo"
- "AWS X-Ray"
use_cases:
- "Performance optimization"
- "Service dependencies"
- "Error propagation"Q10: 効果的なアラートをどのように設定するか?
適切に設計されたアラートは疲労を軽減し、インシデントへの迅速な対応を可能にします。
# prometheus-alerting-rules.yaml
# Prometheus alerting rules with best practices
groups:
- name: application-alerts
rules:
# Alert on symptom, not cause
- alert: HighErrorRate
# Error rate > 1% over 5 minutes
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01
for: 5m # Avoid false positives
labels:
severity: critical
team: backend
annotations:
summary: "High error rate detected"
description: |
Error rate is {{ $value | humanizePercentage }}
for the last 5 minutes.
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# Proactive alert on saturation
- alert: DiskSpaceRunningLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes)
* 100 < 20
for: 15m
labels:
severity: warning
annotations:
summary: "Disk space below 20%"
description: |
Node {{ $labels.instance }} has only
{{ $value | humanize }}% disk space remaining.
# SLO-based alerting
- alert: SLOBudgetBurnRate
# Error budget consumed too quickly
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (1 - 0.999) * 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "SLO budget burning too fast"
description: |
At current error rate, monthly SLO budget will be
exhausted in less than 2 days.# alertmanager-config.yaml
# AlertManager configuration with intelligent routing
global:
resolve_timeout: 5m
route:
receiver: default
group_by: [alertname, cluster, service]
group_wait: 30s # Wait to group alerts
group_interval: 5m # Interval between grouped notifications
repeat_interval: 4h # Re-alert if not resolved
routes:
# Critical alerts: immediate notification
- match:
severity: critical
receiver: pagerduty-critical
continue: true # Also notify Slack
# Alerts by team
- match:
team: backend
receiver: slack-backend
- match:
team: infrastructure
receiver: slack-infra
receivers:
- name: pagerduty-critical
pagerduty_configs:
- service_key: <pagerduty-key>
severity: critical
- name: slack-backend
slack_configs:
- channel: '#alerts-backend'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'主要原則:原因ではなく症状(ユーザーへの影響)に基づいてアラートを出す、ランブックを含める、SLOに従ってしきい値を調整する。
セキュリティとコンプライアンス
セキュリティの質問では、リスクと対策の理解が評価されます。
Q11: Kubernetesクラスターをどのように保護するか?
Kubernetesのセキュリティは複数のレイヤーをカバーしています:ネットワーク、認証、ワークロード、データ。
# kubernetes-security-policies.yaml
# NetworkPolicy: network isolation between namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
# Applied to all pods in namespace
podSelector: {}
policyTypes:
- Ingress
- Egress
# No traffic allowed by default
ingress: []
egress: []
---
# Allow only necessary traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
# Accept only from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
# Allow to database
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
# Allow DNS
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53# pod-security-standards.yaml
# PodSecurity: workload restrictions
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
# Enforce: blocks violations
pod-security.kubernetes.io/enforce: restricted
# Warn: warns without blocking
pod-security.kubernetes.io/warn: restricted
# Audit: logs violations
pod-security.kubernetes.io/audit: restricted
---
# Pod compliant with "restricted" standards
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
namespace: production
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
resources:
limits:
memory: "256Mi"
cpu: "500m"
requests:
memory: "128Mi"
cpu: "250m"
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}Kubernetesのセキュリティは複数のレイヤーを組み合わせます:認可のためのRBAC、ネットワーク分離のためのNetworkPolicies、ワークロード制限のためのPodSecurity、保存時のシークレットの暗号化。
Q12: 最小権限の原則とは何か、どのように適用するか?
この原則は、ユーザーまたはシステムがタスクを完了するために必要な最小限の権限のみを持つべきであることを定めています。
# rbac-least-privilege.yaml
# Kubernetes RBAC with minimal permissions
# Role: permissions in a specific namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: deployment-manager
rules:
# Pod reading (for monitoring)
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
# Deployment management only
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "update", "patch"]
# No create/delete on deployments
# No access to secrets or sensitive configmaps
---
# RoleBinding: Role <-> ServiceAccount association
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: deployment-manager-binding
namespace: production
subjects:
- kind: ServiceAccount
name: ci-cd-deployer
namespace: production
roleRef:
kind: Role
name: deployment-manager
apiGroup: rbac.authorization.k8s.io
---
# Dedicated ServiceAccount for CI/CD
apiVersion: v1
kind: ServiceAccount
metadata:
name: ci-cd-deployer
namespace: production
annotations:
# Automatic token expiration
kubernetes.io/enforce-mountable-secrets: "true"この原則はAWS IAM、データベース、ネットワークアクセスにも適用されます。
SREと信頼性
SREの質問では、信頼性のプラクティスとインシデント管理の理解が評価されます。
Q13: SLOとは何か、どのように定義するか?
Service Level Objectives(SLO)はサービスの期待される信頼性を定量化し、エンジニアリングの意思決定をガイドします。
# slo-definitions.yaml
# SLO definitions for an API service
service: payment-api
owner: payments-team
slos:
- name: availability
description: "Service responds successfully to requests"
sli:
# SLI: measured metric
type: availability
good_events: "http_requests_total{status=~'2..'}"
total_events: "http_requests_total"
target: 99.9% # SLO: objective
window: 30d # Measurement window
# Error budget: 0.1% = 43.2 minutes/month
error_budget:
monthly_minutes: 43.2
- name: latency
description: "Response time below threshold"
sli:
type: latency
good_events: "http_request_duration_seconds_bucket{le='0.3'}"
total_events: "http_request_duration_seconds_count"
target: 99% # 99% of requests < 300ms
window: 30d
- name: throughput
description: "Ability to process transactions"
sli:
type: throughput
query: "sum(rate(transactions_processed_total[5m]))"
target: ">= 1000 TPS"
# Actions based on error budget
error_budget_policy:
- condition: "remaining > 50%"
actions:
- "Feature development prioritized"
- "Experimentation allowed"
- condition: "remaining 20-50%"
actions:
- "Balance features and reliability"
- "Increase testing coverage"
- condition: "remaining < 20%"
actions:
- "Freeze non-critical deployments"
- "Focus on reliability improvements"
- condition: "exhausted"
actions:
- "Incident response mode"
- "All hands on reliability"SLOは客観的な意思決定を可能にします:新機能をデプロイするか、信頼性を強化するか。
Q14: 効果的なポストモーテムをどのように実施するか?
ブレームレスなポストモーテムは学習を促進し、将来のインシデントを防ぎます。
# postmortem-template.yaml
# Blameless post-mortem template
incident:
id: "INC-2026-0042"
title: "Payment service unavailability"
severity: SEV1
duration: "45 minutes"
date: "2026-01-15"
# Factual timeline
timeline:
- time: "14:32"
event: "Alert: error rate > 5% on payment-api"
actor: "PagerDuty"
- time: "14:35"
event: "Incident declared, team notified"
actor: "On-call engineer"
- time: "14:42"
event: "Cause identified: connection pool exhausted"
actor: "Backend team"
- time: "14:55"
event: "Mitigation: deployment rollback"
actor: "Backend team"
- time: "15:17"
event: "Service restored, monitoring stable"
actor: "Backend team"
# Measurable impact
impact:
users_affected: 12500
transactions_failed: 847
revenue_impact: "~$16,500"
slo_budget_consumed: "2.3 days"
# Root cause analysis (5 Whys)
root_cause_analysis:
- question: "Why was the service unavailable?"
answer: "DB connections were exhausted"
- question: "Why were connections exhausted?"
answer: "A slow query was blocking connections"
- question: "Why was there a slow query?"
answer: "Missing index on a new table"
- question: "Why was the index missing?"
answer: "Incomplete migration deployed"
- question: "Why was the migration incomplete?"
answer: "No execution plan validation in staging"
# Corrective actions
action_items:
- id: "AI-001"
type: "prevent"
description: "Add SQL execution plan validation in CI"
owner: "DBA team"
due_date: "2026-01-22"
priority: P1
- id: "AI-002"
type: "detect"
description: "Alert on connection pool usage > 80%"
owner: "SRE team"
due_date: "2026-01-18"
priority: P1
- id: "AI-003"
type: "mitigate"
description: "Implement circuit breaker on DB queries"
owner: "Backend team"
due_date: "2026-01-29"
priority: P2
# Lessons learned
lessons_learned:
what_went_well:
- "Fast detection thanks to alerting (< 3 min)"
- "Clear communication in incident channel"
- "Rollback completed in less than 15 minutes"
what_went_poorly:
- "No load testing on new endpoint"
- "Staging didn't reflect prod data volume"
lucky:
- "Incident during daytime with full team available"目標はシステムを改善することであり、責任を問うことではありません。アクションは3つのカテゴリに分類されます:防止、検出、軽減。
今すぐ練習を始めましょう!
面接シミュレーターと技術テストで知識をテストしましょう。
まとめ
DevOpsの面接は、文化から技術ツールまで幅広いスキルをカバーしています。成功の鍵は、具体的な実装例で示された概念への深い理解を示すことにあります。
準備チェックリスト
- ✅ CI/CDの概念を習熟し、完全なパイプラインを設計できる
- ✅ Kubernetesのアーキテクチャを理解し、一般的な問題をデバッグできる
- ✅ IaCツール(Terraform、Ansible)とそれぞれのユースケースを知っている
- ✅ モニタリングの設定と関連するアラートの定義方法を知っている
- ✅ セキュリティのベストプラクティス(最小権限、多層防御)を適用できる
- ✅ SREプラクティス(SLO、エラーバジェット、ポストモーテム)を説明できる
- ✅ 具体的な問題解決の例を持っている
- ✅ 複雑な概念をシンプルに説明できる
タグ
共有
関連記事

Kubernetes面接対策:Pod、Service、Deploymentの仕組みと実践的な質問解説
Kubernetesの3大構成要素であるPod、Service、Deploymentを本番レベルのYAMLマニフェストとともに解説。ネットワーキングの内部動作、スケーリング戦略、面接頻出の質問と回答を網羅します。

Docker:開発から本番環境まで
アプリケーションのコンテナ化のための完全なDockerガイド。Dockerfile、Docker Compose、マルチステージビルド、本番環境へのデプロイを実践的な例で解説します。

データアナリストのためのSQL:ウィンドウ関数、CTE、高度なクエリ技法
SQLのウィンドウ関数、CTE(共通テーブル式)、高度な分析クエリをコード例付きで解説。データアナリスト面接対策と実務に直結する必須テクニック。