Portal | Level: L1: Foundations | Topics: DORA Metrics & DevEx | Domain: DevOps & Tooling
DORA Metrics & DevEx — Primer¶
Why This Matters¶
For years, engineering teams debated productivity with gut feel. Was the team shipping fast enough? Were outages getting worse? Was the new deployment process helping? Nobody had a consistent way to measure it, so every argument was anecdotal.
DORA (DevOps Research and Assessment) solved this by identifying four metrics that predict software delivery performance — and more importantly, business outcomes. Teams that score "Elite" on all four metrics are 2x more likely to meet their business goals than "Low" performers. The metrics are not arbitrary KPIs. They're the result of six years of research across 33,000 professionals.
Who made it: DORA (DevOps Research and Assessment) was founded in 2015 by Dr. Nicole Forsgren and Gene Kim. The research was published in the book Accelerate (2018) by Forsgren, Jez Humble, and Gene Kim, which won the Shingo Publication Award. Google acquired DORA in 2018. The research spans 33,000+ professionals across 2,000+ organizations over six years.
Understanding DORA metrics gives you a shared language for engineering health conversations: with leadership (who want outcomes), with developers (who want feedback on their process), and with ops teams (who want to reduce toil and firefighting). If you join a team and don't know where to start improving, measure DORA first — the worst metric is your biggest leverage point.
Core Concepts¶
1. The Four DORA Metrics¶
| Metric | What it measures | Elite | High | Medium | Low |
|---|---|---|---|---|---|
| Deployment Frequency | How often code ships to production | On-demand (multiple/day) | Weekly–monthly | Monthly–6-monthly | Fewer than 6× per year |
| Lead Time for Changes | Commit to production time | < 1 hour | 1 day–1 week | 1 week–1 month | 1–6 months |
| Change Failure Rate | % of deploys causing incidents | 0–15% | 16–30% | 16–30% | 16–30% |
| Time to Restore Service (MTTR) | How long to recover from failure | < 1 hour | < 1 day | 1 day–1 week | > 6 months |
Note: The 2023 DORA report updated these thresholds. The metric cluster separating Elite from the rest has widened — Elite teams are now 127× faster on deployment frequency and 6,570× faster on lead time than Low teams.
Throughput metrics: Deployment Frequency + Lead Time for Changes. These measure how fast you deliver.
Stability metrics: Change Failure Rate + Time to Restore Service. These measure how safely you deliver.
The key insight: high throughput and high stability are NOT in tension. Elite teams are simultaneously the fastest AND the most stable.
Fun fact: This was the most counterintuitive finding of the DORA research. Before Accelerate, the prevailing assumption was that speed and stability were a tradeoff -- move fast and break things, or move slowly and be reliable. The data proved the opposite: the practices that increase speed (CI/CD, trunk-based development, automated testing) are the same practices that increase stability. Teams that sacrifice safety for speed create a vicious cycle — more incidents → more toil → less time for features → more pressure to cut corners.
2. Deployment Frequency¶
What: How often does your team deploy to production?
Why it matters: More frequent deploys mean smaller batch sizes. Smaller batches mean easier rollbacks, lower risk per deploy, and faster feedback. A team deploying once per sprint accumulates 2 weeks of changes per release — one bug can hide in 50 PRs.
How to measure:
# From CI/CD system (e.g., GitHub Actions)
# Count successful deploys to production environment per time period
# GitHub Actions query
gh api repos/{owner}/{repo}/actions/runs \
--jq '[.workflow_runs[] | select(.conclusion=="success" and .head_branch=="main")] | length'
# Or from your deployment log table in a data warehouse:
SELECT
DATE_TRUNC('week', deployed_at) AS week,
COUNT(*) AS deployments,
COUNT(DISTINCT DATE(deployed_at)) AS days_with_deploys
FROM deployments
WHERE environment = 'production'
AND status = 'success'
AND deployed_at >= NOW() - INTERVAL '90 days'
GROUP BY 1
ORDER BY 1;
Improvement levers: - Reduce batch size (smaller PRs, shorter feature flags) - Automate all deployment steps (no manual approval gates for every deploy) - Build confidence with canary/progressive delivery so deploys feel safe - Decouple deploy from release (feature flags)
3. Lead Time for Changes¶
What: The time from a commit being merged to it running in production.
Why it matters: Long lead times mean slow feedback loops. A bug introduced on Monday isn't caught until it ships Friday — 4 days of debugging is now in the blast radius. Short lead times mean you find out quickly whether a change worked.
How to measure:
-- From version control + deployment system
SELECT
pr.merged_at,
d.deployed_at,
EXTRACT(EPOCH FROM (d.deployed_at - pr.merged_at))/3600 AS lead_time_hours,
pr.title,
pr.html_url
FROM pull_requests pr
JOIN deployments d ON d.commit_sha = pr.merge_commit_sha
WHERE d.environment = 'production'
AND pr.merged_at >= NOW() - INTERVAL '90 days'
ORDER BY lead_time_hours DESC;
-- Percentile breakdown
SELECT
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY lead_time_hours) AS p50,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY lead_time_hours) AS p75,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lead_time_hours) AS p95
FROM (
SELECT EXTRACT(EPOCH FROM (d.deployed_at - pr.merged_at))/3600 AS lead_time_hours
FROM pull_requests pr
JOIN deployments d ON d.commit_sha = pr.merge_commit_sha
WHERE d.environment = 'production'
) t;
Common blockers to short lead time: 1. Long CI pipeline (40+ minute builds) 2. Manual QA gates (wait for a person) 3. Manual deployment approval at every step 4. Staging environment that is slow or flaky 5. Infrastructure provisioning as part of deploy
Improvement levers: - Parallelize CI jobs, cache dependencies - Automate acceptance tests (replace manual QA gate) - Use feature flags to separate deploy from release - Deploy to production multiple times per day (even if feature is flagged off)
4. Change Failure Rate¶
What: The percentage of production deployments that cause a degraded service or require a hotfix/rollback.
Why it matters: High failure rates indicate that the team is shipping low-confidence changes — either tests are inadequate, review is perfunctory, or the deploy process doesn't catch regressions. Every incident costs time, user trust, and on-call energy.
How to measure:
-- Classify each deploy as success or failure
-- "failure" = any deploy followed by a rollback, hotfix, or P1/P2 incident within N hours
SELECT
DATE_TRUNC('month', deployed_at) AS month,
COUNT(*) AS total_deploys,
SUM(CASE WHEN caused_incident THEN 1 ELSE 0 END) AS failed_deploys,
ROUND(
100.0 * SUM(CASE WHEN caused_incident THEN 1 ELSE 0 END) / COUNT(*),
1
) AS change_failure_rate_pct
FROM deployments
WHERE environment = 'production'
GROUP BY 1
ORDER BY 1;
What "caused incident" means: - A P1 or P2 incident opened within 1 hour of deploy - A rollback triggered within 1 hour of deploy - A hotfix deployed within 24 hours that references the original deploy
Improvement levers: - Improve test coverage (unit, integration, smoke) - Implement progressive delivery (canary, blue/green) — limits blast radius and auto-rolls back - Reduce PR size (smaller PRs are reviewed more carefully) - Add pre-deploy smoke tests in CI - Use feature flags to dark-launch risky changes
5. Time to Restore Service (MTTR)¶
What: The time from service degradation being detected to full restoration.
Why it matters: You will have incidents. The question is whether you can recover in minutes or hours. Short MTTR requires: fast detection (good alerting), fast diagnosis (good observability), and fast rollback mechanisms (automated rollback, known good state).
How to measure:
-- From incident management system (PagerDuty, OpsGenie, Jira)
SELECT
DATE_TRUNC('month', started_at) AS month,
COUNT(*) AS incidents,
AVG(EXTRACT(EPOCH FROM (resolved_at - started_at))/60) AS avg_mttr_minutes,
PERCENTILE_CONT(0.50) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (resolved_at - started_at))/60
) AS p50_mttr_minutes,
PERCENTILE_CONT(0.95) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (resolved_at - started_at))/60
) AS p95_mttr_minutes
FROM incidents
WHERE severity IN ('P1', 'P2')
AND started_at >= NOW() - INTERVAL '90 days'
GROUP BY 1
ORDER BY 1;
The detection gap: MTTR includes detection time. If an incident starts at 2am and nobody notices until 6am (because your alerting is alert-fatigued or not covering the right signals), your MTTR is at least 4 hours before anyone starts diagnosing.
Improvement levers: - Improve alerting: alert on symptoms (error rate, latency), not just causes (CPU usage) - Reduce alert fatigue (fewer noisy alerts = faster response to real ones) - Build runbooks for known failure modes - Practice incident response (game days, chaos engineering) - Enable fast rollback: feature flags, automated canary abort, ArgoCD self-heal
6. Instrumenting CI/CD Pipelines for DORA¶
GitHub Actions example — emit DORA events to a metrics store:
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Record deploy start
id: deploy-start
run: echo "start_time=$(date -u +%Y-%m-%dT%H:%M:%SZ)" >> $GITHUB_OUTPUT
- name: Run tests
run: make test
- name: Deploy
run: make deploy-prod
- name: Record deploy success
if: success()
run: |
curl -X POST https://metrics.internal/dora/deploy \
-H "Content-Type: application/json" \
-d '{
"timestamp": "${{ steps.deploy-start.outputs.start_time }}",
"sha": "${{ github.sha }}",
"environment": "production",
"status": "success",
"duration_seconds": '"$(($(date +%s) - $(date -d "${{ steps.deploy-start.outputs.start_time }}" +%s)))"',
"repo": "${{ github.repository }}",
"actor": "${{ github.actor }}"
}'
- name: Record deploy failure
if: failure()
run: |
curl -X POST https://metrics.internal/dora/deploy \
-H "Content-Type: application/json" \
-d '{
"timestamp": "${{ steps.deploy-start.outputs.start_time }}",
"sha": "${{ github.sha }}",
"environment": "production",
"status": "failure",
"repo": "${{ github.repository }}"
}'
Prometheus metrics for DORA:
# Custom metrics emitted from deploy scripts
# deployment_frequency_total{env, repo, status}
# lead_time_seconds{env, repo}
# change_failure_rate{env, repo}
# Grafana dashboard query for deployment frequency
rate(deployment_frequency_total{env="production",status="success"}[7d]) * 60 * 60 * 24
# = deploys per day (7-day rolling window)
7. Four Keys Project (Google)¶
The Four Keys project is an open-source implementation that: - Ingests events from GitHub, GitLab, or Cloud Build via Pub/Sub - Parses commit → deploy → incident chains - Stores in BigQuery - Provides a Looker Studio dashboard showing all four DORA metrics
# Setup (Terraform-based)
git clone https://github.com/GoogleCloudPlatform/fourkeys
cd fourkeys
terraform init
terraform apply -var project_id=my-gcp-project
Simpler self-hosted alternative: use Sleuth, LinearB, or Cortex which integrate with GitHub/GitLab directly without custom infrastructure.
8. SPACE Framework¶
SPACE (Satisfaction, Performance, Activity, Communication, Efficiency) is a complement to DORA that covers the developer experience dimension DORA doesn't address.
| Dimension | Example metrics |
|---|---|
| Satisfaction | Developer NPS, intent to stay, engagement surveys |
| Performance | Code quality (defect rate, test coverage), system reliability |
| Activity | PR throughput, commits per engineer, docs written |
| Communication | Code review turnaround time, PR size, meeting load |
| Efficiency | CI wait time, local build time, toil hours per week |
Gotcha: DORA metrics can be gamed. If you measure deployment frequency, teams can split deployments into trivial micro-deploys. If you measure lead time, teams can merge directly to main without review. Always pair DORA metrics with quality signals (change failure rate, customer satisfaction) to prevent Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."
Key insight: Activity metrics (commits, PRs) alone are poor proxies for performance. A developer who writes 2 large PRs that ship clean may contribute more than one who writes 20 PRs that each need 3 revision cycles.
9. Developer Experience Metrics¶
Beyond DORA and SPACE, practical DevEx metrics that signal toil:
# CI pipeline duration trend (should be < 10 minutes)
SELECT
DATE(created_at) AS day,
AVG(duration_minutes) AS avg_ci_duration_minutes,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_minutes) AS p95_ci_duration_minutes
FROM ci_runs
WHERE branch = 'main'
AND created_at >= NOW() - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;
# Flaky test rate (tests that fail intermittently)
SELECT
test_name,
COUNT(*) AS total_runs,
SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) AS failures,
ROUND(100.0 * SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) / COUNT(*), 1) AS flake_rate_pct
FROM test_runs
WHERE created_at >= NOW() - INTERVAL '30 days'
GROUP BY 1
HAVING flake_rate_pct BETWEEN 1 AND 99 -- always passing or always failing aren't flaky
ORDER BY failures DESC;
# PR review turnaround (time from open to first review)
SELECT
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY hours_to_first_review) AS p50,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY hours_to_first_review) AS p95
FROM pull_requests
WHERE opened_at >= NOW() - INTERVAL '30 days';
10. Improvement Loops — Turning Metrics Into Action¶
Measuring is not improving. The loop that creates improvement:
1. Measure (collect the 4 DORA metrics + DevEx metrics)
2. Identify the worst metric (biggest gap from Elite)
3. Hypothesize causes (blameless post-mortems, team surveys)
4. Run one experiment (change one thing, measure for 4 weeks)
5. Evaluate (did the metric move? in the right direction?)
6. Standardize or rollback
7. Repeat
Common first wins by worst metric:
| Worst metric | First experiments |
|---|---|
| Low deploy frequency | Break large PRs into smaller ones; daily deploy ceremony |
| Long lead time | Parallelize CI; eliminate manual approval gates |
| High change failure rate | Add smoke tests post-deploy; implement canary deploys |
| High MTTR | Add symptom-based alerting; write runbooks for top 5 incident types |
Interview tip: When asked about measuring engineering productivity, lead with DORA metrics. They are the gold standard backed by peer-reviewed research. Avoid metrics like "lines of code" or "story points completed" -- these measure activity, not outcomes. The four DORA metrics directly correlate with business performance, which is what leadership actually cares about.
Quick Reference¶
# GitHub: count production deploys in last 30 days
gh run list --workflow=deploy.yml --json conclusion,createdAt \
| jq '[.[] | select(.conclusion=="success")] | length'
# Estimate lead time from git log
git log --merges --format="%H %ci" | head -20
# Cross-reference with deploy timestamps from your system
# PagerDuty: MTTR via API
curl -X GET "https://api.pagerduty.com/incidents?statuses[]=resolved&since=$(date -d '30 days ago' -Iseconds)&until=$(date -Iseconds)" \
-H "Authorization: Token token=YOUR_PD_TOKEN" \
-H "Accept: application/vnd.pagerduty+json;version=2" \
| jq '[.incidents[] | {
id: .incident_number,
created: .created_at,
resolved: .resolved_at,
mttr_minutes: ((.resolved_at | fromdateiso8601) - (.created_at | fromdateiso8601)) / 60
}]'
# Check Four Keys dashboard queries
# https://github.com/GoogleCloudPlatform/fourkeys/blob/main/queries/
# Run a team survey for satisfaction/devex signal
# Use the DORA survey questions: https://dora.dev/research/
Wiki Navigation¶
Prerequisites¶
- CI/CD Pipelines & Patterns (Topic Pack, L1)