Devops¶

80 cards — 🟢 14 easy | 🟡 39 medium | 🔴 17 hard

🟢 Easy (14)¶

1. What is a git commit and what information does it store?

Show answer

* In Git, a commit is a snapshot of your repo at a specific point in time.
* The git commit command will save all staged changes, along with a brief description from the user, in a “commit” to the local repository.

2. What is a merge in git and how does it combine branches?

Show answer

* Merging is Git's way of putting a forked history back together again. The git merge command lets you take the independent lines of development created by git branch and integrate them into a single branch.

3. What is caching? How does it work? Why is it important?

Show answer

Caching is fast access to frequently used resources which are computationally expensive or IO intensive and do not change often. There can be several layers of cache that can start from CPU caches to distributed cache systems. Common ones are in memory caching and distributed caching. Caches are typically data structures that contains some data, such as a hashtable or dictionary.

Remember: cache layers — CPU L1/L2 (ns), RAM/Memcached (us), Redis (ms), CDN (ms), Disk (ms). Each trades freshness for speed.

4. What is the core value often put forward when talking about postmortem?

Show answer

Blamelessness.
Postmortems need to be blameless and this value should be remided at the beginning of every postmortem. This is the best way to ensure that people are playing the game to find the root cause and not trying to hide their possible faults.

5. What is a merge conflict?

Show answer

* A merge conflict is an event that occurs when Git is unable to automatically resolve differences in code between two commits. When all the changes in the code occur on different lines or in different files, Git will successfully merge commits without your help.

Gotcha: conflicts happen when the SAME lines change in two branches. Git marks them with <<<<<<< / ======= / >>>>>>> markers.

6. What is the role of monitoring in SRE?

Show answer

Google: "Monitoring is one of the primary means by which service owners keep track of a system’s health and availability"

Read more about it [here](https://sre.google/sre-book/introduction)

Remember: Google's four golden signals — Latency, Traffic, Errors, Saturation (LTES). The minimum effective monitoring for any production service.

7. What is DevOps and what problems does it solve?

Show answer

The definition of DevOps from selected companies:

**Amazon**:

"DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity: evolving and improving products at a faster pace than organizations using traditional software development and infrastructure management processes.

8. What is a postmortem ?

Show answer

The postmortem is a process that should take place following an incident. It’s purpose is to identify the root cause of an incident and the actions that should be taken to avoid this kind of incidents from happening again.

9. What is Version Control?

Show answer

* Version control is the system of tracking and managing changes to software code.
* It helps software teams to manage changes to source code over time.
* Version control also helps developers move faster and allows software teams to preserve efficiency and agility as the team scales to include more developers.

10. What is Toil in the context of SRE and DevOps?

Show answer

Google: Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows

Read more about it [here](https://sre.google/sre-book/eliminating-toil/)

Remember: TOIL = Manual, Repetitive, Automatable, Tactical, No enduring value, Linearly scaling. If work checks all six boxes, automate it.

11. One of your team members suggests to set a goal of "deploying at least 20 times a day" in regards to CD. What is your take on that?

Show answer

A couple of thoughts:

1. Why is it an important goal? Is it affecting the business somehow? One of the KPIs? In other words, does it matters?
2. This might introduce risks such as losing quality in favor of quantity
3. You might want to set a possibly better goal such as "be able to deploy whenever we need to deploy"

12. What is "infrastructure as code"? What implementation of IAC are you familiar with?

Show answer

IAC (infrastructure as code) is a declarative approach of defining infrastructure or architecture of a system. Some implementations are ARM templates for Azure and Terraform that can work across multiple cloud providers.

Example: Terraform (multi-cloud), CloudFormation (AWS), Pulumi (real code), Ansible (config + provisioning). IaC = reproducible, version-controlled infra.

13. What is a Software Repository?

Show answer

Wikipedia: "A software repository, or “repo” for short, is a storage location for software packages. Often a table of contents is stored, as well as metadata."

Read more [here](https://en.wikipedia.org/wiki/Software_repository)

Example: Docker Hub (containers), PyPI (Python), npm (JavaScript), Maven Central (Java), Crates.io (Rust). Each ecosystem has its own package registry.

14. What is Reliability? How does it fit DevOps?

Show answer

Reliability, when used in DevOps context, is the ability of a system to recover from infrastructure failure or disruption. Part of it is also being able to scale based on your organization or team demands.

Example: reliability = redundancy (multi-AZ) + monitoring (Prometheus) + auto-scaling + circuit breaking + chaos engineering.

🟡 Medium (39)¶

1. What are some practical implementations or practices of GitOp?

Show answer

* Store Infra files in a version control repository (like Git)
* Apply review/approval process for changes

Example: ArgoCD and Flux watch Git repos and auto-sync cluster state. Terraform configs and Helm values stored in Git — every change goes through code review.

2. Explain stateless vs. stateful

Show answer

Stateless applications don't store any data in the host which makes it ideal for horizontal scaling and microservices.
Stateful applications depend on the storage to save state and data, typically databases are stateful applications.

Example: REST API = stateless. PostgreSQL = stateful. K8s uses Deployments for stateless, StatefulSets for stateful workloads.

3. How to add a new worker node in Jenkins ?

Show answer

Log into the Jenkins master and navigate to Manage Jenkins > Manage Nodes > New Node. Enter a name for the new node and select Permanent Agent. Configure SSH and click on Launch.

Gotcha: ensure agent has Java and network access to Jenkins master. Test with curl http://jenkins:8080 before configuring.

4. What is shared modules in Jenkins ?

Show answer

Shared modules in Jenkins refer to a collection of reusable code and resources that can be shared across multiple Jenkins jobs. This allows for easier maintenance, reduced duplication, and improved consistency across multiple build processes.

Example: Git repo with vars/deployToK8s.groovy. Pipelines use @Library('mylib') to import. Eliminates copy-paste across Jenkinsfiles.

5. What are the two main SRE KPIs

Show answer

Service Level Indicators (SLI) and Service Level Objectives (SLO).

Remember: SLI measures, SLO targets, SLA contracts. Indicator → Objective → Agreement. SLI is the number, SLO is the goal, SLA has business consequences.

6. Can you explain the CICD process in your current project ? or Can you talk about any CICD process that you have implemented ?

Show answer

In the current project we use the following tools orchestrated with Jenkins to achieve CICD.
- Maven, Sonar, AppScan, ArgoCD, and Kubernetes

Coming to the implementation, the entire process takes place in 8 steps

1. Code Commit: Developers commit code changes to a Git repository hosted on GitHub.

Example: Git commit → Build → Test → Scan (SonarQube/AppScan) → Artifact → Deploy staging → Integration test → Deploy prod (ArgoCD to K8s).

7. What are the benefits of DevOps? What can it help us to achieve?

Show answer

* Collaboration
* Improved delivery
* Security
* Speed
* Scale
* Reliability

Remember: CSSRS — Collaboration, Security, Speed, Reliability, Scale. DevOps = culture + automation + measurement + sharing (CAMS model).

8. What are the anti-patterns of DevOps?

Show answer

A couple of examples:

* One person is in charge of specific tasks. For example there is only one person who is allowed to merge the code of everyone else into the repository.
* Treating production differently from development environment. For example, not implementing security in development environment
* Not allowing someone to push to production on Friday ;)

Gotcha: "Only Bob can deploy" = bus factor 1. "No Friday deploys" = deploy process too scary to trust. Both worth fixing.

9. What is latest version of Jenkins or which version of Jenkins are you using ?

Show answer

This is a very simple question interviewers ask to understand if you are actually using Jenkins day-to-day, so always be prepared for this.

Gotcha: interviewers ask this to check you actually USE Jenkins daily. Know your version — check bottom-right of Jenkins UI.

10. Would you prefer a "configuration->deployment" model or "deployment->configuration"? Why?

Show answer

Both have advantages and disadvantages.
With "configuration->deployment" model for example, where you build one image to be used by multiple deployments, there is less chance of deployments being different from one another, so it has a clear advantage of a consistent environment.

Example: config→deploy = one image, many envs (12-factor). deploy→config = per-env images. Config→deploy is generally preferred for consistency.

11. What benefits does infrastructure-as-code have?

Show answer

- fully automated process of provisioning, modifying and deleting your infrastructure
- version control for your infrastructure which allows you to quickly rollback to previous versions
- validate infrastructure quality and stability with automated tests and code reviews
- makes infrastructure tasks less repetitive

12. How would you describe a successful DevOps engineer or a team?

Show answer

The answer can focus on:

* Collaboration
* Communication
* Set up and improve workflows and processes (related to testing, delivery, ...)
* Dealing with issues

Things to think about:

* What DevOps teams or engineers should NOT focus on or do?
* Do DevOps teams or engineers have to be innovative or practice innovation as part of their role?

Remember: great DevOps teams automate toil, share on-call, deploy frequently with confidence, treat incidents as learning opportunities.

13. What are the different ways to trigger jenkins pipelines ?

Show answer

This can be done in multiple ways,
To briefly explain about the different options,
```
- Poll SCM: Jenkins can periodically check the repository for changes and automatically build if changes are detected.
This can be configured in the "Build Triggers" section of a job.

Example: Poll SCM, webhook, cron schedule, remote API (curl), upstream job trigger — five main Jenkins trigger mechanisms.

14. What's your philosophy on automation?

Show answer

Automate the boring, repeatable, and dangerous. Not everything.

**Good candidates for automation**:
* Repetitive tasks (deployments, provisioning)
* Error-prone manual steps
* Dangerous operations (safer when scripted)
* Toil that doesn't require judgment

**Bad candidates for automation**:
* One-off tasks (time to automate > time to do)

Remember: automate the boring, repeatable, and dangerous. Not one-off tasks where automation time exceeds manual time.

15. What do you think about the following statement: "100% is the only right availability target for a system"

Show answer

Wrong. No system can guarantee 100% availability as no system is safe from experiencing zero downtime.
Many systems and services will fall somewhere between 99% and 100% uptime (or at least this is how most systems and services should be).

Fun fact: five nines (99.999%) = 5.26 min/year downtime. Each additional nine is exponentially harder and more expensive. 100% is theoretical impossibility.

16. What's the biggest mistake senior engineers make?

Show answer

Over-engineering before constraints are real.

**Symptoms**:
* Building for scale you don't have
* Abstract frameworks for single use cases
* "We might need this later"
* Complexity without corresponding value
* Not invented here syndrome

**Why it happens**:

Remember: YAGNI — You Aren't Gonna Need It. Build for today's constraints, not imagined future scale. Premature abstraction is technical debt.

17. What's the biggest mistake junior engineers make?

Show answer

Changing things before understanding blast radius.

**Symptoms**:
* "I'll just restart this service" (in production)
* Running commands from Stack Overflow without understanding
* "It worked in dev" mentality
* Not checking current state before changing it

**The fix**:
* Read before write - understand current state

Remember: "Read before write." Understand current state before changing it. Check blast radius before acting. Ask what happens if this fails.

18. Explain Declarative and Procedural styles. The technologies you are familiar with (or using) are using procedural or declarative style?

Show answer

Declarative - You write code that specifies the desired end state
Procedural - You describe the steps to get to the desired end state

Declarative Tools - Terraform, Puppet, CloudFormation, Ansible
Procedural Tools - Chef

To better emphasize the difference, consider creating two virtual instances/servers.

Remember: Declarative = WHAT (Terraform, Puppet). Procedural = HOW (Chef). Declarative is naturally idempotent — run twice, same result.

19. can you use Jenkins to build applications with multiple programming languages using different agents in different stages ?

Show answer

Yes, Jenkins can be used to build applications with multiple programming languages by using different build agents in different stages of the build process.

Jenkins supports multiple build agents, which can be used to run build jobs on different platforms and with different configurations.

20. Are you familiar with "The Cathedral and the Bazaar models"? Explain each of the models

Show answer

* Cathedral - source code released when software is released
* Bazaar - source code is always available publicly (e.g. Linux Kernel)

Name origin: Eric Raymond's 1997 essay. Cathedral = closed releases. Bazaar = open continuous development (Linux kernel model).

21. How to add a new plugin in Jenkins ?

Show answer

Using the CLI,
`java -jar jenkins-cli.jar install-plugin `

Using the UI,

1. Click on the "Manage Jenkins" link in the left-side menu.
2. Click on the "Manage Plugins" link.

Gotcha: test plugins in staging first. Plugin conflicts can break pipelines. Pin plugin versions in production.

22. Two engineers in your team argue on where to put the configuration and infra related files of a certain application. One of them suggests to put it in the same repo as the application repository and the other one suggests to put to put it in its own separate repository. What's your take on that?

Show answer

One might say we need more details as to what these configuration and infra files look like exactly and how complex the application and its CI/CD pipeline(s), but in general, most of the time you will want to put configuration and infra related files in their own separate repository and not in the repository of the application for multiple reasons:

23. What are some of the common plugins that you use in Jenkins ?

Show answer

Be prepared for answer, you need to have atleast 3-4 on top of your head, so that interview feels you use jenkins on a day-to-day basis.

Example: Pipeline, Git, Blue Ocean, Credentials Binding, Docker Pipeline, Kubernetes, Slack Notification. Name at least 4 confidently.

24. What deployment strategies are you familiar with or have used?

Show answer

There are several deployment strategies:
* Rolling
* Blue green deployment
* Canary releases
* Recreate strategy

Remember: RBCC — Rolling (gradual), Blue-Green (instant switch), Canary (small % test), Recreate (stop all, start new).

25. What are some of the advantages of applying GitOps?

Show answer

* It introduces limited/granular access to infrastructure
* It makes it easier to trace who makes changes to infrastructure
* Declarative desired state in git enables drift detection and auto-remediation
* Pull requests provide a review and approval workflow for infrastructure changes
* Git history provides a complete audit trail of every change

Remember: GitOps = 'Git as the single source of truth for infrastructure.' Declarative state in a repo, reconciliation loop keeps reality matching the repo.

Example: ArgoCD and Flux are popular GitOps tools for Kubernetes — they watch a git repo and automatically apply changes to the cluster.

26. What are MTTF (mean time to failure) and MTTR (mean time to repair)? What these metrics help us to evaluate?

Show answer

* MTTF (mean time to failure) other known as uptime, can be defined as how long the system runs before if fails.
* MTTR (mean time to recover) on the other hand, is the amount of time it takes to repair a broken system.
* MTBF (mean time between failures) is the amount of time between failures of the system.

27. What do you take into consideration when choosing a tool/technology?

Show answer

A few ideas to think about:

* mature/stable vs. cutting edge
* community size
* architecture aspects - agent vs. agentless, master vs. masterless, etc.
* learning curve

Remember: MCAL — Maturity, Community, Architecture (agent vs agentless), Learning curve. Evaluate against your team's skills, not feature lists.

28. Why are there multiple software distributions? What differences they can have?

Show answer

Different distributions can focus on different things like: focus on different environments (server vs. mobile vs. desktop), support specific hardware, specialize in different domains (security, multimedia, ...), etc. Basically, different aspects of the software and what it supports, get different priority in each distribution.

29. You need to install periodically a package (unless it's already exists) on different operating systems (Ubuntu, RHEL, ...). How would you do it?

Show answer

There are multiple ways to answer this question (there is no right and wrong here):

* Simple cron job
* Pipeline with configuration management technology (such Puppet, Ansible, Chef, etc.)
...

30. How do you manage build artifacts?

Show answer

Build artifacts are usually stored in a repository. They can be used in release pipelines for deployment purposes. Usually there is retention period on the build artifacts.

Example: Artifactory, Nexus, GitHub Packages store versioned outputs (Docker images, JARs, npm packages) with retention policies and access control.

31. What SRE team is responsible for?

Show answer

Google: "the SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their services"

Read more about it [here](https://sre.google/sre-book/introduction)

Remember: SRE owns ALPMECEC — Availability, Latency, Performance, Monitoring, Emergency response, Change mgmt, Efficiency, Capacity planning.

32. What best practices are you familiar with regarding version control?

Show answer

* Use a descriptive commit message
* Make each commit a logical unit
* Incorporate others' changes frequently
* Share your changes frequently
* Coordinate with your co-workers
* Don't commit generated files
* Don't commit binary files

Gotcha: never commit generated files — they cause merge conflicts and bloat. Build artifacts belong in .gitignore, rebuilt by CI/CD.

33. When a repository refereed to as "GitOps Repository" what does it means?

Show answer

A repository that doesn't holds the application source code, but the configuration, infra, ... files that required to test and deploy the application.

Example: app repo = source + Dockerfile. GitOps repo = Helm charts + Kustomize overlays + env values. ArgoCD watches the GitOps repo.

34. What types of tests are you familiar with?

Show answer

Styling, unit, functional, API, integration, smoke, scenario, ...

You should be able to explain those that you mention.

Remember: testing pyramid — Unit (fast, many) → Integration (medium) → E2E (slow, few). Pyramid shape = fast feedback loop.

35. What are the differences between SRE and DevOps?

Show answer

Google: "One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel."

Read more about it [here](https://sre.google/sre-book/introduction)

Remember: "SRE implements DevOps" — adds concrete practices (error budgets, SLOs, toil budgets) on top of DevOps cultural principles.

36. Can you describe which tool or platform you chose to use in some of the following areas and how?

Show answer

This is a more practical version of the previous question where you might be asked additional specific questions on the technology you chose

* CI/CD - Jenkins, Circle CI, Travis, Drone, Argo CD, Zuul
* Provisioning infrastructure - Terraform, CloudFormation
* Configuration Management - Ansible, Puppet, Chef
* Monitoring & alerting - Prometheus, Nagios
* Logging - Logstash, Graylog

37. How to backup Jenkins ?

Show answer

Backing up Jenkins is a very easy process, there are multiple default and configured files and folders in Jenkins that you might want to backup.
```
- Configuration: The `~/.jenkins` folder. You can use a tool like rsync to backup the entire directory to another location.

Remember: back up JENKINS_HOME — config.xml, jobs/, plugins/, secrets/. Use rsync or the ThinBackup plugin.

38. Explain "Software Distribution"

Show answer

Read [this](https://venam.nixers.net/blog/unix/2020/03/29/distro-pkgs.html) fantastic article on the topic.

From the article: "Thus, software distribution is about the mechanism and the community that takes the burden and decisions to build an assemblage of coherent software that can be shipped."

39. What ways are there to distribute software? What are the advantages and disadvantages of each method?

Show answer

* Source - Maintain build script within version control system so that user can build your app after cloning repository. Advantage: User can quickly checkout different versions of application. Disadvantage: requires build tools installed on users machine.

🔴 Hard (17)¶

1. How does a web server work?

Show answer

We can understand web servers using two view points, which is:

(i) Hardware (ii) Software

(i) A web server is nothing but a remote computer which stores website's component files(HTML,CSS and Javascript files) and web server's software.A web server connects to
the Internet and supports physical data interchange with other devices connected to the web.

2. When should you NOT fix a production issue immediately?

Show answer

When the fix has a larger blast radius than the problem, or when the cause is unclear.

Don't fix when:

1. During partial outage with unclear cause
- 10% of users affected
- Root cause unknown
- "Fix" might affect other 90%
- Worse: mask the real problem

2. When fix has larger blast radius
- Problem: one microservice degraded

Remember: cure must not be worse than disease. 10% affected + risky fix hitting the other 90% = wait, gather evidence, coordinate.

3. What makes a principal/staff Linux engineer different from a senior engineer?

Show answer

Principal engineers anticipate failure modes and design systems that limit blast radius.

Key differentiators:

1. Anticipates failure modes
- Doesn't just fix problems, predicts them
- "What happens when X fails?" before X fails
- Designs for failure, not just success
- Thinks in failure domains and cascades

2. Designs blast-radius limits
- Isolation between components

Remember: seniors fix problems. Principals prevent them with blast-radius limits, failure domains, and cascading failure protections.

4. Explain mutable vs. immutable infrastructure

Show answer

In mutable infrastructure paradigm, changes are applied on top of the existing infrastructure and over time
the infrastructure builds up a history of changes. Ansible, Puppet and Chef are examples of tools which
follow mutable infrastructure paradigm.

Remember: Mutable = update in place (Ansible). Immutable = replace entirely (Terraform/Packer). Immutable prevents snowflake servers that drift over time.

5. What is JNLP and why is it used in Jenkins ?

Show answer

In Jenkins, JNLP is used to allow agents (also known as "slave nodes") to be launched and managed remotely by the Jenkins master instance. This allows Jenkins to distribute build tasks to multiple agents, providing scalability and improving performance.

When a Jenkins agent is launched using JNLP, it connects to the Jenkins master and receives build tasks, which it then executes. The results of the build are then sent back to the master and displayed in the Jenkins user interface.

6. How to deal with a configuration drift?

Show answer

Configuration drift can be avoided with desired state configuration (DSC) implementation. Desired state configuration can be a declarative file that defined how a system should be. There are tools to enforce desired state such a terraform or azure dsc. There are incremental or complete strategies.

7. How do you store/secure/handle secrets in Jenkins ?

Show answer

Again, there are multiple ways to achieve this,
Let me give you a brief explanation of all the posible options.
```
- Credentials Plugin: Jenkins provides a credentials plugin that can be used to store secrets such as passwords, API keys, and certificates.

Gotcha: never echo secrets in logs. Use withCredentials block and external vaults (HashiCorp Vault, AWS Secrets Manager) for production.

8. What's the most dangerous assumption in Linux/infrastructure engineering?

Show answer

The system is doing what I think it is.

This assumption kills because:

1. Storage tells the truth
- Drives report success, data not written
- RAID says healthy, silent corruption
- Backup job green, restore fails
- "It said OK" means nothing

2. Monitoring sees everything
- Monitoring shows what you measure

Remember: "Trust but verify." The system does what you OBSERVE, not what you THINK. Every monitoring gap is an unverified assumption.

9. Why are snapshot-based backups dangerous?

Show answer

Snapshots capture crash-consistent state, not application-consistent state.

The illusion:
- "Snapshots are instant backups"
- "Point-in-time recovery"
- "Zero-downtime backups"

The reality:

1. Crash-consistent vs app-consistent
- Crash-consistent: what disk looks like if power cut
- App-consistent: what disk looks like after clean shutdown

Remember: crash-consistent != application-consistent. Snapshot captures mid-write disk state. Always quiesce apps before snapshotting.

10. What is an error budget?

Show answer

Atlassian: "An error budget is the maximum amount of time that a technical system can fail without contractual consequences."

Read more about it [here](https://www.atlassian.com/incident-management/kpis/error-budget)

Example: 99.9% SLO = 43.8 min/month error budget. Error budget = 1 - SLO. Burn it on outages → freeze feature deploys until it refills.

11. How to setup auto-scaling group for Jenkins in AWS ?

Show answer

Here is a high-level overview of how to set up an autoscaling group for Jenkins in Amazon Web Services (AWS):
```
- Launch EC2 instances: Create an Amazon Elastic Compute Cloud (EC2) instance with the desired configuration and install Jenkins on it. This instance will be used as the base image for the autoscaling group.

12. What is a configuration drift? What problems is it causing?

Show answer

Configuration drift happens when in an environment of servers with the exact same configuration and software, a certain server
or servers are being applied with updates or configuration which other servers don't get and over time these servers become
slightly different than all others.

This situation might lead to bugs which hard to identify and reproduce.

Example: server A gets an emergency patch B and C miss. Now A behaves differently under load. terraform plan detects drift before it causes outages.

13. Why do backups often succeed but restores fail?

Show answer

Backups test the write path, not the read path or application consistency.

Common restore failures:

1. No restore testing
- Backup job: green checkmarks for years
- First restore attempt: during disaster
- "We've never actually tried restoring"

2. Permission/ownership issues
- Backup as root, restore files owned by root
- Application runs as different user

Remember: untested backups are Schrodinger's backups. Test restores monthly. First restore attempt should never be during a real disaster.

14. A team member of yours, suggests to replace the current CI/CD platform used by the organization with a new one. How would you reply?

Show answer

Things to think about:

* What we gain from doing so? Are there new features in the new platform? Does the new platform deals with some of the limitations presented in the current platform?
* What this suggestion is based on? In other words, did he/she tried out the new platform? Was there extensive technical research?
* What does the switch from one platform to another will require from the o

Remember: migration cost includes retraining, pipeline rewriting, integration reconfiguration, and potential downtime. Evaluate carefully before switching.

15. Why do "five nines" (99.999%) systems fail catastrophically when they do fail?

Show answer

The same optimizations that achieve high availability create conditions for catastrophic failure.

The paradox:

1. Rare paths never tested
- 99.999% uptime = 5 minutes downtime/year
- Failure recovery code runs 5 min/year
- That code has bugs nobody found
- When it runs, it fails

2. Humans out of practice
- On-call never pages for this system

Fun fact: recovery code running 5 min/year has untested bugs. When it finally runs during an outage, IT fails too.

16. How do you decide build vs buy?

Show answer

Simple framework:

**Build if**:
* It's core value - competitive advantage
* Unique requirements that tools don't solve
* Long-term ownership makes sense
* You have the team to maintain it

**Buy if**:
* It's plumbing - everyone needs it

Remember: Build if core value. Buy if plumbing. Ownership cost often exceeds purchase cost. Most teams should buy observability, build differentiators.

17. Do you have experience with testing cross-projects changes? (aka cross-dependency)

Show answer

Note: cross-dependency is when you have two or more changes to separate projects and you would like to test them in mutual build instead of testing each change separately.