Decision Tree: Managed vs Self-Hosted Service¶

Category: Architecture Decisions Starting Question: "Should we use a managed service or self-host?" Estimated traversal: 3-5 minutes Domains: infrastructure, cost, operations, compliance, databases

The Tree¶

Should we use a managed service or self-host?
│
├── Is this service a core differentiator for your business?
│   ├── Yes →
│   │   └── Do you need customizations the managed service cannot provide?
│   │       ├── Yes → DECISION: Self-host (competitive moat justifies cost)
│   │       └── No  → WARNING: Verify that "differentiator" claim; likely still use managed
│   │
│   └── No →
│       └── Is the service stateful? (database, object storage, message queue)
│           ├── Yes →
│           │   └── Do you have compliance requirements about data locality or vendor access?
│           │       ├── Yes →
│           │       │   └── Can a managed service with your-region + BYOK satisfy the requirement?
│           │       │       ├── Yes → DECISION: Use managed service (with compliance controls)
│           │       │       └── No  → DECISION: Self-host (compliance forces it)
│           │       └── No  → DECISION: Use managed service (stateful + no compliance = managed wins)
│           │
│           └── No (stateless) →
│               └── Do you have 24/7 ops capacity to manage the service?
│                   ├── Yes →
│                   │   └── Does managed pricing exceed self-hosted TCO at your scale?
│                   │       ├── Yes →
│                   │       │   └── Is the TCO saving > 20% after eng time is included?
│                   │       │       ├── Yes → DECISION: Self-host (cost justified at scale)
│                   │       │       └── No  → DECISION: Use managed service (margin too thin)
│                   │       └── No  → DECISION: Use managed service (cheaper total cost)
│                   └── No  →
│                       └── Is the managed SLA sufficient for your availability requirement?
│                           ├── Yes → DECISION: Use managed service (ops capacity constraint)
│                           └── No  → WARNING: SLA gap — negotiate or architect multi-region

Node Details¶

Check 1: Core Business Differentiator¶

How to assess: Ask: "If a competitor used the same managed service, would they have equivalent capability in this area?" If yes, it is not a differentiator. True differentiators are things like a proprietary algorithm, unique data pipeline, or specialized model that gives you a competitive moat. What you're looking for: The service being evaluated must provide unique customer value that cannot be replicated by using the same managed product. Infrastructure (databases, queues, caches, logging pipelines) almost never qualifies. Common pitfall: Engineering teams frequently claim infrastructure is a "differentiator" to justify the technical interest of self-hosting. Challenge this claim by asking the product team whether customers are paying for this capability specifically.

Check 2: Customization Requirements¶

How to assess: Write down the specific customization needed. Then check: (a) whether the managed service has a configuration option, extension, or plugin for it; (b) whether the customization is essential for launch or a nice-to-have. What you're looking for: A concrete, launch-blocking feature gap between the managed offering and your requirement. "We want to modify the internals" is not a requirement; "the managed service does not support X and we need X for compliance" is. Common pitfall: Conflating "we prefer to control this" with "we need to control this." Preference is not a requirement. Evaluate the actual feature gap, not the desire for control.

Check 3: Stateful vs. Stateless¶

How to assess: Does the service store data that must persist beyond a single request or container restart? Databases, caches with persistence (Redis with AOF/RDB), message queues, object storage, and search indexes are stateful. Load balancers, API gateways (without session state), and batch processors are stateless. What you're looking for: Any service that manages durable data. Stateful services self-hosted carry the full burden of backup, restore, point-in-time recovery, replica failover, and schema migrations — all problems that managed services have solved at scale. Common pitfall: Treating a database as "just another service to deploy." The operational cost of running a highly-available, recoverable database is routinely underestimated by a factor of 3-5x in initial planning.

Check 4: Compliance — Data Locality and Vendor Access¶

How to assess: Review your compliance framework documentation for data residency requirements (must stay in EU, must not leave US-EAST, etc.). Check whether the compliance requirement restricts vendor access to encryption keys (BYOK, HYOK). Consult your security/compliance team for the exact standard. What you're looking for: Written policy requirements that (a) restrict data to specific geographic regions, or (b) prohibit the cloud provider from having key access, or (c) require audit logs of all data access including by provider staff. Common pitfall: Assuming compliance forces self-hosting without checking whether the managed provider offers a compliant tier. AWS GovCloud, Azure Government, and equivalent offerings, plus BYOK key management, satisfy many compliance frameworks within a managed context.

Check 5: 24/7 Ops Capacity¶

How to assess: Determine whether you have an on-call rotation that can respond within your SLO breach window at 3am on a Sunday. Count the number of engineers trained on the service who are in the rotation. Calculate the interrupt load: how many incidents per month does a comparable self-hosted service generate? What you're looking for: At minimum 3 engineers trained and available for on-call coverage, with headroom in the rotation for incidents. Self-hosted services require proactive capacity management, version upgrades, security patch cycles, and backup verification — all consuming engineering time. Common pitfall: "We'll figure out on-call later." You will not figure it out later. The time to establish on-call is before the first 3am page, not during it. If you cannot staff 24/7 coverage today, assume you cannot staff it.

Check 6: TCO Comparison¶

How to assess: Build a spreadsheet. Managed cost: published pricing × expected usage × 12 months. Self-hosted cost: EC2/VM instances + storage + networking + (engineer hours/month × loaded hourly rate) + incident cost. Use 20% of a senior engineer's time as a floor for a non-trivial self-hosted service. What you're looking for: A clear monetary difference, not a rough feeling. The 20% threshold (managed must cost 20%+ more than self-hosted after eng time) accounts for the variance and risk in TCO estimates. Common pitfall: TCO calculations that omit engineer time. "The managed service costs $2k/month but we could run it on $400/month of EC2" ignores that running it on EC2 costs 20-40 hours/month of senior engineer time, which at $150/hr loaded is $3,000-$6,000/month. The managed service is almost always cheaper when engineering time is included.

Check 7: Managed SLA Sufficiency¶

How to assess: Find the managed service's SLA document (usually in the provider's legal/terms section). Convert the SLA percentage to monthly downtime budget: 99.9% = 43 min/month, 99.95% = 21 min/month, 99.99% = 4.4 min/month. Compare against your service's availability requirement. What you're looking for: The managed SLA percentage must be equal to or better than your availability requirement. Note that SLAs are about credit eligibility, not guaranteed uptime — real availability is typically better than the SLA floor. Common pitfall: Expecting a managed service SLA to be higher than your own service's SLA. If your service has a 99.9% SLA to customers, a dependency with a 99.9% managed SLA provides zero headroom. Either require 99.99% for dependencies or architect for degraded-mode operation when the dependency is down.

Terminal Actions¶

Decision: Use Managed Service¶

Choose: The managed offering from your primary cloud provider or a SaaS vendor. Why: For non-differentiating services, especially stateful ones, the total cost of ownership of a managed service — when engineering time is honestly accounted — is lower than self-hosting. Managed services provide SLAs, automated backups, point-in-time recovery, security patches, and version upgrades that would each require engineering investment to replicate. Next step: Evaluate 2-3 managed options on: pricing at your projected scale, SLA, compliance certifications, migration path, and vendor lock-in severity. Prefer providers with data export guarantees.

Decision: Self-Host¶

Choose: Self-hosted deployment, typically containerized on Kubernetes or as a managed VM fleet. Why: Self-hosting is justified when: (a) it is a genuine business differentiator requiring customization not available in managed form, (b) compliance requirements cannot be satisfied by any managed offering, or (c) scale is large enough that managed costs exceed self-hosted TCO by more than 20% after full engineering cost accounting. Next step: Document the operational runbook before deploying to production: backup/restore procedure, failover procedure, upgrade procedure, security patch SLA. Assign clear ownership. Establish on-call coverage. Set a quarterly review to reassess whether the justification still holds.

Decision: Use Managed Service with Compliance Controls¶

Choose: Managed service + customer-managed encryption keys (CMEK/BYOK) + region-restricted deployment + audit logging enabled. Why: Modern managed cloud services support the majority of compliance frameworks without requiring self-hosting. Using managed infrastructure with appropriate controls satisfies data residency, encryption at rest with customer key control, and access logging requirements while retaining the operational benefits of a managed offering. Next step: Document which specific controls are enabled and how they map to the compliance requirement. Get sign-off from your compliance team or auditor before go-live. Schedule annual review of the compliance posture.

Decision: Hybrid (Managed Control Plane, Self-Hosted Data Plane)¶

Choose: Use a managed control plane for orchestration, configuration, and metadata while keeping data-plane components that touch sensitive data in self-hosted infrastructure. Why: Some services split cleanly into a control plane (which manages metadata and configuration) and a data plane (which processes or stores actual data). For compliance scenarios where data locality is required but operational simplicity is also desired, this pattern allows the vendor to manage complexity without touching regulated data. Next step: Verify that the service you're evaluating actually supports this split (e.g., Confluent Cloud for Kafka supports bring-your-own-Kafka cluster). Not all services support a hybrid model. Validate with the vendor before designing around this assumption.

Warning: "Differentiator" Claim Not Validated¶

When: An engineering team has labeled a standard infrastructure service (database, cache, queue) as a business differentiator without product team validation. Risk: Self-hosting a service that provides no competitive advantage consumes engineering capacity that could be spent on actual product differentiation. This is one of the most common forms of engineering over-investment. Mitigation: Require the product manager or business owner to write one sentence explaining how self-hosting this service creates measurable customer value. If they cannot, use the managed service.

Warning: SLA Gap Between Requirement and Managed Offering¶

When: Your availability SLO requires 99.99% but the managed service only offers a 99.9% SLA. Risk: Any managed service incident becomes an SLO breach. You have no credit-based protection against exceeding your own SLO, and you cannot hold the provider accountable for breaches of your SLO that are within their SLA. Mitigation: Either architect for multi-region redundancy (run the managed service in two regions with failover), negotiate a higher SLA with the provider, or revise your availability requirement to be realistic for the dependency tier you're using.

Edge Cases¶

Open-source software with a managed offering: When the underlying software is open source (PostgreSQL, Redis, Kafka, Elasticsearch), the "self-hosted" option is lower risk because the software is well-understood and community-supported. In these cases, managed is still recommended for stateful services, but the self-host option has better tooling than proprietary software.
Rapid prototyping / pre-product-market-fit: Before you have meaningful traffic, the decision criteria shift. Use managed services for everything to minimize operational burden and preserve engineering focus on product discovery. Revisit after you have production load data.
Cost-at-scale inflection points: Managed service costs typically grow linearly with usage; self-hosted costs grow in steps (add another node). There is usually an inflection point where self-hosted becomes cheaper. For most teams, this occurs at $10k+/month of managed spend on a single service. Below that threshold, the engineering cost of self-hosting exceeds the price premium of managed.
Vendor lock-in for critical path services: If a managed service is on the critical path for your revenue-generating system, evaluate the migration cost before adopting. Not all managed services are equally portable. PostgreSQL on RDS can migrate to PostgreSQL on Aurora, to Cloud SQL, or to self-hosted Postgres with tooling. A proprietary service with no export API creates irreversible lock-in.
Managed services in regulated industries (banking, healthcare): Some industries require that you maintain operational control of all infrastructure components that touch regulated data. In these cases, "self-host" may not be a choice — it may be a regulatory requirement. Validate with your compliance officer before defaulting to managed.

Cross-References¶

Topic Packs: Infrastructure Fundamentals, Cloud Providers, Security
Related trees: Which Database, Where Should This Run, Monolith vs Microservices