Decision Tree: Where Should This Run?¶

Category: Architecture Decisions Starting Question: "Where should we deploy this workload?" Estimated traversal: 3-5 minutes Domains: infrastructure, kubernetes, serverless, compute, cloud, on-prem

The Tree¶

Where should we deploy this workload?
│
├── Does the workload require GPU or specialized hardware?
│   ├── Yes →
│   │   └── Is cloud GPU availability and cost acceptable?
│   │       ├── Yes → DECISION: Cloud GPU VM or K8s node pool (spot/on-demand GPU)
│   │       └── No  → DECISION: Bare metal (cost, compliance, or availability driver)
│   │
│   └── No →
│       └── Is the workload stateless and containerized?
│           ├── No (stateful, legacy, requires OS customization) →
│           │   └── Is it a database or durable storage system?
│           │       ├── Yes → DECISION: Managed database service (see which-database.md)
│           │       └── No  → DECISION: VM (OS control, stateful, or legacy requirements)
│           │
│           └── Yes (stateless, containerized) →
│               └── Does traffic have a strong spike/zero pattern? (scales to zero required)
│                   ├── Yes →
│                   │   └── Is the expected request duration < 15 minutes?
│                   │       ├── Yes →
│                   │       │   └── Does the function need VPC/on-prem network access?
│                   │       │       ├── Yes → DECISION: Serverless in VPC (Lambda/Cloud Run + VPC connector)
│                   │       │       └── No  → DECISION: Serverless (Lambda / Cloud Run / Azure Functions)
│                   │       └── No  →
│                   │           └── Is startup latency (cold start) acceptable?
│                   │               ├── Yes → DECISION: Serverless container (Cloud Run, Fargate)
│                   │               └── No  → DECISION: Container service with min-instances > 0
│                   │
│                   └── No (steady or predictable traffic) →
│                       └── Do you have an existing Kubernetes cluster?
│                           ├── Yes →
│                           │   └── Does the team have Kubernetes operational expertise?
│                           │       ├── Yes → DECISION: Kubernetes (existing cluster, team capable)
│                           │       └── No  → WARNING: Use managed container service; K8s without expertise is risky
│                           └── No  →
│                               └── Does the workload complexity justify K8s cluster overhead?
│                                   ├── Yes (multiple services, complex networking, auto-scaling needs)
│                                   │   → DECISION: Kubernetes (managed: EKS, GKE, AKS)
│                                   └── No  → DECISION: Managed container service (ECS, Cloud Run, App Service)

Node Details¶

Check 1: GPU or Specialized Hardware¶

How to assess: List the runtime requirements: CUDA cores required, specific accelerator (A100, H100, TPU, FPGA), InfiniBand networking, custom NIC offload, DPDK, or specific CPU instruction sets (AVX-512). What you're looking for: Hardware requirements that cannot be satisfied by standard cloud instance types in the regions you operate, or where the cost of cloud GPU spot pricing is prohibitive at the required scale. Common pitfall: Buying bare metal for GPU workloads without modeling the utilization rate. Cloud GPU is economically superior to owned bare metal below 60-70% GPU utilization. Below that threshold, you are paying for idle metal. Model your utilization before committing to a purchase.

Check 2: Stateless and Containerized¶

How to assess: Answer these: (a) Can you lose a running instance and restart it with no data loss? (b) Does the workload run in a Docker container without modification? (c) Does the workload require persistent local filesystem state that survives restarts? What you're looking for: A workload that is fully stateless (no local state that matters) and already containerized or easily containerizable. Stateful workloads that write to local disk (not a mounted volume) are not cleanly containerizable without additional work. Common pitfall: Confusing "runs in Docker" with "stateless." A containerized database is still stateful. Stateful containerized workloads require persistent volumes, careful scheduling, and graceful shutdown handling — not all platforms handle this equally well.

Check 3: Scales to Zero Requirement¶

How to assess: What is the traffic pattern? Plot requests per minute over the past 30 days. Is there a sustained baseline (even nights/weekends)? Or does the workload see zero traffic for extended periods? What you're looking for: Workloads with no minimum traffic baseline — development environments, event-driven processors, infrequently called APIs, batch triggers. If you're paying for idle compute waiting for requests, scales-to-zero is a cost optimization opportunity. Common pitfall: Using serverless for workloads with a steady baseline. If traffic is consistently 100+ req/min 24/7, a fixed set of containers is cheaper than serverless invocation pricing. Run the math: for Lambda, 1M invocations × 500ms × 1GB memory = ~$8.35/month in compute, but 24/7 at 100 req/s is 8.6B invocations/month = ~$1,720 in Lambda vs ~$50/month for two ECS tasks.

Check 4: Request Duration < 15 Minutes¶

How to assess: Measure P99 execution time for the workload. Account for worst-case inputs — large files, slow external APIs, complex computations. What you're looking for: Consistent completion within the hard limits of your serverless platform. AWS Lambda: 15 minutes max. Google Cloud Run: 60 minutes max (HTTP), no limit (Cloud Run Jobs). Azure Functions: 10 minutes (Consumption plan), unlimited (Premium). Common pitfall: Designing a serverless function that runs close to the time limit on average. A function that averages 12 minutes on Lambda will hit the 15-minute limit on any slow execution. Either optimize for well under the limit or choose a different platform.

Check 5: VPC / On-Premises Network Access¶

How to assess: Does this workload need to call a service that is not publicly accessible? Examples: RDS in a private VPC, a legacy system in an on-premises data center, a service behind a VPN, or an ElastiCache instance. What you're looking for: Private network connectivity requirements. Most serverless functions run outside your VPC by default and cannot reach resources on private subnets. Common pitfall: Building a serverless function that needs database access without planning for VPC connectivity. Lambda VPC mode adds 1-10 seconds to cold start time (ENI attachment). Cloud Run VPC connector has a fixed cost. Factor this into your decision and latency budget.

Check 6: Cold Start Latency Acceptable¶

How to assess: What is your P99 latency SLO for the endpoint? What is the expected cold start time for your function runtime? (Node.js/Python: 100-500ms; Java/JVM: 1-10s; container with large image: 2-15s). Multiply by your expected cold start rate given traffic patterns. What you're looking for: The cold start latency budget must fit within your SLO's acceptable latency window. For user-facing endpoints with < 200ms SLOs, a 2-second cold start is disqualifying. For async background processing with no latency SLO, cold starts do not matter. Common pitfall: Ignoring cold starts during development (no cold starts in steady-state load testing) and discovering them in production when traffic is low or spiky. Always test cold start behavior explicitly.

Check 7: Existing Kubernetes Cluster¶

How to assess: Check whether your organization runs a Kubernetes cluster (EKS, GKE, AKS, or self-managed) that this workload could join. Check the cluster's resource availability, network policies, and whether it is in the same cloud region as this workload's dependencies. What you're looking for: An existing, well-operated Kubernetes cluster with available capacity, operated by a team with K8s expertise who can provide support. Common pitfall: Creating a new Kubernetes cluster for a single workload. A Kubernetes cluster itself requires operational maintenance: node upgrades, control plane patches, certificate rotation, etcd backup. The overhead is justified when shared across many workloads; it is not justified for one service.

Check 8: Team Kubernetes Expertise¶

How to assess: Can your team answer without documentation: How do you debug a Pod stuck in Pending state? What is a PodDisruptionBudget and when do you need one? How do you roll back a failed Deployment? How do you evict a Pod from a draining node? What you're looking for: At least 2 engineers who can operate a Kubernetes workload through an incident without needing to escalate to a Kubernetes specialist. Common pitfall: Deploying to Kubernetes because "that's where everything runs" without ensuring the team can operate it. Kubernetes failures without expertise are particularly costly: misconfigured RBAC, resource limits not set, no liveness/readiness probes, missing PodDisruptionBudgets. These are all footguns that require K8s knowledge to avoid.

Terminal Actions¶

Decision: Kubernetes (Managed)¶

Choose: A managed Kubernetes cluster (EKS, GKE, AKS) with your workload deployed as a Deployment or StatefulSet. Why: Kubernetes is the right platform for steady-traffic containerized workloads when you have an existing cluster, team expertise, and workload complexity that benefits from its scheduling, auto-scaling, and service discovery capabilities. It provides the richest set of controls for resource management, network policy, and rolling deployments. Next step: Define resource requests and limits (required for scheduling). Add liveness and readiness probes. Configure HorizontalPodAutoscaler with CPU/memory metrics. Set PodDisruptionBudget for availability during node drains. Define Deployment strategy (RollingUpdate with maxUnavailable=0 for zero-downtime). Add to cluster monitoring with namespace-scoped dashboards.

Decision: Serverless (Lambda / Cloud Run / Azure Functions)¶

Choose: A serverless function platform appropriate to your cloud provider. Why: Serverless is the right choice for event-driven, variable-traffic, short-duration workloads. You pay only for invocations, scale to zero automatically, and eliminate cluster management overhead. Ideal for: API backends with variable traffic, event processors, scheduled jobs, webhook handlers. Next step: Define function boundaries (one function per discrete operation, not a monolith-in-Lambda). Set memory and timeout values based on measured performance. Configure reserved concurrency to protect downstream services from Lambda auto-scaling overwhelming them. Implement structured logging to CloudWatch/Cloud Logging. Set up DLQ for async invocations.

Decision: Managed Container Service (ECS / Cloud Run / App Service)¶

Choose: A managed container orchestration service that abstracts cluster management (AWS ECS Fargate, Google Cloud Run, Azure Container Apps). Why: Managed container services occupy the middle ground: you deploy containers without managing a Kubernetes control plane, but you get auto-scaling, service discovery, and load balancing. Appropriate when Kubernetes overhead is not justified but you want more control than serverless. Next step: Define task/service resource sizing. Configure auto-scaling policy (target tracking based on CPU or request metrics). Set up health checks. Configure service-to-service networking. For ECS: use Fargate to avoid EC2 fleet management.

Decision: VM¶

Choose: Cloud virtual machines (EC2, GCE, Azure VMs), managed as an autoscaling group or individually. Why: VMs are the right choice when: the workload requires OS-level customization (custom kernel modules, specific security hardening), the application is not containerizable, the workload is stateful in a way that does not fit container patterns, or you are running legacy software that requires a specific OS configuration. Next step: Use managed instance groups / autoscaling groups rather than individually managed VMs. Define a base image with required packages pre-installed. Automate instance provisioning (Packer for image build, Terraform for group configuration). Install SSM Agent / Cloud Shell for access without SSH keys. Set up auto-healing based on health checks.

Decision: Bare Metal¶

Choose: Physical servers, either owned or via cloud bare-metal offerings (EC2 Bare Metal, GCP Bare Metal Solution, OVH, Equinix). Why: Bare metal is justified when: GPU/accelerator requirements are best served by dedicated hardware at sustainable utilization, compliance or data sovereignty requirements prohibit shared-tenancy cloud, network performance requirements exceed what hypervisor networking can provide (DPDK, SR-IOV, RDMA), or cost analysis at sustained high utilization shows bare metal TCO below cloud. Next step: Define the provisioning and lifecycle management tooling (PXE boot, IPMI/iDRAC, Ansible for configuration). Plan for hardware failure (redundant power, RAID or distributed storage). Establish a hardware refresh cycle. Ensure your bare metal is in a co-location facility with the security, power, and connectivity guarantees your compliance framework requires.

Decision: Cloud GPU VM or K8s GPU Node Pool¶

Choose: GPU-enabled cloud instances (p3/p4d on AWS, A100/V100 on GCP) as standalone VMs or as a dedicated node pool in Kubernetes. Why: Cloud GPUs provide the flexibility of cloud (no hardware ownership, scale up/down) with specialized compute. A Kubernetes GPU node pool allows GPU workloads to coexist with CPU workloads in a shared cluster, with GPU resource limits enforced by the scheduler. Next step: Evaluate spot/preemptible GPU instances for batch workloads (60-90% cost savings at the cost of interruption). Install the NVIDIA device plugin for Kubernetes GPU scheduling. Set resource limits with nvidia.com/gpu: 1 in your Pod spec. Implement checkpointing for long training jobs to survive spot interruptions.

Warning: K8s Without Expertise¶

When: Deploying to Kubernetes without team members who can operate it under incident conditions. Risk: Kubernetes failures are opaque without expertise. A misconfigured admission controller, an incorrect RBAC policy, or a broken CSI driver can make a cluster unresponsive in ways that require deep K8s knowledge to diagnose and recover. Without expertise, a K8s incident becomes a major outage. Mitigation: Use a managed container service (ECS Fargate, Cloud Run) until the team has built K8s expertise through training and incremental exposure. Alternatively, engage a platform team or SRE team to own cluster operations, with a clear escalation path for workload teams during incidents.

Warning: Serverless Cold Starts in Latency-Sensitive Path¶

When: Using serverless for a user-facing endpoint with a < 500ms P99 SLO and variable traffic (cold starts expected). Risk: Cold starts in production will violate the SLO during low-traffic windows, after deployments (new version, all cold containers), and after scaling events. Mitigation: Use provisioned concurrency (Lambda) or minimum instances (Cloud Run) to keep a warm pool. For Lambda, provisioned concurrency costs ~65% of a continuously-running Lambda — compare against running a container service. Measure cold start rate in production and size the warm pool accordingly.

Edge Cases¶

Hybrid cloud with on-premises requirements: If data must not leave your data center (regulatory, latency, air-gap), the cloud options are eliminated for that data tier. You may run compute on cloud while storing data on-premises, using a private link or direct connect for the data path. Evaluate whether a managed on-premises solution (AWS Outposts, Azure Stack) is more cost-effective than fully self-managed.
Edge computing and CDN workers: For workloads that must run close to end users globally (personalization, A/B testing, authentication at the edge), CDN edge workers (Cloudflare Workers, Fastly Compute, Lambda@Edge) are a distinct compute tier not covered by this tree. They have severe resource constraints (128MB memory, 50ms CPU limit) but sub-5ms latency globally.
Machine learning inference at scale: Inference serving has a specific set of requirements (GPU batching, model loading time, high QPS at low latency) that make it a distinct workload category. Dedicated inference serving platforms (TorchServe, Triton Inference Server, SageMaker Inference) are often more appropriate than general-purpose Kubernetes for this workload.
Development and staging environments: Scale-to-zero is almost always correct for non-production environments. Even if production runs on Kubernetes with minimum replicas, development should use serverless or scale-to-zero container services to minimize cost. Do not mirror production topology in staging if cost efficiency matters.
Stateful Kubernetes workloads: StatefulSets with persistent volumes are supported in Kubernetes but add significant operational complexity compared to stateless workloads. PV provisioning, volume binding, graceful pod shutdown ordering, and backup procedures all require specific K8s knowledge. For new stateful workloads, strongly prefer a managed stateful service over StatefulSets unless you have deep K8s expertise.

Cross-References¶

Topic Packs: Kubernetes, Serverless, Cloud Infrastructure
Related trees: Managed vs Self-Hosted, Which Database, Service Mesh