Skip to content

Production Readiness Review: System Architecture

Overview

You are the newest member of the platform engineering team at Meridian, a mid-scale B2B SaaS company that provides real-time inventory management and order fulfillment for e-commerce retailers. The platform processes approximately 15,000 requests per second at peak, handles $2M+ in daily transaction volume, and serves 400 active tenants across North America and Europe.

The system runs on Kubernetes across three clusters (production, staging, and disaster recovery), backed by PostgreSQL for persistence, Redis for caching and session state, RabbitMQ for async messaging, and S3-compatible object storage for media and reports. Observability is provided by Prometheus, Grafana, Loki, and Tempo. Deployments flow through GitHub Actions into ArgoCD, which reconciles Helm releases against the clusters. Security is enforced through HashiCorp Vault, cert-manager, and Open Policy Agent (OPA). Cloud infrastructure is provisioned with Terraform; a small fleet of bare-metal edge nodes is managed with Ansible.

You are about to start on-call rotations. This document describes the system you will be responsible for.


Architecture Diagram

                            +-----------+
                            |   Users   |
                            +-----+-----+
                                  |
                            +-----v-----+
                            | CloudFront|
                            |   (CDN)   |
                            +-----+-----+
                                  |
                         +--------v--------+
                         | Ingress-NGINX   |
                         | (TLS termination|
                         |  rate limiting) |
                         +--------+--------+
                                  |
                      +-----------v-----------+
                      |   API Gateway (Kong)  |
                      |   - Auth/rate limit   |
                      |   - Request routing   |
                      |   - JWT validation    |
                      +-----------+-----------+
                                  |
           +----------+-----------+-----------+-----------+
           |          |           |           |           |
     +-----v----+ +---v------+ +-v--------+ +v--------+ +v---------+
     |  Auth    | | Order    | | Inventory| | Search  | | Billing  |
     |  Service | | Service  | | Service  | | Service | | Service  |
     +-----+----+ +---+------+ +----+-----+ +----+----+ +----+-----+
           |          |              |            |           |
           |    +-----v------+      |       +----v----+      |
           |    | Fulfillment|      |       | Elastic-|      |
           |    | Service    |      |       | search  |      |
           |    +-----+------+      |       +---------+      |
           |          |              |                        |
      +----v----+  +--v-------+  +--v-------+          +-----v-----+
      | Notifi- |  | Worker   |  | Report   |          | Stripe    |
      | cation  |  | Service  |  | Service  |          | Webhook   |
      | Service |  +--+-------+  +--+-------+          | Handler   |
      +---------+     |              |                  +-----------+
                      |              |
                 +----v----+   +----v----+
                 | RabbitMQ|   |   S3    |
                 | (async  |   | (media, |
                 |  queue) |   | reports)|
                 +---------+   +---------+

          +-------------+    +----------+
          | PostgreSQL  |    |  Redis   |
          | (primary +  |    | (cache,  |
          |  2 replicas)|    |  sessions)|
          +-------------+    +----------+

    Observability Stack              CI/CD Pipeline
    ==================              ==============
    Prometheus (metrics)            GitHub Actions (build/test)
    Grafana (dashboards)                |
    Loki (logs)                     Container Registry (GHCR)
    Tempo (traces)                      |
    Alertmanager (paging)           ArgoCD (GitOps sync)
                                        |
    Security                        Helm (release mgmt)
    ========                            |
    Vault (secrets)                 3 Clusters (prod/stg/DR)
    cert-manager (TLS)
    OPA/Gatekeeper (policy)

    Infrastructure
    ==============
    Terraform (AWS: VPC, EKS, RDS, ElastiCache, MSK, S3)
    Ansible (bare-metal edge nodes: 12 servers, 3 sites)
    Calico (CNI + NetworkPolicy enforcement)

Component Details

API Gateway: Kong

Attribute Detail
Purpose Central entry point for all API traffic. Handles JWT validation, rate limiting (per-tenant), request routing, and request/response transformation.
Dependencies Ingress-NGINX (upstream), Auth Service (JWT public keys), Redis (rate limit counters), PostgreSQL (Kong config in DB-less mode uses declarative config stored in ConfigMap).
Failure modes Pod crash causes 502 from ingress (mitigated by 3 replicas + PDB). Redis unavailable degrades rate limiting to permissive mode. Misconfigured route returns 404 for valid paths.
SLO 99.95% availability, p99 added latency < 15ms.

Auth Service

Attribute Detail
Purpose User authentication (OAuth2/OIDC), API key management, tenant identity, JWT issuance and rotation.
Dependencies PostgreSQL (user/tenant data), Redis (session cache, token blacklist), Vault (signing keys, OIDC client secrets).
Failure modes PostgreSQL unreachable causes login failures. Vault token expiry blocks key rotation. Stale Redis cache can allow revoked tokens for up to 5 minutes (TTL).
SLO 99.95% availability, p95 login latency < 200ms.

Order Service

Attribute Detail
Purpose Order lifecycle management: creation, validation, state transitions (pending, confirmed, fulfilled, canceled). Publishes events to RabbitMQ for downstream processing.
Dependencies PostgreSQL (order data, ACID transactions), RabbitMQ (event publishing), Inventory Service (stock reservation via sync gRPC), Redis (idempotency keys).
Failure modes RabbitMQ unavailable blocks event publishing; orders still persist but downstream processing stalls. Inventory Service timeout causes order creation to hang (circuit breaker trips after 5s). Database connection pool exhaustion under load spike.
SLO 99.95% availability, p99 order creation < 500ms, zero lost orders (data durability).

Inventory Service

Attribute Detail
Purpose Real-time stock level tracking, reservation management, warehouse sync. Exposes gRPC API for internal services and REST API for tenant dashboards.
Dependencies PostgreSQL (stock data with row-level locking), Redis (hot stock cache, write-through), RabbitMQ (consumes fulfillment events to decrement stock).
Failure modes Cache inconsistency causes overselling (mitigated by DB-level pessimistic locking). Slow warehouse sync causes stale data for up to 15 minutes.
SLO 99.9% availability, stock accuracy within 60 seconds of warehouse update.

Search Service

Attribute Detail
Purpose Full-text product search, filtering, faceting, and autocomplete for tenant storefronts.
Dependencies Elasticsearch (3-node cluster, 2 replicas per index), PostgreSQL (initial data load and reindex source), RabbitMQ (consumes product update events for near-real-time index updates).
Failure modes Elasticsearch cluster goes yellow/red (reduces redundancy or blocks writes). Reindex from PostgreSQL takes 45 minutes (during which search results may be stale). Mapping conflicts on schema changes block indexing.
SLO 99.9% availability, p95 search latency < 150ms, index lag < 30 seconds.

Billing Service

Attribute Detail
Purpose Tenant subscription management, usage metering, invoice generation, Stripe integration for payment processing.
Dependencies PostgreSQL (billing records, usage counters), Stripe API (payment processing, webhook receiver), Redis (usage counter aggregation before flush to DB).
Failure modes Stripe API outage blocks payment processing (invoices queue for retry). Usage counter loss on Redis crash (mitigated by periodic DB flush every 60s). Webhook signature validation failure silently drops payment confirmations.
SLO 99.9% availability, invoice accuracy 100%, payment processing within 30s of trigger.

Fulfillment Service

Attribute Detail
Purpose Coordinates order fulfillment: picks warehouse, generates shipping labels, tracks shipment status, updates order state.
Dependencies RabbitMQ (consumes order events, publishes fulfillment events), PostgreSQL (fulfillment records), S3 (shipping label storage), external carrier APIs (FedEx, UPS, USPS).
Failure modes Carrier API timeout causes label generation delay (retry with exponential backoff). Message processing failure causes redelivery loop (dead-letter queue after 3 retries). S3 unavailability blocks label storage.
SLO 99.9% availability, fulfillment processing within 5 minutes of order confirmation.

Notification Service

Attribute Detail
Purpose Sends transactional emails, SMS, and webhook notifications to tenants and end-users.
Dependencies RabbitMQ (consumes notification events), Redis (deduplication cache, rate limiting per tenant), SendGrid (email), Twilio (SMS), PostgreSQL (notification log, template storage).
Failure modes External provider outage causes notification delay (failover from SendGrid to SES configured). Template rendering error causes garbled notifications. Rate limit misconfiguration causes tenant notification flood.
SLO 99.5% delivery rate, email delivery within 60s, SMS within 30s.

Worker Service

Attribute Detail
Purpose Background job processing: data imports, bulk operations, scheduled tasks (cron-based), tenant data exports.
Dependencies RabbitMQ (job queue), PostgreSQL (job state, tenant data), S3 (export output), Redis (job deduplication and distributed locking).
Failure modes Long-running job blocks queue consumer (mitigated by separate queues per job type). OOM kill on large data exports (memory limit 2Gi, large exports stream to S3). Cron job overlap on slow execution (distributed lock prevents double-run).
SLO Job completion within 2x estimated duration, zero dropped jobs.

Report Service

Attribute Detail
Purpose Generates tenant analytics reports, dashboards data, and scheduled report delivery.
Dependencies PostgreSQL read replica (heavy queries offloaded from primary), S3 (report PDF/CSV storage), Redis (report cache, 15-minute TTL), RabbitMQ (scheduled report triggers).
Failure modes Read replica lag causes stale report data. Large report generation causes high memory usage (streaming pagination mitigates). S3 upload timeout on large reports.
SLO Report generation within 5 minutes, data freshness within 15 minutes of source.

Data Stores

PostgreSQL

  • Version: 15.x, managed via RDS (prod/staging) and local container (dev)
  • Topology: Primary + 2 read replicas (prod), single instance (staging/DR)
  • Databases: meridian_auth, meridian_orders, meridian_inventory, meridian_billing, meridian_platform (shared)
  • Backup: Automated daily snapshots (RDS), WAL archival to S3, point-in-time recovery window of 7 days
  • Connection pooling: PgBouncer sidecar (transaction mode, max 200 connections per pod)
  • Failure modes: Primary failover takes 60-120s (automatic via RDS Multi-AZ). Connection pool exhaustion under load. Replica lag during bulk operations.

Redis

  • Version: 7.x, managed via ElastiCache (prod/staging)
  • Topology: Primary + 1 replica, cluster mode disabled
  • Uses: Session cache, rate limit counters, idempotency keys, hot stock cache, distributed locks, usage counter aggregation
  • Failure modes: Failover causes 10-30s connection interruption. Memory pressure triggers eviction (allkeys-lru policy). Split-brain during network partition.
  • Persistence: AOF disabled (cache-only; authoritative data is in PostgreSQL)

RabbitMQ

  • Version: 3.12.x, self-managed (Helm chart: bitnami/rabbitmq)
  • Topology: 3-node cluster with quorum queues (prod), single node (staging)
  • Exchanges: orders.events, inventory.events, fulfillment.events, notifications, reports.scheduled
  • Dead-letter handling: Failed messages routed to *.dlq queues after 3 retries, monitored via Prometheus exporter
  • Failure modes: Network partition causes cluster split (pause-minority mode configured). Disk alarm blocks publishers. Queue backup during consumer outage causes memory pressure.

Elasticsearch

  • Version: 8.x, self-managed (ECK operator)
  • Topology: 3 data nodes, 2 dedicated masters, 1 coordinator (prod)
  • Indices: Product catalog per tenant (rolling aliases), search analytics
  • Failure modes: Cluster yellow (lost replica), cluster red (lost primary shard). JVM heap pressure causes slow queries. Mapping explosion from uncontrolled dynamic fields.

S3 (AWS S3)

  • Buckets: meridian-media (product images), meridian-reports (generated reports), meridian-backups (DB snapshots, WAL archives), meridian-fulfillment (shipping labels)
  • Access: IAM roles for service accounts (IRSA), pre-signed URLs for tenant access
  • Lifecycle: Media retained indefinitely, reports archived to Glacier after 90 days, backups retained 30 days
  • Failure modes: Rare (S3 is 99.99% available). Pre-signed URL expiry causes download failures. CORS misconfiguration blocks browser uploads.

Observability Stack

Prometheus

  • Deployment: Prometheus Operator (kube-prometheus-stack Helm chart)
  • Retention: 15 days local, Thanos sidecar ships to S3 for long-term (1 year)
  • Scrape targets: All services expose /metrics, node-exporter on all nodes, kube-state-metrics, cAdvisor, RabbitMQ exporter, PostgreSQL exporter, Redis exporter, Elasticsearch exporter
  • Recording rules: Pre-computed SLO burn rates, aggregated request rates, p50/p95/p99 latencies
  • Alert rules: 85 active alert rules across 12 groups

Alertmanager

  • Routing: Severity-based routing to PagerDuty (critical/warning), Slack (info), email (daily digest)
  • Inhibition: Node-level alerts inhibit pod-level alerts on the same node
  • Silences: Managed via Alertmanager UI, requires justification comment
  • Escalation: Page -> 15 min ack timeout -> escalate to secondary -> 30 min -> escalate to engineering manager

Grafana

  • Dashboards: 24 dashboards organized by service, infrastructure, and SLO
  • Data sources: Prometheus (metrics), Loki (logs), Tempo (traces), PostgreSQL (business metrics)
  • Provisioned: All dashboards as code (ConfigMaps), no manual dashboard creation in prod

Loki

  • Deployment: Simple scalable mode (3 read, 3 write, 1 backend)
  • Retention: 30 days
  • Labels: namespace, pod, container, app, level
  • Storage: S3 backend for chunks, BoltDB Shipper for index

Tempo

  • Deployment: Distributed mode (distributor, ingester, querier, compactor)
  • Retention: 14 days
  • Instrumentation: OpenTelemetry SDK in all services, auto-instrumented HTTP/gRPC/DB
  • Sampling: Head-based sampling at 10% for normal traffic, 100% for errors and slow requests (>1s)

CI/CD Pipeline

Build (GitHub Actions)

Push to main
    |
    v
Lint + Unit Tests (parallel matrix: Python 3.11, Go 1.21)
    |
    v
Security Scan (Trivy container scan, Snyk dependency scan)
    |
    v
Docker Build + Push to GHCR
    |
    v
Helm chart lint + template validation
    |
    v
Update image tag in GitOps repo (argocd-manifests/)

Deploy (ArgoCD + Helm)

GitOps repo updated
    |
    v
ArgoCD detects drift (polling interval: 3 min)
    |
    v
Helm template rendered with environment-specific values
    |
    v
Staging auto-sync (immediate)
    |
    v
Prod manual sync (requires approval in ArgoCD UI)
    |        \
    v         v
Rolling update    Canary (Order Service only, via Argo Rollouts)
    |
    v
Post-deploy smoke tests (automated)
    |
    v
DR cluster sync (30 min delay, automated)

Rollback

  • Helm: helm rollback <release> <revision> — instant, reverts to previous manifest set
  • ArgoCD: Revert commit in GitOps repo, ArgoCD auto-syncs
  • Database: Migrations are forward-only; rollback requires a new forward migration
  • Time to rollback: < 5 minutes for application, 15-30 minutes if DB migration is involved

Security

Vault (HashiCorp)

  • Deployment: HA mode (3 pods, Raft storage), auto-unsealed via AWS KMS
  • Secret engines: KV v2 (application secrets), PKI (internal CA), Transit (encryption-as-a-service for PII)
  • Auth methods: Kubernetes auth (pod identity), AppRole (CI/CD), OIDC (human operators)
  • Secret injection: External Secrets Operator syncs Vault secrets to Kubernetes Secrets
  • Rotation: Database credentials rotated every 24h (Vault dynamic secrets), API keys rotated quarterly

cert-manager

  • Issuers: Let's Encrypt (public TLS via DNS-01 challenge), Vault PKI (internal mTLS)
  • Certificates: Ingress TLS (public), inter-service mTLS (internal CA, 90-day rotation)
  • Monitoring: cert-manager Prometheus metrics, alert when certificate expiry < 14 days

OPA / Gatekeeper

  • Policies enforced:
  • All containers must run as non-root
  • All pods must have resource limits
  • No latest image tags
  • Only approved container registries (ghcr.io/meridian/*)
  • All namespaces must have NetworkPolicies
  • No privileged containers
  • PDB required for deployments with replicas > 1
  • Audit mode: New policies deployed in audit mode for 7 days before enforcing

Infrastructure

Terraform

  • Provider: AWS (us-east-1 primary, eu-west-1 DR)
  • Modules: VPC (subnetting, NAT, VPN), EKS (cluster, node groups, IRSA), RDS (PostgreSQL), ElastiCache (Redis), S3 (buckets + lifecycle), IAM (roles, policies), Route53 (DNS)
  • State: S3 backend with DynamoDB locking, per-environment state files
  • Workspaces: prod, staging, dr
  • Drift detection: Weekly terraform plan (automated, results posted to Slack)

Ansible

  • Target: 12 bare-metal edge nodes across 3 regional sites (used for CDN origin, local caching)
  • Playbooks: bootstrap.yml (OS setup, Docker, node-exporter), upgrade.yml (OS patches, Docker version), addons.yml (monitoring agents, log shippers)
  • Inventory: Dynamic inventory from CMDB API, grouped by site
  • Execution: Manual via bastion host, scheduled OS patching monthly via cron + Ansible

Networking

  • CNI: Calico (VXLAN mode)
  • Ingress: Ingress-NGINX with ModSecurity WAF rules
  • NetworkPolicies: Default-deny per namespace, explicit allow rules for inter-service communication
  • DNS: CoreDNS (in-cluster), Route53 (external), split-horizon for internal vs external resolution
  • Load balancing: AWS NLB (L4) fronting Ingress-NGINX, internal ClusterIP services

Cluster Topology

Production Cluster (us-east-1)

Node Group Instance Type Count Purpose
system m5.xlarge 3 Control plane add-ons, monitoring, ArgoCD
application m5.2xlarge 6 Application workloads
data r5.2xlarge 3 PostgreSQL operator, Elasticsearch, RabbitMQ
spot m5.xlarge 2-8 Worker/Report service burst capacity (Spot instances)
  • Namespaces: meridian-prod, monitoring, argocd, vault, ingress, cert-manager, gatekeeper-system, elastic-system, rabbitmq
  • Total pods: ~120 (steady state)
  • PDBs: All stateful services and core application services have PDBs (minAvailable or maxUnavailable)

Staging Cluster (us-east-1)

Node Group Instance Type Count Purpose
general m5.large 3 All workloads (smaller replicas)
  • Namespaces: Mirror of prod, single-replica deployments
  • Purpose: Pre-production validation, integration testing, performance baseline
  • Auto-sync: ArgoCD auto-syncs on GitOps repo update (no approval required)

DR Cluster (eu-west-1)

Node Group Instance Type Count Purpose
system m5.xlarge 2 Control plane add-ons
application m5.xlarge 3 Application workloads (scaled down)
data r5.xlarge 2 Read replicas, standby data stores
  • State: PostgreSQL cross-region read replica (async, ~1s lag), Redis not replicated (cold start on failover), RabbitMQ not replicated (messages in-flight are lost on failover)
  • Sync: ArgoCD syncs DR cluster 30 minutes after prod (intentional delay to catch bad deploys)
  • Failover: DNS failover via Route53 health checks (TTL 60s), manual promotion of DB replica
  • RPO: < 1 minute (PostgreSQL WAL streaming), RTO: < 15 minutes (DNS propagation + DB promotion + service verification)

On-Call Expectations

Rotation

  • Schedule: Weekly rotation, 2-person on-call (primary + secondary)
  • Hours: 24/7 for critical alerts, business hours only for warning-level
  • Handoff: Monday 10:00 AM, 30-minute handoff meeting with outgoing on-call
  • Tools: PagerDuty (paging), Slack #incidents (coordination), Grafana (investigation), Zoom (war room)

Responsibilities

The on-call engineer is responsible for:

  1. Acknowledging alerts within 15 minutes (critical) or 1 hour (warning)
  2. Triaging the issue: determine scope, impact, and urgency
  3. Mitigating the customer impact (even if root cause is not yet known)
  4. Escalating when needed (see escalation paths below)
  5. Communicating status via Slack #incidents and StatusPage (for customer-facing issues)
  6. Documenting actions taken in the incident channel
  7. Writing postmortems for any Sev1 or Sev2 incident within 48 hours

Escalation Paths

Severity Response Time First Responder Escalation
Sev1 (service down, data loss risk) 5 min ack Primary on-call Secondary on-call (15 min) -> Engineering Manager (30 min) -> VP Engineering (1 hr)
Sev2 (degraded performance, partial outage) 15 min ack Primary on-call Secondary on-call (30 min) -> Team lead (1 hr)
Sev3 (non-critical issue, workaround exists) 1 hr ack Primary on-call Ticket created, addressed next business day
Sev4 (cosmetic, informational) Next business day On-call reviews in morning triage Ticket created, prioritized in sprint

Key Dashboards

Dashboard URL Path Purpose
System Overview /d/system-overview High-level health of all services
SLO Burn Rate /d/slo-burn-rate Error budget consumption per service
Order Pipeline /d/order-pipeline Order creation through fulfillment
Infrastructure /d/infra-overview Node health, resource usage, network
On-Call Summary /d/oncall-summary Active alerts, recent incidents, handoff notes

Common Runbook Entry Points

Symptom Runbook
Pod CrashLoopBackOff Check logs, resource limits, readiness probes, recent deploys
High error rate on API Gateway Check upstream service health, recent deploys, rate limit config
Database connection errors Check PgBouncer pools, RDS events, connection count
RabbitMQ queue backup Check consumer health, message rate, DLQ count
Node NotReady Check kubelet, system resources, network, cloud provider events
Certificate expiry alert Check cert-manager logs, issuer status, DNS challenge
Vault sealed Check auto-unseal (KMS), pod restarts, storage backend

Network Topology

Internet
    |
[AWS NLB] (TCP 443, TLS passthrough)
    |
[Ingress-NGINX pods] (namespace: ingress, 3 replicas)
    |  - TLS termination (Let's Encrypt certs via cert-manager)
    |  - ModSecurity WAF
    |  - Rate limiting (global)
    |
[Kong pods] (namespace: meridian-prod, 3 replicas)
    |  - Per-tenant rate limiting
    |  - JWT validation
    |  - Request routing
    |
[Application Services] (namespace: meridian-prod)
    |  - ClusterIP services
    |  - Calico NetworkPolicy: default-deny ingress
    |  - Explicit allow from Kong namespace
    |  - mTLS between services (Vault PKI certs)
    |
[Data Stores]
    |  - PostgreSQL: RDS (private subnet, security group)
    |  - Redis: ElastiCache (private subnet, security group)
    |  - RabbitMQ: In-cluster (namespace: rabbitmq, NetworkPolicy restricted)
    |  - Elasticsearch: In-cluster (namespace: elastic-system)
    |  - S3: VPC endpoint (no internet traversal)

DNS

  • External: api.meridian.io -> Route53 -> NLB -> Ingress-NGINX
  • Internal: <service>.meridian-prod.svc.cluster.local -> CoreDNS
  • Split-horizon: db.internal.meridian.io resolves to RDS endpoint internally, NXDOMAIN externally

Firewall Rules (Calico NetworkPolicy summary)

Source Destination Ports Policy
ingress namespace Kong pods 8000/TCP Allow
Kong pods All app services 8080/TCP, 9090/gRPC Allow
App services PostgreSQL (RDS) 5432/TCP Allow (via Security Group)
App services Redis (ElastiCache) 6379/TCP Allow (via Security Group)
App services RabbitMQ pods 5672/TCP Allow
App services Elasticsearch pods 9200/TCP Allow
Prometheus All pods (metrics) */TCP (metrics port) Allow
All other All other * Deny (default)