Production Readiness Review: System Architecture¶

Overview¶

You are the newest member of the platform engineering team at Meridian, a mid-scale B2B SaaS company that provides real-time inventory management and order fulfillment for e-commerce retailers. The platform processes approximately 15,000 requests per second at peak, handles $2M+ in daily transaction volume, and serves 400 active tenants across North America and Europe.

The system runs on Kubernetes across three clusters (production, staging, and disaster recovery), backed by PostgreSQL for persistence, Redis for caching and session state, RabbitMQ for async messaging, and S3-compatible object storage for media and reports. Observability is provided by Prometheus, Grafana, Loki, and Tempo. Deployments flow through GitHub Actions into ArgoCD, which reconciles Helm releases against the clusters. Security is enforced through HashiCorp Vault, cert-manager, and Open Policy Agent (OPA). Cloud infrastructure is provisioned with Terraform; a small fleet of bare-metal edge nodes is managed with Ansible.

You are about to start on-call rotations. This document describes the system you will be responsible for.

Architecture Diagram¶

                            +-----------+
                            |   Users   |
                            +-----+-----+
                                  |
                            +-----v-----+
                            | CloudFront|
                            |   (CDN)   |
                            +-----+-----+
                                  |
                         +--------v--------+
                         | Ingress-NGINX   |
                         | (TLS termination|
                         |  rate limiting) |
                         +--------+--------+
                                  |
                      +-----------v-----------+
                      |   API Gateway (Kong)  |
                      |   - Auth/rate limit   |
                      |   - Request routing   |
                      |   - JWT validation    |
                      +-----------+-----------+
                                  |
           +----------+-----------+-----------+-----------+
           |          |           |           |           |
     +-----v----+ +---v------+ +-v--------+ +v--------+ +v---------+
     |  Auth    | | Order    | | Inventory| | Search  | | Billing  |
     |  Service | | Service  | | Service  | | Service | | Service  |
     +-----+----+ +---+------+ +----+-----+ +----+----+ +----+-----+
           |          |              |            |           |
           |    +-----v------+      |       +----v----+      |
           |    | Fulfillment|      |       | Elastic-|      |
           |    | Service    |      |       | search  |      |
           |    +-----+------+      |       +---------+      |
           |          |              |                        |
      +----v----+  +--v-------+  +--v-------+          +-----v-----+
      | Notifi- |  | Worker   |  | Report   |          | Stripe    |
      | cation  |  | Service  |  | Service  |          | Webhook   |
      | Service |  +--+-------+  +--+-------+          | Handler   |
      +---------+     |              |                  +-----------+
                      |              |
                 +----v----+   +----v----+
                 | RabbitMQ|   |   S3    |
                 | (async  |   | (media, |
                 |  queue) |   | reports)|
                 +---------+   +---------+

          +-------------+    +----------+
          | PostgreSQL  |    |  Redis   |
          | (primary +  |    | (cache,  |
          |  2 replicas)|    |  sessions)|
          +-------------+    +----------+

    Observability Stack              CI/CD Pipeline
    ==================              ==============
    Prometheus (metrics)            GitHub Actions (build/test)
    Grafana (dashboards)                |
    Loki (logs)                     Container Registry (GHCR)
    Tempo (traces)                      |
    Alertmanager (paging)           ArgoCD (GitOps sync)
                                        |
    Security                        Helm (release mgmt)
    ========                            |
    Vault (secrets)                 3 Clusters (prod/stg/DR)
    cert-manager (TLS)
    OPA/Gatekeeper (policy)

    Infrastructure
    ==============
    Terraform (AWS: VPC, EKS, RDS, ElastiCache, MSK, S3)
    Ansible (bare-metal edge nodes: 12 servers, 3 sites)
    Calico (CNI + NetworkPolicy enforcement)

Component Details¶

API Gateway: Kong¶

Attribute	Detail
Purpose	Central entry point for all API traffic. Handles JWT validation, rate limiting (per-tenant), request routing, and request/response transformation.
Dependencies	Ingress-NGINX (upstream), Auth Service (JWT public keys), Redis (rate limit counters), PostgreSQL (Kong config in DB-less mode uses declarative config stored in ConfigMap).
Failure modes	Pod crash causes 502 from ingress (mitigated by 3 replicas + PDB). Redis unavailable degrades rate limiting to permissive mode. Misconfigured route returns 404 for valid paths.
SLO	99.95% availability, p99 added latency < 15ms.

Auth Service¶

Attribute	Detail
Purpose	User authentication (OAuth2/OIDC), API key management, tenant identity, JWT issuance and rotation.
Dependencies	PostgreSQL (user/tenant data), Redis (session cache, token blacklist), Vault (signing keys, OIDC client secrets).
Failure modes	PostgreSQL unreachable causes login failures. Vault token expiry blocks key rotation. Stale Redis cache can allow revoked tokens for up to 5 minutes (TTL).
SLO	99.95% availability, p95 login latency < 200ms.

Order Service¶

Attribute	Detail
Purpose	Order lifecycle management: creation, validation, state transitions (pending, confirmed, fulfilled, canceled). Publishes events to RabbitMQ for downstream processing.
Dependencies	PostgreSQL (order data, ACID transactions), RabbitMQ (event publishing), Inventory Service (stock reservation via sync gRPC), Redis (idempotency keys).
Failure modes	RabbitMQ unavailable blocks event publishing; orders still persist but downstream processing stalls. Inventory Service timeout causes order creation to hang (circuit breaker trips after 5s). Database connection pool exhaustion under load spike.
SLO	99.95% availability, p99 order creation < 500ms, zero lost orders (data durability).

Inventory Service¶

Attribute	Detail
Purpose	Real-time stock level tracking, reservation management, warehouse sync. Exposes gRPC API for internal services and REST API for tenant dashboards.
Dependencies	PostgreSQL (stock data with row-level locking), Redis (hot stock cache, write-through), RabbitMQ (consumes fulfillment events to decrement stock).
Failure modes	Cache inconsistency causes overselling (mitigated by DB-level pessimistic locking). Slow warehouse sync causes stale data for up to 15 minutes.
SLO	99.9% availability, stock accuracy within 60 seconds of warehouse update.

Search Service¶

Attribute	Detail
Purpose	Full-text product search, filtering, faceting, and autocomplete for tenant storefronts.
Dependencies	Elasticsearch (3-node cluster, 2 replicas per index), PostgreSQL (initial data load and reindex source), RabbitMQ (consumes product update events for near-real-time index updates).
Failure modes	Elasticsearch cluster goes yellow/red (reduces redundancy or blocks writes). Reindex from PostgreSQL takes 45 minutes (during which search results may be stale). Mapping conflicts on schema changes block indexing.
SLO	99.9% availability, p95 search latency < 150ms, index lag < 30 seconds.

Billing Service¶

Attribute	Detail
Purpose	Tenant subscription management, usage metering, invoice generation, Stripe integration for payment processing.
Dependencies	PostgreSQL (billing records, usage counters), Stripe API (payment processing, webhook receiver), Redis (usage counter aggregation before flush to DB).
Failure modes	Stripe API outage blocks payment processing (invoices queue for retry). Usage counter loss on Redis crash (mitigated by periodic DB flush every 60s). Webhook signature validation failure silently drops payment confirmations.
SLO	99.9% availability, invoice accuracy 100%, payment processing within 30s of trigger.

Fulfillment Service¶

Attribute	Detail
Purpose	Coordinates order fulfillment: picks warehouse, generates shipping labels, tracks shipment status, updates order state.
Dependencies	RabbitMQ (consumes order events, publishes fulfillment events), PostgreSQL (fulfillment records), S3 (shipping label storage), external carrier APIs (FedEx, UPS, USPS).
Failure modes	Carrier API timeout causes label generation delay (retry with exponential backoff). Message processing failure causes redelivery loop (dead-letter queue after 3 retries). S3 unavailability blocks label storage.
SLO	99.9% availability, fulfillment processing within 5 minutes of order confirmation.

Notification Service¶

Attribute	Detail
Purpose	Sends transactional emails, SMS, and webhook notifications to tenants and end-users.
Dependencies	RabbitMQ (consumes notification events), Redis (deduplication cache, rate limiting per tenant), SendGrid (email), Twilio (SMS), PostgreSQL (notification log, template storage).
Failure modes	External provider outage causes notification delay (failover from SendGrid to SES configured). Template rendering error causes garbled notifications. Rate limit misconfiguration causes tenant notification flood.
SLO	99.5% delivery rate, email delivery within 60s, SMS within 30s.

Worker Service¶

Attribute	Detail
Purpose	Background job processing: data imports, bulk operations, scheduled tasks (cron-based), tenant data exports.
Dependencies	RabbitMQ (job queue), PostgreSQL (job state, tenant data), S3 (export output), Redis (job deduplication and distributed locking).
Failure modes	Long-running job blocks queue consumer (mitigated by separate queues per job type). OOM kill on large data exports (memory limit 2Gi, large exports stream to S3). Cron job overlap on slow execution (distributed lock prevents double-run).
SLO	Job completion within 2x estimated duration, zero dropped jobs.

Report Service¶

Attribute	Detail
Purpose	Generates tenant analytics reports, dashboards data, and scheduled report delivery.
Dependencies	PostgreSQL read replica (heavy queries offloaded from primary), S3 (report PDF/CSV storage), Redis (report cache, 15-minute TTL), RabbitMQ (scheduled report triggers).
Failure modes	Read replica lag causes stale report data. Large report generation causes high memory usage (streaming pagination mitigates). S3 upload timeout on large reports.
SLO	Report generation within 5 minutes, data freshness within 15 minutes of source.

Data Stores¶

PostgreSQL¶

Version: 15.x, managed via RDS (prod/staging) and local container (dev)
Topology: Primary + 2 read replicas (prod), single instance (staging/DR)
Databases: meridian_auth, meridian_orders, meridian_inventory, meridian_billing, meridian_platform (shared)
Backup: Automated daily snapshots (RDS), WAL archival to S3, point-in-time recovery window of 7 days
Connection pooling: PgBouncer sidecar (transaction mode, max 200 connections per pod)
Failure modes: Primary failover takes 60-120s (automatic via RDS Multi-AZ). Connection pool exhaustion under load. Replica lag during bulk operations.

Redis¶

Version: 7.x, managed via ElastiCache (prod/staging)
Topology: Primary + 1 replica, cluster mode disabled
Uses: Session cache, rate limit counters, idempotency keys, hot stock cache, distributed locks, usage counter aggregation
Failure modes: Failover causes 10-30s connection interruption. Memory pressure triggers eviction (allkeys-lru policy). Split-brain during network partition.
Persistence: AOF disabled (cache-only; authoritative data is in PostgreSQL)

RabbitMQ¶

Version: 3.12.x, self-managed (Helm chart: bitnami/rabbitmq)
Topology: 3-node cluster with quorum queues (prod), single node (staging)
Exchanges: orders.events, inventory.events, fulfillment.events, notifications, reports.scheduled
Dead-letter handling: Failed messages routed to *.dlq queues after 3 retries, monitored via Prometheus exporter
Failure modes: Network partition causes cluster split (pause-minority mode configured). Disk alarm blocks publishers. Queue backup during consumer outage causes memory pressure.

Elasticsearch¶

Version: 8.x, self-managed (ECK operator)
Topology: 3 data nodes, 2 dedicated masters, 1 coordinator (prod)
Indices: Product catalog per tenant (rolling aliases), search analytics
Failure modes: Cluster yellow (lost replica), cluster red (lost primary shard). JVM heap pressure causes slow queries. Mapping explosion from uncontrolled dynamic fields.

S3 (AWS S3)¶

Buckets: meridian-media (product images), meridian-reports (generated reports), meridian-backups (DB snapshots, WAL archives), meridian-fulfillment (shipping labels)
Access: IAM roles for service accounts (IRSA), pre-signed URLs for tenant access
Lifecycle: Media retained indefinitely, reports archived to Glacier after 90 days, backups retained 30 days
Failure modes: Rare (S3 is 99.99% available). Pre-signed URL expiry causes download failures. CORS misconfiguration blocks browser uploads.

Observability Stack¶

Prometheus¶

Deployment: Prometheus Operator (kube-prometheus-stack Helm chart)
Retention: 15 days local, Thanos sidecar ships to S3 for long-term (1 year)
Scrape targets: All services expose /metrics, node-exporter on all nodes, kube-state-metrics, cAdvisor, RabbitMQ exporter, PostgreSQL exporter, Redis exporter, Elasticsearch exporter
Recording rules: Pre-computed SLO burn rates, aggregated request rates, p50/p95/p99 latencies
Alert rules: 85 active alert rules across 12 groups

Alertmanager¶

Routing: Severity-based routing to PagerDuty (critical/warning), Slack (info), email (daily digest)
Inhibition: Node-level alerts inhibit pod-level alerts on the same node
Silences: Managed via Alertmanager UI, requires justification comment
Escalation: Page -> 15 min ack timeout -> escalate to secondary -> 30 min -> escalate to engineering manager

Grafana¶

Dashboards: 24 dashboards organized by service, infrastructure, and SLO
Data sources: Prometheus (metrics), Loki (logs), Tempo (traces), PostgreSQL (business metrics)
Provisioned: All dashboards as code (ConfigMaps), no manual dashboard creation in prod

Loki¶

Deployment: Simple scalable mode (3 read, 3 write, 1 backend)
Retention: 30 days
Labels: namespace, pod, container, app, level
Storage: S3 backend for chunks, BoltDB Shipper for index

Tempo¶

Deployment: Distributed mode (distributor, ingester, querier, compactor)
Retention: 14 days
Instrumentation: OpenTelemetry SDK in all services, auto-instrumented HTTP/gRPC/DB
Sampling: Head-based sampling at 10% for normal traffic, 100% for errors and slow requests (>1s)

CI/CD Pipeline¶

Build (GitHub Actions)¶

Push to main
    |
    v
Lint + Unit Tests (parallel matrix: Python 3.11, Go 1.21)
    |
    v
Security Scan (Trivy container scan, Snyk dependency scan)
    |
    v
Docker Build + Push to GHCR
    |
    v
Helm chart lint + template validation
    |
    v
Update image tag in GitOps repo (argocd-manifests/)

Deploy (ArgoCD + Helm)¶

GitOps repo updated
    |
    v
ArgoCD detects drift (polling interval: 3 min)
    |
    v
Helm template rendered with environment-specific values
    |
    v
Staging auto-sync (immediate)
    |
    v
Prod manual sync (requires approval in ArgoCD UI)
    |        \
    v         v
Rolling update    Canary (Order Service only, via Argo Rollouts)
    |
    v
Post-deploy smoke tests (automated)
    |
    v
DR cluster sync (30 min delay, automated)

Rollback¶

Helm: helm rollback <release> <revision> — instant, reverts to previous manifest set
ArgoCD: Revert commit in GitOps repo, ArgoCD auto-syncs
Database: Migrations are forward-only; rollback requires a new forward migration
Time to rollback: < 5 minutes for application, 15-30 minutes if DB migration is involved

Security¶

Vault (HashiCorp)¶

Deployment: HA mode (3 pods, Raft storage), auto-unsealed via AWS KMS
Secret engines: KV v2 (application secrets), PKI (internal CA), Transit (encryption-as-a-service for PII)
Auth methods: Kubernetes auth (pod identity), AppRole (CI/CD), OIDC (human operators)
Secret injection: External Secrets Operator syncs Vault secrets to Kubernetes Secrets
Rotation: Database credentials rotated every 24h (Vault dynamic secrets), API keys rotated quarterly

cert-manager¶

Issuers: Let's Encrypt (public TLS via DNS-01 challenge), Vault PKI (internal mTLS)
Certificates: Ingress TLS (public), inter-service mTLS (internal CA, 90-day rotation)
Monitoring: cert-manager Prometheus metrics, alert when certificate expiry < 14 days

OPA / Gatekeeper¶

Policies enforced:
All containers must run as non-root
All pods must have resource limits
No latest image tags
Only approved container registries (ghcr.io/meridian/*)
All namespaces must have NetworkPolicies
No privileged containers
PDB required for deployments with replicas > 1
Audit mode: New policies deployed in audit mode for 7 days before enforcing

Infrastructure¶

Terraform¶

Provider: AWS (us-east-1 primary, eu-west-1 DR)
Modules: VPC (subnetting, NAT, VPN), EKS (cluster, node groups, IRSA), RDS (PostgreSQL), ElastiCache (Redis), S3 (buckets + lifecycle), IAM (roles, policies), Route53 (DNS)
State: S3 backend with DynamoDB locking, per-environment state files
Workspaces: prod, staging, dr
Drift detection: Weekly terraform plan (automated, results posted to Slack)

Ansible¶

Target: 12 bare-metal edge nodes across 3 regional sites (used for CDN origin, local caching)
Playbooks: bootstrap.yml (OS setup, Docker, node-exporter), upgrade.yml (OS patches, Docker version), addons.yml (monitoring agents, log shippers)
Inventory: Dynamic inventory from CMDB API, grouped by site
Execution: Manual via bastion host, scheduled OS patching monthly via cron + Ansible

Networking¶

CNI: Calico (VXLAN mode)
Ingress: Ingress-NGINX with ModSecurity WAF rules
NetworkPolicies: Default-deny per namespace, explicit allow rules for inter-service communication
DNS: CoreDNS (in-cluster), Route53 (external), split-horizon for internal vs external resolution
Load balancing: AWS NLB (L4) fronting Ingress-NGINX, internal ClusterIP services

Cluster Topology¶

Production Cluster (us-east-1)¶

Node Group	Instance Type	Count	Purpose
system	m5.xlarge	3	Control plane add-ons, monitoring, ArgoCD
application	m5.2xlarge	6	Application workloads
data	r5.2xlarge	3	PostgreSQL operator, Elasticsearch, RabbitMQ
spot	m5.xlarge	2-8	Worker/Report service burst capacity (Spot instances)

Namespaces: meridian-prod, monitoring, argocd, vault, ingress, cert-manager, gatekeeper-system, elastic-system, rabbitmq
Total pods: ~120 (steady state)
PDBs: All stateful services and core application services have PDBs (minAvailable or maxUnavailable)

Staging Cluster (us-east-1)¶

Node Group	Instance Type	Count	Purpose
general	m5.large	3	All workloads (smaller replicas)

Namespaces: Mirror of prod, single-replica deployments
Purpose: Pre-production validation, integration testing, performance baseline
Auto-sync: ArgoCD auto-syncs on GitOps repo update (no approval required)

DR Cluster (eu-west-1)¶

Node Group	Instance Type	Count	Purpose
system	m5.xlarge	2	Control plane add-ons
application	m5.xlarge	3	Application workloads (scaled down)
data	r5.xlarge	2	Read replicas, standby data stores

State: PostgreSQL cross-region read replica (async, ~1s lag), Redis not replicated (cold start on failover), RabbitMQ not replicated (messages in-flight are lost on failover)
Sync: ArgoCD syncs DR cluster 30 minutes after prod (intentional delay to catch bad deploys)
Failover: DNS failover via Route53 health checks (TTL 60s), manual promotion of DB replica
RPO: < 1 minute (PostgreSQL WAL streaming), RTO: < 15 minutes (DNS propagation + DB promotion + service verification)

On-Call Expectations¶

Rotation¶

Schedule: Weekly rotation, 2-person on-call (primary + secondary)
Hours: 24/7 for critical alerts, business hours only for warning-level
Handoff: Monday 10:00 AM, 30-minute handoff meeting with outgoing on-call
Tools: PagerDuty (paging), Slack #incidents (coordination), Grafana (investigation), Zoom (war room)

Responsibilities¶

The on-call engineer is responsible for:

Acknowledging alerts within 15 minutes (critical) or 1 hour (warning)
Triaging the issue: determine scope, impact, and urgency
Mitigating the customer impact (even if root cause is not yet known)
Escalating when needed (see escalation paths below)
Communicating status via Slack #incidents and StatusPage (for customer-facing issues)
Documenting actions taken in the incident channel
Writing postmortems for any Sev1 or Sev2 incident within 48 hours

Escalation Paths¶

Severity	Response Time	First Responder	Escalation
Sev1 (service down, data loss risk)	5 min ack	Primary on-call	Secondary on-call (15 min) -> Engineering Manager (30 min) -> VP Engineering (1 hr)
Sev2 (degraded performance, partial outage)	15 min ack	Primary on-call	Secondary on-call (30 min) -> Team lead (1 hr)
Sev3 (non-critical issue, workaround exists)	1 hr ack	Primary on-call	Ticket created, addressed next business day
Sev4 (cosmetic, informational)	Next business day	On-call reviews in morning triage	Ticket created, prioritized in sprint

Key Dashboards¶

Dashboard	URL Path	Purpose
System Overview	`/d/system-overview`	High-level health of all services
SLO Burn Rate	`/d/slo-burn-rate`	Error budget consumption per service
Order Pipeline	`/d/order-pipeline`	Order creation through fulfillment
Infrastructure	`/d/infra-overview`	Node health, resource usage, network
On-Call Summary	`/d/oncall-summary`	Active alerts, recent incidents, handoff notes

Common Runbook Entry Points¶

Symptom	Runbook
Pod CrashLoopBackOff	Check logs, resource limits, readiness probes, recent deploys
High error rate on API Gateway	Check upstream service health, recent deploys, rate limit config
Database connection errors	Check PgBouncer pools, RDS events, connection count
RabbitMQ queue backup	Check consumer health, message rate, DLQ count
Node NotReady	Check kubelet, system resources, network, cloud provider events
Certificate expiry alert	Check cert-manager logs, issuer status, DNS challenge
Vault sealed	Check auto-unseal (KMS), pod restarts, storage backend

Network Topology¶

Internet
    |
[AWS NLB] (TCP 443, TLS passthrough)
    |
[Ingress-NGINX pods] (namespace: ingress, 3 replicas)
    |  - TLS termination (Let's Encrypt certs via cert-manager)
    |  - ModSecurity WAF
    |  - Rate limiting (global)
    |
[Kong pods] (namespace: meridian-prod, 3 replicas)
    |  - Per-tenant rate limiting
    |  - JWT validation
    |  - Request routing
    |
[Application Services] (namespace: meridian-prod)
    |  - ClusterIP services
    |  - Calico NetworkPolicy: default-deny ingress
    |  - Explicit allow from Kong namespace
    |  - mTLS between services (Vault PKI certs)
    |
[Data Stores]
    |  - PostgreSQL: RDS (private subnet, security group)
    |  - Redis: ElastiCache (private subnet, security group)
    |  - RabbitMQ: In-cluster (namespace: rabbitmq, NetworkPolicy restricted)
    |  - Elasticsearch: In-cluster (namespace: elastic-system)
    |  - S3: VPC endpoint (no internet traversal)

DNS¶

External: api.meridian.io -> Route53 -> NLB -> Ingress-NGINX
Internal: <service>.meridian-prod.svc.cluster.local -> CoreDNS
Split-horizon: db.internal.meridian.io resolves to RDS endpoint internally, NXDOMAIN externally

Firewall Rules (Calico NetworkPolicy summary)¶

Source	Destination	Ports	Policy
ingress namespace	Kong pods	8000/TCP	Allow
Kong pods	All app services	8080/TCP, 9090/gRPC	Allow
App services	PostgreSQL (RDS)	5432/TCP	Allow (via Security Group)
App services	Redis (ElastiCache)	6379/TCP	Allow (via Security Group)
App services	RabbitMQ pods	5672/TCP	Allow
App services	Elasticsearch pods	9200/TCP	Allow
Prometheus	All pods (metrics)	*/TCP (metrics port)	Allow
All other	All other	*	Deny (default)