production_readiness
Production Readiness Review: System Architecture
Overview
You are the newest member of the platform engineering team at Meridian , a mid-scale B2B SaaS company that provides real-time inventory management and order fulfillment for e-commerce retailers. The platform processes approximately 15,000 requests per second at peak, handles $2M+ in daily transaction volume, and serves 400 active tenants across North America and Europe.
The system runs on Kubernetes across three clusters (production, staging, and disaster recovery), backed by PostgreSQL for persistence, Redis for caching and session state, RabbitMQ for async messaging, and S3-compatible object storage for media and reports. Observability is provided by Prometheus, Grafana, Loki, and Tempo. Deployments flow through GitHub Actions into ArgoCD, which reconciles Helm releases against the clusters. Security is enforced through HashiCorp Vault, cert-manager, and Open Policy Agent (OPA). Cloud infrastructure is provisioned with Terraform; a small fleet of bare-metal edge nodes is managed with Ansible.
You are about to start on-call rotations. This document describes the system you will be responsible for.
Architecture Diagram
+-----------+
| Users |
+-----+-----+
|
+-----v-----+
| CloudFront|
| (CDN) |
+-----+-----+
|
+--------v--------+
| Ingress-NGINX |
| (TLS termination|
| rate limiting) |
+--------+--------+
|
+-----------v-----------+
| API Gateway (Kong) |
| - Auth/rate limit |
| - Request routing |
| - JWT validation |
+-----------+-----------+
|
+----------+-----------+-----------+-----------+
| | | | |
+-----v----+ +---v------+ +-v--------+ +v--------+ +v---------+
| Auth | | Order | | Inventory| | Search | | Billing |
| Service | | Service | | Service | | Service | | Service |
+-----+----+ +---+------+ +----+-----+ +----+----+ +----+-----+
| | | | |
| +-----v------+ | +----v----+ |
| | Fulfillment| | | Elastic-| |
| | Service | | | search | |
| +-----+------+ | +---------+ |
| | | |
+----v----+ +--v-------+ +--v-------+ +-----v-----+
| Notifi- | | Worker | | Report | | Stripe |
| cation | | Service | | Service | | Webhook |
| Service | +--+-------+ +--+-------+ | Handler |
+---------+ | | +-----------+
| |
+----v----+ +----v----+
| RabbitMQ| | S3 |
| (async | | (media, |
| queue) | | reports)|
+---------+ +---------+
+-------------+ +----------+
| PostgreSQL | | Redis |
| (primary + | | (cache, |
| 2 replicas)| | sessions)|
+-------------+ +----------+
Observability Stack CI/CD Pipeline
================== ==============
Prometheus (metrics) GitHub Actions (build/test)
Grafana (dashboards) |
Loki (logs) Container Registry (GHCR)
Tempo (traces) |
Alertmanager (paging) ArgoCD (GitOps sync)
|
Security Helm (release mgmt)
======== |
Vault (secrets) 3 Clusters (prod/stg/DR)
cert-manager (TLS)
OPA/Gatekeeper (policy)
Infrastructure
==============
Terraform (AWS: VPC, EKS, RDS, ElastiCache, MSK, S3)
Ansible (bare-metal edge nodes: 12 servers, 3 sites)
Calico (CNI + NetworkPolicy enforcement)
Component Details
API Gateway: Kong
Attribute
Detail
Purpose
Central entry point for all API traffic. Handles JWT validation, rate limiting (per-tenant), request routing, and request/response transformation.
Dependencies
Ingress-NGINX (upstream), Auth Service (JWT public keys), Redis (rate limit counters), PostgreSQL (Kong config in DB-less mode uses declarative config stored in ConfigMap).
Failure modes
Pod crash causes 502 from ingress (mitigated by 3 replicas + PDB). Redis unavailable degrades rate limiting to permissive mode. Misconfigured route returns 404 for valid paths.
SLO
99.95% availability, p99 added latency < 15ms.
Auth Service
Attribute
Detail
Purpose
User authentication (OAuth2/OIDC), API key management, tenant identity, JWT issuance and rotation.
Dependencies
PostgreSQL (user/tenant data), Redis (session cache, token blacklist), Vault (signing keys, OIDC client secrets).
Failure modes
PostgreSQL unreachable causes login failures. Vault token expiry blocks key rotation. Stale Redis cache can allow revoked tokens for up to 5 minutes (TTL).
SLO
99.95% availability, p95 login latency < 200ms.
Order Service
Attribute
Detail
Purpose
Order lifecycle management: creation, validation, state transitions (pending, confirmed, fulfilled, canceled). Publishes events to RabbitMQ for downstream processing.
Dependencies
PostgreSQL (order data, ACID transactions), RabbitMQ (event publishing), Inventory Service (stock reservation via sync gRPC), Redis (idempotency keys).
Failure modes
RabbitMQ unavailable blocks event publishing; orders still persist but downstream processing stalls. Inventory Service timeout causes order creation to hang (circuit breaker trips after 5s). Database connection pool exhaustion under load spike.
SLO
99.95% availability, p99 order creation < 500ms, zero lost orders (data durability).
Inventory Service
Attribute
Detail
Purpose
Real-time stock level tracking, reservation management, warehouse sync. Exposes gRPC API for internal services and REST API for tenant dashboards.
Dependencies
PostgreSQL (stock data with row-level locking), Redis (hot stock cache, write-through), RabbitMQ (consumes fulfillment events to decrement stock).
Failure modes
Cache inconsistency causes overselling (mitigated by DB-level pessimistic locking). Slow warehouse sync causes stale data for up to 15 minutes.
SLO
99.9% availability, stock accuracy within 60 seconds of warehouse update.
Search Service
Attribute
Detail
Purpose
Full-text product search, filtering, faceting, and autocomplete for tenant storefronts.
Dependencies
Elasticsearch (3-node cluster, 2 replicas per index), PostgreSQL (initial data load and reindex source), RabbitMQ (consumes product update events for near-real-time index updates).
Failure modes
Elasticsearch cluster goes yellow/red (reduces redundancy or blocks writes). Reindex from PostgreSQL takes 45 minutes (during which search results may be stale). Mapping conflicts on schema changes block indexing.
SLO
99.9% availability, p95 search latency < 150ms, index lag < 30 seconds.
Billing Service
Attribute
Detail
Purpose
Tenant subscription management, usage metering, invoice generation, Stripe integration for payment processing.
Dependencies
PostgreSQL (billing records, usage counters), Stripe API (payment processing, webhook receiver), Redis (usage counter aggregation before flush to DB).
Failure modes
Stripe API outage blocks payment processing (invoices queue for retry). Usage counter loss on Redis crash (mitigated by periodic DB flush every 60s). Webhook signature validation failure silently drops payment confirmations.
SLO
99.9% availability, invoice accuracy 100%, payment processing within 30s of trigger.
Fulfillment Service
Attribute
Detail
Purpose
Coordinates order fulfillment: picks warehouse, generates shipping labels, tracks shipment status, updates order state.
Dependencies
RabbitMQ (consumes order events, publishes fulfillment events), PostgreSQL (fulfillment records), S3 (shipping label storage), external carrier APIs (FedEx, UPS, USPS).
Failure modes
Carrier API timeout causes label generation delay (retry with exponential backoff). Message processing failure causes redelivery loop (dead-letter queue after 3 retries). S3 unavailability blocks label storage.
SLO
99.9% availability, fulfillment processing within 5 minutes of order confirmation.
Notification Service
Attribute
Detail
Purpose
Sends transactional emails, SMS, and webhook notifications to tenants and end-users.
Dependencies
RabbitMQ (consumes notification events), Redis (deduplication cache, rate limiting per tenant), SendGrid (email), Twilio (SMS), PostgreSQL (notification log, template storage).
Failure modes
External provider outage causes notification delay (failover from SendGrid to SES configured). Template rendering error causes garbled notifications. Rate limit misconfiguration causes tenant notification flood.
SLO
99.5% delivery rate, email delivery within 60s, SMS within 30s.
Worker Service
Attribute
Detail
Purpose
Background job processing: data imports, bulk operations, scheduled tasks (cron-based), tenant data exports.
Dependencies
RabbitMQ (job queue), PostgreSQL (job state, tenant data), S3 (export output), Redis (job deduplication and distributed locking).
Failure modes
Long-running job blocks queue consumer (mitigated by separate queues per job type). OOM kill on large data exports (memory limit 2Gi, large exports stream to S3). Cron job overlap on slow execution (distributed lock prevents double-run).
SLO
Job completion within 2x estimated duration, zero dropped jobs.
Report Service
Attribute
Detail
Purpose
Generates tenant analytics reports, dashboards data, and scheduled report delivery.
Dependencies
PostgreSQL read replica (heavy queries offloaded from primary), S3 (report PDF/CSV storage), Redis (report cache, 15-minute TTL), RabbitMQ (scheduled report triggers).
Failure modes
Read replica lag causes stale report data. Large report generation causes high memory usage (streaming pagination mitigates). S3 upload timeout on large reports.
SLO
Report generation within 5 minutes, data freshness within 15 minutes of source.
Data Stores
PostgreSQL
Version : 15.x, managed via RDS (prod/staging) and local container (dev)
Topology : Primary + 2 read replicas (prod), single instance (staging/DR)
Databases : meridian_auth, meridian_orders, meridian_inventory, meridian_billing, meridian_platform (shared)
Backup : Automated daily snapshots (RDS), WAL archival to S3, point-in-time recovery window of 7 days
Connection pooling : PgBouncer sidecar (transaction mode, max 200 connections per pod)
Failure modes : Primary failover takes 60-120s (automatic via RDS Multi-AZ). Connection pool exhaustion under load. Replica lag during bulk operations.
Redis
Version : 7.x, managed via ElastiCache (prod/staging)
Topology : Primary + 1 replica, cluster mode disabled
Uses : Session cache, rate limit counters, idempotency keys, hot stock cache, distributed locks, usage counter aggregation
Failure modes : Failover causes 10-30s connection interruption. Memory pressure triggers eviction (allkeys-lru policy). Split-brain during network partition.
Persistence : AOF disabled (cache-only; authoritative data is in PostgreSQL)
RabbitMQ
Version : 3.12.x, self-managed (Helm chart: bitnami/rabbitmq)
Topology : 3-node cluster with quorum queues (prod), single node (staging)
Exchanges : orders.events, inventory.events, fulfillment.events, notifications, reports.scheduled
Dead-letter handling : Failed messages routed to *.dlq queues after 3 retries, monitored via Prometheus exporter
Failure modes : Network partition causes cluster split (pause-minority mode configured). Disk alarm blocks publishers. Queue backup during consumer outage causes memory pressure.
Elasticsearch
Version : 8.x, self-managed (ECK operator)
Topology : 3 data nodes, 2 dedicated masters, 1 coordinator (prod)
Indices : Product catalog per tenant (rolling aliases), search analytics
Failure modes : Cluster yellow (lost replica), cluster red (lost primary shard). JVM heap pressure causes slow queries. Mapping explosion from uncontrolled dynamic fields.
S3 (AWS S3)
Buckets : meridian-media (product images), meridian-reports (generated reports), meridian-backups (DB snapshots, WAL archives), meridian-fulfillment (shipping labels)
Access : IAM roles for service accounts (IRSA), pre-signed URLs for tenant access
Lifecycle : Media retained indefinitely, reports archived to Glacier after 90 days, backups retained 30 days
Failure modes : Rare (S3 is 99.99% available). Pre-signed URL expiry causes download failures. CORS misconfiguration blocks browser uploads.
Observability Stack
Prometheus
Deployment : Prometheus Operator (kube-prometheus-stack Helm chart)
Retention : 15 days local, Thanos sidecar ships to S3 for long-term (1 year)
Scrape targets : All services expose /metrics, node-exporter on all nodes, kube-state-metrics, cAdvisor, RabbitMQ exporter, PostgreSQL exporter, Redis exporter, Elasticsearch exporter
Recording rules : Pre-computed SLO burn rates, aggregated request rates, p50/p95/p99 latencies
Alert rules : 85 active alert rules across 12 groups
Alertmanager
Routing : Severity-based routing to PagerDuty (critical/warning), Slack (info), email (daily digest)
Inhibition : Node-level alerts inhibit pod-level alerts on the same node
Silences : Managed via Alertmanager UI, requires justification comment
Escalation : Page -> 15 min ack timeout -> escalate to secondary -> 30 min -> escalate to engineering manager
Grafana
Dashboards : 24 dashboards organized by service, infrastructure, and SLO
Data sources : Prometheus (metrics), Loki (logs), Tempo (traces), PostgreSQL (business metrics)
Provisioned : All dashboards as code (ConfigMaps), no manual dashboard creation in prod
Loki
Deployment : Simple scalable mode (3 read, 3 write, 1 backend)
Retention : 30 days
Labels : namespace, pod, container, app, level
Storage : S3 backend for chunks, BoltDB Shipper for index
Tempo
Deployment : Distributed mode (distributor, ingester, querier, compactor)
Retention : 14 days
Instrumentation : OpenTelemetry SDK in all services, auto-instrumented HTTP/gRPC/DB
Sampling : Head-based sampling at 10% for normal traffic, 100% for errors and slow requests (>1s)
CI/CD Pipeline
Build (GitHub Actions)
Push to main
|
v
Lint + Unit Tests (parallel matrix: Python 3.11, Go 1.21)
|
v
Security Scan (Trivy container scan, Snyk dependency scan)
|
v
Docker Build + Push to GHCR
|
v
Helm chart lint + template validation
|
v
Update image tag in GitOps repo (argocd-manifests/)
Deploy (ArgoCD + Helm)
GitOps repo updated
|
v
ArgoCD detects drift (polling interval: 3 min)
|
v
Helm template rendered with environment-specific values
|
v
Staging auto-sync (immediate)
|
v
Prod manual sync (requires approval in ArgoCD UI)
| \
v v
Rolling update Canary (Order Service only, via Argo Rollouts)
|
v
Post-deploy smoke tests (automated)
|
v
DR cluster sync (30 min delay, automated)
Rollback
Helm : helm rollback <release> <revision> — instant, reverts to previous manifest set
ArgoCD : Revert commit in GitOps repo, ArgoCD auto-syncs
Database : Migrations are forward-only; rollback requires a new forward migration
Time to rollback : < 5 minutes for application, 15-30 minutes if DB migration is involved
Security
Vault (HashiCorp)
Deployment : HA mode (3 pods, Raft storage), auto-unsealed via AWS KMS
Secret engines : KV v2 (application secrets), PKI (internal CA), Transit (encryption-as-a-service for PII)
Auth methods : Kubernetes auth (pod identity), AppRole (CI/CD), OIDC (human operators)
Secret injection : External Secrets Operator syncs Vault secrets to Kubernetes Secrets
Rotation : Database credentials rotated every 24h (Vault dynamic secrets), API keys rotated quarterly
cert-manager
Issuers : Let's Encrypt (public TLS via DNS-01 challenge), Vault PKI (internal mTLS)
Certificates : Ingress TLS (public), inter-service mTLS (internal CA, 90-day rotation)
Monitoring : cert-manager Prometheus metrics, alert when certificate expiry < 14 days
OPA / Gatekeeper
Policies enforced :
All containers must run as non-root
All pods must have resource limits
No latest image tags
Only approved container registries (ghcr.io/meridian/*)
All namespaces must have NetworkPolicies
No privileged containers
PDB required for deployments with replicas > 1
Audit mode : New policies deployed in audit mode for 7 days before enforcing
Infrastructure
Provider : AWS (us-east-1 primary, eu-west-1 DR)
Modules : VPC (subnetting, NAT, VPN), EKS (cluster, node groups, IRSA), RDS (PostgreSQL), ElastiCache (Redis), S3 (buckets + lifecycle), IAM (roles, policies), Route53 (DNS)
State : S3 backend with DynamoDB locking, per-environment state files
Workspaces : prod, staging, dr
Drift detection : Weekly terraform plan (automated, results posted to Slack)
Ansible
Target : 12 bare-metal edge nodes across 3 regional sites (used for CDN origin, local caching)
Playbooks : bootstrap.yml (OS setup, Docker, node-exporter), upgrade.yml (OS patches, Docker version), addons.yml (monitoring agents, log shippers)
Inventory : Dynamic inventory from CMDB API, grouped by site
Execution : Manual via bastion host, scheduled OS patching monthly via cron + Ansible
Networking
CNI : Calico (VXLAN mode)
Ingress : Ingress-NGINX with ModSecurity WAF rules
NetworkPolicies : Default-deny per namespace, explicit allow rules for inter-service communication
DNS : CoreDNS (in-cluster), Route53 (external), split-horizon for internal vs external resolution
Load balancing : AWS NLB (L4) fronting Ingress-NGINX, internal ClusterIP services
Cluster Topology
Production Cluster (us-east-1)
Node Group
Instance Type
Count
Purpose
system
m5.xlarge
3
Control plane add-ons, monitoring, ArgoCD
application
m5.2xlarge
6
Application workloads
data
r5.2xlarge
3
PostgreSQL operator, Elasticsearch, RabbitMQ
spot
m5.xlarge
2-8
Worker/Report service burst capacity (Spot instances)
Namespaces : meridian-prod, monitoring, argocd, vault, ingress, cert-manager, gatekeeper-system, elastic-system, rabbitmq
Total pods : ~120 (steady state)
PDBs : All stateful services and core application services have PDBs (minAvailable or maxUnavailable)
Staging Cluster (us-east-1)
Node Group
Instance Type
Count
Purpose
general
m5.large
3
All workloads (smaller replicas)
Namespaces : Mirror of prod, single-replica deployments
Purpose : Pre-production validation, integration testing, performance baseline
Auto-sync : ArgoCD auto-syncs on GitOps repo update (no approval required)
DR Cluster (eu-west-1)
Node Group
Instance Type
Count
Purpose
system
m5.xlarge
2
Control plane add-ons
application
m5.xlarge
3
Application workloads (scaled down)
data
r5.xlarge
2
Read replicas, standby data stores
State : PostgreSQL cross-region read replica (async, ~1s lag), Redis not replicated (cold start on failover), RabbitMQ not replicated (messages in-flight are lost on failover)
Sync : ArgoCD syncs DR cluster 30 minutes after prod (intentional delay to catch bad deploys)
Failover : DNS failover via Route53 health checks (TTL 60s), manual promotion of DB replica
RPO : < 1 minute (PostgreSQL WAL streaming), RTO: < 15 minutes (DNS propagation + DB promotion + service verification)
On-Call Expectations
Rotation
Schedule : Weekly rotation, 2-person on-call (primary + secondary)
Hours : 24/7 for critical alerts, business hours only for warning-level
Handoff : Monday 10:00 AM, 30-minute handoff meeting with outgoing on-call
Tools : PagerDuty (paging), Slack #incidents (coordination), Grafana (investigation), Zoom (war room)
Responsibilities
The on-call engineer is responsible for:
Acknowledging alerts within 15 minutes (critical) or 1 hour (warning)
Triaging the issue: determine scope, impact, and urgency
Mitigating the customer impact (even if root cause is not yet known)
Escalating when needed (see escalation paths below)
Communicating status via Slack #incidents and StatusPage (for customer-facing issues)
Documenting actions taken in the incident channel
Writing postmortems for any Sev1 or Sev2 incident within 48 hours
Escalation Paths
Severity
Response Time
First Responder
Escalation
Sev1 (service down, data loss risk)
5 min ack
Primary on-call
Secondary on-call (15 min) -> Engineering Manager (30 min) -> VP Engineering (1 hr)
Sev2 (degraded performance, partial outage)
15 min ack
Primary on-call
Secondary on-call (30 min) -> Team lead (1 hr)
Sev3 (non-critical issue, workaround exists)
1 hr ack
Primary on-call
Ticket created, addressed next business day
Sev4 (cosmetic, informational)
Next business day
On-call reviews in morning triage
Ticket created, prioritized in sprint
Key Dashboards
Dashboard
URL Path
Purpose
System Overview
/d/system-overview
High-level health of all services
SLO Burn Rate
/d/slo-burn-rate
Error budget consumption per service
Order Pipeline
/d/order-pipeline
Order creation through fulfillment
Infrastructure
/d/infra-overview
Node health, resource usage, network
On-Call Summary
/d/oncall-summary
Active alerts, recent incidents, handoff notes
Common Runbook Entry Points
Symptom
Runbook
Pod CrashLoopBackOff
Check logs, resource limits, readiness probes, recent deploys
High error rate on API Gateway
Check upstream service health, recent deploys, rate limit config
Database connection errors
Check PgBouncer pools, RDS events, connection count
RabbitMQ queue backup
Check consumer health, message rate, DLQ count
Node NotReady
Check kubelet, system resources, network, cloud provider events
Certificate expiry alert
Check cert-manager logs, issuer status, DNS challenge
Vault sealed
Check auto-unseal (KMS), pod restarts, storage backend
Network Topology
Internet
|
[AWS NLB] (TCP 443, TLS passthrough)
|
[Ingress-NGINX pods] (namespace: ingress, 3 replicas)
| - TLS termination (Let's Encrypt certs via cert-manager)
| - ModSecurity WAF
| - Rate limiting (global)
|
[Kong pods] (namespace: meridian-prod, 3 replicas)
| - Per-tenant rate limiting
| - JWT validation
| - Request routing
|
[Application Services] (namespace: meridian-prod)
| - ClusterIP services
| - Calico NetworkPolicy: default-deny ingress
| - Explicit allow from Kong namespace
| - mTLS between services (Vault PKI certs)
|
[Data Stores]
| - PostgreSQL: RDS (private subnet, security group)
| - Redis: ElastiCache (private subnet, security group)
| - RabbitMQ: In-cluster (namespace: rabbitmq, NetworkPolicy restricted)
| - Elasticsearch: In-cluster (namespace: elastic-system)
| - S3: VPC endpoint (no internet traversal)
DNS
External : api.meridian.io -> Route53 -> NLB -> Ingress-NGINX
Internal : <service>.meridian-prod.svc.cluster.local -> CoreDNS
Split-horizon : db.internal.meridian.io resolves to RDS endpoint internally, NXDOMAIN externally
Firewall Rules (Calico NetworkPolicy summary)
Source
Destination
Ports
Policy
ingress namespace
Kong pods
8000/TCP
Allow
Kong pods
All app services
8080/TCP, 9090/gRPC
Allow
App services
PostgreSQL (RDS)
5432/TCP
Allow (via Security Group)
App services
Redis (ElastiCache)
6379/TCP
Allow (via Security Group)
App services
RabbitMQ pods
5672/TCP
Allow
App services
Elasticsearch pods
9200/TCP
Allow
Prometheus
All pods (metrics)
*/TCP (metrics port)
Allow
All other
All other
*
Deny (default)
Pages that link here
April 1, 2026 21:00:58
March 19, 2026 13:58:35