Decision Tree: Sync vs Async Communication¶

Category: Architecture Decisions Starting Question: "Should service A communicate with service B synchronously or asynchronously?" Estimated traversal: 3-4 minutes Domains: microservices, messaging, architecture, distributed-systems, reliability

The Tree¶

Should service A communicate with service B synchronously or asynchronously?
│
├── Does the caller need the result immediately to continue processing?
│   ├── Yes →
│   │   └── Is the operation expected to complete in < 1 second under normal load?
│   │       ├── Yes →
│   │       │   └── Can service B be temporarily unavailable without blocking service A?
│   │       │       ├── No  → DECISION: Synchronous HTTP/gRPC (tight coupling accepted)
│   │       │       └── Yes → WARNING: Consider circuit breaker + sync, or async with polling
│   │       └── No (operation takes > 1s) →
│   │           └── Can the caller use a polling or callback pattern?
│   │               ├── Yes → DECISION: Async with webhook/callback (fire-and-forget + notify)
│   │               └── No  → WARNING: Long sync calls — evaluate queue + frontend polling
│   │
│   └── No (caller does not need immediate result) →
│       └── Is there a fan-out requirement? (one event → multiple independent consumers)
│           ├── Yes →
│           │   └── Does message order matter within each partition/consumer group?
│           │       ├── Yes → DECISION: Async with Kafka/Kinesis (ordered, partitioned stream)
│           │       └── No  → DECISION: Async with SNS/EventBridge (pub-sub fan-out)
│           │
│           └── No (single consumer or simple queue) →
│               └── Does the consumer need to replay or audit past messages?
│                   ├── Yes → DECISION: Async with Kafka (log-based, replayable)
│                   └── No  →
│                       └── Is exactly-once or at-least-once delivery required?
│                           ├── Exactly-once → DECISION: Async with transactional queue
│                           │                  (SQS FIFO + idempotency key, or Kafka transactions)
│                           └── At-least-once → DECISION: Async with SQS/RabbitMQ standard queue

Node Details¶

Check 1: Caller Needs Immediate Result¶

How to assess: Trace the code path. After calling service B, does service A immediately use B's response to generate its own response to the user or calling system? If A returns a response to the user that includes data from B, the call is synchronous by requirement. What you're looking for: A genuine data dependency in the response path. Not "it would be convenient to have the result now," but "the response cannot be constructed without B's data." Common pitfall: Conflating convenience with requirement. Developers often write synchronous calls because it is simpler to code, not because the architecture requires it. Ask: "Could the user see a loading state while B processes, and we update the UI asynchronously?" If yes, async is viable.

Check 2: Operation Completes in < 1 Second¶

How to assess: Measure the P99 latency of the operation in service B under expected peak load. If B doesn't exist yet, estimate based on: database query latency (1-50ms), external API call (50-500ms), ML inference (100ms-5s), batch processing (seconds to minutes). What you're looking for: Consistent sub-second completion at P99, not just average. P99 latency is what your SLO is measured against. A service with 50ms P50 and 5s P99 is a bad synchronous dependency. Common pitfall: Measuring latency in staging with low concurrency. Synchronous calls under high load create backpressure: slow responses in B cause A's thread pool to fill, leading to cascading failures. Always measure at expected peak load, not at idle.

Check 3: Service B Can Be Temporarily Unavailable¶

How to assess: What happens to service A if service B is down? Can A return a degraded but useful response without B's data? Can A serve from a cache? Can A defer the operation until B recovers? What you're looking for: Whether A has a defined degraded mode that does not depend on B's availability. If A's SLO is tied directly to B's availability, you have coupled their availability SLOs. Common pitfall: Assuming synchronous calls are safe because "B is highly available." Every dependency adds to your total failure budget. A system with 10 synchronous dependencies, each at 99.9% availability, has a combined availability of ~99% — one 9 lower than each individual dependency.

Check 4: Fan-Out Requirement¶

How to assess: List all systems that need to react to the event. An "OrderPlaced" event might need to trigger: inventory reservation, email notification, analytics pipeline, fraud detection, loyalty points calculation. If 3+ independent systems need to consume the event, fan-out is a significant factor. What you're looking for: Multiple independent consumers that should react to the same event without coupling to each other. Fan-out via synchronous calls requires the producer to know all consumers, creating N-way coupling. A message bus decouples the producer from consumers. Common pitfall: Creating a synchronous fan-out chain (A calls B, which calls C, which calls D) instead of a parallel fan-out. Synchronous chains multiply latency and mean one slow consumer blocks all downstream processing.

Check 5: Message Order Requirements¶

How to assess: Does the processing logic break if messages arrive out of order? Example: "user created" then "user updated" must be processed in order or the update applies to a non-existent user. Contrast with: "email sent" events where order does not matter. What you're looking for: Stateful consumers where out-of-order delivery creates incorrect state. Note that order requirements are almost always scoped to a single entity (all events for user ID 123 must be ordered) rather than globally ordered across all messages. Common pitfall: Requiring global ordering when per-entity ordering is sufficient. Kafka provides per-partition ordering, which maps to per-entity ordering when you partition by entity ID. Global ordering eliminates parallelism entirely — avoid it unless truly required.

Check 6: Message Replay / Audit Requirements¶

How to assess: Do you need to reprocess historical events (for a new consumer, for bug correction, for audit)? Does regulatory compliance require an immutable record of events? Is event sourcing part of the design? What you're looking for: Requirement to access messages after they have been processed. Traditional queues (SQS, RabbitMQ) delete messages after acknowledgment. Log-based queues (Kafka) retain messages for a configurable period and allow replay. Common pitfall: Using a traditional queue for a workload that will eventually need replay capability, then having to migrate to a log-based queue after the fact. If you are building an event-sourced system or need auditability, Kafka's log retention is a fundamental architectural requirement, not an optional feature.

Check 7: Delivery Guarantee¶

How to assess: What happens if a message is processed twice? Write down the impact: "send the same email twice" (annoying but recoverable), "charge the customer twice" (critical failure). Can the consumer implement idempotency (same message twice produces the same result)? What you're looking for: Whether "exactly-once" is a genuine business requirement or whether "at-least-once + idempotency" achieves the same outcome. True exactly-once delivery is expensive and rarely necessary when consumers are idempotent. Common pitfall: Requesting exactly-once delivery without implementing idempotent consumers. Exactly-once semantics at the queue level (FIFO deduplication) only help if the consumer's processing is also idempotent. Focus on idempotency in the consumer first, then evaluate whether additional queue-level guarantees are needed.

Terminal Actions¶

Decision: Synchronous HTTP/gRPC¶

Choose: Direct HTTP or gRPC call from service A to service B. Why: When the caller genuinely requires B's response to continue, and the operation is fast and B is highly available, synchronous communication is the simplest and most debuggable option. Avoid introducing async complexity when the call pattern is inherently request-response. Next step: Implement a circuit breaker (Hystrix, Resilience4j, or built-in Envoy circuit breaker if using a service mesh). Set an explicit timeout that is tighter than your caller's own timeout. Add distributed tracing to make the synchronous call visible in your observability stack.

Decision: Async with SQS/RabbitMQ Standard Queue¶

Choose: A standard queue (AWS SQS standard, RabbitMQ, Google Pub/Sub) for point-to-point work queue pattern. Why: Standard queues provide durability (messages survive consumer restarts), decoupling (producer does not need to know if the consumer is available), backpressure management (queue depth as a scaling signal), and at-least-once delivery. Best for background jobs, work queues, and single-consumer patterns where ordering is not required. Next step: Design the consumer to be idempotent — it must handle processing the same message twice without side effects. Set a dead-letter queue (DLQ) for messages that fail after N retries. Monitor queue depth as a key operational metric and autoscale consumers based on it.

Decision: Async with Kafka or Kinesis¶

Choose: Apache Kafka (self-hosted or Confluent Cloud) or AWS Kinesis for ordered, partitioned, replayable event streams. Why: Kafka is the right choice when you need: ordered delivery within a partition, message replay capability, multiple independent consumer groups processing the same events, or high throughput (100k+ events/sec). It is the foundation of event-driven architectures and event sourcing. Next step: Define your topic partitioning strategy (partition by entity ID for per-entity ordering). Set a retention period based on your replay requirements. Define consumer group IDs for each independent consumer. Monitor consumer lag as the primary health indicator — lag indicates a consumer falling behind the producer.

Decision: Async with SNS/EventBridge (Pub-Sub Fan-Out)¶

Choose: AWS SNS, Google Pub/Sub, or AWS EventBridge for fan-out to multiple independent subscribers. Why: When one event must trigger multiple independent reactions, a pub-sub fan-out decouples the producer from each consumer. The producer emits one event; the bus delivers it to all registered subscribers independently. Adding a new consumer does not require changes to the producer. Next step: Define a schema registry or at least a documented event schema contract. Consumers must declare the events they consume. Use EventBridge schema discovery to maintain a catalog of event types. Test consumer isolation — one failing consumer should not affect others.

Decision: Async with Webhook/Callback¶

Choose: Fire-and-forget + asynchronous notification via webhook or callback URL. Why: When a long-running operation (> 1s) is initiated by the caller but the caller cannot block waiting for the result, the webhook pattern allows the caller to continue and be notified upon completion. Common for: payment processing, report generation, ML inference jobs. Next step: Define the callback contract: what payload the callback will include, how the caller authenticates the callback, and what happens if the callback delivery fails (retry policy, expiry). The caller must store sufficient state to process the callback when it arrives, potentially minutes or hours later.

Warning: Long Synchronous Calls Under Load¶

When: An operation takes > 1s and is called synchronously under user-facing load. Risk: Thread pools are finite. If B is slow under load, A's threads fill waiting for responses, eventually causing A to queue or reject new requests. This is the primary mechanism by which a single slow downstream service cascades into a full system outage. Mitigation: If the operation must be synchronous (user needs the result), set a strict timeout and return an appropriate error when exceeded. If the operation can be async, use the polling pattern: A initiates a job and returns a job ID; the client polls for completion using a separate status endpoint.

Warning: Sync Circuit Breaker Not Configured¶

When: Synchronous calls exist between services but no circuit breaker is configured. Risk: A single unavailable dependency cascades: requests to A time out waiting for B; thread pools exhaust; A becomes unavailable; A's callers experience the same cascade. Mitigation: Every synchronous service-to-service call must have: (1) an explicit timeout, (2) a circuit breaker that opens after N failures and stops sending requests to B, (3) a fallback that A serves when B is unavailable. This is table stakes for synchronous inter-service communication.

Edge Cases¶

Request-reply over async infrastructure: When you need request-reply semantics (caller waits for result) but want the reliability of a queue (durability, decoupling), implement async request-reply: producer sends a message with a correlation ID and a reply-to queue; consumer processes and sends the response to the reply-to queue; producer polls or subscribes to the reply queue. This adds complexity but provides durability.
Database as message queue (outbox pattern): For transactional reliability, the outbox pattern — write the event to a local database table in the same transaction as the business operation, then relay it to the message bus — provides exactly-once event emission without distributed transactions. Use this when you cannot afford to lose events and cannot use distributed transactions.
Synchronous calls in a monolith: Within a monolith, "synchronous calls" are in-process function calls, not network calls. The failure modes are completely different. This tree applies specifically to network-based inter-service communication.
User-facing real-time features: Synchronous gRPC or HTTP is appropriate for real-time user interactions (search autocomplete, live collaboration, interactive dashboards) where latency is a user-experience requirement. Async patterns introduce minimum latency floors that are incompatible with sub-100ms user experience requirements.
Saga pattern for distributed transactions: When a multi-step operation spans multiple services and each step must roll back on failure, use the saga pattern (choreography or orchestration). This is an async pattern for what would be a synchronous database transaction in a monolith. The complexity is significant — document all compensating transactions before implementing.

Cross-References¶

Topic Packs: Messaging and Queues, Distributed Systems, Reliability Patterns
Related trees: Service Mesh, Monolith vs Microservices, Which Database