Skip to content

How We Got Here: Service Communication

Arc: Networking Eras covered: 5 Timeline: ~2005-2025 Read time: ~12 min


The Original Problem

In 2005, if you had two applications that needed to talk to each other, you hard-coded a hostname and port number into a configuration file. Service A called Service B at http://serviceB.internal:8080/api/data. When Service B moved to a new server, you updated every configuration file that referenced it. When Service B needed to scale to three instances, you put a load balancer in front and changed the hostname. When Service B was slow and took Service A down with it, you added a timeout and hoped it was long enough.

There was no service discovery, no automatic load balancing between instances, no circuit breaking, no retries with backoff, no mutual TLS, and no observability into the calls between services. The network was treated as reliable, instantaneous, and secure — and it was none of those things.


Era 1: Direct HTTP and SOAP (~2005-2010)

The Solution

Services communicated via HTTP. Simple services used REST-like patterns (though the term wasn't yet widely used). Enterprise systems used SOAP (Simple Object Access Protocol) with WSDL (Web Services Description Language) for contract definition. Service discovery was a DNS entry or a load balancer VIP managed by the network team.

What It Looked Like

<!-- SOAP request (~2007) -->
POST /OrderService HTTP/1.1
Host: orders.internal.example.com
Content-Type: text/xml
SOAPAction: "CreateOrder"

<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
  <soap:Body>
    <CreateOrder xmlns="http://example.com/orders">
      <CustomerId>12345</CustomerId>
      <Items>
        <Item ProductId="ABC" Quantity="2"/>
      </Items>
    </CreateOrder>
  </soap:Body>
</soap:Envelope>
// Client-side "resilience" — a try/catch and a prayer
try {
    OrderResponse response = orderClient.createOrder(request);
} catch (Exception e) {
    logger.error("Order service call failed", e);
    throw new ServiceUnavailableException("Please try again later");
}

Why It Was Better

  • Standardized protocol (HTTP) worked across languages and platforms
  • SOAP/WSDL provided strict contract definition and code generation
  • Load balancers (F5, HAProxy) provided basic traffic distribution
  • DNS-based service discovery was simple and universal

Why It Wasn't Enough

  • SOAP was verbose and slow (XML parsing overhead)
  • No client-side resilience (timeouts, retries, circuit breaking)
  • Service discovery was manual (update DNS/config when services moved)
  • Load balancers were hardware appliances — expensive and slow to configure
  • No observability into inter-service communication
  • Cascading failures were common (one slow service took everything down)

Legacy You'll Still See

SOAP persists in banking, insurance, healthcare, and government systems. Many "legacy APIs" are SOAP/WSDL. Direct HTTP with hardcoded endpoints is still the starting point for simple architectures. F5 load balancers are in every large enterprise data center.


Era 2: REST APIs and Client-Side Resilience (~2010-2016)

The Solution

REST (Roy Fielding, 2000, but mainstream adoption ~2010) replaced SOAP with a simpler, JSON-based approach. Netflix open-sourced the libraries that made their microservices architecture work: Eureka for service discovery, Ribbon for client-side load balancing, Hystrix for circuit breaking, and Zuul for API gateway routing. These patterns showed the industry how to build resilient inter-service communication.

What It Looked Like

// Netflix Hystrix circuit breaker (~2014)
@HystrixCommand(
    fallbackMethod = "getDefaultRecommendations",
    commandProperties = {
        @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "20"),
        @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50"),
        @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "3000")
    }
)
public List<Recommendation> getRecommendations(String userId) {
    return restTemplate.getForObject(
        "http://recommendation-service/api/users/{id}/recommendations",
        List.class, userId);
}

public List<Recommendation> getDefaultRecommendations(String userId) {
    return Collections.emptyList(); // graceful degradation
}
# Netflix Eureka client — service registration
eureka:
  client:
    serviceUrl:
      defaultZone: http://eureka:8761/eureka/
  instance:
    preferIpAddress: true
    leaseRenewalIntervalInSeconds: 10

Why It Was Better

  • REST + JSON was simpler, lighter, and faster than SOAP + XML
  • Client-side service discovery eliminated manual DNS management
  • Circuit breakers prevented cascading failures
  • Client-side load balancing distributed traffic without hardware LBs
  • Retry logic with exponential backoff handled transient failures

Why It Wasn't Enough

  • Library-based: every service needed the Netflix stack (Java-centric)
  • Polyglot architectures needed separate implementations per language
  • Developers had to understand and configure resilience patterns correctly
  • Library upgrades required redeploying every service
  • JSON/REST lacked strong typing and efficient serialization
  • No automatic mTLS between services

Legacy You'll Still See

REST APIs are the current default for synchronous service communication. Hystrix (now in maintenance mode) patterns live on in resilience4j and Spring Cloud Circuit Breaker. The circuit breaker, retry, and timeout patterns are fundamental — you need to understand them regardless of the implementation.


Era 3: gRPC and Protocol Buffers (~2015-2020)

The Solution

gRPC (Google, 2015) brought efficient binary serialization (Protocol Buffers), HTTP/2 multiplexing, bidirectional streaming, and code generation to service communication. You defined your API in a .proto file, and gRPC generated client and server code in 10+ languages. Performance was dramatically better than JSON/REST for high-throughput, low-latency communication.

What It Looked Like

// user.proto — API contract
syntax = "proto3";
package user.v1;

service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc ListUsers(ListUsersRequest) returns (stream User);
  rpc CreateUser(CreateUserRequest) returns (User);
}

message GetUserRequest {
  string user_id = 1;
}

message User {
  string user_id = 1;
  string name = 2;
  string email = 3;
  google.protobuf.Timestamp created_at = 4;
}
// Generated Go client usage
conn, err := grpc.Dial("user-service:50051", grpc.WithInsecure())
client := userpb.NewUserServiceClient(conn)

user, err := client.GetUser(ctx, &userpb.GetUserRequest{
    UserId: "usr-12345",
})
// user.Name, user.Email — strongly typed, no JSON parsing

Why It Was Better

  • Binary serialization: 5-10x smaller payloads than JSON
  • HTTP/2: multiplexed connections, header compression, streaming
  • Strong typing: proto definitions are the contract, code is generated
  • Language-agnostic: one proto file generates clients in any language
  • Streaming: server-side, client-side, and bidirectional

Why It Wasn't Enough

  • Not browser-friendly (gRPC-Web was a workaround, not a solution)
  • Debugging was harder (binary traffic isn't human-readable)
  • Proto backward compatibility required discipline (field numbering rules)
  • Load balancing was different from REST (HTTP/2 persistent connections)
  • Still no built-in service mesh capabilities (mTLS, observability, traffic shaping)

Legacy You'll Still See

gRPC is the standard for internal service-to-service communication in high-performance architectures. Kubernetes APIs use gRPC (via protobuf). Google, Netflix, Uber, and most large tech companies use gRPC internally. If you work on microservices at scale, you will encounter gRPC.


Era 4: Service Mesh (Istio, Linkerd) (~2017-2023)

The Solution

Service meshes moved communication concerns out of the application and into the infrastructure. A sidecar proxy (Envoy for Istio, linkerd2-proxy for Linkerd) was injected alongside every service instance. The proxy handled mTLS, load balancing, retries, circuit breaking, observability, and traffic shaping — without any application code changes. The control plane (Istiod, Linkerd control plane) configured all the proxies centrally.

What It Looked Like

# Istio VirtualService — traffic management without code changes
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: user-service
spec:
  hosts:
    - user-service
  http:
    - route:
        - destination:
            host: user-service
            subset: v1
          weight: 90
        - destination:
            host: user-service
            subset: v2
          weight: 10
      timeout: 3s
      retries:
        attempts: 3
        perTryTimeout: 1s
        retryOn: 5xx,reset,connect-failure
# Istio PeerAuthentication — enforce mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT
# Observability comes free — Kiali dashboard shows:
# - Service dependency graph
# - Request rates, error rates, latency (RED metrics)
# - mTLS status for every connection
# - Traffic flow between services

Why It Was Better

  • Zero code changes: mTLS, retries, circuit breaking come from the proxy
  • Language-agnostic: works for any service regardless of language
  • Centralized policy: traffic rules, security, and observability managed as config
  • Automatic mTLS: every service-to-service call encrypted and authenticated
  • Deep observability: request-level metrics and traces without instrumentation

Why It Wasn't Enough

  • Sidecar overhead: CPU, memory, and latency cost per pod
  • Operational complexity: the mesh itself needs monitoring and management
  • Debugging through proxies was harder (proxy logs, proxy configs)
  • Istio's complexity became legendary (too many CRDs, too many knobs)
  • Resource overhead was significant (2x the pod count for sidecars)
  • Not all traffic patterns worked well (non-HTTP protocols, UDP)

Legacy You'll Still See

Istio is the most deployed service mesh but is often seen as too complex. Linkerd is popular for its simplicity. Both are in production at large organizations. The service mesh pattern is established but not universal — many teams decide the overhead isn't worth it for their scale.


Era 5: eBPF-Based Mesh and Ambient Mesh (~2022-2025)

The Solution

Cilium Service Mesh (2022) used eBPF to move mesh functionality into the Linux kernel, eliminating the sidecar proxy for many use cases. Istio's Ambient Mesh (2022) replaced per-pod sidecars with per-node proxies (ztunnels) for L4 processing and optional per-service waypoint proxies for L7. Both aimed to reduce the resource overhead and operational complexity of traditional service meshes.

What It Looked Like

# Cilium Service Mesh — no sidecars needed
# L4 load balancing, mTLS, and network policy via eBPF
# L7 observability via Hubble

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-frontend-to-api
spec:
  endpointSelector:
    matchLabels:
      app: api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
      toPorts:
        - ports:
            - port: "8080"
              protocol: TCP
# Istio Ambient Mesh — per-node ztunnel + optional waypoint proxies
# Enable ambient mode for a namespace:
kubectl label namespace production istio.io/dataplane-mode=ambient

# L4 mTLS and authorization: handled by ztunnel (per-node, always on)
# L7 features (retries, traffic splitting): opt-in via waypoint proxy
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: user-service-waypoint
  namespace: production
  labels:
    istio.io/waypoint-for: service
spec:
  gatewayClassName: istio-waypoint
# Hubble (Cilium) — kernel-level observability
hubble observe --namespace production --protocol http
# Shows every HTTP request with source, destination, method, path, status
# Zero application changes, zero sidecar proxies

Why It Was Better

  • No sidecar overhead: eBPF runs in the kernel, ambient uses per-node proxies
  • Lower latency: kernel-level processing avoids proxy hops
  • Simpler operations: fewer components to manage
  • Gradual adoption: start with L4 (mTLS), add L7 only where needed
  • Hubble provides deep network observability without instrumentation

Why It Wasn't Enough

  • eBPF requires Linux kernel 5.x+ (limits older infrastructure)
  • L7 processing in eBPF is limited (complex routing still needs proxies)
  • Ambient Mesh is still evolving (not yet GA for all features)
  • Cilium's scope is growing rapidly (networking + mesh + observability) — complexity is shifting, not disappearing
  • Migration from sidecar mesh to sidecarless is non-trivial

Legacy You'll Still See

This is the current frontier. Cilium is becoming the default CNI for Kubernetes (GKE uses it natively). Ambient Mesh is Istio's future direction. The sidecar model is being phased out for most use cases. Organizations adopting a service mesh today are choosing between Cilium and Istio Ambient.


Where We Are Now

Most organizations are at one of three stages: (1) direct service-to-service HTTP/gRPC with client-side resilience libraries, (2) a service mesh (Istio or Linkerd) for automatic mTLS and observability, or (3) evaluating eBPF-based alternatives. The trend is clearly toward infrastructure-level service communication management — developers write business logic, the platform handles resilience, security, and observability.

Where It's Going

The sidecar proxy model is being replaced by kernel-level (eBPF) and per-node proxy architectures. Service mesh capabilities will become built into the platform (managed Kubernetes offerings will include mesh features by default). The distinction between "networking" and "application platform" will blur — mTLS, traffic shaping, and observability will be expected defaults, not add-ons.

The Pattern

Every generation moves communication concerns further from the application code and deeper into the infrastructure. From hardcoded URLs to client libraries to sidecar proxies to kernel programs — the pattern is always the same: make the right thing the default and make developers opt out of safety rather than opt in.

Key Takeaway for Practitioners

Don't adopt a service mesh until you have a problem that a service mesh solves (mTLS at scale, traffic shaping between services, unified observability). Start with good client-side resilience (timeouts, retries, circuit breakers). Add a mesh when the operational cost of library-based resilience exceeds the operational cost of running the mesh.

Cross-References