Technical Blog

Microservice Communication - Advanced Systems Design

Microservice Communication Patterns - SDE 2 Level Deep Dive

Microservice architecture requires careful consideration of inter-service communication strategies. This document covers synchronous and asynchronous patterns with distributed systems trade-offs suitable for FAANG-level system design.


Table of Contents

  1. Communication Paradigms
  2. Synchronous Communication
  3. Asynchronous Communication
  4. Message Brokers Deep Dive
  5. Distributed System Challenges
  6. Pattern Comparison Matrix
  7. Production Considerations

Communication Paradigms

Overview

In distributed systems, services need to exchange information reliably and efficiently. The choice between synchronous and asynchronous communication impacts:

  • Consistency Model (Strong vs Eventual)
  • Latency SLA (P99, P95 tail latencies)
  • System Coupling (degree of interdependence)
  • Failure Recovery (fault isolation)
  • Scalability (horizontal scaling capabilities)

Synchronous Communication

Definition

A caller service makes a request and blocks/waits for a response before proceeding. The request-response pattern is inherently synchronous.

Implementation Mechanisms

1. REST APIs (HTTP/HTTPS)

  • Protocol: HTTP/1.1 or HTTP/2
  • Serialization: JSON (text-based, 2-3 KB overhead)
  • Latency: 50-500ms per call depending on network/processing
  • Use Case: Simple CRUD operations, public APIs, browser clients

Considerations:

  • Stateless by design (easier horizontal scaling)
  • Connection overhead: TCP 3-way handshake (~5-10ms for distant regions)
  • Polling complexity: consumer must repeatedly ask for updates
  • Retry logic requires idempotency keys to prevent duplicates

Example Scenario:

Order Service → Payment Service (wait for response)
│                    │
└─ Blocks waiting ───┘
If Payment Service down → immediate failure

2. GraphQL

  • Advantage: Eliminates over-fetching and under-fetching of data
  • Latency: Similar to REST but can reduce round-trips
  • Complexity: Query parsing and execution overhead (~2-5ms additional)
  • Use Case: Complex data aggregation, mobile clients with limited bandwidth

Trade-offs:

  • Solves data fetching problems but doesn’t solve inherent synchronous coupling
  • Deeper query complexity can lead to N+1 problems in resolvers

N+1 Problem — Simple Analogy First:

Imagine you are a librarian. Someone asks: “Give me all books, and for each book tell me the author’s hometown.”

  • You fetch the list of 10 books → 1 trip to the shelf
  • Then for each book, you go call the author individually to ask their hometown → 10 separate phone calls
  • Total = 1 + 10 = 11 trips just to answer one question.
  • If there were 1000 books → 1001 trips. That is the N+1 problem.

How it happens in GraphQL:

GraphQL has resolvers — small functions, one per field, each responsible for fetching its own data. They do not know about each other. So when you ask for a list AND a nested field on each item, each nested field fires its own separate database query.

GraphQL Query:
{
  orders {        ← 1 query: fetch all orders
    id
    user {        ← resolver fires for EACH order separately
      name
    }
  }
}

What actually executes in the database:
SELECT * FROM orders                        ← 1 query
SELECT * FROM users WHERE id = 101          ← for order 1
SELECT * FROM users WHERE id = 102          ← for order 2
SELECT * FROM users WHERE id = 103          ← for order 3
...

If there are 1000 orders → 1001 DB queries for one client request.

Why resolvers behave this way: Each resolver is written independently:

// Order resolver
resolveUser(order) {
    return db.query("SELECT * FROM users WHERE id = " + order.userId)
    // It has NO idea 999 other orders are also calling this same function
}

Fix — DataLoader (Batching):

DataLoader collects all the IDs being requested within a single moment, then fires one batched query instead of N individual ones.

Without DataLoader:
→ 1000 separate: SELECT * FROM users WHERE id = X

With DataLoader:
→ 1 batched:     SELECT * FROM users WHERE id IN (101, 102, 103, ...)

Why it matters more in microservices: In microservices, resolvers don’t just hit a DB — they make HTTP or gRPC calls to other services. So N+1 DB queries become N+1 network calls across services, each adding network latency. 1000 orders = 1000 HTTP calls to the User Service, which can overwhelm it.

3. Feign (Spring Cloud) with Service Discovery (Eureka)

  • Purpose: Client-side HTTP client with declarative interface
  • Discovery: Automatic service location via Eureka registry
  • Circuit Breaker Integration: Works with Hystrix/Resilience4j
  • Latency: Similar to REST (~50-500ms) + discovery lookup (~5-20ms)

How it works:

Service A (Feign Client) → Eureka Registry (lookup ServiceB instance)
                        → HTTP call to ServiceB
                        → Blocking wait

Failure Scenario:

  • Service B instances crash → Eureka marks them unhealthy (takes 30-60 seconds)
  • New requests fail → Circuit breaker opens (fail-fast)
  • Cascading failures can occur if timeout not configured properly

4. gRPC (Remote Procedure Call)

  • Protocol: HTTP/2 with multiplexing
  • Serialization: Protocol Buffers (binary, ~300 bytes overhead)
  • Latency: 5-50ms (10x faster than REST due to binary format)
  • Performance: 1000+ concurrent streams per connection
  • Throughput: 100K+ requests per second per instance

Advantages:

  • Bi-directional streaming (client→server, server→client, both directions)
  • Server push capabilities
  • Native deadline/timeout management
  • Load balancing aware (gRPC can balance per request)

Use Cases:

  • Inter-service communication in microservices (no public API)
  • Real-time data streaming
  • High-frequency data updates (time-series, metrics)
  • Internal services where schema evolution is controlled

Example:

service OrderService {
  rpc CreateOrder(OrderRequest) returns (OrderResponse);
  // Bi-directional streaming
  rpc StreamOrders(stream Order) returns (stream OrderStatus);
}

Trade-offs of gRPC:

  • Binary protocol not human-readable (harder debugging with standard tools)
  • Requires code generation (.proto files → language-specific stubs)
  • Not browser-friendly (needs gRPC-web wrapper)
  • Steep learning curve compared to REST

Key Issues with Synchronous Communication

1. Tight Coupling

  • Service A directly depends on Service B’s availability
  • Service B’s latency becomes Service A’s latency
  • Schema changes in B require coordinated deployments

Cascading Failure Example:

Order Service -> Payment Service -> Bank Service
     v              v                  v
1s timeout    + 1s timeout        (down)
     v              v
Result: 3 second timeout for user request
If 100 requests/sec -> 100 connection threads blocked for 3 seconds
-> Rapid thread pool exhaustion -> Denial of Service

Availability Math for the Same Call Chain:

For synchronous serial dependencies, end-user success needs all required services to be available at the same time:

A_total = A_order * A_payment * A_bank

Example:

  • A_order = 99.99% = 0.9999
  • A_payment = 99.9% = 0.999
  • A_bank = 99.9% = 0.999

A_total = 0.9999 * 0.999 * 0.999 = 0.997901099 = 99.7901%

So failure probability is:

1 - 0.997901099 = 0.002098901 = 0.2099%

Annual downtime (525,600 minutes/year):

Downtime = (1 - A_total) * 525,600 = 0.002098901 * 525,600 ~ 1,103 minutes/year (~18.4 hours/year)

2. Availability Amplification

  • Concept: If each service has 99.9% uptime, the composite system availability decreases
  • Math: 99.9% × 99.9% × 99.9% = 99.7% (3 hops)
  • Impact: At 99.7%, you have ~26 hours of downtime per year
  • Mitigation: Circuit breakers (fail fast), timeouts, fallbacks

3. Distributed Transaction Problem

  • ACID transactions impossible across network boundaries
  • Network partitions prevent atomic all-or-nothing outcomes
  • Compensation logic (saga pattern) required for consistency

Example - Distributed Transaction:

Service A writes: user balance -= $100  [Success]
Network fails
Service B writes: order record += 1     [Fails]

Result: Inconsistent state - money taken but no order created

4. Timeout Configuration

Timeouts should be based on latency SLOs, not guesswork.

  • Too aggressive (for example 100ms): healthy but slow tail requests fail.
  • Too lenient (for example 60s): threads/connections stay blocked, increasing blast radius.
  • Practical rule: start with downstream P95 or P99 and add safety margin + jitter.

What P95 means (and other P values):

P means percentile of latency distribution.

  • P50 (median): 50% of requests finish at or below this latency.
  • P90: 90% finish at or below this latency.
  • P95: 95% finish at or below this latency.
  • P99: 99% finish at or below this latency (tail latency).

Example (10,000 Payment requests in 1 minute):

  • P50 = 40ms
  • P90 = 80ms
  • P95 = 120ms
  • P99 = 300ms

Interpretation: 95% of calls complete within 120ms, but the slowest 1% can still take around 300ms.

So when we set timeout to 250ms, we are intentionally allowing normal tail (P95) while still cutting off pathological slow requests before they block resources too long.

Senior guidance for choosing timeout target:

  • User-facing synchronous call: usually start near P99 (or P95 + safety margin) and tune using error budget.
  • High-QPS internal hop: prefer tighter timeout near P95 + margin to protect capacity.
  • Always combine timeout with retries (bounded), jitter, circuit breaker, and idempotency.

Example (timeout budget):

  • End-to-end API SLO: 800ms
  • Call chain: API Gateway -> Order -> Payment -> Bank
  • Budget split: Gateway(80) + Order(120) + Payment(250) + Bank(250) + buffer(100)
  • Set Payment timeout near observed tail (say P95=120ms) with margin, e.g. 250ms, not 60s.

This keeps failures fast and recoverable, and prevents thread pool lockup.

5. Retry Storms

A retry storm happens when many clients retry at the same time after a transient failure, multiplying load on an already unhealthy dependency.

  • If 10k requests fail and each client retries 3 times immediately, backend sees up to 40k attempts (1 original + 3 retries).
  • This can turn a short outage into sustained overload (thundering herd).

Mitigation pattern:

  • Exponential backoff + jitter
  • Retry only idempotent operations
  • Cap attempts and total retry time
  • Combine with circuit breaker and rate limiting

Formula:

delay = min(max_delay, base_delay * 2^attempt) + random(0, jitter)

Example delays (base=100ms, max=2s, jitter=0-100ms):

  • Retry 1: ~100-200ms
  • Retry 2: ~200-300ms
  • Retry 3: ~400-500ms
  • Retry 4: ~800-900ms
  • Retry 5: ~1600-1700ms

This spreads retries over time, reduces synchronized spikes, and improves recovery probability.

When to Use Synchronous Communication

1. Query Operations (Read-Only)

  • Fetching product catalog, user profiles, inventory status
  • No side effects → Safe to retry without concern

2. Transactional Workflows (Atomic Operations)

  • Payment processing: Must validate before committing
  • Inventory reservation: Must check stock before confirming order
  • Requirement: Immediate feedback needed for validity

3. Low-Latency User-Facing Operations

  • Authentication/Authorization: Users expect <100ms response
  • Search queries: P99 latency < 200ms critical for UX
  • API Gateway to origin services: Direct request flow

4. Strongly Consistent Operations

  • Distributed transactions using 2-phase commit (rare, use carefully)
  • When eventual consistency is not acceptable (compliance requirements)

Asynchronous Communication

Definition

A sender publishes a message to a broker and continues immediately without waiting for processing. Processing happens later by one or more consumers.

Architecture Overview

Producer → Message Broker → Consumer
         (fire & forget)    (processes when ready)

Key Properties:

  • Producer decoupled from Consumer
  • Broker guarantees message delivery (if configured)
  • Consumer processes at own pace
  • Enable backpressure (broker buffers messages)

Message Broker Responsibilities

  1. Message Persistence: Store messages durably (disk, replicated)
  2. Routing: Direct messages to correct consumers
  3. Redelivery: Retry if consumer fails to acknowledge
  4. Retention: Keep messages for configured duration
  5. Consumer Offset Management: Track what each consumer has read
  6. Fault Tolerance: Replicate data across multiple brokers

Asynchronous Patterns

Pattern 1: Point-to-Point (Queue)

Producer → Queue (Broker) → Consumer
         Message stays until consumed
         Only ONE consumer processes it
         Message deleted after acknowledgment

Characteristics:

  • Delivery Guarantee: Exactly-once (with idempotent processing)
  • Consumers: One message, one handler
  • Ordering: Per-queue ordering (if single partition)
  • Retention: Until consumed and acknowledged
  • Examples: RabbitMQ, ActiveMQ, AWS SQS

Use Cases:

  • Task distribution: Job queues (background processing)
  • Work distribution: Order processing across multiple workers
  • Rate limiting: Decouple fast producer from slow consumer

Failure Handling:

Message published → Consumer processes → Crashes before ACK
Broker detects no ACK → Retries with another consumer instance

Pattern 2: Publish-Subscribe (Topic)

Publisher → Topic (Broker) ─��→ Subscriber1
                            ├→ Subscriber2
                            └→ Subscriber3
Message goes to ALL active subscribers

Characteristics:

  • Delivery Guarantee: At-least-once per subscriber
  • Consumers: Message goes to all subscribers
  • Ordering: Per-partition ordering (Kafka)
  • Retention: Time-based or size-based
  • Examples: Kafka, RabbitMQ (with fanout), AWS SNS

Use Cases:

  • Event broadcasting: User signed up → multiple services react
  • Data replication: Update propagated to read replicas
  • Real-time notifications: Push to multiple clients

Critical Issue - Subscriber Availability:

Scenario 1: Subscriber offline at publish time
→ Message lost (if not persistent topic)

Scenario 2: Subscriber offline but topic is persistent
→ Subscriber can replay from offset
→ Recovery is possible

Pattern 3: Event-Driven Architecture

Difference from Messaging:

  • Messaging: Explicit message format, tight coupling on schema
  • Events: Publish intent/state change, not contracts
  • Subscribers react to events based on their own business logic

Example:

Event: UserRegisteredEvent {
  userId: "123",
  email: "user@example.com",
  timestamp: "2024-09-13T22:02:37Z"
}

Subscribers:
- Email Service: Send welcome email
- Analytics Service: Track registration metrics
- Recommendation Service: Initialize user profile

Advantages:

  • Loose coupling: Event structure changes don’t affect all subscribers
  • Scalability: New subscribers can be added without modifying publisher
  • Auditability: Event log provides complete history

Message Brokers Deep Dive

Apache Kafka - Distributed Publish-Subscribe

Architecture

Kafka Cluster (Fault Tolerant)
├── Zookeeper Cluster (Cluster Coordination)
│   ├── Leader (writes)
│   ├── Follower (reads)
│   └── Follower (reads)
│
├── Broker 1 [Leader] ─── Broker 2 [Replica] ─── Broker 3 [Replica]
│   ├── Topic-A-P0 (Leader)
│   │   ├── Offset 0: Message1
│   │   ├── Offset 1: Message2
│   │   └── Offset 2: Message3
│   │
│   ├── Topic-A-P1 (Leader)
│   │   └── Offset 0: Message4
│   │
│   └── Topic-B-P0 (Leader)
│       └── Offset 0: Message5

Core Concepts

1. Topic

  • Logical collection of messages on a subject
  • Partitioned for parallel processing
  • Replicated for fault tolerance
  • Messages retained based on policy (time/size)

2. Partition

  • Physical unit of parallelism
  • Ordered sequence of messages (immutable log)
  • Each message has unique offset (0, 1, 2, …)
  • Critical: One message = one offset in one partition

Partitioning Strategy:

Producer: "user:123" → Hash("user:123") % 5 partitions = Partition 3
Producer: "user:456" → Hash("user:456") % 5 partitions = Partition 1

Result: All messages for same user go to same partition (preserves order)

Anti-Pattern (Hotspot):

If all producers send null key or same key
→ All messages go to same partition
→ Single broker becomes bottleneck
→ Data skew (one partition much larger)

3. Offset

  • Sequential ID for each message in a partition
  • Consumer tracks which offset it has read
  • Enables replaying from any point
  • Use Case: Fixing bugs, reprocessing data, catch-up

4. Replication

  • Replication Factor (RF): How many replicas of each partition
  • Leader: Accepts all writes and reads
  • ISR (In-Sync Replicas): Followers caught up with leader
  • Failure Scenario:
Leader fails → One ISR becomes new leader (no data loss if RF=2+)
If non-ISR elected → Data loss possible

5. Zookeeper (Cluster Coordination)

  • Responsibilities:

    • Broker membership management (who’s alive?)
    • Controller election (one broker becomes controller)
    • Leader election for each partition
    • Consumer group coordination
  • Critical: Odd number required (3, 5, 7 for quorum)

  • Zookeeper data:

/brokers/ids: [1, 2, 3]  (active brokers)
/controller: 1           (current controller)
/topics/my-topic/partitions/0/state: {"leader": 1, "isr": [1,2]}
/consumers/group-1/offsets/my-topic/0: 42  (consumer read up to offset 42)

Producer Behavior

Configuration Trade-offs:

ConfigValueLatencyDurabilityThroughput
acks=0No wait for broker0-1msRISKY (data loss)Very High
acks=1Wait for leader5-20msGoodHigh
acks=allWait for all ISR20-100msSAFE (replicated)Moderate

Batching (improves throughput):

  • Buffer messages for batch.size bytes or linger.ms time
  • Send all at once → Reduces round trips
  • Trade-off: Slightly higher latency for much higher throughput

Compression:

  • snappy: Good compression ratio, fast
  • gzip: Better compression, slower
  • lz4: Very fast, moderate compression
  • Can reduce network bandwidth by 50-90%

Idempotency:

Enable: enable.idempotence=true, transactional.id="unique-id"
Effect: Duplicates automatically deduplicated by broker
Cost: Slight performance hit

Consumer Behavior

Consumer Groups:

Topic with 4 partitions:
P0, P1, P2, P3

Consumer Group A (3 instances):
- Consumer A1 reads P0, P1
- Consumer A2 reads P2
- Consumer A3 reads P3 (idle)

If A3 dies:
- Rebalancing: P3 reassigned to A1 or A2
- Stop-the-world: ~30 seconds of paused consumption

Offset Management:

  • Automatic: Consumer group stores offset in __consumer_offsets topic
  • Manual: Application tracks offset in external store
  • Seek: Reset offset to specific point
Consumer can do:
consumer.seek(partition, 0)  // Start from beginning
consumer.seek(partition, offset - 100)  // Replay last 100
consumer.seekToEnd()  // Skip to latest

Important: Offset committed AFTER processing, not before.

Failure Scenarios:

Scenario 1: Message processed, consumer crashes before offset commit
→ Message reprocessed on restart (at-least-once)

Scenario 2: Message processed + offset committed, consumer crashes
→ Message skipped on restart (exactly-once from consumer perspective)

Solution: Idempotent processing (safe to replay)

RabbitMQ - Traditional Message Queue

Architecture

RabbitMQ Cluster (Queue-based)
├── Broker 1 (Master)
│   ├── Queue: order.processing
│   │   ├── Message: OrderID:1001 (unacknowledged)
│   │   ├── Message: OrderID:1002 (ack pending)
│   │   └── Message: OrderID:1003
│   └── Queue: payment.processing
│
├── Broker 2 (Mirror/Replica)
├── Broker 3 (Mirror/Replica)

Core Concepts

1. Queue (Point-to-Point)

Producer → Exchange → Queue → Consumer (ACK) → Message deleted
  • Messages stay until consumed and acknowledged
  • Only one consumer processes each message
  • Ordering: Single queue guarantees order
  • Load Balance: Multiple consumers share work (round-robin)

2. Exchange (Routing)

  • Direct: Exact key matching
"user.created" �� Queue: user-service
  • Topic: Pattern matching with wildcards
"user.*" → Queues: [user-service, email-service, analytics-service]
  • Fanout: Broadcast to all bound queues
"broadcast" → All bound queues receive copy
  • Headers: Match on message headers

3. Message Acknowledgment

Producer sends → Broker stores → Consumer processes → Consumer ACKs
If no ACK within timeout → Broker redelivers to another consumer

Manual vs Auto ACK:

  • Auto: Acknowledge immediately (risky, message loss on crash)
  • Manual: Acknowledge after successful processing (safe, slightly slower)

Comparison with Kafka

AspectRabbitMQKafka
TypeQueue + Pub-SubDistributed Log (Pub-Sub)
PersistenceOptionalDefault (log-based)
OrderingPer-queuePer-partition
ThroughputTypically 10K-100K+ msgs/sec (workload and topology dependent)Typically 100K to millions msgs/sec (partitioning, batching, and hardware dependent)
ReplicationYesYes (partition-level)
Consumer OffsetAuto tracked (deleted on ACK)Manual control (time-based retention)
Message Loss RiskLow if ACK enabled1 in 100M (configurable)
Use CaseTask queues, simple workflowsEvent streams, high-volume logging

Note: Throughput numbers vary heavily by message size, durability settings, ACK mode, replication factor, disk/network limits, and producer/consumer parallelism.


Distributed System Challenges

CAP Theorem & Consistency Models

CAP: Any distributed system can guarantee at most 2 of 3:

  • Consistency: All nodes see same data
  • Availability: System responds to requests
  • Partition Tolerance: System survives network splits

In Practice: Network partitions WILL occur → Choose CP or AP

Eventual Consistency in Asynchronous Systems

State A → Message Sent → Broker → Network Delay → State B
        t=0              t=10ms                     t=100ms

During t=10-100ms: Different services see different state
This window is EVENTUALLY CONSISTENT

Risks:

  • User sees stale data temporarily
  • Multiple writes create conflicts (last-write-wins not always correct)
  • Compensating transactions may be needed

Example - E-Commerce:

User updates shipping address:
1. Address Service: Address updated immediately
2. Shipping Service: Receives update async, 500ms later
3. User starts checkout: Sees new address but shipping rates might be stale

Idempotency & Exactly-Once Semantics

Problem: In distributed systems, retries can cause duplicates

Delivery Guarantees:

  1. At-least-once: Message guaranteed delivered, but may be duplicated
  2. At-most-once: Message delivered once but may be lost
  3. Exactly-once: Message delivered exactly once (hardest to achieve)

Idempotent Processing:

Consumer processes:
if idempotencyKey not in database:
    processMessage()
    mark idempotencyKey as processed
else:
    skip (already processed)

Example:

Payment API:
POST /payment
{
  "idempotencyKey": "req-12345",
  "amount": 100,
  "userId": "user-123"
}

First call: Charges $100, stores idempotencyKey
Retry due to timeout: Uses same idempotencyKey, doesn't charge again

Network Partitions & Byzantine Failures

Scenario: Service A can’t reach Service B

Service A      Network Partition      Service B
(Active)                              (Active)

A can't call B, so:
- Timeout occurs
- Circuit breaker opens
- Fallback logic executes (stale cache, default value)

B processes messages from queue independently
Result: Temporarily inconsistent state (A thinks B failed, B is fine)

Recovery:

  • When network heals, systems resync
  • Data reconciliation may be needed
  • Audit logs help identify inconsistencies

Cascading Failures

Scenario: Service B slow due to database overload

Service A → Service B (100ms, times out at 200ms)
Service C → Service A (100ms, times out at 200ms)
Service D → Service C (100ms, times out at 200ms)

A's 100 threads waiting for B → Exhausted
C's 100 threads waiting for A → Exhausted
D's 100 threads waiting for C → Exhausted
Entire system grinds to halt

Mitigation:

  • Circuit Breaker: Stop calling failing service, fail fast
  • Bulkheads: Isolate thread pools (don’t use shared pool)
  • Timeouts: Kill long-running requests
  • Fallbacks: Graceful degradation

Pattern Comparison Matrix

CharacteristicREST/gRPCKafkaRabbitMQ
Latency50-500ms50-100ms (processing)10-100ms
ThroughputREST: 1K-10KgRPC: 100K+1M+ msgs/sec10K-100K msgs/sec
ConsistencyStrong (sync)EventualEventual
CouplingTightLooseLoose
Ordering GuaranteeNo (unless coded)Per-partition (strong)Per-queue (strong)
Replay CapabilityNo (unless logged)Yes (offset management)No (delete on ACK)
Operational ComplexityLowVery HighMedium
Infrastructure FootprintMinimalLarge cluster requiredMedium cluster
Failure DomainSingle hopBroker clusterBroker cluster
Best ForQuery/transactionalEvent streaming/analyticsTask queues/workflows

Production Considerations

Monitoring & Observability

Key Metrics to Track:

Synchronous:

  • P99, P95, P50 latencies
  • Error rate (4xx, 5xx)
  • Circuit breaker state changes
  • Timeout occurrences

Asynchronous:

  • Consumer lag (offset - latest offset)
  • End-to-end latency (publish to process)
  • Message loss rate
  • Queue depth (buildup indicates slow consumers)
  • Redelivery rate (indicates failure)

Scaling Strategies

Synchronous:

  • Horizontal scaling: Add more instances
  • Caching: Reduce repeated calls
  • Database optimization: Faster responses → shorter timeout needed

Asynchronous:

  • Partition keys: Distribute load evenly
  • Consumer group scaling: Add instances
  • Retention tuning: Balance disk space vs. replay capability

Failure Handling Patterns

1. Retry with Exponential Backoff

Attempt 1: Immediate
Attempt 2: 100ms + random(0-50ms)
Attempt 3: 200ms + random(0-100ms)
Attempt 4: 400ms + random(0-200ms)
Attempt 5: 800ms + random(0-400ms) [then give up]

2. Circuit Breaker

Closed (normal) → Threshold errors exceeded → Open (failing) 
                                              → After timeout → Half-Open (test)
                                              → If success → Closed
                                              → If failure → Open

3. Bulkhead Pattern

Service A
├── Thread Pool 1 (for Service B calls) [10 threads max]
├── Thread Pool 2 (for Service C calls) [10 threads max]
└── Thread Pool 3 (for Service D calls) [10 threads max]

If B fails: Only Pool 1 exhausted, Pool 2 & 3 work fine

4. Saga Pattern (Distributed Transactions)

Order Service → Payment Service → Inventory Service
              → if Payment fails → Compensation (refund)
              → if Inventory fails → Compensation (cancel payment)

Security Considerations

Authentication/Authorization:

  • gRPC: mTLS (mutual TLS) for service-to-service
  • REST: OAuth2, JWT tokens
  • Kafka: ACLs, authentication plugins

Data Encryption:

  • In-transit: TLS/SSL
  • At-rest: Disk encryption, broker-side
  • End-to-end: Application-level encryption

Cost-Benefit Analysis

Choose Synchronous When:

  • ✓ Low latency required (<100ms)
  • ✓ Strong consistency needed
  • ✓ Simple request-response pattern
  • ✓ Limited message volume
  • ✗ Scalability not a constraint

Choose Asynchronous When:

  • ✓ Decoupling required
  • ✓ High throughput needed
  • ✓ Eventual consistency acceptable
  • ✓ Backpressure handling required
  • ✗ Real-time interaction needed
API Gateway (REST) → Services (gRPC for sync, critical paths)
                  → Message Broker (async for non-critical paths)
                  → Databases (cached for reads)

Example (E-commerce Order):
1. REST: Order created (needs immediate response)
2. gRPC: Validate inventory (need strong consistency)
3. Kafka: Async events (payment, notification, analytics)

Summary & Decision Tree

Is response needed immediately?
├─ Yes → Use Synchronous (REST/gRPC)
│  ├─ Internal service? → gRPC (faster, binary)
│  └─ Browser/public API? → REST (widely supported)
│
└─ No → Use Asynchronous (Message Broker)
   ├─ Single consumer per message? → RabbitMQ (task queue)
   └─ Multiple subscribers needed? → Kafka (event stream)
      └─ High volume (1M+ msgs/sec)? → Kafka (designed for this)

Key Takeaways for FAANG Interviews

  1. Synchronous = Tight Coupling + Availability Amplification → Use for critical path only
  2. Asynchronous = Loose Coupling + Operational Complexity → Use for background jobs
  3. gRPC is the future → 10x faster than REST, but less accessible
  4. Kafka is the standard → High throughput, replay capability, event sourcing
  5. Idempotency is non-negotiable → Design all async handlers to be idempotent
  6. Observability is critical → Monitor latencies, lag, error rates, queue depth
  7. CAP theorem is reality → Partition tolerance mandatory, choose CP or AP
  8. Complexity scales with scale → Simple systems can be synchronous, complex ones need async