Microservice Communication - Advanced Systems Design

Sep 13, 2024

Microservice Communication Patterns - SDE 2 Level Deep Dive

Microservice architecture requires careful consideration of inter-service communication strategies. This document covers synchronous and asynchronous patterns with distributed systems trade-offs suitable for FAANG-level system design.

Communication Paradigms
Synchronous Communication
Asynchronous Communication
Message Brokers Deep Dive
Distributed System Challenges
Pattern Comparison Matrix
Production Considerations

Communication Paradigms

Overview

In distributed systems, services need to exchange information reliably and efficiently. The choice between synchronous and asynchronous communication impacts:

Consistency Model (Strong vs Eventual)
Latency SLA (P99, P95 tail latencies)
System Coupling (degree of interdependence)
Failure Recovery (fault isolation)
Scalability (horizontal scaling capabilities)

Synchronous Communication

Definition

A caller service makes a request and blocks/waits for a response before proceeding. The request-response pattern is inherently synchronous.

Implementation Mechanisms

1. REST APIs (HTTP/HTTPS)

Protocol: HTTP/1.1 or HTTP/2
Serialization: JSON (text-based, 2-3 KB overhead)
Latency: 50-500ms per call depending on network/processing
Use Case: Simple CRUD operations, public APIs, browser clients

Considerations:

Stateless by design (easier horizontal scaling)
Connection overhead: TCP 3-way handshake (~5-10ms for distant regions)
Polling complexity: consumer must repeatedly ask for updates
Retry logic requires idempotency keys to prevent duplicates

Example Scenario:

Order Service → Payment Service (wait for response)
│                    │
└─ Blocks waiting ───┘
If Payment Service down → immediate failure

2. GraphQL

Advantage: Eliminates over-fetching and under-fetching of data
Latency: Similar to REST but can reduce round-trips
Complexity: Query parsing and execution overhead (~2-5ms additional)
Use Case: Complex data aggregation, mobile clients with limited bandwidth

Trade-offs:

Solves data fetching problems but doesn’t solve inherent synchronous coupling
Deeper query complexity can lead to N+1 problems in resolvers

N+1 Problem — Simple Analogy First:

Imagine you are a librarian. Someone asks: “Give me all books, and for each book tell me the author’s hometown.”
You fetch the list of 10 books → 1 trip to the shelf
Then for each book, you go call the author individually to ask their hometown → 10 separate phone calls
Total = 1 + 10 = 11 trips just to answer one question.
If there were 1000 books → 1001 trips. That is the N+1 problem.

How it happens in GraphQL:

GraphQL has resolvers — small functions, one per field, each responsible for fetching its own data. They do not know about each other. So when you ask for a list AND a nested field on each item, each nested field fires its own separate database query.

GraphQL Query:
{
  orders {        ← 1 query: fetch all orders
    id
    user {        ← resolver fires for EACH order separately
      name
    }
  }
}

What actually executes in the database:
SELECT * FROM orders                        ← 1 query
SELECT * FROM users WHERE id = 101          ← for order 1
SELECT * FROM users WHERE id = 102          ← for order 2
SELECT * FROM users WHERE id = 103          ← for order 3
...

If there are 1000 orders → 1001 DB queries for one client request.

Why resolvers behave this way: Each resolver is written independently:

// Order resolver
resolveUser(order) {
    return db.query("SELECT * FROM users WHERE id = " + order.userId)
    // It has NO idea 999 other orders are also calling this same function
}

Fix — DataLoader (Batching):

DataLoader collects all the IDs being requested within a single moment, then fires one batched query instead of N individual ones.

Without DataLoader:
→ 1000 separate: SELECT * FROM users WHERE id = X

With DataLoader:
→ 1 batched:     SELECT * FROM users WHERE id IN (101, 102, 103, ...)

Why it matters more in microservices: In microservices, resolvers don’t just hit a DB — they make HTTP or gRPC calls to other services. So N+1 DB queries become N+1 network calls across services, each adding network latency. 1000 orders = 1000 HTTP calls to the User Service, which can overwhelm it.

3. Feign (Spring Cloud) with Service Discovery (Eureka)

Purpose: Client-side HTTP client with declarative interface
Discovery: Automatic service location via Eureka registry
Circuit Breaker Integration: Works with Hystrix/Resilience4j
Latency: Similar to REST (~50-500ms) + discovery lookup (~5-20ms)

How it works:

Service A (Feign Client) → Eureka Registry (lookup ServiceB instance)
                        → HTTP call to ServiceB
                        → Blocking wait

Failure Scenario:

Service B instances crash → Eureka marks them unhealthy (takes 30-60 seconds)
New requests fail → Circuit breaker opens (fail-fast)
Cascading failures can occur if timeout not configured properly

4. gRPC (Remote Procedure Call)

Protocol: HTTP/2 with multiplexing
Serialization: Protocol Buffers (binary, ~300 bytes overhead)
Latency: 5-50ms (10x faster than REST due to binary format)
Performance: 1000+ concurrent streams per connection
Throughput: 100K+ requests per second per instance

Advantages:

Bi-directional streaming (client→server, server→client, both directions)
Server push capabilities
Native deadline/timeout management
Load balancing aware (gRPC can balance per request)

Use Cases:

Inter-service communication in microservices (no public API)
Real-time data streaming
High-frequency data updates (time-series, metrics)
Internal services where schema evolution is controlled

Example:

service OrderService {
  rpc CreateOrder(OrderRequest) returns (OrderResponse);
  // Bi-directional streaming
  rpc StreamOrders(stream Order) returns (stream OrderStatus);
}

Trade-offs of gRPC:

Binary protocol not human-readable (harder debugging with standard tools)
Requires code generation (.proto files → language-specific stubs)
Not browser-friendly (needs gRPC-web wrapper)
Steep learning curve compared to REST

Key Issues with Synchronous Communication

1. Tight Coupling

Service A directly depends on Service B’s availability
Service B’s latency becomes Service A’s latency
Schema changes in B require coordinated deployments

Cascading Failure Example:

Order Service -> Payment Service -> Bank Service
     v              v                  v
1s timeout    + 1s timeout        (down)
     v              v
Result: 3 second timeout for user request
If 100 requests/sec -> 100 connection threads blocked for 3 seconds
-> Rapid thread pool exhaustion -> Denial of Service

Availability Math for the Same Call Chain:

For synchronous serial dependencies, end-user success needs all required services to be available at the same time:

A_total = A_order * A_payment * A_bank

Example:

A_order = 99.99% = 0.9999
A_payment = 99.9% = 0.999
A_bank = 99.9% = 0.999

A_total = 0.9999 * 0.999 * 0.999 = 0.997901099 = 99.7901%

So failure probability is:

1 - 0.997901099 = 0.002098901 = 0.2099%

Annual downtime (525,600 minutes/year):

Downtime = (1 - A_total) * 525,600 = 0.002098901 * 525,600 ~ 1,103 minutes/year (~18.4 hours/year)

2. Availability Amplification

Concept: If each service has 99.9% uptime, the composite system availability decreases
Math: 99.9% × 99.9% × 99.9% = 99.7% (3 hops)
Impact: At 99.7%, you have ~26 hours of downtime per year
Mitigation: Circuit breakers (fail fast), timeouts, fallbacks

3. Distributed Transaction Problem

ACID transactions impossible across network boundaries
Network partitions prevent atomic all-or-nothing outcomes
Compensation logic (saga pattern) required for consistency

Example - Distributed Transaction:

Service A writes: user balance -= $100  [Success]
Network fails
Service B writes: order record += 1     [Fails]

Result: Inconsistent state - money taken but no order created

4. Timeout Configuration

Timeouts should be based on latency SLOs, not guesswork.

Too aggressive (for example 100ms): healthy but slow tail requests fail.
Too lenient (for example 60s): threads/connections stay blocked, increasing blast radius.
Practical rule: start with downstream P95 or P99 and add safety margin + jitter.

What P95 means (and other P values):

P means percentile of latency distribution.

P50 (median): 50% of requests finish at or below this latency.
P90: 90% finish at or below this latency.
P95: 95% finish at or below this latency.
P99: 99% finish at or below this latency (tail latency).

Example (10,000 Payment requests in 1 minute):

P50 = 40ms
P90 = 80ms
P95 = 120ms
P99 = 300ms

Interpretation: 95% of calls complete within 120ms, but the slowest 1% can still take around 300ms.

So when we set timeout to 250ms, we are intentionally allowing normal tail (P95) while still cutting off pathological slow requests before they block resources too long.

Senior guidance for choosing timeout target:

User-facing synchronous call: usually start near P99 (or P95 + safety margin) and tune using error budget.
High-QPS internal hop: prefer tighter timeout near P95 + margin to protect capacity.
Always combine timeout with retries (bounded), jitter, circuit breaker, and idempotency.

Example (timeout budget):

End-to-end API SLO: 800ms
Call chain: API Gateway -> Order -> Payment -> Bank
Budget split: Gateway(80) + Order(120) + Payment(250) + Bank(250) + buffer(100)
Set Payment timeout near observed tail (say P95=120ms) with margin, e.g. 250ms, not 60s.

This keeps failures fast and recoverable, and prevents thread pool lockup.

5. Retry Storms

A retry storm happens when many clients retry at the same time after a transient failure, multiplying load on an already unhealthy dependency.

If 10k requests fail and each client retries 3 times immediately, backend sees up to 40k attempts (1 original + 3 retries).
This can turn a short outage into sustained overload (thundering herd).

Mitigation pattern:

Exponential backoff + jitter
Retry only idempotent operations
Cap attempts and total retry time
Combine with circuit breaker and rate limiting

Formula:

delay = min(max_delay, base_delay * 2^attempt) + random(0, jitter)

Example delays (base=100ms, max=2s, jitter=0-100ms):

Retry 1: ~100-200ms
Retry 2: ~200-300ms
Retry 3: ~400-500ms
Retry 4: ~800-900ms
Retry 5: ~1600-1700ms

This spreads retries over time, reduces synchronized spikes, and improves recovery probability.

When to Use Synchronous Communication

1. Query Operations (Read-Only)

Fetching product catalog, user profiles, inventory status
No side effects → Safe to retry without concern

2. Transactional Workflows (Atomic Operations)

Payment processing: Must validate before committing
Inventory reservation: Must check stock before confirming order
Requirement: Immediate feedback needed for validity

3. Low-Latency User-Facing Operations

Authentication/Authorization: Users expect <100ms response
Search queries: P99 latency < 200ms critical for UX
API Gateway to origin services: Direct request flow

4. Strongly Consistent Operations

Distributed transactions using 2-phase commit (rare, use carefully)
When eventual consistency is not acceptable (compliance requirements)

Asynchronous Communication

Definition

A sender publishes a message to a broker and continues immediately without waiting for processing. Processing happens later by one or more consumers.

Architecture Overview

Producer → Message Broker → Consumer
         (fire & forget)    (processes when ready)

Key Properties:

Producer decoupled from Consumer
Broker guarantees message delivery (if configured)
Consumer processes at own pace
Enable backpressure (broker buffers messages)

Message Broker Responsibilities

Message Persistence: Store messages durably (disk, replicated)
Routing: Direct messages to correct consumers
Redelivery: Retry if consumer fails to acknowledge
Retention: Keep messages for configured duration
Consumer Offset Management: Track what each consumer has read
Fault Tolerance: Replicate data across multiple brokers

Asynchronous Patterns

Pattern 1: Point-to-Point (Queue)

Producer → Queue (Broker) → Consumer
         Message stays until consumed
         Only ONE consumer processes it
         Message deleted after acknowledgment

Characteristics:

Delivery Guarantee: Exactly-once (with idempotent processing)
Consumers: One message, one handler
Ordering: Per-queue ordering (if single partition)
Retention: Until consumed and acknowledged
Examples: RabbitMQ, ActiveMQ, AWS SQS

Use Cases:

Task distribution: Job queues (background processing)
Work distribution: Order processing across multiple workers
Rate limiting: Decouple fast producer from slow consumer

Failure Handling:

Message published → Consumer processes → Crashes before ACK
Broker detects no ACK → Retries with another consumer instance

Publisher → Topic (Broker) ─��→ Subscriber1
                            ├→ Subscriber2
                            └→ Subscriber3
Message goes to ALL active subscribers

Characteristics:

Delivery Guarantee: At-least-once per subscriber
Consumers: Message goes to all subscribers
Ordering: Per-partition ordering (Kafka)
Retention: Time-based or size-based
Examples: Kafka, RabbitMQ (with fanout), AWS SNS

Use Cases:

Event broadcasting: User signed up → multiple services react
Data replication: Update propagated to read replicas
Real-time notifications: Push to multiple clients

Critical Issue - Subscriber Availability:

Scenario 1: Subscriber offline at publish time
→ Message lost (if not persistent topic)

Scenario 2: Subscriber offline but topic is persistent
→ Subscriber can replay from offset
→ Recovery is possible

Pattern 3: Event-Driven Architecture

Difference from Messaging:

Messaging: Explicit message format, tight coupling on schema
Events: Publish intent/state change, not contracts
Subscribers react to events based on their own business logic

Example:

Event: UserRegisteredEvent {
  userId: "123",
  email: "user@example.com",
  timestamp: "2024-09-13T22:02:37Z"
}

Subscribers:
- Email Service: Send welcome email
- Analytics Service: Track registration metrics
- Recommendation Service: Initialize user profile

Advantages:

Loose coupling: Event structure changes don’t affect all subscribers
Scalability: New subscribers can be added without modifying publisher
Auditability: Event log provides complete history

Message Brokers Deep Dive

Architecture

Kafka Cluster (Fault Tolerant)
├── Zookeeper Cluster (Cluster Coordination)
│   ├── Leader (writes)
│   ├── Follower (reads)
│   └── Follower (reads)
│
├── Broker 1 [Leader] ─── Broker 2 [Replica] ─── Broker 3 [Replica]
│   ├── Topic-A-P0 (Leader)
│   │   ├── Offset 0: Message1
│   │   ├── Offset 1: Message2
│   │   └── Offset 2: Message3
│   │
│   ├── Topic-A-P1 (Leader)
│   │   └── Offset 0: Message4
│   │
│   └── Topic-B-P0 (Leader)
│       └── Offset 0: Message5

Core Concepts

1. Topic

Logical collection of messages on a subject
Partitioned for parallel processing
Replicated for fault tolerance
Messages retained based on policy (time/size)

2. Partition

Physical unit of parallelism
Ordered sequence of messages (immutable log)
Each message has unique offset (0, 1, 2, …)
Critical: One message = one offset in one partition

Partitioning Strategy:

Producer: "user:123" → Hash("user:123") % 5 partitions = Partition 3
Producer: "user:456" → Hash("user:456") % 5 partitions = Partition 1

Result: All messages for same user go to same partition (preserves order)

Anti-Pattern (Hotspot):

If all producers send null key or same key
→ All messages go to same partition
→ Single broker becomes bottleneck
→ Data skew (one partition much larger)

3. Offset

Sequential ID for each message in a partition
Consumer tracks which offset it has read
Enables replaying from any point
Use Case: Fixing bugs, reprocessing data, catch-up

4. Replication

Replication Factor (RF): How many replicas of each partition
Leader: Accepts all writes and reads
ISR (In-Sync Replicas): Followers caught up with leader
Failure Scenario:

Leader fails → One ISR becomes new leader (no data loss if RF=2+)
If non-ISR elected → Data loss possible

5. Zookeeper (Cluster Coordination)

Responsibilities:
- Broker membership management (who’s alive?)
- Controller election (one broker becomes controller)
- Leader election for each partition
- Consumer group coordination
Critical: Odd number required (3, 5, 7 for quorum)
Zookeeper data:

/brokers/ids: [1, 2, 3]  (active brokers)
/controller: 1           (current controller)
/topics/my-topic/partitions/0/state: {"leader": 1, "isr": [1,2]}
/consumers/group-1/offsets/my-topic/0: 42  (consumer read up to offset 42)

Producer Behavior

Configuration Trade-offs:

Config	Value	Latency	Durability	Throughput
`acks=0`	No wait for broker	0-1ms	RISKY (data loss)	Very High
`acks=1`	Wait for leader	5-20ms	Good	High
`acks=all`	Wait for all ISR	20-100ms	SAFE (replicated)	Moderate

Batching (improves throughput):

Buffer messages for batch.size bytes or linger.ms time
Send all at once → Reduces round trips
Trade-off: Slightly higher latency for much higher throughput

Compression:

snappy: Good compression ratio, fast
gzip: Better compression, slower
lz4: Very fast, moderate compression
Can reduce network bandwidth by 50-90%

Idempotency:

Enable: enable.idempotence=true, transactional.id="unique-id"
Effect: Duplicates automatically deduplicated by broker
Cost: Slight performance hit

Consumer Behavior

Consumer Groups:

Topic with 4 partitions:
P0, P1, P2, P3

Consumer Group A (3 instances):
- Consumer A1 reads P0, P1
- Consumer A2 reads P2
- Consumer A3 reads P3 (idle)

If A3 dies:
- Rebalancing: P3 reassigned to A1 or A2
- Stop-the-world: ~30 seconds of paused consumption

Offset Management:

Automatic: Consumer group stores offset in __consumer_offsets topic
Manual: Application tracks offset in external store
Seek: Reset offset to specific point

Consumer can do:
consumer.seek(partition, 0)  // Start from beginning
consumer.seek(partition, offset - 100)  // Replay last 100
consumer.seekToEnd()  // Skip to latest

Important: Offset committed AFTER processing, not before.

Failure Scenarios:

Scenario 1: Message processed, consumer crashes before offset commit
→ Message reprocessed on restart (at-least-once)

Scenario 2: Message processed + offset committed, consumer crashes
→ Message skipped on restart (exactly-once from consumer perspective)

Solution: Idempotent processing (safe to replay)

RabbitMQ - Traditional Message Queue

Architecture

RabbitMQ Cluster (Queue-based)
├── Broker 1 (Master)
│   ├── Queue: order.processing
│   │   ├── Message: OrderID:1001 (unacknowledged)
│   │   ├── Message: OrderID:1002 (ack pending)
│   │   └── Message: OrderID:1003
│   └── Queue: payment.processing
│
├── Broker 2 (Mirror/Replica)
├── Broker 3 (Mirror/Replica)

Core Concepts

1. Queue (Point-to-Point)

Producer → Exchange → Queue → Consumer (ACK) → Message deleted

Messages stay until consumed and acknowledged
Only one consumer processes each message
Ordering: Single queue guarantees order
Load Balance: Multiple consumers share work (round-robin)

2. Exchange (Routing)

Direct: Exact key matching

"user.created" �� Queue: user-service

Topic: Pattern matching with wildcards

"user.*" → Queues: [user-service, email-service, analytics-service]

Fanout: Broadcast to all bound queues

"broadcast" → All bound queues receive copy

Headers: Match on message headers

3. Message Acknowledgment

Producer sends → Broker stores → Consumer processes → Consumer ACKs
If no ACK within timeout → Broker redelivers to another consumer

Manual vs Auto ACK:

Auto: Acknowledge immediately (risky, message loss on crash)
Manual: Acknowledge after successful processing (safe, slightly slower)

Comparison with Kafka

Aspect	RabbitMQ	Kafka
Type	Queue + Pub-Sub	Distributed Log (Pub-Sub)
Persistence	Optional	Default (log-based)
Ordering	Per-queue	Per-partition
Throughput	Typically 10K-100K+ msgs/sec (workload and topology dependent)	Typically 100K to millions msgs/sec (partitioning, batching, and hardware dependent)
Replication	Yes	Yes (partition-level)
Consumer Offset	Auto tracked (deleted on ACK)	Manual control (time-based retention)
Message Loss Risk	Low if ACK enabled	1 in 100M (configurable)
Use Case	Task queues, simple workflows	Event streams, high-volume logging

Note: Throughput numbers vary heavily by message size, durability settings, ACK mode, replication factor, disk/network limits, and producer/consumer parallelism.

Distributed System Challenges

CAP Theorem & Consistency Models

CAP: Any distributed system can guarantee at most 2 of 3:

Consistency: All nodes see same data
Availability: System responds to requests
Partition Tolerance: System survives network splits

In Practice: Network partitions WILL occur → Choose CP or AP

Eventual Consistency in Asynchronous Systems

State A → Message Sent → Broker → Network Delay → State B
        t=0              t=10ms                     t=100ms

During t=10-100ms: Different services see different state
This window is EVENTUALLY CONSISTENT

Risks:

User sees stale data temporarily
Multiple writes create conflicts (last-write-wins not always correct)
Compensating transactions may be needed

Example - E-Commerce:

User updates shipping address:
1. Address Service: Address updated immediately
2. Shipping Service: Receives update async, 500ms later
3. User starts checkout: Sees new address but shipping rates might be stale

Idempotency & Exactly-Once Semantics

Problem: In distributed systems, retries can cause duplicates

Delivery Guarantees:

At-least-once: Message guaranteed delivered, but may be duplicated
At-most-once: Message delivered once but may be lost
Exactly-once: Message delivered exactly once (hardest to achieve)

Idempotent Processing:

Consumer processes:
if idempotencyKey not in database:
    processMessage()
    mark idempotencyKey as processed
else:
    skip (already processed)

Example:

Payment API:
POST /payment
{
  "idempotencyKey": "req-12345",
  "amount": 100,
  "userId": "user-123"
}

First call: Charges $100, stores idempotencyKey
Retry due to timeout: Uses same idempotencyKey, doesn't charge again

Network Partitions & Byzantine Failures

Scenario: Service A can’t reach Service B

Service A      Network Partition      Service B
(Active)                              (Active)

A can't call B, so:
- Timeout occurs
- Circuit breaker opens
- Fallback logic executes (stale cache, default value)

B processes messages from queue independently
Result: Temporarily inconsistent state (A thinks B failed, B is fine)

Recovery:

When network heals, systems resync
Data reconciliation may be needed
Audit logs help identify inconsistencies

Cascading Failures

Scenario: Service B slow due to database overload

Service A → Service B (100ms, times out at 200ms)
Service C → Service A (100ms, times out at 200ms)
Service D → Service C (100ms, times out at 200ms)

A's 100 threads waiting for B → Exhausted
C's 100 threads waiting for A → Exhausted
D's 100 threads waiting for C → Exhausted
Entire system grinds to halt

Mitigation:

Circuit Breaker: Stop calling failing service, fail fast
Bulkheads: Isolate thread pools (don’t use shared pool)
Timeouts: Kill long-running requests
Fallbacks: Graceful degradation

Pattern Comparison Matrix

Characteristic	REST/gRPC	Kafka	RabbitMQ
Latency	50-500ms	50-100ms (processing)	10-100ms
Throughput	REST: 1K-10KgRPC: 100K+	1M+ msgs/sec	10K-100K msgs/sec
Consistency	Strong (sync)	Eventual	Eventual
Coupling	Tight	Loose	Loose
Ordering Guarantee	No (unless coded)	Per-partition (strong)	Per-queue (strong)
Replay Capability	No (unless logged)	Yes (offset management)	No (delete on ACK)
Operational Complexity	Low	Very High	Medium
Infrastructure Footprint	Minimal	Large cluster required	Medium cluster
Failure Domain	Single hop	Broker cluster	Broker cluster
Best For	Query/transactional	Event streaming/analytics	Task queues/workflows

Production Considerations

Monitoring & Observability

Key Metrics to Track:

Synchronous:

P99, P95, P50 latencies
Error rate (4xx, 5xx)
Circuit breaker state changes
Timeout occurrences

Asynchronous:

Consumer lag (offset - latest offset)
End-to-end latency (publish to process)
Message loss rate
Queue depth (buildup indicates slow consumers)
Redelivery rate (indicates failure)

Scaling Strategies

Synchronous:

Horizontal scaling: Add more instances
Caching: Reduce repeated calls
Database optimization: Faster responses → shorter timeout needed

Asynchronous:

Partition keys: Distribute load evenly
Consumer group scaling: Add instances
Retention tuning: Balance disk space vs. replay capability

Failure Handling Patterns

1. Retry with Exponential Backoff

Attempt 1: Immediate
Attempt 2: 100ms + random(0-50ms)
Attempt 3: 200ms + random(0-100ms)
Attempt 4: 400ms + random(0-200ms)
Attempt 5: 800ms + random(0-400ms) [then give up]

2. Circuit Breaker

Closed (normal) → Threshold errors exceeded → Open (failing) 
                                              → After timeout → Half-Open (test)
                                              → If success → Closed
                                              → If failure → Open

3. Bulkhead Pattern

Service A
├── Thread Pool 1 (for Service B calls) [10 threads max]
├── Thread Pool 2 (for Service C calls) [10 threads max]
└── Thread Pool 3 (for Service D calls) [10 threads max]

If B fails: Only Pool 1 exhausted, Pool 2 & 3 work fine

4. Saga Pattern (Distributed Transactions)

Order Service → Payment Service → Inventory Service
              → if Payment fails → Compensation (refund)
              → if Inventory fails → Compensation (cancel payment)

Security Considerations

Authentication/Authorization:

gRPC: mTLS (mutual TLS) for service-to-service
REST: OAuth2, JWT tokens
Kafka: ACLs, authentication plugins

Data Encryption:

In-transit: TLS/SSL
At-rest: Disk encryption, broker-side
End-to-end: Application-level encryption

Cost-Benefit Analysis

Choose Synchronous When:

✓ Low latency required (<100ms)
✓ Strong consistency needed
✓ Simple request-response pattern
✓ Limited message volume
✗ Scalability not a constraint

Choose Asynchronous When:

✓ Decoupling required
✓ High throughput needed
✓ Eventual consistency acceptable
✓ Backpressure handling required
✗ Real-time interaction needed

Hybrid Approach (Recommended for Complex Systems)

API Gateway (REST) → Services (gRPC for sync, critical paths)
                  → Message Broker (async for non-critical paths)
                  → Databases (cached for reads)

Example (E-commerce Order):
1. REST: Order created (needs immediate response)
2. gRPC: Validate inventory (need strong consistency)
3. Kafka: Async events (payment, notification, analytics)

Summary & Decision Tree

Is response needed immediately?
├─ Yes → Use Synchronous (REST/gRPC)
│  ├─ Internal service? → gRPC (faster, binary)
│  └─ Browser/public API? → REST (widely supported)
│
└─ No → Use Asynchronous (Message Broker)
   ├─ Single consumer per message? → RabbitMQ (task queue)
   └─ Multiple subscribers needed? → Kafka (event stream)
      └─ High volume (1M+ msgs/sec)? → Kafka (designed for this)

Key Takeaways for FAANG Interviews

Synchronous = Tight Coupling + Availability Amplification → Use for critical path only
Asynchronous = Loose Coupling + Operational Complexity → Use for background jobs
gRPC is the future → 10x faster than REST, but less accessible
Kafka is the standard → High throughput, replay capability, event sourcing
Idempotency is non-negotiable → Design all async handlers to be idempotent
Observability is critical → Monitor latencies, lag, error rates, queue depth
CAP theorem is reality → Partition tolerance mandatory, choose CP or AP
Complexity scales with scale → Simple systems can be synchronous, complex ones need async

Microservice Communication Patterns - SDE 2 Level Deep Dive

Table of Contents

Communication Paradigms

Overview

Synchronous Communication

Definition

Implementation Mechanisms

1. REST APIs (HTTP/HTTPS)

2. GraphQL

3. Feign (Spring Cloud) with Service Discovery (Eureka)

4. gRPC (Remote Procedure Call)

Key Issues with Synchronous Communication

1. Tight Coupling

2. Availability Amplification

3. Distributed Transaction Problem

4. Timeout Configuration

5. Retry Storms

When to Use Synchronous Communication

Asynchronous Communication

Definition

Architecture Overview

Message Broker Responsibilities

Asynchronous Patterns

Pattern 1: Point-to-Point (Queue)

Pattern 2: Publish-Subscribe (Topic)

Pattern 3: Event-Driven Architecture

Message Brokers Deep Dive

Apache Kafka - Distributed Publish-Subscribe

Architecture

Core Concepts

Producer Behavior

Consumer Behavior

RabbitMQ - Traditional Message Queue

Architecture

Core Concepts

Comparison with Kafka

Distributed System Challenges

CAP Theorem & Consistency Models

Eventual Consistency in Asynchronous Systems

Idempotency & Exactly-Once Semantics

Network Partitions & Byzantine Failures

Cascading Failures

Pattern Comparison Matrix

Production Considerations

Monitoring & Observability

Scaling Strategies

Failure Handling Patterns

Security Considerations

Cost-Benefit Analysis

Hybrid Approach (Recommended for Complex Systems)

Summary & Decision Tree

Key Takeaways for FAANG Interviews