Microservice Communication - Advanced Systems Design
Microservice Communication Patterns - SDE 2 Level Deep Dive
Microservice architecture requires careful consideration of inter-service communication strategies. This document covers synchronous and asynchronous patterns with distributed systems trade-offs suitable for FAANG-level system design.
Table of Contents
- Communication Paradigms
- Synchronous Communication
- Asynchronous Communication
- Message Brokers Deep Dive
- Distributed System Challenges
- Pattern Comparison Matrix
- Production Considerations
Communication Paradigms
Overview
In distributed systems, services need to exchange information reliably and efficiently. The choice between synchronous and asynchronous communication impacts:
- Consistency Model (Strong vs Eventual)
- Latency SLA (P99, P95 tail latencies)
- System Coupling (degree of interdependence)
- Failure Recovery (fault isolation)
- Scalability (horizontal scaling capabilities)
Synchronous Communication
Definition
A caller service makes a request and blocks/waits for a response before proceeding. The request-response pattern is inherently synchronous.
Implementation Mechanisms
1. REST APIs (HTTP/HTTPS)
- Protocol: HTTP/1.1 or HTTP/2
- Serialization: JSON (text-based, 2-3 KB overhead)
- Latency: 50-500ms per call depending on network/processing
- Use Case: Simple CRUD operations, public APIs, browser clients
Considerations:
- Stateless by design (easier horizontal scaling)
- Connection overhead: TCP 3-way handshake (~5-10ms for distant regions)
- Polling complexity: consumer must repeatedly ask for updates
- Retry logic requires idempotency keys to prevent duplicates
Example Scenario:
Order Service → Payment Service (wait for response)
│ │
└─ Blocks waiting ───┘
If Payment Service down → immediate failure
2. GraphQL
- Advantage: Eliminates over-fetching and under-fetching of data
- Latency: Similar to REST but can reduce round-trips
- Complexity: Query parsing and execution overhead (~2-5ms additional)
- Use Case: Complex data aggregation, mobile clients with limited bandwidth
Trade-offs:
- Solves data fetching problems but doesn’t solve inherent synchronous coupling
- Deeper query complexity can lead to N+1 problems in resolvers
N+1 Problem — Simple Analogy First:
Imagine you are a librarian. Someone asks: “Give me all books, and for each book tell me the author’s hometown.”
- You fetch the list of 10 books → 1 trip to the shelf
- Then for each book, you go call the author individually to ask their hometown → 10 separate phone calls
- Total = 1 + 10 = 11 trips just to answer one question.
- If there were 1000 books → 1001 trips. That is the N+1 problem.
How it happens in GraphQL:
GraphQL has resolvers — small functions, one per field, each responsible for fetching its own data. They do not know about each other. So when you ask for a list AND a nested field on each item, each nested field fires its own separate database query.
GraphQL Query:
{
orders { ← 1 query: fetch all orders
id
user { ← resolver fires for EACH order separately
name
}
}
}
What actually executes in the database:
SELECT * FROM orders ← 1 query
SELECT * FROM users WHERE id = 101 ← for order 1
SELECT * FROM users WHERE id = 102 ← for order 2
SELECT * FROM users WHERE id = 103 ← for order 3
...
If there are 1000 orders → 1001 DB queries for one client request.
Why resolvers behave this way: Each resolver is written independently:
// Order resolver
resolveUser(order) {
return db.query("SELECT * FROM users WHERE id = " + order.userId)
// It has NO idea 999 other orders are also calling this same function
}
Fix — DataLoader (Batching):
DataLoader collects all the IDs being requested within a single moment, then fires one batched query instead of N individual ones.
Without DataLoader:
→ 1000 separate: SELECT * FROM users WHERE id = X
With DataLoader:
→ 1 batched: SELECT * FROM users WHERE id IN (101, 102, 103, ...)
Why it matters more in microservices: In microservices, resolvers don’t just hit a DB — they make HTTP or gRPC calls to other services. So N+1 DB queries become N+1 network calls across services, each adding network latency. 1000 orders = 1000 HTTP calls to the User Service, which can overwhelm it.
3. Feign (Spring Cloud) with Service Discovery (Eureka)
- Purpose: Client-side HTTP client with declarative interface
- Discovery: Automatic service location via Eureka registry
- Circuit Breaker Integration: Works with Hystrix/Resilience4j
- Latency: Similar to REST (~50-500ms) + discovery lookup (~5-20ms)
How it works:
Service A (Feign Client) → Eureka Registry (lookup ServiceB instance)
→ HTTP call to ServiceB
→ Blocking wait
Failure Scenario:
- Service B instances crash → Eureka marks them unhealthy (takes 30-60 seconds)
- New requests fail → Circuit breaker opens (fail-fast)
- Cascading failures can occur if timeout not configured properly
4. gRPC (Remote Procedure Call)
- Protocol: HTTP/2 with multiplexing
- Serialization: Protocol Buffers (binary, ~300 bytes overhead)
- Latency: 5-50ms (10x faster than REST due to binary format)
- Performance: 1000+ concurrent streams per connection
- Throughput: 100K+ requests per second per instance
Advantages:
- Bi-directional streaming (client→server, server→client, both directions)
- Server push capabilities
- Native deadline/timeout management
- Load balancing aware (gRPC can balance per request)
Use Cases:
- Inter-service communication in microservices (no public API)
- Real-time data streaming
- High-frequency data updates (time-series, metrics)
- Internal services where schema evolution is controlled
Example:
service OrderService {
rpc CreateOrder(OrderRequest) returns (OrderResponse);
// Bi-directional streaming
rpc StreamOrders(stream Order) returns (stream OrderStatus);
}
Trade-offs of gRPC:
- Binary protocol not human-readable (harder debugging with standard tools)
- Requires code generation (
.protofiles → language-specific stubs) - Not browser-friendly (needs gRPC-web wrapper)
- Steep learning curve compared to REST
Key Issues with Synchronous Communication
1. Tight Coupling
- Service A directly depends on Service B’s availability
- Service B’s latency becomes Service A’s latency
- Schema changes in B require coordinated deployments
Cascading Failure Example:
Order Service -> Payment Service -> Bank Service
v v v
1s timeout + 1s timeout (down)
v v
Result: 3 second timeout for user request
If 100 requests/sec -> 100 connection threads blocked for 3 seconds
-> Rapid thread pool exhaustion -> Denial of Service
Availability Math for the Same Call Chain:
For synchronous serial dependencies, end-user success needs all required services to be available at the same time:
A_total = A_order * A_payment * A_bank
Example:
A_order = 99.99% = 0.9999A_payment = 99.9% = 0.999A_bank = 99.9% = 0.999
A_total = 0.9999 * 0.999 * 0.999 = 0.997901099 = 99.7901%
So failure probability is:
1 - 0.997901099 = 0.002098901 = 0.2099%
Annual downtime (525,600 minutes/year):
Downtime = (1 - A_total) * 525,600 = 0.002098901 * 525,600 ~ 1,103 minutes/year (~18.4 hours/year)
2. Availability Amplification
- Concept: If each service has 99.9% uptime, the composite system availability decreases
- Math: 99.9% × 99.9% × 99.9% = 99.7% (3 hops)
- Impact: At 99.7%, you have ~26 hours of downtime per year
- Mitigation: Circuit breakers (fail fast), timeouts, fallbacks
3. Distributed Transaction Problem
- ACID transactions impossible across network boundaries
- Network partitions prevent atomic all-or-nothing outcomes
- Compensation logic (saga pattern) required for consistency
Example - Distributed Transaction:
Service A writes: user balance -= $100 [Success]
Network fails
Service B writes: order record += 1 [Fails]
Result: Inconsistent state - money taken but no order created
4. Timeout Configuration
Timeouts should be based on latency SLOs, not guesswork.
- Too aggressive (for example 100ms): healthy but slow tail requests fail.
- Too lenient (for example 60s): threads/connections stay blocked, increasing blast radius.
- Practical rule: start with downstream
P95orP99and add safety margin + jitter.
What P95 means (and other P values):
P means percentile of latency distribution.
P50(median): 50% of requests finish at or below this latency.P90: 90% finish at or below this latency.P95: 95% finish at or below this latency.P99: 99% finish at or below this latency (tail latency).
Example (10,000 Payment requests in 1 minute):
P50 = 40msP90 = 80msP95 = 120msP99 = 300ms
Interpretation: 95% of calls complete within 120ms, but the slowest 1% can still take around 300ms.
So when we set timeout to 250ms, we are intentionally allowing normal tail (P95) while still cutting off pathological slow requests before they block resources too long.
Senior guidance for choosing timeout target:
- User-facing synchronous call: usually start near
P99(orP95 + safety margin) and tune using error budget. - High-QPS internal hop: prefer tighter timeout near
P95 + marginto protect capacity. - Always combine timeout with retries (bounded), jitter, circuit breaker, and idempotency.
Example (timeout budget):
- End-to-end API SLO:
800ms - Call chain: API Gateway -> Order -> Payment -> Bank
- Budget split:
Gateway(80) + Order(120) + Payment(250) + Bank(250) + buffer(100) - Set Payment timeout near observed tail (say
P95=120ms) with margin, e.g.250ms, not 60s.
This keeps failures fast and recoverable, and prevents thread pool lockup.
5. Retry Storms
A retry storm happens when many clients retry at the same time after a transient failure, multiplying load on an already unhealthy dependency.
- If 10k requests fail and each client retries 3 times immediately, backend sees up to
40kattempts (1 original + 3 retries). - This can turn a short outage into sustained overload (thundering herd).
Mitigation pattern:
- Exponential backoff + jitter
- Retry only idempotent operations
- Cap attempts and total retry time
- Combine with circuit breaker and rate limiting
Formula:
delay = min(max_delay, base_delay * 2^attempt) + random(0, jitter)
Example delays (base=100ms, max=2s, jitter=0-100ms):
- Retry 1: ~
100-200ms - Retry 2: ~
200-300ms - Retry 3: ~
400-500ms - Retry 4: ~
800-900ms - Retry 5: ~
1600-1700ms
This spreads retries over time, reduces synchronized spikes, and improves recovery probability.
When to Use Synchronous Communication
1. Query Operations (Read-Only)
- Fetching product catalog, user profiles, inventory status
- No side effects → Safe to retry without concern
2. Transactional Workflows (Atomic Operations)
- Payment processing: Must validate before committing
- Inventory reservation: Must check stock before confirming order
- Requirement: Immediate feedback needed for validity
3. Low-Latency User-Facing Operations
- Authentication/Authorization: Users expect <100ms response
- Search queries: P99 latency < 200ms critical for UX
- API Gateway to origin services: Direct request flow
4. Strongly Consistent Operations
- Distributed transactions using 2-phase commit (rare, use carefully)
- When eventual consistency is not acceptable (compliance requirements)
Asynchronous Communication
Definition
A sender publishes a message to a broker and continues immediately without waiting for processing. Processing happens later by one or more consumers.
Architecture Overview
Producer → Message Broker → Consumer
(fire & forget) (processes when ready)
Key Properties:
- Producer decoupled from Consumer
- Broker guarantees message delivery (if configured)
- Consumer processes at own pace
- Enable backpressure (broker buffers messages)
Message Broker Responsibilities
- Message Persistence: Store messages durably (disk, replicated)
- Routing: Direct messages to correct consumers
- Redelivery: Retry if consumer fails to acknowledge
- Retention: Keep messages for configured duration
- Consumer Offset Management: Track what each consumer has read
- Fault Tolerance: Replicate data across multiple brokers
Asynchronous Patterns
Pattern 1: Point-to-Point (Queue)
Producer → Queue (Broker) → Consumer
Message stays until consumed
Only ONE consumer processes it
Message deleted after acknowledgment
Characteristics:
- Delivery Guarantee: Exactly-once (with idempotent processing)
- Consumers: One message, one handler
- Ordering: Per-queue ordering (if single partition)
- Retention: Until consumed and acknowledged
- Examples: RabbitMQ, ActiveMQ, AWS SQS
Use Cases:
- Task distribution: Job queues (background processing)
- Work distribution: Order processing across multiple workers
- Rate limiting: Decouple fast producer from slow consumer
Failure Handling:
Message published → Consumer processes → Crashes before ACK
Broker detects no ACK → Retries with another consumer instance
Pattern 2: Publish-Subscribe (Topic)
Publisher → Topic (Broker) ─��→ Subscriber1
├→ Subscriber2
└→ Subscriber3
Message goes to ALL active subscribers
Characteristics:
- Delivery Guarantee: At-least-once per subscriber
- Consumers: Message goes to all subscribers
- Ordering: Per-partition ordering (Kafka)
- Retention: Time-based or size-based
- Examples: Kafka, RabbitMQ (with fanout), AWS SNS
Use Cases:
- Event broadcasting: User signed up → multiple services react
- Data replication: Update propagated to read replicas
- Real-time notifications: Push to multiple clients
Critical Issue - Subscriber Availability:
Scenario 1: Subscriber offline at publish time
→ Message lost (if not persistent topic)
Scenario 2: Subscriber offline but topic is persistent
→ Subscriber can replay from offset
→ Recovery is possible
Pattern 3: Event-Driven Architecture
Difference from Messaging:
- Messaging: Explicit message format, tight coupling on schema
- Events: Publish intent/state change, not contracts
- Subscribers react to events based on their own business logic
Example:
Event: UserRegisteredEvent {
userId: "123",
email: "user@example.com",
timestamp: "2024-09-13T22:02:37Z"
}
Subscribers:
- Email Service: Send welcome email
- Analytics Service: Track registration metrics
- Recommendation Service: Initialize user profile
Advantages:
- Loose coupling: Event structure changes don’t affect all subscribers
- Scalability: New subscribers can be added without modifying publisher
- Auditability: Event log provides complete history
Message Brokers Deep Dive
Apache Kafka - Distributed Publish-Subscribe
Architecture
Kafka Cluster (Fault Tolerant)
├── Zookeeper Cluster (Cluster Coordination)
│ ├── Leader (writes)
│ ├── Follower (reads)
│ └── Follower (reads)
│
├── Broker 1 [Leader] ─── Broker 2 [Replica] ─── Broker 3 [Replica]
│ ├── Topic-A-P0 (Leader)
│ │ ├── Offset 0: Message1
│ │ ├── Offset 1: Message2
│ │ └── Offset 2: Message3
│ │
│ ├── Topic-A-P1 (Leader)
│ │ └── Offset 0: Message4
│ │
│ └── Topic-B-P0 (Leader)
│ └── Offset 0: Message5
Core Concepts
1. Topic
- Logical collection of messages on a subject
- Partitioned for parallel processing
- Replicated for fault tolerance
- Messages retained based on policy (time/size)
2. Partition
- Physical unit of parallelism
- Ordered sequence of messages (immutable log)
- Each message has unique offset (0, 1, 2, …)
- Critical: One message = one offset in one partition
Partitioning Strategy:
Producer: "user:123" → Hash("user:123") % 5 partitions = Partition 3
Producer: "user:456" → Hash("user:456") % 5 partitions = Partition 1
Result: All messages for same user go to same partition (preserves order)
Anti-Pattern (Hotspot):
If all producers send null key or same key
→ All messages go to same partition
→ Single broker becomes bottleneck
→ Data skew (one partition much larger)
3. Offset
- Sequential ID for each message in a partition
- Consumer tracks which offset it has read
- Enables replaying from any point
- Use Case: Fixing bugs, reprocessing data, catch-up
4. Replication
- Replication Factor (RF): How many replicas of each partition
- Leader: Accepts all writes and reads
- ISR (In-Sync Replicas): Followers caught up with leader
- Failure Scenario:
Leader fails → One ISR becomes new leader (no data loss if RF=2+)
If non-ISR elected → Data loss possible
5. Zookeeper (Cluster Coordination)
Responsibilities:
- Broker membership management (who’s alive?)
- Controller election (one broker becomes controller)
- Leader election for each partition
- Consumer group coordination
Critical: Odd number required (3, 5, 7 for quorum)
Zookeeper data:
/brokers/ids: [1, 2, 3] (active brokers)
/controller: 1 (current controller)
/topics/my-topic/partitions/0/state: {"leader": 1, "isr": [1,2]}
/consumers/group-1/offsets/my-topic/0: 42 (consumer read up to offset 42)
Producer Behavior
Configuration Trade-offs:
| Config | Value | Latency | Durability | Throughput |
|---|---|---|---|---|
acks=0 | No wait for broker | 0-1ms | RISKY (data loss) | Very High |
acks=1 | Wait for leader | 5-20ms | Good | High |
acks=all | Wait for all ISR | 20-100ms | SAFE (replicated) | Moderate |
Batching (improves throughput):
- Buffer messages for
batch.sizebytes orlinger.mstime - Send all at once → Reduces round trips
- Trade-off: Slightly higher latency for much higher throughput
Compression:
snappy: Good compression ratio, fastgzip: Better compression, slowerlz4: Very fast, moderate compression- Can reduce network bandwidth by 50-90%
Idempotency:
Enable: enable.idempotence=true, transactional.id="unique-id"
Effect: Duplicates automatically deduplicated by broker
Cost: Slight performance hit
Consumer Behavior
Consumer Groups:
Topic with 4 partitions:
P0, P1, P2, P3
Consumer Group A (3 instances):
- Consumer A1 reads P0, P1
- Consumer A2 reads P2
- Consumer A3 reads P3 (idle)
If A3 dies:
- Rebalancing: P3 reassigned to A1 or A2
- Stop-the-world: ~30 seconds of paused consumption
Offset Management:
- Automatic: Consumer group stores offset in
__consumer_offsetstopic - Manual: Application tracks offset in external store
- Seek: Reset offset to specific point
Consumer can do:
consumer.seek(partition, 0) // Start from beginning
consumer.seek(partition, offset - 100) // Replay last 100
consumer.seekToEnd() // Skip to latest
Important: Offset committed AFTER processing, not before.
Failure Scenarios:
Scenario 1: Message processed, consumer crashes before offset commit
→ Message reprocessed on restart (at-least-once)
Scenario 2: Message processed + offset committed, consumer crashes
→ Message skipped on restart (exactly-once from consumer perspective)
Solution: Idempotent processing (safe to replay)
RabbitMQ - Traditional Message Queue
Architecture
RabbitMQ Cluster (Queue-based)
├── Broker 1 (Master)
│ ├── Queue: order.processing
│ │ ├── Message: OrderID:1001 (unacknowledged)
│ │ ├── Message: OrderID:1002 (ack pending)
│ │ └── Message: OrderID:1003
│ └── Queue: payment.processing
│
├── Broker 2 (Mirror/Replica)
├── Broker 3 (Mirror/Replica)
Core Concepts
1. Queue (Point-to-Point)
Producer → Exchange → Queue → Consumer (ACK) → Message deleted
- Messages stay until consumed and acknowledged
- Only one consumer processes each message
- Ordering: Single queue guarantees order
- Load Balance: Multiple consumers share work (round-robin)
2. Exchange (Routing)
- Direct: Exact key matching
"user.created" �� Queue: user-service
- Topic: Pattern matching with wildcards
"user.*" → Queues: [user-service, email-service, analytics-service]
- Fanout: Broadcast to all bound queues
"broadcast" → All bound queues receive copy
- Headers: Match on message headers
3. Message Acknowledgment
Producer sends → Broker stores → Consumer processes → Consumer ACKs
If no ACK within timeout → Broker redelivers to another consumer
Manual vs Auto ACK:
- Auto: Acknowledge immediately (risky, message loss on crash)
- Manual: Acknowledge after successful processing (safe, slightly slower)
Comparison with Kafka
| Aspect | RabbitMQ | Kafka |
|---|---|---|
| Type | Queue + Pub-Sub | Distributed Log (Pub-Sub) |
| Persistence | Optional | Default (log-based) |
| Ordering | Per-queue | Per-partition |
| Throughput | Typically 10K-100K+ msgs/sec (workload and topology dependent) | Typically 100K to millions msgs/sec (partitioning, batching, and hardware dependent) |
| Replication | Yes | Yes (partition-level) |
| Consumer Offset | Auto tracked (deleted on ACK) | Manual control (time-based retention) |
| Message Loss Risk | Low if ACK enabled | 1 in 100M (configurable) |
| Use Case | Task queues, simple workflows | Event streams, high-volume logging |
Note: Throughput numbers vary heavily by message size, durability settings, ACK mode, replication factor, disk/network limits, and producer/consumer parallelism.
Distributed System Challenges
CAP Theorem & Consistency Models
CAP: Any distributed system can guarantee at most 2 of 3:
- Consistency: All nodes see same data
- Availability: System responds to requests
- Partition Tolerance: System survives network splits
In Practice: Network partitions WILL occur → Choose CP or AP
Eventual Consistency in Asynchronous Systems
State A → Message Sent → Broker → Network Delay → State B
t=0 t=10ms t=100ms
During t=10-100ms: Different services see different state
This window is EVENTUALLY CONSISTENT
Risks:
- User sees stale data temporarily
- Multiple writes create conflicts (last-write-wins not always correct)
- Compensating transactions may be needed
Example - E-Commerce:
User updates shipping address:
1. Address Service: Address updated immediately
2. Shipping Service: Receives update async, 500ms later
3. User starts checkout: Sees new address but shipping rates might be stale
Idempotency & Exactly-Once Semantics
Problem: In distributed systems, retries can cause duplicates
Delivery Guarantees:
- At-least-once: Message guaranteed delivered, but may be duplicated
- At-most-once: Message delivered once but may be lost
- Exactly-once: Message delivered exactly once (hardest to achieve)
Idempotent Processing:
Consumer processes:
if idempotencyKey not in database:
processMessage()
mark idempotencyKey as processed
else:
skip (already processed)
Example:
Payment API:
POST /payment
{
"idempotencyKey": "req-12345",
"amount": 100,
"userId": "user-123"
}
First call: Charges $100, stores idempotencyKey
Retry due to timeout: Uses same idempotencyKey, doesn't charge again
Network Partitions & Byzantine Failures
Scenario: Service A can’t reach Service B
Service A Network Partition Service B
(Active) (Active)
A can't call B, so:
- Timeout occurs
- Circuit breaker opens
- Fallback logic executes (stale cache, default value)
B processes messages from queue independently
Result: Temporarily inconsistent state (A thinks B failed, B is fine)
Recovery:
- When network heals, systems resync
- Data reconciliation may be needed
- Audit logs help identify inconsistencies
Cascading Failures
Scenario: Service B slow due to database overload
Service A → Service B (100ms, times out at 200ms)
Service C → Service A (100ms, times out at 200ms)
Service D → Service C (100ms, times out at 200ms)
A's 100 threads waiting for B → Exhausted
C's 100 threads waiting for A → Exhausted
D's 100 threads waiting for C → Exhausted
Entire system grinds to halt
Mitigation:
- Circuit Breaker: Stop calling failing service, fail fast
- Bulkheads: Isolate thread pools (don’t use shared pool)
- Timeouts: Kill long-running requests
- Fallbacks: Graceful degradation
Pattern Comparison Matrix
| Characteristic | REST/gRPC | Kafka | RabbitMQ |
|---|---|---|---|
| Latency | 50-500ms | 50-100ms (processing) | 10-100ms |
| Throughput | REST: 1K-10KgRPC: 100K+ | 1M+ msgs/sec | 10K-100K msgs/sec |
| Consistency | Strong (sync) | Eventual | Eventual |
| Coupling | Tight | Loose | Loose |
| Ordering Guarantee | No (unless coded) | Per-partition (strong) | Per-queue (strong) |
| Replay Capability | No (unless logged) | Yes (offset management) | No (delete on ACK) |
| Operational Complexity | Low | Very High | Medium |
| Infrastructure Footprint | Minimal | Large cluster required | Medium cluster |
| Failure Domain | Single hop | Broker cluster | Broker cluster |
| Best For | Query/transactional | Event streaming/analytics | Task queues/workflows |
Production Considerations
Monitoring & Observability
Key Metrics to Track:
Synchronous:
- P99, P95, P50 latencies
- Error rate (4xx, 5xx)
- Circuit breaker state changes
- Timeout occurrences
Asynchronous:
- Consumer lag (offset - latest offset)
- End-to-end latency (publish to process)
- Message loss rate
- Queue depth (buildup indicates slow consumers)
- Redelivery rate (indicates failure)
Scaling Strategies
Synchronous:
- Horizontal scaling: Add more instances
- Caching: Reduce repeated calls
- Database optimization: Faster responses → shorter timeout needed
Asynchronous:
- Partition keys: Distribute load evenly
- Consumer group scaling: Add instances
- Retention tuning: Balance disk space vs. replay capability
Failure Handling Patterns
1. Retry with Exponential Backoff
Attempt 1: Immediate
Attempt 2: 100ms + random(0-50ms)
Attempt 3: 200ms + random(0-100ms)
Attempt 4: 400ms + random(0-200ms)
Attempt 5: 800ms + random(0-400ms) [then give up]
2. Circuit Breaker
Closed (normal) → Threshold errors exceeded → Open (failing)
→ After timeout → Half-Open (test)
→ If success → Closed
→ If failure → Open
3. Bulkhead Pattern
Service A
├── Thread Pool 1 (for Service B calls) [10 threads max]
├── Thread Pool 2 (for Service C calls) [10 threads max]
└── Thread Pool 3 (for Service D calls) [10 threads max]
If B fails: Only Pool 1 exhausted, Pool 2 & 3 work fine
4. Saga Pattern (Distributed Transactions)
Order Service → Payment Service → Inventory Service
→ if Payment fails → Compensation (refund)
→ if Inventory fails → Compensation (cancel payment)
Security Considerations
Authentication/Authorization:
- gRPC: mTLS (mutual TLS) for service-to-service
- REST: OAuth2, JWT tokens
- Kafka: ACLs, authentication plugins
Data Encryption:
- In-transit: TLS/SSL
- At-rest: Disk encryption, broker-side
- End-to-end: Application-level encryption
Cost-Benefit Analysis
Choose Synchronous When:
- ✓ Low latency required (<100ms)
- ✓ Strong consistency needed
- ✓ Simple request-response pattern
- ✓ Limited message volume
- ✗ Scalability not a constraint
Choose Asynchronous When:
- ✓ Decoupling required
- ✓ High throughput needed
- ✓ Eventual consistency acceptable
- ✓ Backpressure handling required
- ✗ Real-time interaction needed
Hybrid Approach (Recommended for Complex Systems)
API Gateway (REST) → Services (gRPC for sync, critical paths)
→ Message Broker (async for non-critical paths)
→ Databases (cached for reads)
Example (E-commerce Order):
1. REST: Order created (needs immediate response)
2. gRPC: Validate inventory (need strong consistency)
3. Kafka: Async events (payment, notification, analytics)
Summary & Decision Tree
Is response needed immediately?
├─ Yes → Use Synchronous (REST/gRPC)
│ ├─ Internal service? → gRPC (faster, binary)
│ └─ Browser/public API? → REST (widely supported)
│
└─ No → Use Asynchronous (Message Broker)
├─ Single consumer per message? → RabbitMQ (task queue)
└─ Multiple subscribers needed? → Kafka (event stream)
└─ High volume (1M+ msgs/sec)? → Kafka (designed for this)
Key Takeaways for FAANG Interviews
- Synchronous = Tight Coupling + Availability Amplification → Use for critical path only
- Asynchronous = Loose Coupling + Operational Complexity → Use for background jobs
- gRPC is the future → 10x faster than REST, but less accessible
- Kafka is the standard → High throughput, replay capability, event sourcing
- Idempotency is non-negotiable → Design all async handlers to be idempotent
- Observability is critical → Monitor latencies, lag, error rates, queue depth
- CAP theorem is reality → Partition tolerance mandatory, choose CP or AP
- Complexity scales with scale → Simple systems can be synchronous, complex ones need async