System Design Interview: Real-Time Chat System

Design a real-time chat application similar to Slack or WhatsApp. This question tests understanding of real-time communication, message delivery guarantees, presence systems, and data modeling at scale.

Interview Format (45 minutes)

Time Allocation:

Requirements gathering: 5-8 minutes
High-level design: 10-15 minutes
Deep dive: 15-20 minutes
Scale and edge cases: 5-10 minutes

Step 1: Requirements Gathering (5-8 min)

A strong candidate will clarify the scope before designing anything.

Functional Requirements

Good questions to ask:

Is this 1:1 chat, group chat, or both? (both, groups up to 500 members)
Do we need message history/persistence? (yes, searchable)
What message types? (text, images, files)
Do we need read receipts? (yes)
Online/offline presence? (yes)
Push notifications? (yes, for offline users)
Message editing/deletion? (yes)

Agreed requirements:

1:1 and group messaging (up to 500 members)
Real-time message delivery
Persistent message history with search
Read receipts and typing indicators
Online/offline presence
Push notifications for offline users
Image and file sharing

Non-Functional Requirements

Good questions to ask:

Expected user base? (50M DAU)
Messages per day? (1B messages/day)
Message size limit? (64KB text, 100MB files)
Latency requirements? (<200ms delivery)
Geographic distribution? (global)
Message retention? (forever for paid, 90 days for free)

Agreed requirements:

Low latency (<200ms for message delivery)
High availability (99.99% uptime)
Message ordering guaranteed within a conversation
At-least-once delivery (with deduplication)
End-to-end encryption (stretch goal)

Calculations

Messages:

50M DAU, average 20 messages/day = 1B messages/day
Average message size: 200 bytes
1B x 200 bytes = 200GB/day = 73TB/year

Connections:

50M concurrent WebSocket connections (peak)
Each connection: ~10KB memory
50M x 10KB = 500GB RAM for connections alone

QPS:

1B messages/day = ~12,000 messages/sec average
Peak (3x): ~36,000 messages/sec

Red flags if candidate:

Designs only for HTTP polling
Doesn't consider message ordering
Ignores offline scenarios
Doesn't ask about group size limits

Step 2: High-Level Design (10-15 min)

API Design

WebSocket Connection:

wss://chat.example.com/ws?token=<auth_token>

// Client -> Server
{
  "type": "send_message",
  "conversationId": "conv_123",
  "content": "Hello!",
  "clientMessageId": "client_uuid_456"  // for deduplication
}

// Server -> Client
{
  "type": "new_message",
  "messageId": "msg_789",
  "conversationId": "conv_123",
  "senderId": "user_001",
  "content": "Hello!",
  "timestamp": "2026-02-14T10:30:00Z"
}

REST APIs (for non-real-time operations):

GET  /api/conversations                    # List conversations
GET  /api/conversations/:id/messages       # Message history (paginated)
POST /api/conversations                    # Create conversation/group
POST /api/conversations/:id/messages       # Send message (fallback)
PUT  /api/messages/:id                     # Edit message
DELETE /api/messages/:id                   # Delete message
POST /api/upload                           # Upload file/image

Good candidate discusses:

WebSocket vs SSE vs long polling trade-offs
REST fallback for reliability
Client-generated message IDs for deduplication

Core Components

┌───────────────┐
│    Clients    │
└───────┬───────┘
        │ WSS
┌───────▼───────────────────────────────────┐
│          WebSocket Gateway                │
│  (Connection management, routing)         │
└───────┬───────────────┬───────────────────┘
        │               │
┌───────▼───────┐ ┌─────▼──────────────┐
│  Chat Service │ │  Presence Service  │
│  (Messages)   │ │  (Online status)   │
└───────┬───────┘ └─────┬──────────────┘
        │               │
┌───────▼───────┐ ┌─────▼──────────────┐
│  Message DB   │ │  Redis Cluster     │
│  (Cassandra)  │ │  (Presence + Pub/Sub)│
└───────────────┘ └────────────────────┘

Data Model

Messages (Cassandra / DynamoDB):

sql

-- Partition by conversation, sorted by time
CREATE TABLE messages (
    conversation_id UUID,
    message_id TIMEUUID,
    sender_id UUID,
    content TEXT,
    content_type TEXT,       -- 'text', 'image', 'file'
    media_url TEXT,
    created_at TIMESTAMP,
    edited_at TIMESTAMP,
    deleted BOOLEAN,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Conversations (PostgreSQL):

sql

CREATE TABLE conversations (
    id UUID PRIMARY KEY,
    type VARCHAR(10),         -- 'direct', 'group'
    name VARCHAR(255),
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE TABLE conversation_members (
    conversation_id UUID REFERENCES conversations(id),
    user_id UUID,
    role VARCHAR(20) DEFAULT 'member',
    joined_at TIMESTAMP,
    last_read_message_id UUID,
    PRIMARY KEY (conversation_id, user_id)
);

CREATE INDEX idx_user_conversations
    ON conversation_members(user_id);

Step 3: Deep Dive (15-20 min)

Message Delivery Flow

Sender -> WebSocket Gateway -> Chat Service -> Message DB
                                    |
                                    v
                              Message Queue
                                    |
                    ┌───────────────┼───────────────┐
                    v               v               v
              WS Gateway      WS Gateway      Push Service
              (User A)        (User B)        (Offline Users)

Implementation:

python

class ChatService:
    def handle_message(self, sender_id, conversation_id, content, client_msg_id):
        # 1. Deduplication check
        if self.message_store.exists_by_client_id(client_msg_id):
            return  # Already processed

        # 2. Validate sender is member of conversation
        if not self.is_member(sender_id, conversation_id):
            raise PermissionError("Not a member")

        # 3. Store message
        message = self.message_store.create(
            conversation_id=conversation_id,
            sender_id=sender_id,
            content=content,
            client_message_id=client_msg_id
        )

        # 4. Get conversation members
        members = self.get_members(conversation_id)

        # 5. Fan out to online members via pub/sub
        for member_id in members:
            if member_id != sender_id:
                self.pubsub.publish(
                    channel=f"user:{member_id}",
                    message=message.to_dict()
                )

        # 6. Send push notifications to offline members
        offline_members = [m for m in members
                          if not self.presence.is_online(m)]
        self.push_service.notify(offline_members, message)

        # 7. Acknowledge to sender
        return {"status": "delivered", "messageId": message.id}

Presence System

Challenge: Tracking 50M online users in real time

python

class PresenceService:
    def __init__(self, redis):
        self.redis = redis
        self.HEARTBEAT_INTERVAL = 30  # seconds
        self.TIMEOUT = 90  # seconds

    def user_connected(self, user_id):
        self.redis.hset(f"presence:{user_id}", mapping={
            "status": "online",
            "last_seen": time.time(),
            "server_id": self.server_id
        })
        self.redis.expire(f"presence:{user_id}", self.TIMEOUT)

        # Notify contacts
        self._broadcast_status(user_id, "online")

    def heartbeat(self, user_id):
        self.redis.hset(f"presence:{user_id}",
                       "last_seen", time.time())
        self.redis.expire(f"presence:{user_id}", self.TIMEOUT)

    def user_disconnected(self, user_id):
        # Don't immediately mark offline (might reconnect)
        self.redis.hset(f"presence:{user_id}",
                       "status", "away")

        # Schedule offline check after grace period
        self.scheduler.schedule(
            delay=30,
            task=self._check_still_offline,
            args=(user_id,)
        )

    def is_online(self, user_id):
        data = self.redis.hgetall(f"presence:{user_id}")
        if not data:
            return False
        return (time.time() - float(data["last_seen"])) < self.TIMEOUT

    def _broadcast_status(self, user_id, status):
        # Only broadcast to users who have this user in their contacts
        contacts = self.get_contacts(user_id)
        for contact_id in contacts:
            self.pubsub.publish(
                channel=f"user:{contact_id}",
                message={"type": "presence", "userId": user_id, "status": status}
            )

Strong candidate discusses:

Heartbeat mechanism vs connection-based detection
Grace period before marking offline
Fan-out problem for popular users (hundreds of contacts)
Lazy presence (only check when user opens a conversation)

Read Receipts and Typing Indicators

python

# Read receipts: persistent (stored in DB)
def mark_read(user_id, conversation_id, message_id):
    db.update("conversation_members",
        set={"last_read_message_id": message_id},
        where={"conversation_id": conversation_id,
               "user_id": user_id})

    # Notify other members
    pubsub.publish(f"conv:{conversation_id}", {
        "type": "read_receipt",
        "userId": user_id,
        "lastReadMessageId": message_id
    })

# Typing indicators: ephemeral (never stored)
def typing_started(user_id, conversation_id):
    pubsub.publish(f"conv:{conversation_id}", {
        "type": "typing",
        "userId": user_id,
        "status": "started"
    })
    # Auto-expire after 5 seconds (in case stop event lost)

Message Ordering

Challenge: Ensuring messages appear in correct order across devices

Approach: Server-assigned timestamps + sequence numbers

python

class MessageOrderer:
    def assign_order(self, conversation_id, message):
        # Atomic increment per conversation
        seq = self.redis.incr(f"seq:{conversation_id}")
        message.sequence_number = seq
        message.server_timestamp = time.time_ns()
        return message

    def resolve_conflicts(self, messages):
        # Sort by sequence number (primary)
        # Then by server timestamp (secondary)
        return sorted(messages,
                     key=lambda m: (m.sequence_number, m.server_timestamp))

Strong candidate discusses:

Client-side vs server-side timestamps
Causal ordering vs total ordering
Handling out-of-order delivery on client

Step 4: Scale and Edge Cases (5-10 min)

Scaling WebSocket Connections

Problem: Single server can handle ~500K connections max

Solution: WebSocket Gateway Cluster

┌────────────────────────────────────────────────┐
│              Load Balancer (L4)                 │
│          (Sticky sessions by user_id)          │
└──────┬──────────┬──────────┬──────────┬────────┘
       │          │          │          │
  ┌────▼────┐ ┌──▼─────┐ ┌─▼──────┐ ┌▼───────┐
  │  WS GW  │ │ WS GW  │ │ WS GW  │ │ WS GW  │
  │  500K   │ │ 500K   │ │ 500K   │ │ 500K   │
  └────┬────┘ └──┬─────┘ └─┬──────┘ └┬───────┘
       │         │         │         │
       └────────┬┴─────────┴─────────┘
                │
       ┌────────▼──────────┐
       │   Redis Pub/Sub   │
       │   (Message Bus)   │
       └───────────────────┘

Connection registry (which user is on which server):

python

class ConnectionRegistry:
    def register(self, user_id, server_id):
        self.redis.sadd(f"connections:{user_id}", server_id)

    def unregister(self, user_id, server_id):
        self.redis.srem(f"connections:{user_id}", server_id)

    def get_servers(self, user_id):
        return self.redis.smembers(f"connections:{user_id}")

    def route_message(self, user_id, message):
        servers = self.get_servers(user_id)
        for server_id in servers:
            self.pubsub.publish(f"server:{server_id}", {
                "target_user": user_id,
                "message": message
            })

Group Message Fan-Out

Problem: A message to a 500-person group means 499 deliveries

python

def fan_out_group_message(conversation_id, message):
    members = get_members(conversation_id)

    if len(members) <= 50:
        # Small group: fan-out on write (push to each member)
        for member_id in members:
            deliver_to_user(member_id, message)
    else:
        # Large group: fan-out on read (members pull when online)
        store_in_conversation_feed(conversation_id, message)
        # Only push notification to online + mentioned users
        online = [m for m in members if is_online(m)]
        mentioned = extract_mentions(message.content)
        notify_users = set(online + mentioned)
        for user_id in notify_users:
            deliver_to_user(user_id, message)

Offline Message Sync

python

def sync_messages(user_id, last_sync_timestamp):
    """Called when a user comes back online"""
    conversations = get_user_conversations(user_id)

    unread = {}
    for conv_id in conversations:
        last_read = get_last_read_message(user_id, conv_id)
        new_messages = get_messages_after(conv_id, last_read,
                                         limit=50)
        if new_messages:
            unread[conv_id] = {
                "messages": new_messages,
                "unread_count": count_unread(conv_id, last_read)
            }

    return unread

Edge Cases

Strong candidates identify:

Network partitions (messages sent but not acknowledged)
Device sync (user on phone and laptop simultaneously)
Large media files (separate upload flow with CDN)
Spam and abuse (rate limiting, content moderation)
Message deletion propagation across all devices
Clock skew between servers

Evaluation Rubric

Strong Performance (Hire)

Chooses WebSockets with proper justification
Designs for message ordering and delivery guarantees
Handles presence efficiently at scale
Considers fan-out strategies for groups
Discusses offline sync and push notifications
Clear separation of real-time vs persistent data
Mentions security (encryption, auth)

Adequate Performance (Maybe)

Functional design with WebSockets
Basic message storage and retrieval
Some scaling considerations
Misses edge cases like offline sync or ordering
Can be guided toward better solutions

Weak Performance (No Hire)

Only considers HTTP polling
No thought given to delivery guarantees
Doesn't address group messaging challenges
Can't reason about connection management at scale
Poor data model choices

Follow-up Questions

For senior candidates:

How would you implement end-to-end encryption?
Design the notification system in detail
How would you handle message search across billions of messages?
How would you implement message reactions and threads?

For staff+ candidates:

Design the infrastructure for global deployment with <100ms latency
How would you handle compliance (message retention, legal holds)?
Design the system for 500M DAU
How would you implement real-time translation?

This question tests real-time systems design, pub/sub patterns, presence management, and data consistency under concurrent writes. A strong candidate will balance latency requirements with delivery guarantees while maintaining clear system boundaries.

System Design Interview: Real-Time Chat System