System Design Document Template
A high-level template for documenting system architecture with components, data flow, and trade-offs.
Table of Contents
System Design Document Template
System Design Documents capture the high-level architecture of systems, focusing on components, interactions, and trade-offs. They serve as the canonical reference for how a system works and why it was designed that way.
Why Use System Design Documents?
Benefits:
- Provides a shared understanding of system architecture
- Facilitates onboarding new team members
- Enables informed decision-making for changes
- Documents trade-offs and constraints
- Serves as foundation for technical discussions
When to write a system design doc:
- New systems or major rewrites
- Significant architectural changes
- System handoffs between teams
- Before scaling or optimization efforts
- As part of technical due diligence
The Template
# System Design: [System Name]
**Version:** [1.0]
**Author:** [Name]
**Last Updated:** [YYYY-MM-DD]
**Status:** [Draft | Current | Deprecated]
## Executive Summary
[2-3 paragraph overview of the system, its purpose, and key characteristics]
## System Context
### Purpose
[What problem does this system solve? What value does it provide?]
### Users
| User Type | Description | Scale |
|-----------|-------------|-------|
| [User 1] | [Description] | [Number] |
| [User 2] | [Description] | [Number] |
### Key Requirements
**Functional:**
- [Requirement 1]
- [Requirement 2]
**Non-Functional:**
- [Performance requirement]
- [Availability requirement]
- [Security requirement]
## Architecture Overview
### High-Level Architecture[ASCII diagram or description of the overall architecture]
### Key Components
| Component | Responsibility | Technology |
|-----------|----------------|------------|
| [Component 1] | [What it does] | [Tech stack] |
| [Component 2] | [What it does] | [Tech stack] |
### System Boundaries
[What is in scope vs. out of scope for this system]
## Detailed Design
### Component 1: [Name]
**Purpose:** [What this component does]
**Interfaces:**
- Input: [Description]
- Output: [Description]
**Key Behaviors:**
- [Behavior 1]
- [Behavior 2]
**Dependencies:**
- [Dependency 1]
- [Dependency 2]
[Repeat for each major component]
### Data Flow
[Diagram showing how data flows through the system]
**Key Flows:**
1. **[Flow Name]**
- Step 1: [Description]
- Step 2: [Description]
- Step 3: [Description]
### Data Model
**Primary Entities:**
[Entity Relationship Diagram or description]
**Key Tables/Collections:**
| Entity | Description | Key Fields |
|--------|-------------|------------|
| [Entity 1] | [Description] | [Fields] |
| [Entity 2] | [Description] | [Fields] |
### APIs and Interfaces
**External APIs:**
| Endpoint | Method | Purpose |
|----------|--------|---------|
| [Endpoint] | [GET/POST/etc] | [Description] |
**Internal Interfaces:**
[Description of internal communication patterns]
## Infrastructure
### Deployment Architecture
[Deployment diagram showing environments, regions, etc.]
### Technology Stack
| Layer | Technology | Justification |
|-------|------------|---------------|
| Frontend | [Tech] | [Why] |
| Backend | [Tech] | [Why] |
| Database | [Tech] | [Why] |
| Cache | [Tech] | [Why] |
| Message Queue | [Tech] | [Why] |
### Scaling Strategy
**Horizontal Scaling:**
- [Component 1]: [Strategy]
- [Component 2]: [Strategy]
**Vertical Scaling:**
- [Where applicable]
**Auto-scaling Rules:**
- [Rule 1]
- [Rule 2]
## Reliability
### Availability Targets
| Component | Target | Current | SLI |
|-----------|--------|---------|-----|
| [Component] | [99.9%] | [99.85%] | [Metric] |
### Failure Modes
| Failure | Impact | Detection | Recovery |
|---------|--------|-----------|----------|
| [Failure 1] | [Impact] | [How detected] | [Recovery steps] |
### Disaster Recovery
- **RPO (Recovery Point Objective):** [Time]
- **RTO (Recovery Time Objective):** [Time]
- **Backup Strategy:** [Description]
- **DR Procedure:** [Link to runbook]
## Security
### Authentication & Authorization
[Description of auth mechanisms]
### Data Protection
- **Encryption at rest:** [Description]
- **Encryption in transit:** [Description]
- **PII handling:** [Description]
### Security Controls
| Control | Implementation |
|---------|----------------|
| [Control 1] | [How implemented] |
| [Control 2] | [How implemented] |
## Monitoring & Observability
### Key Metrics
| Metric | Description | Alert Threshold |
|--------|-------------|-----------------|
| [Metric 1] | [Description] | [Threshold] |
| [Metric 2] | [Description] | [Threshold] |
### Logging
- **Log format:** [Description]
- **Retention:** [Duration]
- **Key log events:** [List]
### Tracing
[Description of distributed tracing implementation]
### Dashboards
- [Dashboard 1]: [Purpose and link]
- [Dashboard 2]: [Purpose and link]
## Trade-offs and Decisions
### Decision 1: [Title]
**Context:** [What prompted this decision]
**Options Considered:**
1. [Option 1]
2. [Option 2]
**Decision:** [What was chosen]
**Trade-offs:**
- Pros: [Benefits]
- Cons: [Drawbacks]
[Repeat for key decisions]
## Known Limitations
| Limitation | Impact | Mitigation | Future Plans |
|------------|--------|------------|--------------|
| [Limitation 1] | [Impact] | [Current mitigation] | [Plans to address] |
## Future Considerations
- [Planned improvement 1]
- [Planned improvement 2]
- [Technical debt to address]
## Appendix
### Glossary
| Term | Definition |
|------|------------|
| [Term] | [Definition] |
### References
- [Architecture Decision Records]
- [Related System Design Docs]
- [External Documentation]
### Changelog
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | [Date] | [Author] | Initial version |
Complete Example
# System Design: Order Processing System
**Version:** 2.1
**Author:** Engineering Team
**Last Updated:** 2025-10-10
**Status:** Current
## Executive Summary
The Order Processing System (OPS) handles all customer orders from placement through fulfillment. It processes approximately 50,000 orders daily across web, mobile, and API channels, integrating with inventory, payment, and shipping systems.
The system was redesigned in 2024 to support 10x growth, moving from a monolithic architecture to event-driven microservices. Key improvements include sub-second order confirmation, real-time inventory updates, and 99.95% availability.
This document describes the current production architecture and serves as the canonical reference for the order processing domain.
## System Context
### Purpose
Transform customer purchase intent into fulfilled orders by:
- Validating and accepting orders
- Processing payments
- Coordinating inventory allocation
- Orchestrating fulfillment
- Providing order status and tracking
### Users
| User Type | Description | Scale |
|-----------|-------------|-------|
| Customers | End users placing orders | 500K daily active |
| Customer Service | Internal staff managing orders | 200 users |
| Fulfillment Centers | Warehouse systems | 12 locations |
| Partner Systems | B2B integrations | 15 partners |
### Key Requirements
**Functional:**
- Accept orders from multiple channels (web, mobile, API)
- Process payments securely
- Allocate inventory in real-time
- Generate shipping labels and tracking
- Support order modifications and cancellations
- Provide real-time order status
**Non-Functional:**
- Availability: 99.95% uptime
- Latency: <500ms for order placement (p99)
- Throughput: 1,000 orders/minute sustained, 5,000 peak
- Durability: Zero order loss
- Security: PCI-DSS compliant
## Architecture Overview
### High-Level Architecture ┌─────────────────┐
│ CloudFront │
│ CDN │
└────────┬────────┘
│
┌─────────────┐ ┌─────────────┐ ┌───────▼────────┐ │ Web │───►│ │ │ │ │ App │ │ API │◄──│ Load │ ├─────────────┤ │ Gateway │ │ Balancer │ │ Mobile │───►│ │ │ │ │ App │ └──────┬──────┘ └────────────────┘ └─────────────┘ │ │ ┌────────────────┼────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Order │ │ Payment │ │ Inventory │ │ Service │ │ Service │ │ Service │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ └────────────────►├◄────────────────┘ │ ┌──────▼──────┐ │ Kafka │ │ Events │ └──────┬──────┘ │ ┌───────────────┼───────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Fulfillment │ │ Notification│ │ Analytics │ │ Service │ │ Service │ │ Service │ └─────────────┘ └─────────────┘ └─────────────┘
### Key Components
| Component | Responsibility | Technology |
|-----------|----------------|------------|
| API Gateway | Request routing, rate limiting, auth | Kong |
| Order Service | Order lifecycle management | Node.js, PostgreSQL |
| Payment Service | Payment processing, fraud detection | Java, Stripe API |
| Inventory Service | Stock management, allocation | Go, Redis, PostgreSQL |
| Fulfillment Service | Warehouse coordination, shipping | Python, PostgreSQL |
| Notification Service | Customer communications | Node.js, SendGrid, Twilio |
| Event Bus | Async communication | Kafka |
### System Boundaries
**In Scope:**
- Order CRUD operations
- Payment processing
- Inventory allocation
- Fulfillment coordination
- Order notifications
**Out of Scope:**
- Product catalog (separate system)
- Customer accounts (identity service)
- Physical warehouse operations
- Financial reconciliation
## Detailed Design
### Component: Order Service
**Purpose:** Manages the complete order lifecycle from creation through completion.
**Interfaces:**
- Input: REST API, Kafka events
- Output: Kafka events, PostgreSQL
**Key Behaviors:**
- Validates order data and customer eligibility
- Creates order records with unique IDs
- Coordinates with payment and inventory services
- Manages order state transitions
- Handles modifications and cancellations
**State Machine:**
┌─────────┐ validate ┌──────────┐ payment ┌──────────┐ │ CREATED │────────────►│ VALIDATED│───────────►│ PAID │ └─────────┘ └──────────┘ └────┬─────┘ │ ┌───────────────────────────────────────────────┤ │ │ ▼ ▼ ┌─────────┐ ship ┌──────────┐ deliver ┌──────────┐ │CANCELLED│◄────────────│PROCESSING│──────────►│ SHIPPED │ └─────────┘ └──────────┘ └────┬─────┘ │ ▼ ┌──────────┐ │DELIVERED │ └──────────┘
**Dependencies:**
- Payment Service (sync call for payment)
- Inventory Service (sync call for allocation)
- Kafka (async event publishing)
- PostgreSQL (persistence)
### Data Flow
**Order Placement Flow:**
Customer ──► API Gateway ──► Order Service ──► Payment Service │ │ │ ▼ │ Stripe/PayPal │ │ ◄────────────────┘ │ ▼ Inventory Service │ ▼ Kafka Event │ ┌───────────────┼───────────────┐ │ │ │ ▼ ▼ ▼ Fulfillment Notification Analytics
**Sequence:**
1. **Order Created** (10ms)
- Validate request payload
- Check customer eligibility
- Generate order ID
- Persist draft order
2. **Payment Processing** (200-500ms)
- Call Payment Service
- Process via payment provider
- Handle 3DS if required
- Update order with payment ID
3. **Inventory Allocation** (50ms)
- Reserve inventory per item
- Handle partial allocation
- Confirm allocation
4. **Order Confirmed** (5ms)
- Update order status
- Publish OrderConfirmed event
- Return confirmation to customer
5. **Async Processing**
- Fulfillment creates shipment
- Notification sends confirmation email
- Analytics records order metrics
### Data Model
**Primary Entities:**
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Order │ │ OrderItem │ │ Payment │ ├──────────────┤ ├──────────────┤ ├──────────────┤ │ id │──────<│ order_id │ │ id │ │ customer_id │ │ product_id │ │ order_id │>─┐ │ status │ │ quantity │ │ amount │ │ │ total │ │ price │ │ status │ │ │ created_at │ │ status │ │ provider_id │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ └────────────────────────────────────────────────────────┘
**Key Tables:**
| Entity | Description | Key Fields |
|--------|-------------|------------|
| orders | Core order record | id, customer_id, status, total, shipping_address |
| order_items | Line items | order_id, product_id, quantity, unit_price |
| payments | Payment records | order_id, amount, provider, status, provider_ref |
| order_events | Audit trail | order_id, event_type, data, timestamp |
| inventory_holds | Temp allocations | order_id, sku, quantity, expires_at |
### APIs and Interfaces
**Public REST API:**
| Endpoint | Method | Purpose |
|----------|--------|---------|
| /v1/orders | POST | Create new order |
| /v1/orders/{id} | GET | Get order details |
| /v1/orders/{id} | PATCH | Update order |
| /v1/orders/{id}/cancel | POST | Cancel order |
| /v1/orders/{id}/items | GET | List order items |
**Internal Events (Kafka):**
| Topic | Event | Publisher | Consumers |
|-------|-------|-----------|-----------|
| orders | OrderCreated | Order Svc | Analytics, Notification |
| orders | OrderConfirmed | Order Svc | Fulfillment, Notification |
| orders | OrderShipped | Fulfillment | Notification, Analytics |
| payments | PaymentCompleted | Payment Svc | Order Svc |
| inventory | InventoryAllocated | Inventory Svc | Order Svc |
## Infrastructure
### Deployment Architecture
┌────────────────────────────────────────────────────────────────┐ │ AWS US-EAST-1 │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ VPC │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ │ │ AZ-1a │ │ AZ-1b │ │ │ │ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ │ │ │ │ EKS Node │ │ │ │ EKS Node │ │ │ │ │ │ │ │ Group │ │ │ │ Group │ │ │ │ │ │ │ └───────────┘ │ │ └───────────┘ │ │ │ │ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ │ │ │ │ RDS │ │ │ │ RDS │ │ │ │ │ │ │ │ Primary │ │ │ │ Replica │ │ │ │ │ │ │ └───────────┘ │ │ └───────────┘ │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ │ │ MSK (Kafka) Cluster │ │ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ │ │ ElastiCache (Redis) Cluster │ │ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────────────────┘
### Technology Stack
| Layer | Technology | Justification |
|-------|------------|---------------|
| Container Orchestration | EKS (Kubernetes) | Standard, team expertise |
| API Gateway | Kong | Rate limiting, auth, plugins |
| Backend Services | Node.js, Java, Go, Python | Team expertise per domain |
| Primary Database | PostgreSQL (RDS) | ACID, complex queries |
| Cache | Redis (ElastiCache) | Performance, rate limiting |
| Message Queue | Kafka (MSK) | Durability, ordering, scale |
| Search | Elasticsearch | Order search functionality |
| Monitoring | Datadog | Unified observability |
### Scaling Strategy
**Horizontal Scaling:**
- Order Service: Auto-scale 2-20 pods based on CPU/request count
- Payment Service: Auto-scale 2-10 pods
- Inventory Service: Auto-scale 2-15 pods
**Database Scaling:**
- Read replicas for query distribution
- Connection pooling via PgBouncer
- Vertical scaling for write capacity
**Auto-scaling Rules:**
```yaml
# Order Service HPA
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
averageUtilization: 70
- type: Pods
pods:
metric:
name: requests_per_second
target:
averageValue: 100
Reliability
Availability Targets
| Component | Target | Current | SLI |
|---|---|---|---|
| Order API | 99.95% | 99.97% | Successful requests / Total |
| Payment Processing | 99.9% | 99.92% | Successful payments / Attempts |
| Order Confirmation | 99.99% | 99.99% | Orders confirmed < 30s |
Failure Modes
| Failure | Impact | Detection | Recovery |
|---|---|---|---|
| Database failure | Orders can't be created | Health checks, Datadog | Failover to replica (<30s) |
| Payment provider down | Payments fail | Circuit breaker trips | Retry queue, manual processing |
| Kafka unavailable | Async processing stops | Consumer lag alerts | Events queued at producer |
| Inventory service down | Can't allocate stock | Health checks | Degrade to async allocation |
Disaster Recovery
- RPO: 5 minutes (continuous replication)
- RTO: 30 minutes (automated failover)
- Backup Strategy: Daily snapshots, continuous WAL archiving
- DR Procedure: [Link to DR runbook]
Security
Authentication & Authorization
- Customer API: JWT tokens via Auth0
- Internal Services: mTLS + service mesh (Istio)
- Admin Access: SSO + RBAC via Okta
Data Protection
- Encryption at rest: AES-256 (RDS, S3)
- Encryption in transit: TLS 1.3 everywhere
- PII handling: Tokenization for payment data, encryption for addresses
Security Controls
| Control | Implementation |
|---|---|
| WAF | AWS WAF with OWASP rules |
| DDoS Protection | CloudFront + Shield |
| Secrets Management | AWS Secrets Manager |
| Audit Logging | CloudTrail + application logs |
| Vulnerability Scanning | Snyk in CI/CD |
Monitoring & Observability
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| order_creation_latency_p99 | Order placement latency | > 500ms |
| order_error_rate | Failed orders / Total | > 1% |
| payment_success_rate | Successful payments | < 98% |
| inventory_allocation_time | Time to allocate | > 100ms |
| kafka_consumer_lag | Event processing delay | > 1000 messages |
Dashboards
Trade-offs and Decisions
Decision 1: Event-Driven Architecture
Context: Original monolith couldn't scale for 10x growth target.
Options Considered:
- Optimize monolith
- Microservices with sync communication
- Event-driven microservices
Decision: Event-driven microservices with Kafka
Trade-offs:
- Pros: Loose coupling, independent scaling, better resilience
- Cons: Eventual consistency, operational complexity, debugging difficulty
Decision 2: PostgreSQL over DynamoDB
Context: Need transactional consistency for orders.
Decision: PostgreSQL with read replicas
Trade-offs:
- Pros: ACID transactions, complex queries, team expertise
- Cons: Scaling limits, more operational overhead than DynamoDB
Known Limitations
| Limitation | Impact | Mitigation | Future Plans |
|---|---|---|---|
| Single region | Higher latency for non-US users | CDN caching for reads | Multi-region in 2026 |
| Sync payment calls | Latency variability | Timeout + retry | Async payment option |
| No order editing | Poor UX | Cancel and reorder | Edit support Q1 2026 |
Future Considerations
- Multi-region deployment for global latency
- GraphQL API for flexible queries
- Machine learning for fraud detection
- Real-time inventory sync with suppliers
Appendix
Glossary
| Term | Definition |
|---|---|
| Order | A customer's request to purchase products |
| Allocation | Reserving inventory for an order |
| Fulfillment | Process of picking, packing, shipping |
References
- ADR-012: Event-Driven Architecture
- ADR-015: Database Selection
- Payment Service Design Doc
- Inventory Service Design Doc
## Best Practices
### 1. Keep It Current
- Review quarterly
- Update after major changes
- Mark deprecated sections
### 2. Right Audience
- Assume technical readers
- Include context for non-domain experts
- Link to detailed docs for deep dives
### 3. Visualize Complex Concepts
- Architecture diagrams
- Sequence diagrams
- State machines
- Data flow diagrams
### 4. Document Decisions
- Not just "what" but "why"
- Include rejected alternatives
- Reference ADRs
---
*A system design document is a living artifact. It should evolve with the system while maintaining a clear picture of the current architecture.*