Topics

System Design Topics

Key system design principles distilled from engineering blogs at top companies. Filter by difficulty or topic, and click any card for details.

Interview Relevance

Difficulty

Topic

Basic

No prior system design experience needed

CommonScalingReplicationDatabase Design

Single Primary + Read Replicas

Use a single primary for writes and scale reads horizontally with many replicas across regions.

CachingScaling

Caching Reduces Database Load

Put a cache layer in front of your database to serve frequently read data without hitting the DB every time.

ScalingDatabase Design

Connection Pooling

Use a connection pooler to reuse database connections instead of creating a new one per request.

Rate LimitingReliability

Rate Limiting Protects Your System

Limit how many requests a client or endpoint can make in a time window to prevent overwhelming the system.

ReliabilityReplication

High Availability with Standby Replicas

Keep a synchronized standby ready to take over if the primary server fails, minimizing downtime.

FundamentalsScaling

Performance vs Scalability

A performance problem means your system is slow for a single user. A scalability problem means it's fast for one but slow under heavy load.

Fundamentals

Latency vs Throughput

Latency is the time to complete one action. Throughput is how many actions complete per unit time. Aim for maximal throughput with acceptable latency.

CommonFundamentalsCAP Theorem

CAP Theorem

In a distributed system, you can only guarantee two of three: Consistency, Availability, and Partition Tolerance. Since networks fail, you must choose between CP and AP.

FundamentalsCAP Theorem

CP vs AP Tradeoff

Choose CP when your business requires atomic reads and writes. Choose AP when the system must stay responsive even if data is temporarily stale.

CommonConsistencyFundamentals

Weak Consistency

After a write, reads may or may not see it. Best-effort delivery is acceptable when losing some data is tolerable.

CommonConsistencyFundamentals

Eventual Consistency

After a write, reads will eventually see it (typically within milliseconds). Data is replicated asynchronously.

CommonConsistencyFundamentals

Strong Consistency

After a write, every subsequent read returns the updated value. Data is replicated synchronously.

AvailabilityFundamentals

Availability in Numbers

Availability is measured in 'nines' — 99.9% (three 9s) allows ~8h 46min downtime/year, while 99.99% (four 9s) allows only ~52 minutes.

DNSFundamentals

DNS Basics

DNS translates domain names to IP addresses using a hierarchical system of servers. It's the first step in every web request.

CommonLoad BalancingScalingReliability

Load Balancer Overview

Load balancers distribute incoming requests across multiple servers, preventing overload and eliminating single points of failure.

ScalingFundamentals

Horizontal vs Vertical Scaling

Vertical scaling (scale up) means bigger hardware. Horizontal scaling (scale out) means more machines. Horizontal is cheaper and more resilient but adds complexity.

ACIDDatabase DesignFundamentals

ACID Properties

ACID (Atomicity, Consistency, Isolation, Durability) guarantees that database transactions are reliable even during failures.

NoSQLDatabase Design

Key-Value Stores

Key-value stores offer O(1) reads and writes, backed by memory or SSD. Best for simple data models and rapidly-changing data like caches.

CommonCachingFundamentals

Caching Layers Overview

Caching can happen at every layer: client (browser/OS), CDN, web server (reverse proxy), application (Redis/Memcached), and database.

CommonCaching

Cache-Aside (Lazy Loading)

The application checks the cache first. On a miss, it loads from the database, stores the result in cache, then returns it. Only requested data gets cached.

Load BalancingScaling

Sticky Sessions & Centralized Session State

Load-balanced servers break server-local sessions. Fix it with a centralized session store or load-balancer-injected cookies that pin users to a backend.

ReliabilityStorage

RAID: Disk Redundancy Levels

RAID combines multiple disks for performance and/or redundancy. RAID 0 stripes for speed, RAID 1 mirrors for safety, RAID 5/6 balance economy and fault tolerance.

CachingScaling

Static Content Pre-generation

Accept dynamic input but serve pre-rendered static HTML files. Web servers are extremely fast at serving static content, avoiding per-request computation.

CommunicationFundamentals

TCP vs UDP

TCP guarantees ordered, reliable delivery via handshakes and retransmission. UDP is connectionless and faster but may lose or reorder packets.

CachingScaling

Object Caching vs Query Caching

Caching assembled objects instead of raw query results is easier to invalidate and enables async pre-assembly by worker servers.

CommonFundamentalsScalingCapacity Planning

Capacity Estimation

Capacity estimation converts product requirements into concrete numbers — DAU, QPS, storage, and bandwidth — so you can size infrastructure before building it.

Intermediate

For early-career engineers starting to design systems

CachingReliability

Cache Stampede Prevention

Use cache locking/leasing so only one request fetches from the DB on a miss — others wait for the repopulated cache.

ReliabilityDatabase Design

Workload Isolation (Noisy Neighbor)

Route low-priority and high-priority workloads to separate instances so one can't degrade the other.

Database DesignQuery Optimization

Avoid Complex Joins in OLTP

Multi-table joins are an OLTP anti-pattern. Break them apart and move join logic to the application layer.

ScalingShardingDatabase Design

Offload Writes to Sharded Systems

Migrate write-heavy, shardable workloads to horizontally scalable systems to protect the primary.

AvailabilityReliability

Active-Passive Failover

Heartbeats between an active and passive server detect failures. The passive takes over the active's IP when a heartbeat is missed.

AvailabilityReliability

Active-Active Failover

Both servers actively handle traffic, spreading load between them. If one fails, the other absorbs all traffic.

CommonCDNScaling

CDN — Push vs Pull

Push CDNs receive content when you upload it (good for low-traffic, rarely changing content). Pull CDNs fetch content on first request (good for high-traffic sites).

CommonLoad BalancingScaling

Layer 4 vs Layer 7 Load Balancing

Layer 4 routes based on IP/port (fast, simple). Layer 7 routes based on request content like URL, headers, and cookies (flexible, smarter).

Reverse ProxyScalingReliability

Reverse Proxy

A reverse proxy sits in front of backend servers, providing a unified interface while adding security, caching, compression, and SSL termination.

CommonReplicationDatabase Design

Master-Slave Replication

The master handles all writes and replicates them to one or more slaves that serve read-only traffic.

Database DesignScaling

Federation (Functional Partitioning)

Split databases by function (e.g., users, products, forums) to reduce per-database traffic and improve cache locality.

Database DesignScaling

Denormalization

Store redundant copies of data to avoid expensive joins, trading write complexity for read performance.

Database DesignQuery Optimization

SQL Tuning Essentials

Benchmark, profile, then optimize: tighten schemas, add proper indices, avoid expensive joins, and partition hot tables.

NoSQLDatabase Design

Document Stores

Document stores center around JSON/XML documents, providing flexible schemas and APIs to query document internals. Best for semi-structured, occasionally changing data.

CommonSQLNoSQLDatabase Design

SQL vs NoSQL Decision Guide

Choose SQL for structured data, complex joins, and transactions. Choose NoSQL for flexible schemas, massive scale, and high-throughput workloads.

CommonCaching

Write-Through Cache

The application writes to the cache, and the cache synchronously writes to the database. Data is never stale, but writes are slower.

CommonCaching

Write-Behind (Write-Back) Cache

The application writes to the cache, which asynchronously flushes to the database later. Fast writes, but risk of data loss if the cache crashes.

DNSLoad Balancing

DNS Round Robin Drawbacks

DNS round robin is the simplest load-distribution scheme — the DNS server cycles through IPs — but caching and lack of health awareness make it unreliable.

ScalingDatabase Design

Database Partitioning by User Attribute

Split users across servers by a simple attribute (name range, school, geography) for a quick horizontal scaling win before investing in full sharding.

CachingScaling

Memcached: Shared In-Memory Cache Tier

Memcached is a dedicated in-memory key-value daemon that multiple web servers share. LRU eviction automatically discards cold entries when memory is full.

SecurityReliability

Network Security Tiers (Defense in Depth)

Restrict traffic between architecture tiers with port-level firewalls: only HTTP in from the internet, only MySQL between web and DB servers.

CachingDatabase Design

MySQL Query Cache

Enable MySQL's built-in query cache to instantly return results for repeated identical queries without re-executing them.

MicroservicesScaling

Web Layer vs Application Layer

Separating the web layer from the application (platform) layer lets you scale and configure each independently.

MicroservicesScaling

Microservices

A suite of independently deployable, small, modular services. Each runs a unique process and communicates via lightweight protocols.

MicroservicesService Discovery

Service Discovery

Systems like Consul, Etcd, and Zookeeper help services find each other by tracking registered names, addresses, and ports.

CommonAsyncScaling

Message Queues

Message queues decouple producers from consumers: a publisher posts a job, a worker picks it up and processes it in the background.

CommonAsyncScaling

Task Queues

Task queues receive tasks with their data, execute them, and return results. They support scheduling and are ideal for compute-intensive background work.

CommunicationAPI Design

RPC vs REST

RPC exposes behaviors (actions). REST exposes resources (data). RPC is common for internal services; REST is preferred for public APIs.

CommunicationAPI Design

REST API Design

RESTful APIs identify resources by URI, change them with HTTP verbs, use status codes for errors, and should be fully accessible in a browser (HATEOAS).

CommonLoad BalancingScalingInfrastructure

Types of Load Balancers

Load balancers come in three configuration types (software, hardware, cloud) and three functional types (L4, L7, GSLB), each with distinct cost, flexibility, and performance tradeoffs.

CommonCachingScalingInfrastructure

Types of Caching

Caches come in four architectural types — application server, distributed, global, and CDN — each trading off simplicity, scalability, and latency differently.

CommonDatabase DesignSQLNoSQLNewSQL

Types of Databases

Each database type optimizes for a different access pattern and consistency model — RDBMS for transactions and joins, NoSQL for flexible schemas and horizontal scale, NewSQL for global ACID, and time-series for sequential telemetry.

CommonAsyncScalingDistributed Systems

Message Queues Deep Dive

Message queues decouple producers from consumers through an intermediate buffer, enabling asynchronous communication, independent scaling, and fault tolerance across distributed services.

CommonRate LimitingAPI DesignInfrastructure

Rate Limiting

Rate limiting controls how many requests a client can make in a given time window, protecting systems from abuse, DoS attacks, and resource exhaustion while ensuring fair access.

CommonDatabase DesignPerformanceQuery Optimization

Database Indexing

Indexes are auxiliary data structures that speed up reads at the cost of slower writes and extra storage. Choose index type based on query patterns: B-tree for range scans, hash for equality lookups, inverted for full-text search.

CommonReal-TimeNetworkingAPI Design

Real-Time Communication

Polling is simplest but wasteful, long polling reduces wasted requests, WebSockets provide full-duplex real-time channels, and SSE offers lightweight one-way server push — choose based on directionality, latency, and infrastructure complexity.

CommonStorageInfrastructureDistributed Systems

Storage Types

Object storage is best for large unstructured blobs like images and videos, block storage provides raw disk volumes for databases and VMs, and file storage offers shared hierarchical access — store metadata in a database and media in object storage.

CommonReliabilityFault ToleranceInfrastructure

Reliability and Resilience Patterns

Build resilience by eliminating single points of failure through redundancy, protecting cascading failures with circuit breakers, making retries safe with exponential backoff and idempotency, and designing for graceful degradation under overload.

CommonObservabilityInfrastructureReliability

Observability

Observability rests on three pillars — logs capture discrete events, metrics track numeric aggregates over time, and traces follow a single request across services — together they let you detect, diagnose, and resolve production issues.

Advanced

For mid-to-senior engineers operating at scale

Database Design

MVCC Tradeoffs in PostgreSQL

PostgreSQL's MVCC copies the entire row on every update, causing write amplification, dead tuple bloat, and vacuum pressure.

Rate LimitingReliability

Multi-Layer Rate Limiting

Apply rate limiting at every layer — application, connection pooler, proxy, and query — for defense in depth.

Database DesignReliability

Safe Schema Migrations at Scale

Only allow lightweight schema changes in production. Anything that rewrites the table is too dangerous at scale.

ReplicationScaling

Cascading Replication for Replica Scaling

When the primary can't stream WAL to all replicas, use intermediate replicas to relay WAL downstream.

Reliability

Cascading Failure Prevention

The classic failure loop is: load spike -> latency rise -> timeouts -> retries -> amplified load. Break it at every link.

CommonReplicationDatabase Design

Master-Master Replication

Both masters serve reads and writes, coordinating with each other. If either goes down, the other continues operating.

CommonShardingDatabase DesignScaling

Sharding

Distribute data across different databases so each manages only a subset. Reduces traffic, replication, and index size per shard.

NoSQLDatabase Design

Wide Column Stores

Wide column stores (Bigtable, HBase, Cassandra) use column families with row keys. Built for very large datasets with high availability and scalability.

NoSQLDatabase Design

Graph Databases

Graph databases represent data as nodes and relationships (edges). Optimized for complex many-to-many relationships like social networks.

CommonCaching

Refresh-Ahead Cache

The cache automatically refreshes recently accessed entries before their TTL expires, reducing read latency if predictions are accurate.

ReliabilityScaling

Multi-Data-Center & Availability Zones

A single data center is a single point of failure. Distribute across availability zones with independent power and networking, using global DNS to route users.

CommonAsyncReliability

Back Pressure

When queues grow beyond a threshold, reject new work with HTTP 503 and let clients retry with exponential backoff. This preserves throughput for jobs already in the queue.