If you’ve ever walked into a system design interview and immediately started drawing microservices, load balancers, and Kafka clusters—only to have the interviewer stop you 10 minutes in and ask, “Wait, what problem are we actually solving?”—you’re not alone.
This is the #1 mistake candidates make.
The truth is, system design interviews aren’t about memorizing Netflix’s architecture or knowing every AWS service. They’re about demonstrating structured thinking, clear communication, and justified decision-making.
In this guide, I’ll walk you through the exact 7-step framework that works for any system design question—whether you’re designing a URL shortener, Twitter, or a video streaming platform.
Why Most Candidates Fail (And How to Avoid It)
Before we dive into the framework, let’s address the three critical mistakes that cause instant rejection:
Mistake #1: Jumping Straight to Architecture
You hear “Design Instagram” and immediately start drawing boxes labeled “Microservices,” “Kafka,” “Redis,” and “Kubernetes.”
The problem? You haven’t clarified what you’re actually building. Are we focusing on the photo upload flow? The feed generation? Direct messaging? Each would lead to a different architecture.
The fix: Always start with requirements clarification (Step 1 below).
Mistake #2: Over-Engineering Without Justification
You design for 10 billion users when the interviewer mentioned 10 million. You add complex distributed tracing before discussing basic database choices.
The problem? You’re solving problems that don’t exist yet. Interviewers want to see you match complexity to actual requirements.
The fix: Let the scale and requirements drive your design decisions, not the other way around.
Mistake #3: Not Articulating Trade-offs
You pick Cassandra over PostgreSQL, or choose eventual consistency over strong consistency, but you never explain why.
The problem? Every architectural decision involves trade-offs. Not acknowledging them shows a lack of depth.
The fix: For every major decision, explicitly state what you’re gaining and what you’re giving up.
The 7-Step Framework
Here’s the structured approach that will work for any system design interview:
1. Clarify Requirements
2. Back-of-the-Envelope Estimation
3. Define the API
4. High-Level Design
5. Deep Dive into Components
6. Identify Bottlenecks & Scale
7. Discuss Trade-offs
Let’s break down each step with a concrete example: Designing Instagram.
Step 1: Clarify Requirements (3-5 minutes)
The interviewer gives you something vague like “Design Instagram.” Your job is to turn this into a clear, scoped problem.
Don’t assume. Ask questions. Write down the answers.
I break requirements into two categories:
Functional Requirements
What features are we building?
For Instagram, you might clarify:
- Upload photos
- Follow/unfollow users
- View personalized feed
- Like and comment on photos
Non-Functional Requirements
What are the scale, performance, and reliability expectations?
Examples:
- Scale: 500 million active users
- Latency: Feed loads in under 1 second
- Availability: 99.9% uptime
- Storage: Support for high-resolution photos
Why this matters: If you skip this step, you might design for the wrong problem. The interviewer is testing whether you can handle ambiguity and ask intelligent questions.
Pro tip: Write these down in two columns (Functional | Non-Functional) on your whiteboard. It keeps you organized and shows structured thinking.
Step 2: Back-of-the-Envelope Estimation (3-5 minutes)
This is where many candidates panic, but it’s easier than you think. You’re not doing calculus—you’re just estimating orders of magnitude.
Example Calculation for Instagram:
Users & Activity:
- 500 million total users
- 10% are daily active users (DAU) = 50 million DAU
- Each user uploads 1 photo per day on average
Storage:
- 50 million photos per day
- Average photo size: 2 MB
- Daily storage: 50M × 2MB = 100 TB/day
- Annual storage: 100TB × 365 = ~36 PB/year
Traffic:
- Read-heavy system (users view far more than they upload)
- Assume 100:1 read-to-write ratio
- 50M writes/day = ~580 writes/second
- 5B reads/day = ~58,000 reads/second
What This Tells You:
- ✅ You need distributed storage (can’t use a single database)
- ✅ You need caching (58K reads/sec is too much for DB alone)
- ✅ You need CDN for photo delivery (reduce latency globally)
Why this matters: These calculations guide your architectural decisions. You’re not pulling Redis or Cassandra out of thin air—you’re showing the math that justifies them.
Step 3: Define the API (2-3 minutes)
This step is optional, but I always do it because it forces you to think about data flow and inputs/outputs.
Pick 2-3 core APIs. Don’t try to design every endpoint.
Example for Instagram:
POST /api/v1/upload
Request:
- userId: string
- photo: binary
- caption: string
Response:
- photoId: string
- uploadUrl: string
GET /api/v1/feed
Request:
- userId: string
- pageToken: string (for pagination)
Response:
- photos: [Photo]
- nextPageToken: string
POST /api/v1/follow
Request:
- userId: string
- targetUserId: string
Response:
- success: boolean
Why this matters: It shows you understand the system from the client’s perspective. It also makes the next step (high-level design) easier because you know what data needs to flow where.
Step 4: High-Level Design (5-7 minutes)
Now comes the fun part: drawing your architecture.
Start Simple, Then Layer Complexity
Step 4a: The Basics
Client → API Server → Database
Step 4b: Add Components Based on Requirements
Based on our estimation:
- Need to handle 58K reads/sec → Add Load Balancer + Cache
- Need to store 36PB/year of photos → Add Object Storage (S3)
- Need low-latency global access → Add CDN
- Need to handle user data + relationships → Add SQL Database
- Need to handle feed data at scale → Add NoSQL Database
Final High-Level Architecture:
┌─────────┐
│ Client │
└────┬────┘
│
┌────▼─────────┐
│Load Balancer │
└────┬─────────┘
│
┌──────────┴──────────┐
│ │
┌────▼────┐ ┌───▼────┐
│ API │ │ API │
│ Server 1│ │Server 2│
└────┬────┘ └───┬────┘
│ │
┌───────┼────────────────────┤
│ │ │
┌───▼───┐ ┌─▼──────┐ ┌────────▼─────┐
│ Cache │ │SQL DB │ │Object Storage│
│(Redis)│ │(Users, │ │ (S3/Photos) │
└───────┘ │Follow) │ └──────┬───────┘
└────────┘ │
┌─────▼─────┐
┌──────────┐ │ CDN │
│NoSQL DB │ └───────────┘
│ (Feeds) │
└──────────┘
The Key: Explain why you’re adding each component.
- Load Balancer: Distribute traffic across multiple API servers
- Cache (Redis): Reduce database load for frequently accessed data (hot feeds)
- SQL Database: Store structured data (users, follows, likes) with ACID guarantees
- NoSQL Database: Store feed data that needs horizontal scaling
- Object Storage: Cost-effective storage for binary data (photos)
- CDN: Serve static content from edge locations (reduce latency)
Step 5: Deep Dive into Components (5-7 minutes)
The interviewer will pick 1-2 areas and say, “Let’s go deeper.”
This is where trade-offs become critical.
Example Deep Dive: Database Choices
Why SQL (PostgreSQL) for User Data?
- ✅ ACID transactions (critical for money/follows)
- ✅ Strong consistency
- ✅ Relationships (users, follows, likes)
- ✅ Well-understood query patterns
- ❌ Harder to scale horizontally
Why NoSQL (Cassandra) for Feed Data?
- ✅ Horizontal scalability (partition by userId)
- ✅ High write throughput
- ✅ Eventual consistency acceptable for feeds
- ❌ No complex joins
- ❌ No ACID guarantees
You might use both. User profiles and relationships in SQL. Activity feeds in NoSQL.
Example Deep Dive: Caching Strategy
What to cache?
- User profiles (rarely change)
- Hot feeds (top 100 posts for active users)
- Follower counts
Cache invalidation strategy:
- Write-through: Write to cache and database simultaneously (slower writes, consistent reads)
- Write-back: Write to cache first, async to database (faster writes, risk of data loss)
- TTL-based: Set time-to-live for cached data (good for feeds that can tolerate staleness)
For Instagram feeds: Use TTL-based caching with 30-second expiry. Users won’t notice if their feed is 30 seconds old, but you’ll massively reduce database load.
Step 6: Identify Bottlenecks & Scale (3-4 minutes)
The interviewer will ask: “What happens when you hit 10x traffic?”
Walk through potential failure points:
Bottleneck 1: Database
Problem: A single SQL database can’t handle 500K reads/sec.
Solutions:
- Read Replicas: Separate read and write traffic
- Sharding: Partition users across multiple databases (shard by userId)
- Cache Layer: Reduce database hits by 80-90%
Bottleneck 2: API Servers
Problem: Traffic spikes during major events (celebrity posts, etc.)
Solutions:
- Horizontal Scaling: Add more API servers behind load balancer
- Auto-scaling: Use Kubernetes/ECS to scale based on CPU/memory
Bottleneck 3: Photo Storage
Problem: 36PB/year keeps growing.
Solutions:
- Cold Storage: Move old photos to cheaper storage (S3 Glacier)
- Compression: Optimize images on upload
- Smart Serving: Serve different resolutions based on device
Why this matters: You’re showing you can think beyond the happy path. Real systems fail, and you need to anticipate where and how.
Step 7: Wrap Up & Discuss Trade-offs (2-3 minutes)
Summarize your design in 30 seconds. Then mention 1-2 trade-offs.
Example Summary:
“We designed a scalable Instagram architecture using a microservices approach. User data and relationships go into PostgreSQL for consistency. Feeds are stored in Cassandra for horizontal scaling. Photos are stored in S3 and served via CDN for low latency. We cache hot feeds in Redis to handle 58K reads/sec.”
Trade-offs to Mention:
1. Eventual Consistency in Feeds
- ✅ Gain: Better latency and scalability
- ❌ Cost: Users might see slightly stale data (a post from 30 seconds ago might not appear immediately)
2. Microservices Architecture
- ✅ Gain: Independent scaling, team autonomy
- ❌ Cost: Operational complexity, distributed debugging
3. NoSQL for Feeds
- ✅ Gain: Massive write throughput
- ❌ Cost: No ACID, eventual consistency
Why this matters: Acknowledging trade-offs shows maturity. There’s no perfect architecture—only architectures that match your requirements and constraints.
What Interviewers Are Really Evaluating
Remember, the interviewer isn’t looking for the “correct” architecture (there isn’t one). They’re evaluating:
- Can you handle ambiguity? (Step 1: Requirements)
- Can you think quantitatively? (Step 2: Estimation)
- Can you justify decisions? (Steps 4-7: Design + Trade-offs)
- Can you communicate clearly? (All steps)
- Do you know when to stop? (Not over-engineering)
How to Practice This Framework
- Pick a system (Twitter, Uber, Netflix, URL shortener)
- Set a timer for 45 minutes
- Walk through all 7 steps out loud (yes, talk to yourself—it matters)
- Draw on a whiteboard or Excalidraw
- Review: What did you miss? Where did you over-complicate?
Repeat this 10-15 times with different systems. The framework will become second nature.
Common Systems to Practice On
- Easy: URL Shortener, Pastebin, Rate Limiter
- Medium: Twitter, Instagram, Notification System
- Hard: Uber, Google Drive, YouTube, Distributed Cache
Start easy. Build muscle memory. Then tackle the hard ones.
Final Thoughts
System design interviews feel intimidating because they’re open-ended. But that’s exactly why this framework works—it gives you structure in ambiguity.
You’re not memorizing architectures. You’re learning a thinking process that applies to any problem.
Master this framework, and you’ll walk into your next system design interview with confidence.
What’s Next?
If you found this helpful, I’m working on a comprehensive course where we’ll apply this exact framework to 15+ real interview problems—complete with diagrams, deep dives, and trade-off discussions.
For now, practice this framework religiously. Print it out. Pin it to your wall. Use it in mock interviews.
You’ve got this.
Want to go deeper? Watch the full video walkthrough where I demonstrate this framework step-by-step using Excalidraw diagrams: [Link to Video]
Questions or feedback? Drop a comment below or reach out—I read everything.
Happy learning, and good luck with your interviews!
— Nitin