A practical guide to zero-downtime deployments
One of the most common system design questions is:
“How do you deploy changes to production without bringing the system down?”
This question is not really about deployment tools.
It is about how you design systems that remain available while they change.
In real-world systems, deployments happen:
- Frequently
- Under traffic
- With real users online
Downtime is not just inconvenient — it can mean lost revenue, broken trust, and failed SLAs.
This article explains how to handle production deployments safely, predictably, and at scale.
Why Deployments Are Risky
When you deploy a new version of your application, several things can go wrong:
- Servers restart and stop serving traffic
- New code has bugs
- Database migrations break compatibility
- Requests hit a mix of old and new versions
If not handled carefully, a deployment can take the entire system down.
The goal of modern deployments is simple:
Users should not even notice that a deployment happened.
The Core Principle: Never Replace Everything at Once
The biggest mistake in deployments is big-bang replacement:
- Stop all servers
- Deploy new code
- Start everything again
This approach guarantees downtime.
Modern systems follow a different principle:
Change the system gradually while it continues serving traffic.
Load Balancers: The Foundation of Safe Deployments
A load balancer sits in front of your application servers and routes traffic to healthy instances.
This allows you to:
- Remove servers from traffic
- Deploy new versions
- Add them back safely
Basic Architecture
Users
↓
Load Balancer
↓
App Server A (v1)
App Server B (v1)
App Server C (v1)
During deployment, traffic can be shifted instance by instance, instead of all at once.
Rolling Deployments
A rolling deployment updates servers one at a time or in small batches.
How It Works
- Take one instance out of the load balancer
- Deploy the new version
- Health check the instance
- Put it back into rotation
- Repeat for the next instance
Diagram
Before:
[A v1] [B v1] [C v1]
Deploy:
[A v2] [B v1] [C v1]
After:
[A v2] [B v2] [C v2]
✅ Pros of Rolling Deployments
Rolling deployments ensure that some version of the system is always running, which avoids downtime.
They are simple to implement and work well with most load balancers and orchestration systems.
This approach allows teams to deploy frequently without major operational overhead.
❌ Cons of Rolling Deployments
During deployment, the system runs a mix of old and new versions, which can cause issues if backward compatibility is not handled properly.
Rolling deployments also make instant rollback harder, since changes are already partially applied.
📌 Common Use Cases for Rolling Deployments
Rolling deployments are widely used in monolithic applications, backend APIs, and services with strong backward compatibility guarantees.
Blue-Green Deployments
In a blue-green deployment, two identical environments are maintained:
- Blue → current production
- Green → new version
How It Works
- Deploy the new version to the green environment
- Test it fully
- Switch traffic from blue to green
- Keep blue as a rollback option
Diagram
Users
↓
Load Balancer
↙ ↘
Blue (v1) Green (v2)
✅ Pros of Blue-Green Deployments
Blue-green deployments provide near-instant rollback, since the old version remains untouched.
They allow thorough testing of the new version before exposing it to users.
This makes deployments very safe and predictable.
❌ Cons of Blue-Green Deployments
Maintaining two full environments doubles infrastructure cost.
Traffic switching must be handled carefully to avoid session loss or inconsistent state.
📌 Common Use Cases for Blue-Green Deployments
Blue-green deployments are commonly used in high-risk systems, such as financial platforms and enterprise applications.
Canary Deployments
A canary deployment releases the new version to a small subset of users first.
If everything looks good, traffic is gradually increased.
How It Works
90% → v1
10% → v2
Over time:
50% → v2
100% → v2
✅ Pros of Canary Deployments
Canary deployments limit blast radius by exposing new code to a small audience first.
They allow teams to detect performance regressions or bugs using real production traffic.
This is one of the safest ways to deploy changes.
❌ Cons of Canary Deployments
Canary deployments require strong monitoring and metrics.
They are harder to implement and reason about, especially when bugs only affect specific users.
📌 Common Use Cases for Canary Deployments
Canary deployments are popular in large-scale consumer applications, SaaS platforms, and systems with heavy traffic.
Database Changes: The Most Common Deployment Killer
Code can be rolled back.
Database changes cannot.
This is why deployments often fail at the database layer.
Safe Database Migration Strategy
Rule 1: Backward compatibility first
- Add new columns (nullable)
- Deploy application code that supports both old and new schema
- Backfill data
- Remove old columns in a later release
Example
ALTER TABLE users ADD COLUMN phone_v2 TEXT;
Old code continues working.
New code starts using the new column.
Why This Works
At no point does the database schema break running code.
This allows rolling and canary deployments without downtime.
Feature Flags: Deploy Without Releasing
Feature flags allow you to deploy code without enabling it.
Benefits
- Safe experimentation
- Instant rollback
- Controlled rollouts
if (feature_enabled) {
use_new_logic();
}
Health Checks and Graceful Shutdowns
Health Checks
Load balancers must know when a service is ready.
A server should only receive traffic when:
- It is fully started
- Dependencies are available
Graceful Shutdown
When a server is removed:
- Stop accepting new requests
- Finish in-flight requests
- Then shut down
This prevents dropped requests during deployment.
Observability: You Can’t Protect What You Can’t See
Safe deployments rely on:
- Metrics (latency, error rate)
- Logs
- Alerts
If you can’t detect problems quickly, even the best deployment strategy will fail.
Interview Section: Junior vs Senior vs Staff Engineer Answers
❓ Question
How do you deploy to production without downtime?
👶 Junior Engineer Answer
“We deploy one server at a time so the system stays up.”
This answer shows basic awareness but lacks depth and real-world considerations.
👨💻 Senior Engineer Answer
“I use rolling or blue-green deployments behind a load balancer. Instances are taken out of rotation, updated, health-checked, and added back. Database migrations are backward compatible.”
This shows production experience and risk awareness.
🧠 Staff Engineer Answer
“Zero-downtime deployment is a system-wide concern. I combine rolling or canary deployments with feature flags, backward-compatible database changes, health checks, and graceful shutdowns. I also rely on strong observability so we can detect and rollback issues quickly.”
This answer demonstrates end-to-end ownership.
Choosing the Right Deployment Strategy
| Scenario | Recommended Strategy |
|---|---|
| Small backend | Rolling deployment |
| High-risk system | Blue-green |
| Large-scale consumer app | Canary |
| Schema changes | Backward-compatible migrations |
Final Takeaway
Deployments should be boring.
If a deployment feels scary, the system is poorly designed.
Strong systems:
- Change gradually
- Fail safely
- Roll back easily
- Never surprise users
Good deployment design turns production changes into routine events — not emergencies.