Deploy Without Breaking Production
February 1, 2025
Safe Version Deployments: Testing Without User Impact
Deploying a new version of a critical service is nerve-wracking. How do you validate it works in production without breaking things for users? Here's a battle-tested approach I've used successfully.
The Challenge
You need to deploy version 2 of your service, but:
- You can't fully replicate production load in staging
- Real-world edge cases only appear with actual user traffic
- Downtime or errors directly impact users
- Rollback after a bad deployment causes disruption
Solution: Test in production without users noticing.
The Four-Phase Strategy
Phase 1: Shadow Deployment (Week 1-2)
Concept: Run both versions side-by-side. Users see v1 results, but v2 processes the same requests in the background.
Setup:
- Deploy v2 alongside existing v1
- Route production traffic to BOTH versions
- Serve v1 results to users
- Log v2 predictions/responses but don't serve them
What to measure:
- Performance: Latency (p50, p95, p99), CPU, memory usage
- Response differences: How often do v1 and v2 produce different results?
- Error rates: Exceptions, timeouts, crashes in v2
Example from Visa:
At Visa, we used a Test-in-Production (TIP) framework when migrating payment APIs. We ran the old and new API versions in parallel, compared responses byte-by-byte, and flagged any discrepancies. This caught:
- Edge cases in currency conversion logic
- Subtle timezone handling differences
- Unexpected null value handling
We found issues that never appeared in our staging environment because they only occurred with specific merchant configurations in production.
Validation:
- Statistical analysis of response differences
- Investigation of high-variance cases
- Performance regression checks (v2 shouldn't be >10% slower)
Phase 2: Canary Deployment (Week 3)
Concept: Send a small percentage of real traffic to v2. Monitor closely for issues.
Traffic split: 5% to v2, 95% to v1
Key practices:
- Automated monitoring: Alerts fire if v2 violates SLOs
- Automatic rollback: Script triggers rollback if error rate spikes
- Selective routing: Route traffic by user ID hash for consistency (same user always hits same version)
Critical metrics:
Error rate: Must be ≤ v1 error rate
Latency: p99 < SLO threshold
Success rate: ≥ v1 success rate
Resource usage: Within expected bounds
Automated rollback triggers:
- Error rate > 2x baseline
- p99 latency exceeds SLO
- Customer complaints exceed threshold
- Critical functionality fails
Phase 3: Graduated Rollout (Week 4-6)
Concept: Gradually increase traffic to v2 while monitoring for issues.
Traffic progression: 5% → 25% → 50% → 100%
Multi-dimensional rollout strategies:
By feature complexity:
- Start with simple, low-risk endpoints
- Move to complex, high-traffic endpoints last
By customer tier:
- Internal users first (your team tests it)
- Beta customers second (opted-in early adopters)
- All customers last
By geographic region:
- Single region first (e.g., US West)
- Expand to other regions gradually
- Full global rollout last
Rollout pause criteria:
- Any automated alert fires → pause and investigate
- Manual quality checks reveal issues → pause
- Customer feedback indicates problems → pause
Phase 4: Continuous Validation (Ongoing)
Concept: Even at 100% v2, keep validating to catch degradation over time.
Golden dataset:
- Maintain a reference set of requests with known correct responses
- Run them against v2 regularly (daily/weekly)
- Alert if accuracy drops
Drift detection:
- Monitor response patterns over time
- Catch performance degradation early
- Identify data drift (input patterns changing)
Feedback loop:
- User-reported issues tracked and analyzed
- Patterns fed back into testing strategy
- Continuous improvement of validation
Implementation Example
Here's a simple traffic routing pattern:
class VersionRouter:
def process_request(self, request_id, request_data):
version = self.get_version(request_id)
if version == "shadow":
# Shadow: run both, serve v1
v1_result = self.service_v1.process(request_data)
v2_result = self.service_v2.process(request_data)
# Log differences for analysis
self.log_comparison(request_id, v1_result, v2_result)
return v1_result
elif version == "v2":
return self.service_v2.process(request_data)
else:
return self.service_v1.process(request_data)
def get_version(self, request_id):
# Consistent hashing for user experience
user_hash = hash(request_id) % 100
if self.shadow_mode_enabled:
return "shadow"
elif user_hash < self.v2_traffic_percentage:
return "v2"
else:
return "v1"
Key Lessons Learned
1. Shadow Testing is Your Safety Net
Production traffic reveals edge cases you can't imagine in testing. Shadow mode lets you discover them risk-free.
2. Automate Rollback
If a human has to decide whether to roll back, you've already lost critical time. Automated triggers save you.
3. Start Small
5% canary catches most issues. Don't jump to 50% just because 5% looks good for an hour.
4. Consistent Routing Matters
Hash user IDs for routing. Same user should always hit same version. This prevents weird "it worked yesterday but not today" bugs.
5. Pause is Okay
We paused rollouts twice at Visa. Better to investigate than to push forward blindly.
Common Mistakes to Avoid
❌ Skipping shadow mode → Missing prod-only edge cases ✅ Run shadow mode for at least a week
❌ No automated rollback → Slow response to issues ✅ Set clear thresholds and automate rollback
❌ Jumping traffic percentages → 5% to 100% is too aggressive ✅ Gradual: 5% → 25% → 50% → 100%
❌ Ignoring resource usage → v2 might be functionally correct but too slow/expensive ✅ Monitor latency, CPU, memory, costs
❌ Deploying Friday afternoon → If something breaks, you're working the weekend ✅ Deploy Tuesday-Thursday mornings
Quick Reference: Traffic Split Timeline
Week 1-2: Shadow mode (0% user-facing)
Week 3: Canary (5% real traffic)
Week 4: Expansion (25% real traffic)
Week 5: Majority (50% real traffic)
Week 6: Full rollout (100% real traffic)
Adjust timeline based on:
- Service criticality
- Deployment complexity
- Your organization's risk tolerance
- Regulatory requirements
When to Use This Strategy
Use shadow + canary for:
- Mission-critical services
- High-traffic endpoints
- Complex logic changes
- Services with strict SLAs
- Regulated environments
Skip straight to canary if:
- Low-traffic service
- Simple bug fix
- Internal-only service
- Strong staging environment confidence
Conclusion
Testing new versions in production doesn't mean risking user experience. With shadow deployments, canary releases, and automated validation, you can:
- Catch production-only edge cases
- Deploy with confidence
- Roll back instantly if needed
- Maintain uptime during migrations
The key: Start small, measure everything, automate decisions, and move gradually.
At Visa, this approach let us migrate payment systems handling 2,500 TPS with zero customer impact. It works.
How do you handle risky deployments? What's your rollback strategy? Connect on LinkedIn to discuss.