Multi-CDN Failover Architecture
A resilience architecture that removes a single CDN/edge provider as a single point of failure, using DNS-based traffic management to automatically fail over between two independent CDN providers.
2 independent (active + failover)
CDN providers
4 distinct root causes
Edge-case failure modes diagnosed
Seconds (1-click trigger)
Manual failover time
The Problem
A set of business-critical web applications sat entirely behind one CDN/edge provider. If that provider degraded or had an outage, every one of those applications went down at once — there was no fallback path. The goal was to put a second, independent CDN in place as an automatic failover target, without disrupting existing TLS, routing, or authentication behavior.
My Approach
- Selected a DNS-based traffic manager configured with priority routing — primary CDN → secondary CDN → origin — so failover is automatic based on active health probes rather than manual intervention.
- Stood up a second, independent CDN distribution set for every application, mirrored 1:1 against the existing primary CDN configuration (routing rules, custom domains, TLS).
- Replicated the existing edge security policy (WAF managed rule sets + custom allow/deny rules) on the secondary CDN so failover doesn't also mean a drop in protection.
- Wrote the entire stack as Infrastructure-as-Code so the topology is reproducible, diffable, and auditable rather than click-ops.
- Built a one-click manual failover trigger (serverless webhook + mobile shortcut) so an on-call engineer can force traffic off the primary CDN in seconds during a live incident or planned maintenance, without CLI access.
- Ran a cost/tradeoff analysis on the secondary CDN's security add-ons (bot management, managed rule groups) to right-size protection vs. spend.
Key Engineering Challenges Solved
Health-probe blind spots
The traffic manager's health checks didn't always reflect a real outage (e.g., an intentionally disabled edge endpoint reported as "healthy"), so I identified the correct trigger point for planned failover drills vs. relying on automatic detection alone.
Security policy parity
A managed WAF rule blocked legitimate server-to-server traffic that lacked a User-Agent header — diagnosed via sampled request logs and fixed with a scoped rule override rather than a blanket disable.
Identity/session continuity
Authentication callback URLs depended on which header-forwarding policy the CDN used, which could silently break single sign-on after a failover if misconfigured.
Client-side connection caching
Documented a browser-level HTTP/2 connection-reuse behavior that could make a failover look "stuck" from a single client's point of view, distinguishing it from a real infrastructure problem.
Stack
Cloud & Networking
Security
Infrastructure as Code
Automation
Practices
Skills Demonstrated
- ▸Multi-cloud network architecture and disaster-recovery design
- ▸Infrastructure as Code at production scale
- ▸Security-conscious edge configuration (WAF tuning without sacrificing functionality)
- ▸Rigorous, methodical debugging of distributed-systems failure modes
- ▸Building operational tooling that shortens incident response time
- ▸Clear technical documentation of non-obvious system behavior for future engineers