Cloud Infrastructure & SRE

Multi-CDN Failover Architecture

A resilience architecture that removes a single CDN/edge provider as a single point of failure, using DNS-based traffic management to automatically fail over between two independent CDN providers.

2 independent (active + failover)

CDN providers

4 distinct root causes

Edge-case failure modes diagnosed

Seconds (1-click trigger)

Manual failover time

The Problem

A set of business-critical web applications sat entirely behind one CDN/edge provider. If that provider degraded or had an outage, every one of those applications went down at once — there was no fallback path. The goal was to put a second, independent CDN in place as an automatic failover target, without disrupting existing TLS, routing, or authentication behavior.

My Approach

Selected a DNS-based traffic manager configured with priority routing — primary CDN → secondary CDN → origin — so failover is automatic based on active health probes rather than manual intervention.
Stood up a second, independent CDN distribution set for every application, mirrored 1:1 against the existing primary CDN configuration (routing rules, custom domains, TLS).
Replicated the existing edge security policy (WAF managed rule sets + custom allow/deny rules) on the secondary CDN so failover doesn't also mean a drop in protection.
Wrote the entire stack as Infrastructure-as-Code so the topology is reproducible, diffable, and auditable rather than click-ops.
Built a one-click manual failover trigger (serverless webhook + mobile shortcut) so an on-call engineer can force traffic off the primary CDN in seconds during a live incident or planned maintenance, without CLI access.
Ran a cost/tradeoff analysis on the secondary CDN's security add-ons (bot management, managed rule groups) to right-size protection vs. spend.

Key Engineering Challenges Solved

Health-probe blind spots

The traffic manager's health checks didn't always reflect a real outage (e.g., an intentionally disabled edge endpoint reported as "healthy"), so I identified the correct trigger point for planned failover drills vs. relying on automatic detection alone.

Security policy parity

A managed WAF rule blocked legitimate server-to-server traffic that lacked a User-Agent header — diagnosed via sampled request logs and fixed with a scoped rule override rather than a blanket disable.

Identity/session continuity

Authentication callback URLs depended on which header-forwarding policy the CDN used, which could silently break single sign-on after a failover if misconfigured.

Client-side connection caching

Documented a browser-level HTTP/2 connection-reuse behavior that could make a failover look "stuck" from a single client's point of view, distinguishing it from a real infrastructure problem.

Stack

Cloud & Networking

Azure Front DoorAWS CloudFrontAzure Traffic ManagerAWS Route 53VNetsStatic Web AppsAzure Container AppsApp Service

Security

AWS WAF (managed + custom rules)CloudFront Functions (edge geo-filtering)

Infrastructure as Code

Terraform (multi-provider: azurerm + aws)

Automation

Azure Functions (PowerShell 7)Azure Managed IdentityARM REST APIiOS Shortcuts (mobile incident-response trigger)

Practices

DNS architectureTLS/certificate managementIncident response toolingInfrastructure documentation

Skills Demonstrated

▸Multi-cloud network architecture and disaster-recovery design
▸Infrastructure as Code at production scale
▸Security-conscious edge configuration (WAF tuning without sacrificing functionality)
▸Rigorous, methodical debugging of distributed-systems failure modes
▸Building operational tooling that shortens incident response time
▸Clear technical documentation of non-obvious system behavior for future engineers