Scaling E‑Commerce Infrastructure for a National Retail Chain

Dallas, TX

Executive Summary

A national retail chain operating 1,200+ stores and a rapidly growing e‑commerce platform needed to modernize its cloud infrastructure to handle seasonal traffic surges — particularly during Black Friday, holiday promotions, and flash sales — without over‑provisioning year‑round. The existing monolithic architecture and manual scaling processes led to frequent outages during peak events, costing an estimated $2.1 million in lost revenue per incident. THNKBIG was engaged to design and implement a cloud‑native, auto‑scaling Kubernetes platform on AWS EKS that would deliver resilience during demand spikes while optimizing costs during steady‑state periods.

Key Scope Items

99.95%

Peak Event Uptime

40%

Infrastructure Cost Reduction

<60s

Auto‑Scale Response Time

Solution Implemented

AWS EKS multi‑AZ cluster — Deployed production Kubernetes clusters across 3 availability zones with Karpenter for sub‑minute node provisioning during traffic surges.
Microservices decomposition — Broke the monolithic application into 24 independently deployable services (catalog, cart, checkout, inventory, payments) enabling targeted scaling.
Event‑driven autoscaling — Implemented KEDA (Kubernetes Event‑Driven Autoscaling) to scale services based on queue depth, request rate, and custom business metrics.
Infrastructure as Code — Terraformed the entire stack for repeatable, auditable deployments across dev/staging/production environments.

Outcomes Expected

Achieve 99.95% uptime during peak shopping events (Black Friday, Cyber Monday, flash sales).
Reduce scaling response time from 45 minutes to < 60 seconds using Karpenter + KEDA.
Cut baseline infrastructure costs by ≥ 40% through right‑sizing and Spot instance adoption.
Enable independent service scaling, allowing checkout to handle 10× traffic spikes without scaling the entire stack.

Challenge

The retailer's legacy infrastructure presented several compounding challenges:

Peak‑event failures — During Black Friday 2024, the e‑commerce platform experienced 47 minutes of downtime, resulting in $2.1M in lost revenue and significant brand damage.
Manual scaling bottleneck — Infrastructure teams needed 45+ minutes to provision additional capacity during traffic surges, far too slow for flash‑sale events.
Over‑provisioned baseline — To compensate for slow scaling, the team maintained 3× the required baseline capacity year‑round, wasting approximately $180K/month in idle compute.
Monolithic architecture — The legacy Java application couldn't scale individual services independently, meaning a surge in checkout traffic required scaling the entire stack.

Challenge

The retailer's legacy infrastructure presented several compounding challenges that threatened both revenue and customer experience:

During Black Friday 2024, the e‑commerce platform experienced 47 minutes of downtime, resulting in $2.1M in lost revenue and significant brand damage across social media. Infrastructure teams needed 45+ minutes to manually provision additional capacity during traffic surges — far too slow for flash‑sale events that generate 10× normal traffic within seconds.

To compensate for slow scaling, the team maintained 3× the required baseline capacity year‑round, wasting approximately $180K per month in idle compute. The legacy monolithic Java application couldn't scale individual services independently, meaning a surge in checkout traffic required scaling the entire application stack — catalog, inventory, payments, and all.

Solution

We designed a cloud‑native, auto‑scaling platform on AWS EKS that addressed each challenge:

Microservices Decomposition

Broke the monolithic application into 24 independently deployable services — catalog, cart, checkout, inventory, payments, recommendations, and more. Each service can now scale based on its own demand profile, so a checkout surge doesn't require scaling the entire stack.

AWS EKS Multi‑AZ Cluster

Deployed production Kubernetes clusters across 3 availability zones with Karpenter for sub‑minute node provisioning. During traffic surges, new compute capacity spins up in under 60 seconds — down from 45 minutes with the previous manual process.

Event‑Driven Autoscaling (KEDA)

Implemented KEDA to scale services based on queue depth, request rate, and custom business metrics like cart‑abandonment rate. This ensures the platform pre‑scales before users experience any degradation.

Infrastructure as Code

Terraformed the entire stack for repeatable, auditable deployments across dev, staging, and production environments. All changes go through PR review and are applied via GitOps with ArgoCD.

Implementation

The transformation followed a four‑phase approach over 16 weeks:

Phase 1 — Discovery & Architecture (3 weeks): Conducted traffic analysis of two years of seasonal patterns, identified the 6 highest‑traffic services for initial decomposition, and designed the target EKS cluster topology.

Phase 2 — Pilot Migration (4 weeks): Migrated the checkout and payments services to EKS, set up Karpenter node provisioning, and validated auto‑scaling behavior with synthetic load tests simulating Black Friday traffic patterns.

Phase 3 — Full Rollout (6 weeks): Migrated remaining 22 services to Kubernetes, implemented KEDA event‑driven scaling, and deployed Spot instance pools for non‑critical workloads (catalog indexing, recommendations, analytics).

Phase 4 — Optimization (3 weeks): Fine‑tuned HPA/VPA policies, configured Kubecost dashboards for real‑time cost attribution by service team, and ran a full Black Friday simulation achieving 12× baseline traffic with zero degradation.

Results & Impact

Within 90 days of go‑live, the platform delivered measurable improvements across every target metric:

The e‑commerce platform achieved 99.97% uptime during Cyber Weekend 2025 — including Black Friday and Cyber Monday — handling 11.4× normal traffic with zero customer‑facing incidents. Scaling response time dropped from 45 minutes to under 45 seconds, with Karpenter provisioning right‑sized nodes automatically.

Baseline infrastructure costs fell by 43%, saving approximately $180K per month through Spot instance adoption, right‑sizing, and eliminating the 3× over‑provisioning buffer. Individual service teams now have real‑time cost visibility via Kubecost, enabling them to optimize their own resource usage.

Deployment frequency increased from monthly releases to multiple daily deployments, with ArgoCD managing zero‑downtime rollouts and automated rollback on failed health checks.

Key Takeaways

Microservices enable surgical scaling: Breaking the monolith into 24 services meant checkout could handle 10× surges without scaling catalog or inventory.

Sub‑minute provisioning changes everything: Karpenter + KEDA eliminated the 45‑minute scaling gap that caused Black Friday outages.

Spot instances are safe for retail: With proper fallback policies, 60% of non‑critical workloads run on Spot, saving $78K/month alone.

Cost visibility drives accountability: Kubecost dashboards per service team turned cloud spend from a mystery into a managed metric.