Case Study: Certificate Automation in E-commerce Company

Published on December 29, 2024 by CrtMgr Team • 6 min read

SSL Case Study Automation E-commerce DevOps

What does it actually cost to ignore certificate management? For one mid-sized e-commerce company, the answer turned out to be $47,000 in a single afternoon — plus a weekend of cleanup nobody wanted to spend.

This is the story of a team that went from 3-4 certificate incidents per year and 40 hours of monthly manual work, to zero incidents and near-full automation in about four months. The tools aren’t exotic: cert-manager for Kubernetes, Certbot for legacy servers, and a monitoring stack that actually caught problems before users did. What changed was the commitment to treat certificate management as infrastructure, not an afterthought.

Challenge: Certificate Chaos

The company operated across AWS, Azure, and GCP with a complex infrastructure: 15 Kubernetes clusters running 500+ pods, plus legacy VMs still running traditional applications. Their multi-cloud architecture had grown organically over years, and certificate management had become a nightmare.

They were experiencing 3-4 incidents with expired certificates annually. There was no central visibility into what certificates existed or when they would expire. Each renewal required manual intervention, consuming 2-3 hours per month per administrator. Different teams had developed their own practices, and nobody received alerts before certificates expired.

The Breaking Point

July 2023 brought the incident that forced change. Their main checkout flow went down for 2 hours during peak sales hours because of an expired certificate. The damage was severe: $47,000 in lost sales, 23% of customers abandoning their carts, negative SEO impact, and erosion of customer trust.

The root cause? Someone had renewed the certificate but forgot to install it on the load balancer. Classic human error that expensive monitoring could have prevented.

Solution: Three-Phase Transformation

Phase 1: Inventory (2 weeks)

Before automating anything, they needed to understand what they had. They started with automatic scanning using a simple bash script:

# Discovery script
#!/bin/bash
for domain in $(cat domains.txt); do
  echo "Checking $domain..."
  echo | openssl s_client -connect $domain:443 2>/dev/null | \
    openssl x509 -noout -subject -dates -issuer >> inventory.txt
done

The results were imported into a spreadsheet with columns for domain, expiration date, issuer (CA), owner/team, environment (prod/staging/dev), and platform (K8s/VM/Serverless).

The audit uncovered 237 certificates—a surprise. Among them were 18 already expired certificates and 42 with less than 30 days until expiration. These covered 89 different domains across 12 different Certificate Authorities.

Phase 2: Quick Wins (1 month)

The goal was stabilization and basic monitoring before attempting full automation.

Consolidating Certificate Authorities

The first quick win was eliminating the annual bleeding of $15,300 across four different CAs: DigiCert ($8,500), Sectigo ($3,200), GoDaddy ($2,100), and RapidSSL ($1,500). They migrated everything to Let’s Encrypt, dropping their annual certificate costs to zero.

Setting Up Monitoring

They implemented CrtMgr alongside their own monitoring stack. For the Prometheus-based alerting layer, they followed a setup similar to what’s described in the SSL certificate monitoring with Prometheus and Grafana guide — giving them both internal metrics and external visibility:

# Prometheus alerts
- alert: CertificateExpiringIn30Days
  expr: (ssl_cert_not_after - time()) / 86400 < 30
  annotations:
    summary: "Cert expiring for {{ $labels.domain }}"

Slack integration ensured alerts reached the right people. They also created proper procedures and documentation—renewal playbooks, contact lists, escalation procedures, and runbooks all went into Confluence.

After just one month, they had zero expired certificates, all teams were receiving alerts, and standard procedures were documented and accessible.

Phase 3: Full Automation (3 months)

Now for the real transformation: eliminating manual certificate management entirely.

Kubernetes (70% of infrastructure)

They installed cert-manager across all 15 clusters:

# Installation
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.13.0

To migrate all existing Ingress resources, they wrote a Python script that added cert-manager annotations automatically:

#!/usr/bin/env python3
import subprocess
import json

# Get all ingresses
result = subprocess.run(['kubectl', 'get', 'ingress', '-A', '-o', 'json'], 
                       capture_output=True, text=True)
ingresses = json.loads(result.stdout)

for ing in ingresses['items']:
    namespace = ing['metadata']['namespace']
    name = ing['metadata']['name']
    
    # Add cert-manager annotation
    subprocess.run([
        'kubectl', 'annotate', 'ingress', name, '-n', namespace,
        'cert-manager.io/cluster-issuer=letsencrypt-prod',
        '--overwrite'
    ])
    
    print(f"✓ Migrated {namespace}/{name}")

Load Balancers (AWS/Azure/GCP)

For AWS, they leveraged ACM (AWS Certificate Manager) which handles automatic renewal:

# Automatic certificate request and attach
aws acm request-certificate \
  --domain-name example.com \
  --validation-method DNS \
  --subject-alternative-names "*.example.com"

# Auto-renewal by AWS

Legacy VMs

For the remaining traditional servers, they deployed Ansible playbooks:

# renew-certificates.yml
---
- hosts: web_servers
  tasks:
    - name: Install certbot
      apt:
        name: certbot
        state: present
    
    - name: Renew certificates
      command: certbot renew --quiet
      register: renewal
    
    - name: Reload nginx if cert renewed
      service:
        name: nginx
        state: reloaded
      when: renewal.changed
    
    - name: Setup cron job
      cron:
        name: "Certbot renewal"
        minute: "0"
        hour: "3"
        job: "certbot renew --quiet && systemctl reload nginx"

Results After 6 Months

The transformation delivered measurable results:

Metric	Before	After	Change
Incidents/year	3-4	0	-100%
Management time/month	40h	2h	-95%
Automated certificates	0%	98%	+98%
Average renewal time	45 min	0 min	-100%
Expiration alerts	0	237	+∞

The financial impact was substantial. Certificate costs dropped by $15,300 annually. The 38-hour monthly reduction in manual work (from 40h to 2h) saved $6,080 per month at an average rate of $160/hour. Combined, that’s $88,260 in annual savings. They also avoided the $47,000+ in losses from downtime that had plagued them previously.

Quality improved across the board: 100% SSL/TLS uptime, zero customer complaints about certificates, and better SEO positions without certificate errors.

The teams noticed too. DevOps said they stopped worrying about certificates—everything just works automatically. Security finally has full visibility into what’s happening. And the business appreciated that SSL incidents disappeared, keeping customers happier.

Lessons Learned

Several insights emerged from this transformation. Inventory came first—you can’t automate what you don’t know exists. Quick wins like monitoring provided immediate ROI and built momentum. The phased rollout (Kubernetes first, then cloud load balancers, finally legacy VMs) prevented overwhelming the team. Treating infrastructure as code helped with certificate management, and combining internal monitoring with external tools like CrtMgr provided 24/7 coverage.

Looking back, they’d have started earlier—the cost of delay was high. More staging tests would have caught a few edge cases that surfaced in production. Better documentation would have eased onboarding new team members. And earlier communication with all teams would have prevented some surprised reactions to changes.

Final Architecture

┌─────────────────────────────────────┐
│         Applications                 │
│   K8s   │  Cloud  │   Legacy VMs   │
└─────────┴─────────┴────────────────┘
     │         │            │
     ▼         ▼            ▼
┌─────────────────────────────────────┐
│      Certificate Management         │
│  cert-manager │  ACM  │  certbot   │
└─────────────────────────────────────┘
     │         │            │
     ▼         ▼            ▼
┌─────────────────────────────────────┐
│         Let's Encrypt               │
└─────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────┐
│    Monitoring & Alerting            │
│  Prometheus │ Grafana │ CrtMgr     │
└─────────────────────────────────────┘

The entire transformation took 4.5 months with a budget of $12,000 ($2,000 for tools and licenses, $10,000 in DevOps time). The ROI period? Two months, thanks to the immediate savings.

SSL/TLS certificate automation eliminates downtime (zero incidents in 6 months), saves time (95% reduction in management overhead), saves money ($88K+ annually), increases security through better visibility and control, and improves morale by letting teams focus on business value instead of certificate firefighting.

If you’re starting a similar transformation, the sequence matters: inventory first (you can’t automate what you don’t know exists), then monitoring, then automation. Start with your Kubernetes workloads using cert-manager — it gives the fastest wins with the least risk. Then tackle cloud load balancers, and leave legacy VMs for last.

The $12,000 investment paid back in two months. The $47,000 afternoon that triggered the whole project? That never happened again.

From chaos to automation in 4 months. The hardest part wasn’t the technology — it was doing the inventory.