Living with Let's Encrypt rate limits at scale
We accidentally hit Let's Encrypt's per-week certificate limit. Here's what happened, what we learned about Certificate Transparency logs, and the alerting we wired up afterwards.
What went wrong
A misconfigured deployment loop was deleting and re-issuing certs on every container restart instead of re-using the cached ones. The container was restarting every ~10 minutes due to an unrelated health-check bug. Math: 144 cert issuances per day. The 50-per-week limit was reached in about 8 hours.
By the time we noticed, the issuance was blocked for the rest of the rolling 7-day window. We had to fall back to a self-signed cert temporarily and tell the small subset of affected users to ignore the browser warning, which is a bad story.
How we caught it (eventually)
Certificate Transparency monitoring would have caught this within 30 minutes. We had no CT monitoring. Set up alerting via crt.sh now — any unexpected cert issuance for our domains fires a Slack alert.
The free service is fine for small fleets. If you have a lot of domains, pay for a managed CT monitor — they're cheap and worth it.
What we added afterwards
A metric on the cert-renewal service tracking issuance rate per hour. Above-threshold alerts before we get anywhere near the LE limit. Embarrassingly simple — should have been there from day one.
A health check on the container that distinguishes "starting normally" from "thrashing-restart". Health-check bug fixed.
Lesson
External services that have rate limits as a defense against abuse will, eventually, defend themselves against you when something local goes wrong. Monitor your own usage of those services as if they were a downstream dependency you didn't control. Because they are.