Note: All times referenced below are Coordinated Universal Time (UTC).
Incident Summary and Timeline
21:19 - CloudFlare had an outage affecting DNS resolution on several locations. This resulted in Pingdom detecting outages on thousands of sites as well as alerts about access to my Pingdom (which relies on CloudFlare for DNS). Pingdom SRE (Site Reliability Engineering) started to investigate the issue.
21:45 - Pingdom SRE noticed a buildup in our alerting queue due to the massive number of alerts being sent for the outages, capacity was added to process the alerts faster. A decision was made to not drop or purge any alerts and instead allow them to process
The uptime view in my Pingdom also showed all uptime checks in the "Unknown" status, due to the number of results these were also delayed, but monitored continued and the status could be viewed in the individual uptime reports.
22:39 - Average alert delay was around 20 minutes, but some alerts took over 1 hour to process. At this point the queue would take 25 minutes to clear completely.
23:05 - All alerts back to real-time with no delay and the uptime view in my Pingdom showed the correct status of checks.
Between 21:19 and 23:05 on Friday, July 17th, a DNS/Network issue at CloudFlare (3rd party supplier of DNS for my Pingdom, which is access to my Pingdom only, not the monitoring of DNS from Pingdom) cause thousands of websites to suffer from DNS issues. These issues were all correctly identified as outages by Pingdom, but the sheer number of them caused a buildup of a queue in Pingdom alerts. Additionally the uptime view in my Pingdom could not process the current status of all checks so they showed as Unknown when viewing the Uptime page in my Pingdom.
Pingdom SRE added extra capacity to process alerts, but even with added capacity, there were delays of up to 1 hour of Pingdom alerts. By 23:05 the queue had cleared and all alerts caused by the outage had been sent and new alerts were processed in real-time again.
The impact was three-fold to Pingdom:
- Access to my Pingdom was impossible for those who had not cached the DNS of the site
- Alerts were delayed
- Even those who could access my Pingdom could not see the current status of all their uptime checks
Recovery and Lessons learned
Adding extra capacity significantly increased the number of alerts that could be processed, but the delay was still very long in some cases. Pingdom Engineering and SRE are investigating all affected systems to see what can be done to improve an incident like this.
If you are unsure whether you were impacted by this or not, please contact us.