Follow

Post Mortem My Pingdom access, uptime checks list and delayed alerting Issue on Friday July 17th 2020

Note: All times referenced below are Coordinated Universal Time (UTC).

Incident Summary and Timeline

Timeline

21:19 - CloudFlare had an outage affecting DNS resolution on several locations. This resulted in Pingdom detecting outages on thousands of sites as well as alerts about access to my Pingdom (which relies on CloudFlare for DNS). Pingdom SRE (Site Reliability Engineering) started to investigate the issue.

21:37 - CloudFlare updates their status page about the ongoing outage, full details available here.

21:45 - Pingdom SRE noticed a buildup in our alerting queue due to the massive number of alerts being sent for the outages, capacity was added to process the alerts faster. A decision was made to not drop or purge any alerts and instead allow them to process

The uptime view in my Pingdom also showed all uptime checks in the "Unknown" status, due to the number of results these were also delayed, but monitored continued and the status could be viewed in the individual uptime reports.

22:39 - Average alert delay was around 20 minutes, but some alerts took over 1 hour to process. At this point the queue would take 25 minutes to clear completely.

23:05 - All alerts back to real-time with no delay and the uptime view in my Pingdom showed the correct status of checks.

 

 

 

Summary

Between 21:19 and 23:05 on Friday, July 17th, a DNS/Network issue at CloudFlare (3rd party supplier of DNS for my Pingdom, which is access to my Pingdom only, not the monitoring of DNS from Pingdom) cause thousands of websites to suffer from DNS issues. These issues were all correctly identified as outages by Pingdom, but the sheer number of them caused a buildup of a queue in Pingdom alerts. Additionally the uptime view in my Pingdom could not process the current status of all checks so they showed as Unknown when viewing the Uptime page in my Pingdom.

Pingdom SRE added extra capacity to process alerts, but even with added capacity, there were delays of up to 1 hour of Pingdom alerts. By 23:05 the queue had cleared and all alerts caused by the outage had been sent and new alerts were processed in real-time again.

Impact

The impact was three-fold to Pingdom:

  1. Access to my Pingdom was impossible for those who had not cached the DNS of the site
  2. Alerts were delayed
  3. Even those who could access my Pingdom could not see the current status of all their uptime checks

 

Recovery and Lessons learned

Adding extra capacity significantly increased the number of alerts that could be processed, but the delay was still very long in some cases. Pingdom Engineering and SRE are investigating all affected systems to see what can be done to improve an incident like this. 

 

If you are unsure whether you were impacted by this or not, please contact us.

Was this article helpful?
0 out of 0 found this helpful