Follow

Post Mortem, Uptime checks list and delayed alerting Issue on Sunday August 30th 2020

Note: All times referenced below are Coordinated Universal Time (UTC).

Incident Summary and Timeline

Timeline

10:12 - Pingdom Site Reliability Engineering (SRE) receives an alert about unprocessed Pingdom check results.

10:18 - Additional internal alerts to Pingdom SRE are sent related to several internal services, but it is not believed that this is impacting any Pingdom users

11:28 - A queue of alerts to Pingdom users has built up, and it becomes apparent that this is indeed affecting users of Pingdom. 

From our public status page:

 

11:41:09 UTC

At 10:12 UTC, Pingdom SRE was alerted to an issue with the second opinion functionality in Pingdom Uptime checks.

It was later apparent that this caused a buildup of an alerting queue and that alerts from 10:12 UTC can delay up to 25 minutes.

The issue has been resolved, and the queue is rapidly shrinking, and delays will be shorter.

We apologize for not notifying you about this issue earlier, but we will update you as soon as alerts are being processed in real-time again.

12:57:05 UTC

An additional side effect of this issue is that uptime checks in my Pingdom and through API endpoints (/checks more specifically) will return the status Unknown until real-time processing of alerts and outages is back to normal.

Alert delays are still present, current estimates at around 25 minutes.

13:15:07 UTC

While services are slowly recovering, we have discovered a chain reaction issue causing some delays in Transaction Check alerts and page speed check results.

No ETA on when services will be fully restored at this time but we will continue to update here as soon as more information is available.

14:32:10 UTC

Due to a major outage to various internet services, the alerting problem has only grown.

At this time, the queue of alerts increased to delays to over 2 hours and 30 minutes and in order to send accurate alerts again, we have decided to purge the alert queue.

Alerting and monitoring is once again processing in real-time.

We continue to monitor the situation and will provide updates.

14:52:20 UTC

All services back to normal.

API and my Pingdom is reporting the current status of uptime and transaction checks.

Alerting and monitoring working in real-time again.

A post mortem will be made available once our internal investigation is complete.

 

 

Summary

Between 10:12 and 14:52, Pingdom was affected by a more extensive outage on the Internet. Cloudflare has written an excellent blog post going into the details of what happened and how it impacted them and the rest of the Internet.

While Pingdom Site Reliability Engineering was alerted to the issue and attempted to mitigate the problem was made, we eventually had to purge a number of alerts to get real-time monitoring working again.

 

Impact

The impact was two-fold to Pingdom:

  1. Alert delays and eventually missed alerts 
  2. Our uptime views showed the status Unknown while results waited to be processed, this also affected the /checks/ endpoint in our API where the status remained as Unknown

 

Recovery and Lessons learned

Adding extra capacity significantly increased the number of alerts that could be processed. However, the delay was still very long and eventually had to be terminated, resulting in missed alerts. 

Additionally, we have changed our processes to alert customers faster to issues that can affect them and update our status page more often.

We're aware that similar incidents have happened in the past, and the work continues to ensure Pingdom will work properly, even in extreme circumstances.

If you have any additional questions about this, please reach out here

Was this article helpful?
1 out of 1 found this helpful