Follow

Post Mortem Uptime Checks Issue on Wednesday March 18th 2020

Note: All times referenced below are Coordinated Universal Time (UTC).

Incident Summary and Timeline

Timeline

13:23 - Deploy of updated uptime probes initiated to support newer TLS versions

13:40 - Deploy complete in production

13:55 - Pingdom SRE and Customer Support determine that a subset of Uptime checks report the status Unknown

14:04 - Rollback of the update determined as the fastest solution to the issue

14:09 - Rollback initiated

14:13 - Rollback complete

14:29 - Root Cause Analysis started internally to determine what went wrong and how to prevent it

Summary

Between 13:40 and 14:13 on Wednesday, March 18th, all Pingdom users with an HTTP check that changes from an insecure to a secure protocol encountered the status Unknown in their checklist or uptime report. This type of check is by far mostly for pages that redirect HTTP to HTTPS, port 80 to port 443.

The event was triggered by an update to Uptime monitoring protocols that were deployed in our production environment at 13:40.

The update contained an extension of support for new TLS protocols.

A bug in this code caused checks that redirect from insecure to a secure page to be interpreted as Unknown by our probe servers.

The event was detected by the Pingdom Site Reliability Engineering team and customers logged into Pingdom. We confirmed that it was an issue at 13:55.

A decision to roll back the update was made at 14:04 and at 14:13 the rollback was confirmed as complete.

Impact

About 1 in 5 uptime checks in Pingdom have this setup and for the time period of the incident, these checks did not monitor or alert to any outages. We have determined that most customers with this set up had additional monitoring pointed directly towards the secure version of the same URLs so the chance that a real outage was missed and not alerted to is minimal. 

The reports for these checks will, however, contain a number of Unknown results during the time period.

Recovery and Lessons learned

 Pingdom Uptime checks run from a global network of probe servers in different datacenters so both the deploy and rollback take a few minutes. When the rollback was complete it also takes additional time for the individual uptime checks to run again and get a different status than unknown. Almost all checks run on a one-minute interval, but those running less frequently will have longer times of the unknown status recorded in their respective reports.

This issue has led to new quality assurance protocols as well as new testing scenarios being performed before any update is made to Uptime monitoring.

We appreciate your patience with this, please subscribe to updates on status.pingdom.com to get alerted to any ongoing Pingdom issues or just view the current status.

If you are unsure whether you were impacted by this or not, please contact us.

Was this article helpful?
2 out of 2 found this helpful
Powered by Zendesk