Reducing Incident Escalations at Basecamp

Incidents escalated to PagerDuty are generally “all hands on deck”.

tldr; The bottom line is that we had fewer actual site down interruptions and false alarm escalations in 2015.

Here’s a non exhaustive list of contributing factors to these improvements over the years:

  • Eliminating scheduled maintenance that would take a site offline
  • Limiting API abuse (and a general decrease in the number of abuse incidents)
  • Automated blocking of other common abuse traffic
  • Fairly generous ongoing hardware refresh with better distribution across cabinets
  • Completely new core and top of rack network switches
  • Hiring the right people (and the right number of people)
  • Moving to more stable storage (EMC / Isilon to Cleversafe)
  • Taking control of our public Internet connectivity and routing (Our own IP space, our own routers, carefully selected providers, filtering traffic)
  • Right sizing database hardware for every major application
  • Better development / deployment practices and consistency in following those practices (local tests / ci, staging, rollout, production)
  • Practicing incident response and keeping play books up to date
  • Vastly improved metrics and dashboards
  • Better application architecture and design choices with regard to availability and failure modes
  • Being ruthless in our tuning of internal monitoring and alerting (Nagios) to only escalate alerts that really need to be escalated
  • (Full disclosure we actually had more incidents escalated from our internal monitoring this year. The “quality” of those escalations is higher though.)