The cost of downtime to business, company reputation, customer experience and trust has never been higher. Given the constant and connected nature of software driven businesses, customers and users have grown to be less forgiving and more fickle with their attention. An outage in a single service can impact all of its users. An outage in a multi-tenant platform has an exponential impact as it impacts the users of all the individual service providers running their services on the platform.
Balancing preparedness for a black swan event against minor, downtime events
As enterprises look towards designing their disaster recovery solutions, it is easy to get focused on preventing the big disasters and outages. These are the “black swan” events that have an incredibly large, almost decimating impact on service availability. The impact can be wide ranging i.e. it can extend the duration of time the service is out of commission and the amount of data that is lost. As big as these are, the impact of minor but frequent downtime cannot be ignored.
Enterprises need to pay attention to determining, discovering and preventing these smaller outages that can occur more frequently. These small downtimes can add up over the course of a year and completely topple the service availability targets and goals. There are several options available for disaster recovery from onprem disaster recovery solutions to cloud-based disaster recovery solutions that leverage infrastructure and platform capabilities offered by major cloud operators such as AWS, GCP and Microsoft Azure.
Cost of small downtime events
The cost of such minor downtimes can easily add up. Frequent downtimes increase that likelihood that a larger number of users are impacted by the downtime. In addition, the likelihood of the same user being impacted repeatedly across outages also increases. Such frequent downtimes can erode trust in the service. Even if an immediate abandonment of the service does not occur, the impact of repeated downtimes can be felt at renewal time. Either the customer does not expand the size of the engagement and could even decide to not renew their engagement. SaaS businesses that depend on monthly recurring revenue or annual recurring revenue are extremely susceptible to the impact of frequent, minor downtimes.
Key capabilities for developing resiliency
Enterprises looking to develop a resiliency against both major and minor downtime events should focus on developing and maintaining the following capabilities:
All key systems that serve traffic should be continuously backed up. In addition to being designed in a RESTful manner, the data generated, updated and maintained by these services should be continuously backed up to a local, centralized or cloud-based disaster recovery system. Backups should be as frequent as possible while not impacting the service quality and performance of the system. At the same time, backups should be both incremental and snapshot-based to offer flexibility and ability to recover from any time or size of downtimes. In addition, backups should also be multi-level to ensure that the backup system is not impacted by the same outage that is impacting the primary system.
All key systems that serve traffic should also be continuously monitored. This is critical to ensure that outages are detected as soon as possible, and disaster recovery is put in motion immediately. Similar to backup, monitoring needs to be implemented on a system that is not impacted by the same outage that has hit the primary service. In parallel, customer feedback systems also need to be monitored for service outage reports. As soon as reports begin arriving or the monitoring systems alerts to an outage, the outage should be confirmed, and the disaster recovery should be put in motion.
Once a disaster has been detected, reporting and confirmed, a failover process should be initiated that can spin up new servers with the ability to continue servicing any traffic. This is done by ensuring that the servers take on the roles of the servers impacted by the downtime.
The failover servers should be configured to access the backups that contain the state and information required to serve the traffic.
When the downtime is over and the underlying issues in the primary service environment have been diagnosed, fixed and confirmed fixed, a failback process should revert all services to the primary environment. Once the failback has been confirmed successful, failback servers can be reclaimed and destroyed.
In a recent survey, it was reported that only 37% of the respondents met their service availability goals. It was also reported that 71% of respondents had experienced a downtime event in the last 12 months, with 41% reporting having experienced a downtime event in the last 3 months. This shows that downtimes are not only frequent but also expected, and thus require careful planning and design to not only mitigate but ensure speedy recovery and restoration of service. Enterprises have several options at their disposal and should carefully evaluate and choose the solution that best fits their needs, and guarantee the agility required to detect and recover from unexpected downtimes.