Use site reliability engineering to deal with cloud instability

Cloud platforms, as a remotely managed service, come with a Service Level Agreement (SLA) that guarantees a percentage of uptime or your money back. These service level agreements and shifting responsibility for maintaining infrastructure from your organization or colocation provider to the clouds used in your organization have raised the expectation that cloud services will “just work”, even though the reality is often far from that.

IT infrastructure has become faster and cheaper over time, but a server today is not significantly more reliable than a decade ago because the root causes of failures are often environmental or the result of a third-party error.

Some blackouts over the past two years have raised eyebrows as to their origin, effects or circumstances.

The fire that destroyed OVHcloud’s SBG2 installation in Strasbourg in March 2021 was the result of a faulty repair of an uninterruptible power supply system. Cooling systems failed to keep pace with London’s heatwave in July, leading to outages at Google Cloud Platform and Oracle Cloud Infrastructure. Although not cloud-specific, the 2020 Nashville bombing damaged a significant amount of telecommunications equipment, leading to regional outages.

Given an increase in global temperatures due to climate change – and an increase in political temperatures – the potential for disruptions related to climate or extremism is real.

Of course, relatively mundane factors also lead to outages, such as poor software deployments, software supply chain issues, power outages, and network issues ranging in severity from tripped cables to fiber cuts. . Naturally, no discussion of outages would be complete without a mention of DNS and BGP-related outages, which have been cited as the root cause of incidents at Microsoft Azure, Salesforce, Facebook, and Rogers Communications over the past two years.

Engineer like a storm is coming

If your application is mission-critical, deployment and instrumentation should take this into account. Consider where the single points of failure are – deploying in a single region in a single cloud provides no redundancy. Using a Content Delivery Network (CDN) can provide cached versions of pages in the event of a crash, allowing relatively static content to be served, although using a CDN alone does not maintain full feature availability.

Deploying across multiple regions in a single cloud is the least constraining way to ensure availability, although designing a scalable application whose constituent components can be distributed involves significant engineering time and infrastructure costs. . Operating and maintaining individual service units – including data stores – that are deployed in geographically separate facilities is a significant undertaking that requires thoughtful planning and institutional support to accomplish.

Arguments could be made here in favor of multicloud: leveraging parallel infrastructure to eliminate a single point of failure is attractive, but costly, complex and repetitive, requiring institutional knowledge of two different cloud platforms and accommodating them on an equal footing at every stage of your production processes.

Similarly, compelling arguments could be made under these circumstances for hybrid cloud, but that too is complex. Some of this complexity can be managed through initiatives such as AWS Outposts, Azure Stack Hub, and IBM Cloud Satellite, which provide consistent operating environments across public and private infrastructure.

Using these offerings as sole protection against outages is short-sighted – it replaces reliability issues with complexity issues, introducing a new path from which outages could arise.

You need site reliability engineering

By embracing site reliability engineering (SRE) to create scalable and reliable systems, it is possible to meaningfully embrace complexity and increase reliability through careful planning, clearly articulated roles, and control processes. management of well-defined incidents.

Site Reliability Engineers are typically responsible for reducing “labour” – repetitive manual labor directly related to the performance of a service – as well as setting and measuring reliability goals: service level indicators and service level objectives that are linked to the SLAs of a cloud or infrastructure provider. Measurement of these, and application performance in general, is accomplished with observability tools, which allow site reliability engineers and other troubleshooters to ask questions about an environment without knowing what needs to be done. be requested before an incident.

Although there are different approaches to implementing SRE – and by extension, defining the responsibilities of reliability engineers – there is a distinction between engineers and platform teams. The platform teams are in charge of developing the infrastructure of a computer park; Site Reliability Engineers are multidisciplinary roles responsible for ensuring the reliability of the infrastructure, applications, and tools used by an organization to deliver a product or service to customers.

Assume the worst, but hope for the best

The ubiquity of cloud platforms leads to consumer visibility that data center operators don’t have – services like Downdetector illustrate the relationship between cloud outages and outages at consumer brands that use these cloud platforms. Downdetector and, internally, observability tools, provide a real-time understanding of cloud outages that may not be reflected in a cloud platform’s service status pages.

Dashboards provided by vendors require manual intervention to recognize a service degradation or outage, making them an editorial product, not a real-time automated view of a platform’s service status cloud. This does not imply wrongdoing – there are useful reasons to limit information, particularly to avoid informing threat actors about the degree to which a service is stressed by an attack.

Cloud platform operators are naturally striving to improve reliability and reduce the effect of outages. Microsoft’s introduction of Azure Availability Zones to logically separate infrastructure in the same data center region is an attempt to improve overall reliability, and IBM’s work to bolster platform reliability has reduced major incidents by 90% in one year.

Disruptions to cloud platforms, network hiccups – for infrastructure or users – and the unpredictable effects of software changes or “code rot” mean there’s virtually no way to guarantee perfect uptime. of an app. But thoughtful planning and resource allocation can reduce the severity of incidents. Proactive instability engineering requires an upfront investment, but is preferable to emergency firefighting.


James Sanders is Principal Analyst, Cloud and Infrastructure, at CCS Insight.