AWS Product & DevOps SLAs- Cheats & Mistakes to consider (Part 1)

arivanna_A_realistic_image_of_SLA_graphs_coloured_in_green_that_87c91d7e-833d-4777-8167-5e102cae2786-1-1024x606

Promise and trust is something that can’t be detached from the way you run your AWS Operations. When your customers sign up for your products and solutions there is always going to be the trust factor that has to be considered to ensure genuine and effective CX delivery. As SaaS leaders you want to always ensure you want to keep that trust level high and to manage this trust you will need indicators to measure and setup thresholds. This is where the concept of SLA, SLO and SLI comes in.

My company VeUP is focused on ISVs (Independent Software Vendors). Being an ISV, you offer software products and services and must define their quality characteristics. SLIs, SLOs, and SLAs are the standard way to make quality measurable and control it. Unfortunately, we often misread SLAs promised by other vendors and make mistakes in our own offers. To resolve this pain point faced by AWS ISV partners, I will be writing a series of blog articles on SLA best.

Just like how our previous post on DevOps role blueprinting and knowledge analysis enables SaaS leaders to ensure their AWS operations are well optimized, this blog post is looking to support these leaders to maintain trust with suitable SLA practices. This blog post is the first in the series to reveal the SLA mistakes and cheats supported by examples.

Learn the definitions

Let’s refresh the definitions first:

  • SLI (Service-Level Indicator) – a quantitative metric we measure to estimate the quality of the service. Common SLIs are Availability, Latency, Throughput, Error rates, Utilization, etc.
  • SLO (Service Level Objective) – an SLI value to define the appropriate quality level that is made into an objective. SLIs of fully functioning services must be lower than their SLOs.
  • SLA (Service Level Agreement) – a formal agreement between a service provider and internal or external consumers highlighting what to do when SLOs are and/or aren’t met.

 

The definitions are simple. The art and science are which metric works better for SLI, how to define reasonable SLO, and commit to SLA attractive for customers. Important that perfection shouldn’t be a goal. SLAs must be as less strict as possible to keep customers happy. Once the customers are satisfied with a service’s reliability and performance, an extra number of nines or milliseconds in SLA gives little value. Better to focus your resources on new features or other products.

1. Make the right choice for SLI

According to Google, a good SLI metric should be Meaningful, Proportional, and Actionable. All these characteristics are tied to a service’s user experience.

  • “Meaningful” means that the metric must represent high or low user satisfaction.

 

  • “Proportional” means that any change in the metric must be proportional to a deviation in user-perceived experience.

 

  • By “Actionable”, we mean that the metric must provide service owners an insight into why user experience was low over a specific period.

 

Look to always ensure there are maximum of 4-5 metrics for any SLO. Too many SLIs can trigger too many false positives. 1-2 SLI are reasonable for a good SLA.

2. Read SLAs from the end

Never read or create an SLA from a number of nines. Always start with clear definitions of SLI metric, their intervals, and key terms like “request,” “uptime,” “error,” etc. as seen with the definitions shown for Amazon SQS and SNS.

 

Never expect some of the terms you use to be intuitively clear and never refer to common sense. It could play a bad joke on you. The screenshot below showing SLA definitions for DigitalOcean Kubernetes defines Unavailability as “all the requests failed for more than 5 minutes.” Nevertheless, some of the requests can fail not because of the service degradation but because clients send incorrect requests those fail with 400 return codes. Are you ready to pay SLA penalties for such cases as well? I suppose no.

 

3. Share and compare SLA and SLO

A formal SLA is not the only thing you should externally share with customers to build trust in your solutions. In fact, their trust directly depends on how much they know about the architecture and the goals you have in mind for making the design. Of course, providing know-how or secrets is not a good idea, nonetheless, sharing some principles and SLOs, which were designed for your products, would give your product a few extra points of trust. Check the image below to see how AWS shows SLOs and SLAs for S3 storage. A formal SLA is always stricter than a “designed for” SLO. Don’t hesitate to compare them proudly. The gap between the SLO and SLA proves the sustainability of your solution.

 

Mature solutions have separate Data and Control Plane components. Define and share SLAs and SLOs for both. It’s terrible when your Control Plane fails, and users cannot get access to new resources. But it’s much more painful when already executed services are not functioning. That is why Data Plane SLO usually is stricter than Control Plane SLO. Take a look at the example of Control Plane SLA below.

 

4. Offer SLAs for single- and multi-zonal services

All the services have a placement scope where some of them are in a specific Datacenter (DC), Failure Domain, or Availability Zone (AZ). Components of other services may be distributed across several AZs but still stuck to a particular geographical Region. You can develop services distributed across several Regions or make them Global. Therefore, for clarity, for every service you must explain its placement scope, as shown in below image.

 

Please be honest with a placement scope definition. The screenshot below shows three vendors offering services in Multi-Zonal Region. Availability Zones offered by Vendor 3 are in different Failure Domains hundreds of kilometers from each other. No doubt, it is fair approach.

 

Vendor 1 pretends to have Multi-Zonal Region, but they have just three DCs in 60-180 meters from each other. It can therefore be easily considered that they are in the same Failure Domain as all three Datacenters could be damaged by fire, flood, or power outage.

What about Vendor 2? From my perspective, this is a boundary case. Based on the picture, we cannot recognize whether Availability Zones are in the same Failure Domain. So, it makes sense that two or all three AZs could be impacted by a larger-scale incident such as a river flood. Therefore, we should ask Vendor 2 for clarification on this risk. The vendor can reject your request by appealing to internal rules and secrets, but they risk losing trust in such cases.

Of course, you wish to motivate your users to follow the best practices. That is why I recommend you define separate SLAs for service components launched in the same or different Availability Zones, as seen in the example below where SLA of Multi-zonal Google Compute Engine is much higher.

 

5. Take responsibility

SLAs are always supported by penalties for cases when the commitment is broken. Quite often, such penalties are expressed in Service Credits as a percentage of the service’s monthly bill, as shown in the below image where SLA Penalties for Amazon SQS and SNS cover ranges of Availability. The period or magnitude of the SLA violation defines how much money will be credited to next month bill.

 

Some vendors try to avoid paying back their users customers when their SLA is violated. In the example below, Vendor 1 and Vendor 2 offer 99.5% Availability SLA. Their offerings are slightly different but still fair. In comparison, Vendor 3 gives a higher 99.9% of Availability, offering competing SLA. But looking deeper, we see that their users will get only up to 20% of service credits. Also, the Unavailability of the service for less than 3 minutes is not counted as a problem. So, Vendor 3’s SLA ends up offering worse outcomes to their customers.

 

6. Keep the history of SLAs

Your SLA is evolving with time. To avoid problems when you and your users refer to different versions of SLAs, you should keep and share the history of all changes, as shown by AWS below where all Amazon S3 SLAs are available for review and reference. You can get additional trust points explaining the reasons for these changes.

 

7. Do not make SLI a self-valuable target

According to Goodhart’s law, when a measure becomes a target, it ceases to be a good measure. Please focus on your users’ experience rather than just keeping your SLI always green. Maybe you still remember a big Amazon S3 failure back in 2017. The service failed, but the health indicator was still green, making the situation even worse in the eyes of the customers.

 

Failures happen with all the products. You are not able to avoid that but must accept when that happens. AWS accepted and apologized for the S3 failure and published a detailed post-mortem that explained how one of the SLIs was misconfigured. So, they lost some number of trust points with that service failure but got some trust back by being honest with customers.

 

The next blog post in that series will focus on specific Availability SLA details. We will discuss the different methods to derive them and common mistakes from avoiding.

-Vasily Pantyukhin, VeUP CTO