So you’re starting with SLOs. That’s great! That means you have an SLI. Though, where do you set the initial SLO? You probably think two, three, four, or even five nines. The challenge is you don’t know what’s realistic to start with. Shoot too low and it is easy to make the SLO. Shoot too high, you’ll miss it—maybe indefinitely. Perhaps you guess to start. Well, you don’t need to guess. We can measure the capabilities of a system using Statistical Process Control (SPC). These capabilities define where to set initial SLOs.
This post will show how to find your SLOs using SPC with telemetry in Datadog.
What is Statistical Process Control?
Statistical Process Control (SPC) is a form of analytics statistics created by Walter Shewhart and popularized by Dr. Deming. SPC visualizes process capabilities on a control chart. The control chart includes a lower and upper control limit. Points between this line are expected. Points outside these lines are not.
If all the points fall between the control limits, the process is considered “under control”. Its behavior is predictable. If there are points outside the control limits, then investigation is required.
This is a cursory introduction to SPC. See the references for self-study materials.
So how is this any good for finding initial SLOs? Assume the SLI is “the proportion of good requests to total requests.” That produces a number like 97.37, 99.8, or 98.99. SPC measures the system’s current capabilities through the control limits while accounting for normal ups, downs, and unexpected results. These control limits tell us how reasonable the proposed SLO is.
Using a Control Chart
Ultimately we’ll create a chart like the one below. The control chart shows a real production system.
The blue line is the proposed SLO. It’s set to 99.99%. The dashed red lines indicate the upper and lower control limits. The solid black line is the process mean. The SLI is charted on a daily rollup. Red points are outside the control limits. Black points are within the control limits.
So what does this control chart tell us about the SLI and SLO?
First, the system will not achieve the SLO. Setting an SLA at this point would be disastrous! The proposed SLO (blue line) is well above the upper control limits. There is no single reading of the SLI at or above the blue line.
Second, approximately half the points are outside the control limits. This system is not predictable. Maintaining any SLO (unless set sufficiently low) will be challenging.
Third, a special cause shifted the SLI downward around mid-April. The chart cannot tell the cause or why the SLI shifted upward. Was it countermeasures taken by the team or by something else? This run lasted over a month, so don’t expect quick maintenance.
Fourth, a realistic SLO for this service is at most ~99.75%. This is the lower control limit. However, the number of out-of-control points indicates that misses are likely.
Consider the alternatives. There are guaranteed misses if the SLO is set to the process mean (the solid black line). Remember that there is always variation. Points are expected to be above and below this value. If the SLO is set to the upper control limit (about 99.90%), there will be even more misses because meeting that SLO means operating above the current level.
You may be thinking: “This SLO is too low!” 99.75% availability means the system will be unavailable for a few minutes each day. This feeling is because the voice of the customer and the voice of the process are not in sync.
The control chart is the voice of the process. It says, “I’m capable of this range of outcomes”. That feeling that the SLO is too low is the voice of the customer. It is saying, “I want this level of service!”
Say that this system needs that 99.99% SLO. The only way to meet that objective is to change the system. This is a management decision. Changing the system requires sustained effort, constant purpose, and a continuous improvement operating philosophy.
This is kaizen (or “change for the better”). It carries a profound impact. Using SPC to model the voice of the process exemplifies Dr. Deming’s Theory of Variation. Second, engaging in empirically verifiable improvement activities exemplifies Dr. Deming’s Theory of Knowledge.
Start by getting the system under control by removing special cause variation. Then, iterate on improving the system to continually shift the control limits upward. Only then is meeting the 99.99% SLO possible.
Creating the Control Chart
You find the R sample code in my dojo. The code uses Datadog’s SLO API to pull a historical time series. Data may be aggregated, manipulated, then plotted as a control chart.
The control chart featured above is an individual chart for continuous one-at-time data (i.e. one reading of the SLI per day). The upper and lower control limits are calculated using the average and average moving range. 2.66
is derived from the control chart factors.
Here are two key snippets.
First, querying the Datadog API with httr
, then calculating the average, moving range, and upper and lower control limits.
getSLOWeeklyHistory <- function(slo = NULL, to = NULL, from = NULL) {
apiKey <- Sys.getenv("DATADOG_API_KEY")
applicationKey <- Sys.getenv("DATADOG_APP_KEY")
if(apiKey == "" || applicationKey == "") {
stop("DATADOG_API_KEY or DATADOG_APP_KEY missing")
}
r <-
GET(
glue("https://api.datadoghq.com/api/v1/slo/{slo}/history"),
query = list(from_ts = as.integer(from),
to_ts = as.integer(to)),
add_headers("DD-API-KEY" = apiKey, "DD-APPLICATION_KEY" = applicationKey)
)
stop_for_status(r)
json <- content(r)
timestamps <- json %>%
pluck("data", "series", "times") %>%
map(\(x) as_date(as_datetime(x / 1000))) %>%
list_c
numerators <- json %>%
pluck("data", "series", "numerator", "values") %>%
list_c
denominators <- json %>%
pluck("data", "series", "denominator", "values") %>%
list_c
history <- data.frame(
Date = timestamps,
Numerator = numerators,
Denominator = denominators
) %>%
mutate(SLI = Numerator / Denominator) %>%
group_by(Week = floor_date(Date, unit = "week", week_start = 1)) %>%
summarise(SLI = sum(Numerator) / sum(Denominator)) %>%
mutate(
UNPL = mean(SLI) + 2.66 * mean(abs(diff(SLI))),
LNPL = mean(SLI) - 2.66 * mean(abs(diff(SLI))),
Mean = mean(SLI)
) %>%
mutate(Violation = SLI > UNPL | SLI < LNPL)
}
Creating the control chart with ggplot2
:
ggplot(history, aes(x = Week, y = SLI)) +
geom_line() +
geom_point(aes(color = factor(Violation)), show.legend = FALSE) +
geom_hline(aes(yintercept = UNPL), linetype = "dashed", color = "red") +
geom_hline(aes(yintercept = LNPL), linetype = "dashed", color = "red") +
geom_hline(aes(yintercept = Mean), linetype = "solid") +
geom_hline(yintercept = 0.9999, linetype = "solid", color = "blue") +
scale_y_continuous(labels = label_percent()) +
scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Wrap-Up
Control charts work. It’s up to you to use them. They work in a variety of contexts. This example of finding initial SLOs was used because it’s something many engineering teams have encountered. You can use them for latencies, spending, defects, rates of incidents, and more.
Control Charts work. They work when nothing else will work. They have been thoroughly proven. They are not on trial. The question is not whether they will work in your area. The only question is whether or not you will, by using these tools and practicing the way of thinking which goes with them, begin to get the most out of your processes and systems.
The alternative is to be left behind.
— “Understanding Variation” by Donald Wheeler
The control chart is the voice of the process. It says what you’ll get from the system. The voice of the customer says what you want from the system. It is management’s job to align the two.
Thank you to John Willis for peer-reviewing this post.
Want more future Software Kaizen content?
FAQ
Where can I learn more?
What’s special and common cause variation?
Understanding these two types of variation is key to any process improvement. Misattributing one to the other will lead you astray. Fundamentally, common cause variation is built into the system. It cannot be removed. It may only be tuned. Special cause variation is external to the system. It can be removed by changing the system.
Here’s a short example. Consider how long it takes to complete your routine drive to the grocery store.
There are traffic lights along the way. Sometimes you will get all greens. Sometimes you will get all reds. Sometimes you will get a mix. This is an example of common cause variation. You can attempt to tune them (running lights, timing them, etc) though they are built into the system (driving on the roads).
One day you get a flat tire. This is special cause variation (some refer to this as an assignable cause). You can deploy countermeasures for this, such as proper tire maintenance.
What’s your advice on initial SLOs?
SLOs are not aspirational. They must align with business realities. There must be consequences for meeting and failing to meet them. Setting a lower and consistently achievable SLO is better than inconsistently meeting a higher SLO.
Your system may not need 99.999% (25.9 seconds of downtime in a month). Perhaps 99% (7.2 hours of downtime per month) is acceptable and achievable. Choose the one that makes business sense, then stick to it!
You want to run with a proposed SLO for at least a month. This is enough time to see if the system can meet it.
Do not start alerting with SLOs. This is a challenging problem and worthless for systems that do not consistently meet their SLOs.
Listen to my podcast with Alex Hidalgo for more advice on working with SLOs.
Where can I get more code examples?
See my dojo. The dojo contains examples of various control charts and working with Datadog. Examples use R and ggplot2. The source code in this post was taken from the dojo.
Can I make a control chart in Datadog?
No. Datadog’s functions cannot reduce a time series to a single number (say to calculate the process average) and display that on a time series chart. I’d love it if Datadog had a native control chart widget. You can do much more with ggplot2
than with Datadog (like dynamically coloring points, adding ribbon zones, creating chart grids, etc). Datadog is fantastic for time series stuff but not for SPC.