Calculate SLI and SLO with PromQL

Original link: https://www.kawabangga.com/posts/4822

Use PromQL to query the error budget used in the past month, and then display the current SLI. Results as shown below:

The difficulty of this query is that the content queried by PromQL is the value of the time series , such as the query memory > 0.6 , the time and value of all the time series that meet the conditions are found. Getting the result of a query to be time requires some tricks.

The idea of ​​realization is:

  1. First, define the minute-level up standard, that is, the definition of SLO: how to be up in 1min, and how to be down;
  2. Then write a query to find out how many minutes meet the conditions in a time interval, up, and how many minutes down;
  3. Finally, real-time SLI results can be obtained;

There are two ways to think of it.

The first is to use the recording rules. First, a rule called job:sla_minute_up must be defined. The result of this rule evaluation is whether the current minute meets all the conditions for up. So it may be a complex expression with many and connected together.

Then we just add up all the up minutes, ie sum_over_time(job:sla_minute_up[30d]) . Finally divide by all the minutes in a month: sum_over_time(job:sla_minute_up[30d]) / 30 * 24 * 60 , which is the final SLI.

But there is a particularly important place here, that is, sum_over_time calculates all the points that have appeared during this time. For example, assuming that the collection interval is 15s, then the up metric will have 4 points per minute, and the normal result of sum_over_time(up[1m]) should be 4. So, using this method, the evaluation of recording rules internal must be set to 60s to have only one dot per minute.

The actual result of this method is bound to the calculation process, so it is not very good. The following method is more clever.

We can change the way of thinking and directly calculate the percentage of satisfying conditions during this period, and then multiply the percentage by the time period, which is the final up time.

To get this percentage, first of all, the minute-level up definition is still required. However, what we care about is not the specific value, but whether the condition is met, that is, yes or no. Using the >bool operation, the result can be converted into bool. If the condition is satisfied, the result is 1, and if it is not satisfied, the result is 0. In this way, we only need to calculate the average value during this period to get the percentage of SLI. For example, if all are up, then the result is all 1, then the average over all time is 100%. If there is a point that is not up, then according to the percentage of all points occupied, it is also correct.

This method has nothing to do with interval. The smaller the interval, the more data points, and the higher the precision. If the interval is large, the precision will be lost. Objectively speaking, the true SLI represented is correct after removing the precision.

If you want to represent the SLI within a day, assuming that the SLO only has the error rate, you can use the following query:

 (1 - avg_over_time((error/total)[1d])) * 60 * 60 * 24

It can also be used with Grafana to use the time interval selected by Grafana:

 (1 - avg_over_time((error/total)[$__range])) * $__range_s

References:

  1. https://stackoverflow.com/questions/54292803/calculate-the-duration-in-which-a-prometheus-metric-had-a-certain-value
  2. https://docs.bitnami.com/tutorials/implementing-slos-using-prometheus

The post Computing SLI and SLO with PromQL first appeared on Kawa Banga! .

This article is reprinted from: https://www.kawabangga.com/posts/4822
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment