Seiten

Saturday, November 28, 2020

Why "average" as service quality metric sucks

The problem with average to analyze "service quality"

Let's imagine a webserver.

That server answers 1000 requests per minute.

980 requests are answered within 200ms. But 20 requests take 10 seconds.

Average is: (200*980 + 20*10000) / 1000 = (200*980 + 20*10000) = 396ms.

396ms does not sound too bad.

But in reality there are 20 customers with extremely long waiting times. If this is a checkout process they might have canceled their order. You got churn. You got people complaining about your service.

Solution 1: Max

The max would return 10000ms. And we would immediately see that there's a problem in that time period. Simple to implement and easy to understand.

Solution 2: Percentile

The percentile would also help to surface problems. The 99th percentile means "99% of requests are below that value, 1% above".

Bottom line

I usually use max for a certain timeframe to see the extreme values. This helped me quite a lot in the past to make sure I act on potential problems.

Also important: This concept is not limited to webservers. It is also very relevant for eg customer support departments where customers call you. Average will hide extreme values - and again - these extreme value represent customers that likely churn.