Ideally, we would like anomaly detection algorithms to identify all and only anomalies. But in reality this is easier said than done, as these two desiderata tend to trade against one another. The more grabby you are about anomalies, the harder you have to work to avoid non-anomalies. And conversely.

Still, the tradeoff isn’t absolute. It’s possible to be exactly right, exactly wrong, and anywhere in between on how well you identify anomalies.

Let’s take a closer look.

Precision and recall

For any given anomaly detection algorithm, we’d like to characterize how well it identifies all and only anomalies. It turns out that there are a couple of key concepts from information retrieval that can help here:

  • Precision: This measures how well our algorithm identifies only anomalies. It’s a percentage, or a rate if you prefer. Say the algorithm returns a set S(t) of purported anomalies for some threshold t. Some of them are real and some of them aren’t. The precision is the percentage of real anomalies in S(t). (For bookings monitoring, high precision means low alert spam.)
  • Recall: This measures how well we identify all anomalies. Again it’s a percentage. The idea here is that for some given dataset, some data points are anomalies and some aren’t. Our algorithm identifies a set S(t) of anomalies. S(t) captures some percentage of the full set of anomalies, and that percentage is the recall. (For bookings monitoring, high recall means that we’re good at catching outages when they occur.)

Here’s a nice graphic illustrating the above, by Walber:

Precision vs recall

“Recall” is a funny name for finding all anomalies, since we aren’t really recalling them. But it makes more sense if you think about the same problem in the context of information retrieval. There, the user submits a document search query, and the search algorithm returns a set S(t) of search results. The algorithm’s precision tells us what percentage of the search results are relevant, and the algorithm’s recall tells us what percentage of the total relevant documents the search algorithm returned. So it’s the same idea even though “recall” sounds weird for anomaly detection.

Anyway, based on the above, we can visualize an anomaly detection algorithm’s performance along these axes using something called a precision-recall curve.

Precision-recall curves

A precision-recall curve, or PR curve, shows how precision and recall trade against one another for some given algorithm, parameterized by a threshold t. For any given t there’s an associated precision and an associated recall. So if we look across a range of values for t, a curve emerges that shows the tradeoff. This is the PR curve.

Let’s look at an example, due to Charu Aggarwal. Suppose that we have a data set with 100 points that contains five anomalies. Suppose also that we have an Algorithm A that ranks the points from 1 to 100, where 1 is the point the algorithm judges most likely to be an anomaly and 100 is what it deems the least likely. Finally, suppose that the true anomalies are the points that Algorithm A ranks as 1, 5, 8, 15 and 20.

We can generate the precision and recall by starting with recall = 0% and proceeding til we hit recall = 100%. First here’s a table for Algorithm A:

Here the threshold t is the number of picks. (It doesn’t literally have to be a number of picks. It could be some distance threshold that happens to result in a certain number of anomalies being identified.) At one extreme, when t = 1 we have 100% precision and 20% recall. At the other extreme, when t = 20 we have 25% precision and 100% recall.

Here’s a PR curve for Algorithm A:

PR curve for Algorithm A

(The Jupyter notebook is available on GitHub if you’re interested. You have to clone the repo and run the notebook locally as GitHub isn’t rendering it for some reason.)

The plot corresponds precisely to the table I posted above. As we increase the threshold, the precision tends to drop, except for the occasional spike that occurs when our pick captures a real anomaly.

That’s how we plot the PR curve for a single algorithm. Now let’s look at how we might compare a couple of algorithms using their PR curves.

Comparing algorithms using PR curves

We can compare two algorithms by comparing their PR curves. To continue with our earlier example, let’s say that we have a competitor, Algorithm B, that we want to evaluate against Algorithm A. On the same anomaly ranking problem, suppose that Algorithm B assigns ranks 3, 7, 11, 13 and 15 to the five true anomalies.

We’ll suppress the table this time, and simply overlay the PR curve for Algorithm B on top of the one we did for Algorithm A:

PR curves for Algorithms A and B

Looking at the plot, we can see that Algorithm A is stronger at lower thresholds whereas Algorithm B is stronger at higher thresholds.

Putting this in terms of the specific examples, we saw that Algorithm A immediately identifies an anomaly on pick 1, and then finds another one on pick 5. It finds real anomalies reasonably quickly, which accounts for its strong precision out the gate. Algorithm B on the other hand finds its first two anomalies on picks 3 and 7. Early on, B’s precision isn’t as good as A’s.

But then Algorithm A requires 20 picks to find all five anomalies, whereas Algorithm B requires only 15 picks. When A and B are done capturing all anomalies, A’s result set contains more “junk” (false positives) than B’s does. So that’s why B has better precision as the algorithms progress.

So which is better?

In cases where one algorithm completely dominates another algorithm (that is, one algorithm has a precision that exceeds the other for all recall values), then it’s easy to identify a winner.

But in the comparison above, it’s less clear. It is often application-dependent. For instance, if the two anomaly detection algorithms are bookings monitoring algorithms, we might decide that recall is more important than precision (i.e., we need to capture most or all real outages, even if that means dealing with some alert spam). In this case we might go with Algorithm B, because Algorithm B generates less alert spam at the higher thresholds entailed by the high recall requirement.

For another technique, please see Evaluating anomaly detection algorithms with receiver operating characteristic (ROC) curves.

For more information