Area Under the Receiver-Operator Curve (AU-ROC)
The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
- True Positive Rate (TPR), also known as Sensitivity or Recall, is defined as:
- False Positive Rate (FPR), also known as (1 - Specificity), is defined as:
The Area Under the ROC Curve (AU-ROC or AUC) quantifies the overall ability of the classifier to discriminate between positive and negative classes. It represents the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
Key Characteristics:
- Range: AU-ROC values range from 0 to 1. A model whose predictions are 100% wrong has an AU-ROC of 0; one whose predictions are 100% correct has an AU-ROC of 1. A model that performs no better than random guessing will have an AU-ROC around 0.5.
- Threshold Invariance: AU-ROC measures the quality of the model’s rankings, independent of a specific classification threshold.
- Insensitivity to Class Imbalance: Unlike Average Precision, AU-ROC is generally less sensitive to skews in the class distribution. This makes it a useful metric when the dataset has a significant imbalance between positive and negative samples, and correct classification of both classes is important.
The ROC curve itself is typically monotonically non-decreasing. As the classification threshold is lowered (or as more instances are classified as positive), both TPR and FPR will tend to increase or stay the same.
Average Precision (AP)
Consider the precision-recall “curve” for a binary classification problem:
We can make the following observations:
- Not a curve. If the data distribution is not continuous, this is not even a “curve”, rather, it’s just a bunch of points if the distribution is discrete. To visually turn it into a curve, we would need to interpolate between the points.
- Not monotonically decreasing. Whenever you encounter a TP, precision slightly increases, though its limit for or still is going to be .
- Not a graph. This is not the graph of a function because if you lower the threshold a bit and ony add FPs and not TPs, the recall stays the same but precision drops. So there is no unambiguous mapping from recall to precision. Since we often care about the area below this “curve”, we usually take for each recall the minimum precision value above it. This means that, in practice, it is enough to consider the thresholds for which recall increases, which are the thresholds for which a TP appears.
Interpolation. The formula typically used for interpolating the Precision-Recall “curve” and computing its area for a discrete data distribution is
The precision at the recall threshold is, to be precise, usually computed as the interpolated precision
to ensure monotonicity of the curve, which will marginally underestimate the actual value. The recall thresholds , on the other hand, are derived from the unique prediction scores produced by the classifier on the entire dataset. In nondegenerate cases where no two TPs have the exact same predicted value, , so you can actually think of as being a bona-fide average/mean over the precisions at the thresholds determined by the TPs.
For discrete data distributions without any prediction score ties, AP can also be expressed as follows:
While technically not == the same in case of ties, this is in my opinion the best heuristic to think about average precision: Order the scores descendingly and then for each TP, compute the precision for the predictions above it. Average all these precision scores et voilà, you have average precision.==
Practical Implications: trusting top predictions
Imagine you have a disease-, bird- or whatever detectio dataset with positive samples you want to detect and negative samples. Imagine negative samples are ranked first, then the positive samples come next, and the 980 remaining negative samples follow. In this case, the would be abysmal (below ) because the precisions of the top predictions where ranges through the positions of true positives, they are all . If you want to have a detection system in which you can trust the top predictions, this is a very bad scenario. The AU-ROC, on the other hand, would be close to and thus almost perfect in this case. If we set the final decision threshold directly after the true positives, accuracy would be and thus also almost perfect, even though for practical use, we wouldn’t know whether to trust the top predictions or not.
Summary
Heuristic for AU-ROC
How good is the model at ranking positive samples above negative samples, without regard to how many of each there are?
More formally, its the probability of randomly drawing a positive and randomly drawing a negative, and the result being correctly ranked.
Use when both error directions are equally important and you do not care about class imbalance.
Heuristic for AP
How good is the model at finding the class we want to detect and ranking it at the top?
More formally, on average, if I pick a random true positive (TP) with prediction score , which fraction of predictions with confidence are actually correct?
Use when you
- Want to detect one class in particular
- The top ranking matters
- Classes are imbalanced
Multi-class problems: mAP
When considering multiclass single-label problems, we can break them into various binary problems. Suppose we have a predictor that produces probability distributions. We can then break the predictor into individual predictors that we can now interpret as binary classifiers. Given a labeled dataset , the mean average precision () of w.r.t. is
{i\leq N}\big) = \frac{1}{d}\sum_{k=1}^d \operatorname{AP}\big(f_k, (x_i, y_i^{(k)})_{i \leq N}\big)$$ The variable $y_i^{(k)}$ denotes whether sample $x_i$ belongs to class $k$, regardless of whether this is a multilabel or single-label setting. ## Terminology: mAP vs Macro AP vs Micro AP A common source of confusion in the literature is the distinction between **mAP**, **macro AP**, and **micro AP**. In practice, **mAP and macro AP refer to the exact same computation**—both compute the Average Precision for each class individually and then take the arithmetic mean across all classes. The terminology simply varies by domain: computer vision typically uses "mAP" while NLP and information retrieval often prefer "macro AP." The meaningful distinction is between **macro averaging** (what we described above) and **micro averaging**. **Micro AP** pools all predictions, true positives, and false positives across classes before computing a single precision-recall curve and corresponding AP value. This approach is heavily influenced by the most frequent classes in imbalanced datasets, whereas macro averaging treats all classes equally regardless of their sample frequency. When you care about detecting minority classes—which is often the case in real-world applications—macro averaging (mAP) is typically the preferred choice.