Priors in classification

Typically in an astronomical survey, objects of different classes are present in very different proportions. For example, in the Gaia survey, stars outnumber extragalactic objects by a factor of several hundred. When we build a classifier to identify these different classes of object, we must take these different proportions, or class priors, into account.

Any classification model, such as a neural network, random forest, or Gaussian mixture model, has a class prior, even if it is implicit. Often this prior is dictated to some extent by the relative proportions, or class fractions, or the objects of each class in the training data. This is rarely the prior we want, and if we don't actively set it to something appropriate, we will get poor performance. For example, as quasars are very rare, their prior probability must be low, otherwise we will erroneously classify many of the much more common stars as quasars. Simply setting the class fractions in the training data to the class prior is at best inconvenient, as it means we need a lot of stars if we are to have enough quasars. But it is also generally wrong, because the training set frequency influences - but does not normally translate directly into - the classifier's prior. Fortunately, we can overcome this. When using a classifier like a Gaussian mixture model which models likelihoods, it is easy to combine these with a class prior via Bayes' theorem. Even with other models it is often possible to compute its intrinsic prior and then replace this with our desired prior. This we should always do. This now gives us a trained classifier that provides appropriate posterior probabilities (corresponding to our class prior) that are not dictated by the class fractions in the training set.

We always want to test this classifier on a validation data set of objects with known classes. For each object our classifier yields posterior probabilities. We use these to assign a class (e.g. take the largest probability), compute the confusion matrix, then compute the purities and completenesses of the resulting samples. But here we must also be careful. The survey to which we want to apply our classifier (where we do not know the true classes) has a very non-uniform distribution of classes: in our case, stars massively outnumber quasars. We therefore cannot asses the performance of our classifier on a balanced validation data set directly, because this contains far too few potential stellar contaminants, and so would give very optimistic purities. To overcome this, we would naively think we must have a validation set that has representative class fractions (i.e. equal to the prior). This is inconvenient, because we typically need thousands of known quasars to get statistically significant results, but then would need millions of known stars in order to get the right balance. But we don't need to do this. We can instead use any class fractions we like in the validation set, compute the posterior probabilities and assign classes, and then adjust the resulting confusion matrix to reflect the class priors. Exactly how this is done is explained in the last two paragraphs of section 3.4 of Bailer-Jones et al. (2019).

Failing to make the above adjustments corresponds to the well-known base rate fallacy. Using the correct prior and modifying the confusion matrix are important. A classifier with equal priors would perform worse on the rare objects than a classifier with appropriate priors, because the former would tend to misclassify many stars as being extragalactic. However, we would not notice this if we erroneously assessed the classifier on a balanced validation data set (equal numbers in each class), because such a validation set has an artificially low fraction of stars, and hence far too few potential contaminants. The classifier would perform worse but appear better. This is demonstrated in Table 1 of of Bailer-Jones et al. (2019).

Coryn Bailer-Jones
May 2022