Finding rare objects and building pure samples: Probabilistic quasar classification from low resolution Gaia spectra

The Discrete Source Classifier (DSC) is the data processing module responsible for classifying all the objects which Gaia detects. As the name suggests, it assigns objects to discrete classes, e.g. star, galaxy, quasar, binary star, in each case it assigning a class probability. Classification is based primarily on the low resolution BP/RP spectra, because (initially at least) there is no morphological information from Gaia. The subsequent stages in the CU8 data processing are concerned with extracting physical parameters for these classes (e.g. stellar temperatures) and classifying the RVS spectra. DSC is based on machine learning methods for pattern recognition, currently a so-called "Support Vector Machine".

One of the challenges of Gaia is to reliably classify of rare objects, e.g. the expected half million quasars among one thousand million stars. Standard methods for machine learning will often fail to identify them. To address this, the DSC team has developed a method for modifying the output probabilities to accommodate rarity, and applied this in classification experiments on simulated data. The left-hand figure shows, for three classes of objects, the completeness (blue line) and contamination (red line) of a sample of objects as a function of adjustable probabilty thresholds used to build the sample. We see that we can achieve a zero contamination sample of quasars which still has a completeness of 65%, more than sufficient for Gaia. The corresponding probability outputs from the DSC are shown in the right-hand panel. With out method we can control the class priors, which allows a single classification model to be applied to any target population without having to tune the training data and retrain the model.