Modern spectroscopic databases provide a wealth of information about the physical processes and environments associated with astrophysical populations. Techniques such as blind source separation (BSS), in which sets of spectra are decomposed into a number of components, offer the prospect of identifying the signatures of the underlying physical emission processes. Principle Component Analysis (PCA) has been applied with some success but is severely limited by the inherent orthogonality restriction that the components must satisfy. Non-negative matrix factorisation (NMF) is a relatively new BSS technique that incorporates a non-negativity constraint on its components. In this respect, the resulting components may more closely reflect the physical emission signatures than is the case using PCA. Through its application to the ~100,000 quasar spectra in the Sloan Digital Sky Survey (SDSS) DR6 we show that NMF is a fast method for generating compact and accurate reconstructions of the spectra. The ability to reconstruct spectra accurately has numerous astrophysical applications. Combined with improved SDSS redshifts, we apply NMF to the problem of defining robust continua for quasars that exhibit strong broad absorption line (BAL) systems. The resulting catalogue of SDSS DR6 BAL quasars is by far the largest available and the NMF approach allows quantitative error estimates to be derived for the Balnicity Indices as a function of key astrophysical and observational parameters, such as the signal-to-noise ratio of the spectra.
We present different automatic classification schemes for galaxies based on the shapelet decomposition of multicolor imaging, which are benchmarked using data from low- and high-redshift surveys. Furthermore, we use clustering methods to infer statistically significant classes of galaxies in the data sets, without assuming the coincidence with the Hubble classes.
We develop and demonstrate a probabilistic method for classifying rare objects in surveys with the particular goal of building very pure samples. It works by modifying the output probabilities from a classifier so as to accommodate our expectation (priors) concerning the relative frequencies of different classes of objects. We demonstrate our method using the Discrete Source Classifier, a supervised classifier currently based on Support Vector Machines, which we are developing in preparation for the Gaia data analysis. DSC classifies objects using their very low resolution optical spectra. We look in detail at the problem of quasar classification, because identification of a pure quasar sample is necessary to define the Gaia astrometric reference frame. By varying a posterior probability threshold in DSC we can trade off sample completeness and contamination. We show, using our simulated data, that it is possible to achieve a pure sample of quasars (upper limit on contamination of 1 in 40,000) with a completeness of 65% at magnitudes of G=18.5, and 50% at G=20.0, even when quasars have a frequency of only 1 in every 2000 objects. The star sample completeness is simultaneously 99% with a contamination of 0.7%. Including parallax and proper motion in the classifier barely changes the results. We further show that not accounting for class priors in the target population leads to serious misclassifications and poor predictions for sample completeness and contamination. We discuss how a classification model prior may, or may not, be influenced by the class distribution in the training data. Our method controls this prior and so allows a single model to be applied to any target population without having to tune the training data and retrain the model.
The discovery and classification of transients is only the first step in our quest to understand their underlying physics. Follow-up at a wide range of wavelengths is thus an essential component of any survey that aims to study transient objects. Indeed, important breakthroughs may rely on completely different instruments and wavelengths than those used for discovery. In this talk I will use two classes of well-studied transients to demonstrate this point. Gamma-ray bursts remained a complete mystery as long as only gamma-ray observations from BATSE and other surveys were available. The progenitors and physics were only revealed through follow-up observations at optical, radio, and X-ray wavelengths. Supernova physics and the identity of some progenitors are also now accessible through follow-up in the radio and X-rays, providing information that is not available from optical discovery and classification alone. I will summarize some of the most exciting recent observations of GRBs and SNe that highlight this point (e.g. the identity of short GRBs, supernova shock break-out), and will also focus on strategies for effective follow-up in the context of upcoming large surveys (e.g. PanSTARRS, LSST).
The parameter fit from a model grid is limited by our capability to reduce the number of models, taking into account the number of parameters and the non linear variation of the models with the parameters. The local linear regression (LLR) algorithms allow one to fit linearly the data in a local environment determined by a chosen kernel. The MATISSE algorithm, developed in the context of the estimation of stellar parameters from the GAIA RVS spectra, is connected to this class of estimators. A two-steps procedure was introduced. In a first step, a raw parameter estimation is done in order to localize the parameter environment adapted to a linear regression with a limited bias. Then the parameters are estimated by projection on specific vectors computed for an optimal estimation. In this presentation, the different features associated to the general estimation method are reviewed: the estimation by LLR, the determination of the optimal environment, the kernel choice, the interpolation by the objective analysis, the bias correction and the determination of the first parameter set. This procedure can be fruitfully applied to non linear parameter estimation if the number of data to be fitted is largely greater than the number of models.
RAVE is a large spectroscopic survey of the Milky Way, aiming at observing up to one million stars by 2011 and at obtaining their radial velocities and atmosphere parameters (see Steinmetz et al. 2006, AJ 132, 1645 and Zwitter et al. 2008, AJ 136, 421). Owing to their medium resolution (R~7500), RAVE spectra are suitable to perform a chemical abundance analysis. For roughly 82000 out of 220000 spectra collected up to april 2008, we could derive abundance estimates for up to 12 elements, which makes RAVE the largest chemical abundances database existing today. RAVE and the ESA mission Gaia share the basic spectral characteristics (medium resolution, wavelength range 8410-8795 Ang, low signal-to-noise ratio), therefore problems and solutions found for the RAVE project are of general interests for Gaia as well. I will present and discuss some of the most important challenges we faced during the development of the RAVE chemical processing pipeline: (i) selection of a list of absorption lines to measure and their atomic and molecular data and discussion of their precision and availability in the literature; (ii) the method to measure chemical abundances (equivalent widths measurements instead of synthetic spectra matching); (iii) dependence on the atmospheric parameters of the star; (iv) isolated and blended lines measurements, detection and rejection of bad data; (v) accuracy and reliability, critical analysis of the methods adopted. I will present some results of on-going research projects base on chemical abundances in RAVE as well as future developments.
We will describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods for exploration of petascale databases; visual data mining algorithms for visual exploration of the data; indexing of multi-attribute multi-dimensional astronomical databases (beyond RA-Dec spatial indexing) for rapid querying of petabyte databases; and more.
I present the GRavitational lEnsing Accuracy Testing 2008 (GREAT08) Challenge, which focuses on a problem that is of crucial importance for future observations in cosmology. The shapes of distant galaxies can be used to determine the properties of dark energy and the nature of gravity, because light from those galaxies is bent by gravity from the intervening dark matter. The observed galaxy images appear distorted, although only slightly, and their shapes must be precisely disentangled from the effects of pixelisation, convolution and noise. The worldwide gravitational lensing community has made significant progress in techniques to measure these distortions via the Shear TEsting Program (STEP). Via STEP, we have run challenges within our own community, and come to recognise that this particular image analysis problem is ideally matched to experts in statistical inference, inverse problems and computational learning. Thus, in order to continue the progress seen in recent years, we are seeking an infusion of new ideas from these communities. This document details the GREAT08 Challenge for potential participants. Please visit www.great08challenge.info for the latest information.
I will present an overview of the science capabilities of the Gaia mission and the expected contents of the final catalogue. After putting Gaia in the context of other astrometric and photometric surveys, I will discuss how its data is best exploited for Galactic structure studies. The majority of the stars in the Gaia catalogue will be faint (V>18). Hence we should be prepared to deal with large amounts of astrometric data for which the parallaxes, although accurate in an absolute sense, are relatively poorly determined. In addition the phase space data for the majority of the stars will be incomplete as Gaia will gather radial velocities for the brightest 10^8 stars only. The complementary photometry provided by Gaia and other surveys will be crucial for high precision Galactic structure studies to the faintest magnitude limit possible.
Utilization of sub-milliarcsecond trigonometric parallaxes shifts the classical problem of calibration of stellar parameters to a new level of complexity. Derivation of stellar luminosity from the parallaxes is not a straightforward task with a number of statistical effects, such as Malmquist and Lutz-Kelker biases, to be taken into account. We show that different methods are to be used in order to derive parameters of luminosity function depending on nature of an underlying stellar sample. We also demonstrate how biases in luminosity may influence calibration relations between absolute magnitude and other stellar physical parameters, such as colour index, metallicity, radial velocity, pulsation period, etc. It is emphasized that any combination of astrometric parameters, mainly parallaxes, and astrophysical ones must be handled carefully to avoid or reduce statistical effect, which otherwise may seriously affect following astrophysical applications.
The determination of the total mass of galaxies - luminous as well as dark - is a difficult problem. Even for those systems for which kinematic observations are possible, implementation of the same is not trivial. A Bayesian non-parametric algorithm, CHASSIS, is discussed in this context; CHASSIS constrains the phase space distribution function along with the gravitational potential of the system, using an MCMC optimiser. Applications of this mass evaluation technique, to various astrophysical systems will be discussed, under the assumption of isotropy in phase space, while the effect of relaxing the same will be explored in details.
Currently it is generally assumed that one possibility to trigger activity for luminous Active Galactic Nuclei and Quasars are recent merging events, or at least some kind of interaction, occurring in the quasar host galaxy, funelling gas into the center. Although there is circumstantial evidence for this theory, the question for which stellar masses and in which circumstances this might be true is still unanswered. We present the initial steps in a project seeking to address this situation, by developing an automatic method of morphological classification of galaxies, with the specific aim to detect and quantify merger signatures of galaxies in deep, high-resolution images. We use data from the COSMOS, GEMS, and STAGES Hubble-ACS surveys out to redshifts of z~1.5. We use a two step process of 1) describing the light distributions of galaxies with a number of parametric and nonparametric descriptors and 2) attempting to classify distinct groups in the n-dimensional parameter space as well as find degeneracies and dependencies between the different parameters. In comparison to by-eye classifications as well as numerical simulations we want to both determine the limits of interaction strengths that can be automatically detected, as well as get a solid merger fraction for both quasar host and inactive galaxies, with the aim to constrain the importance of interaction for triggering AGN activity.
A maximum likelihood method for determining the spatial properties of tidal debris and of the Galactic spheroid has been developed and proven to be an accurate and efficient means for analyzing data over small wedge-shaped volumes. Using this method, the Sagittarius Dwarf tidal stream is able to be characterized using stars with the colors of blue F turnoff stars within SDSS data. Analyzing the Sagittarius Dwarf debris over many different volumes provides position, direction, and size information of the debris with respect to position on the sky. The stellar spheroid is fit simultaneously; therefore, the best-fit values of the spheroid within each volume are obtained. The results of the maximum likelihood method allow for the data to be separated into two catalogs: one with the spatial properties of the tidal debris and one having the properties of the spheroid. Combining the respective catalogs from many volumes provide both a map of the Sagittarius tidal stream and of the stellar spheroid. These results facilitate studies of tidal stream dynamics and provide a test of the existence of a smooth spheroidal population.
The MACHO project collected ~8 years of time series data on ~40, ~3 and ~45 square degrees in the LMC, SMC and Bulge respectively monitoring ~70 million stars. ~600K variable stars were identified. I will describe a variety of techniques which were used to automatically classify these variables and their sucess (or failure). In particular, eclipsing binaries proved to be hardest periodic variables to automatically identify and classify.
The CoRoT space mission was launched successfully on 27 Dec 2006. The main scientific goal of the mission is twofold: Asteroseismology and search for Exoplanets using the transit method. The latter requires the continuous and precise photometric monitoring of thousands of stars (> 100000 in total). Among this large sample, lots of variable stars of known and unknown types are being discovered as an important by-product of the Exoplanet search. Extracting those for further follow-up studies on short time scales requires the use of automated methods. We present an automated supervised classification method, developed in the framework of the CoRoT mission, but with much broader relevance. The classifier is able to recognize several types of pulsating stars and eclipsing binaries in a reliable way, based on classification attributes derived from their light curves. In an extended version of the classifier, we also include colour information to increase the separability of the classes. An overview of our results from the CoRoT data is presented. We show how the performance of the classifiers, when applied to CoRoT data, was improved in an iterative way. The importance of database-specific systematics and their influence on the classification results is discussed. Finally, we present some planned further improvements and developments.
This contribution describes the automated classification of objects from the DR6 release of the Sloan Digital Sky Survey (SDSS) using support vector machines (SVM). First the SVM classifier was trained on a dataset comprising the u-g, g-r, r-i and i-z colours of 47,401 stars, 415,634 galaxies and 71,031 quasars with spectral classifications. An analysis of the performance of the classifier resulted in a total classification error of 3.80% and shows that the SVM is efficiently able to learn the non-linear, four dimensional class boundaries. Afterwards class membership probabilities for stars, galaxies and quasars were predicted for 12,362,179 objects without spectral classifications which were situated within the inner 90% of the training colour space and had magnitude errors below 10%. The SVM predicted 11,012,775 stars, 1,088,862 galaxies and 260,542 quasars. The relatively high number of galaxies can be explained by our requirement for low magnitude errors which puts constraints on fainter stars. The result was validated by cross-matching against the FIRST, USNO-B and ROSAT surveys. The cross-match with FIRST resulted in 8,666 radio sources of which 94.8% were either predicted to be galaxies or quasars as expected. The cross-match with USNO-B resulted in 9,583,303 matches. It showed that as expected 96.3% of the predicted galaxies and 98.5% of the predicted quasars have proper motions less than 20 mas/year. The 5,597 matches with ROSAT X-ray sources did not lead to further conclusions. A comparison with the morphological classification by SDSS revealed significant differences. From the predicted galaxies 314,863 were classified as stars by SDSS. We claim that this highlights problems in the morphological classification algorithm of SDSS which can be circumvented when using SVMs in colour space.
The classification of time series from photometric large scale surveys into variability types and the description of their properties is difficult for various reasons such as the irregular sampling, the usually few available photometric bands, the diversity of variable objects- to name but a few. In this study, we will review use of various supervised and unsupervised learning methods - such as Self- Organising Maps, Support Vector Machine etc- on the Hipparcos, OGLE, ASAS data. We will also be presenting our approach for processing the data resulting from the Gaia space mission. The approach may be classified into following three broader categories i.e. supervised classification, unsupervised classifications, and "so- called" extractor methods i.e. algorithms that are specialized for particular type of sources.
Although being very successful at larger scales, Cold Dark Matter (CDM) models disagree with observations at small scales. The predicted number of low-mass sub-halos exceeds by an order of magnitude the number of dwarf galaxies found around the Milky Way and M31. One of the possible solutions to this "missing satellites problem" is that the majority of the dwarf galaxies were destroyed in the early epochs of galaxy formation, possibly leaving behind faint remnants which remain undetected until present time. The increasing availability of data provided by large ground-based telescopes, the Hubble Space Telescope, and homogeneous wide area surveys like the Sloan Digital Sky Survey (SDSS) enabled observational tests of such theoretical models and lead to an unprecedented revival of the interest in the study of galactic structure. First studies detected a dozen dwarf galaxies in the Milky Way and M31, indicating that the census of these systems could be incomplete. We therefore started a project to search for survivor dwarf galaxies or tidal debris of already completely destroyed galaxies in the outer halo of the Milky Way. We will exploit the homogeneous and deep multi-color SDSS DR6 dataset using reliable techniques to analyse over-densities in the stellar magnitude-color-position space. The study of the spatial structure and kinematics of the most promising over-densities will confirm whether these objects are tidal debris of accreted dwarf galaxies during the process of hierarchical formation of the Milky Way, and complete the census of merger events in our Galaxy.
The small sizes of low mass stars provide an opportunity to find earth-like planets and "super earths" in habitable zones via transits. Large area synoptic surveys like Panstarrs and LSST will observe large numbers of low mass stars, albeit with irregular (sparse) time sampling relative to the planets' periods and transit durations. I will discuss the numbers of M-stars versus mass and size in the surveys, the methodology for estimating the number of transiting planets, and photometric requirements and approaches for finding transits in sparsely sampled data. Our search for transiting planets and M-star eclipsing binaries in the SDSS-II supernova data will be used to illustrate the problems (and successes) in using sparsely surveys.
We propose an original approach to cluster multi-component datasets. An estimation of the number of clusters and a refinement of initial conditions for a partitioning method are computed from the construction of a minimal spanning tree with Prim's algorithm and under the assumption that the vertices are approximately distributed according to a Poisson distribution. The number of clusters is estimated by thresholding the Prim's trajectory (function which records at each iteration which vertex is connected, and what is the length of the new edge). The corresponding cluster centroids are then computed in order to initialize the Generalized Lloyd's algorithm (K-means), which allows to circumvent initialization problems. New criteria are derived for setting the false alarm rate (power) of a test over the Prim's trajectory. Metrics used for measuring similarity between multi-dimensional data points are based on symmetrical divergences (e.g Kullback-Leibler and Rényi). The use of these informational divergences together with the proposed method leads to better results than some other clustering methods applied to astrophysical data such as: simulated reflectance spectra, popular surveys such as Eight Color Asteroid Survey and Small Main Belt Asteroid Spectroscopic Survey II, and hyper-spectral images.
The SuperMACHO project is a 5 year survey to determine the nature of the lens population responsible for the excess microlensing rate toward the Large Magellanic Cloud. The survey probes deeper than earlier surveys unveiling many more extra-galactic contaminants, particularly type Ia supernovae and AGN's. Using ~10^8 simulated light curves of both microlensing events and type Ia supernovae we determine selection criteria optimized to maximize the microlensing detection efficiency while minimizing the contamination rate from non-microlensing events. This poster discusses these simulations and the selection criteria.
A dormant supermassive black hole lurking in the center of a galaxy will reveal itself when a star approaches close enough to be torn apart by tidal forces, and some fraction of the stellar debris is accreted by the black hole resulting in a luminous accretion flare. Based on the successful detection of two candidate tidal disruption events by GALEX and CFHTLS in the UV and optical, we predict that the Pan-STARRS 1 Medium Deep Survey will detect 15 events/yr. We will present our strategy for identifying these events with the Transient Classification Server, and discuss our plans for follow-up observations. A large sample of detailed light curves of tidal disruption events will allow us to probe the dormant black hole mass function in normal galaxies, and test models for the coevolution of central black holes and their host galaxy bulges.
I'll describe algorithms and data structures for allowing the most powerful machine learning methods, which often scale quadratically or even cubically with the number of data points, to be performed many orders of magnitude faster than naive implementations. Such techniques can make previously impossible statistical analyses tractable on the scale of entire sky surveys. I will discuss scalable algorithms we have developed for n-point correlations, friends-of-friends, nearest-neighbors, kernel density estimation, nonparametric Bayes classification, principal component analysis, local linear regression, isometric non-negative matrix factorization, hidden Markov models, k-means, support vector machine-like classifiers, Gaussian process regression, and Gaussian graphical model inference, among others. In addition to techniques inspired by computational geometry, fast multipole methods, and Monte Carlo integration, we employ a distributed framework which can be thought of as a higher-order version of Google's MapReduce. Our algorithms have enabled several first-of-a-kind large-scale analyses by our collaborators in astrophysics as well as other fields.
I will discuss the application of the matched-filter technique to wide area photometric surveys. This technique is very efficient for distinguishing between discrete stellar populations and the sea of Galactic foreground stars. To date, seven tidal streams within 50 kpc have been discovered by applying this method to the Sloan Digital Sky Survey. Application of similar techniques to future surveys should enable the detection of substructure throughout much of the Local Group.
I will present work in the context of big HST surveys (like GEMS, STAGES and COSMOS) using extensive and well tested image simulations to derive both detection completeness of these surveys as well as the reliability of 2D galaxy fitting codes (GALFIT and GIM2D) which are widely used for galaxy classification through morphology (especially for galaxies at high redshift) but rarely tested thoroughly. We found that, whereas both codes perform similarly well on bright, big galaxies in these surveys, GALFIT is more robust on faint and small galaxies, especially when used in the context of GALAPAGOS, a script that automates the fitting process successfully. I will further show that both codes underestimate the true parameter error bars severely. Whereas a sersic index cut is not an ideal tool to distinguish between early- and late-type it is widely used as such. I will point out the danger in using a simple automated cut.
The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) will use gigapixel CCD cameras on multiaperture telescopes to survey the sky in the visible and infrared bands. A single telescope system (PS1) has been deployed on Maui, and a four-telescope system (PS4) will be sited on Mauna Kea on the Big Island of Hawaii. These systems will survey the sky repeatedly and will generate petabytes of image data and catalogs of billions of stars and galaxies. Each set of images will be combined to create a very sensitive multicolor image of the sky, and differences between images will provide for a massive database of "time domain astronomy," including the study of moving objects and transient or variable objects. All data from PS1 will be put into the public domain following its 3.5 year survey. The project faces formidable challenges in processing the image data in near real time and making the catalog data accessible via relational databases. In this talk, I describe the software systems developed by the Pan-STARRS project and how these core systems will be augmented by an assortment of science "servers" being developed by astronomers in the PS1 Science Consortium.
Dark energy is the subject of many very large proposed surveys both from the ground and from space. These surveys will present a number of challenges to infer Dark Energy properties. This talk will cover various issues, from optimising designs, methods of analysis of very large data sets with nearly singular correlation properties, to efficient means of parameter estimation and model selection.
We have an automated calibration system that provides astrometric calibration (image pointing, orientation, and scale) for data of unknown provenance, using the pixel content of the image alone. This automated system is the first step towards building a model of the sky through wavelength and time---that is, a model of the positions, motions, and variability of all stellar sources, plus an intensity map of all cosmological sources---that can generate any astronomical image ever taken at any time with any equipment in any configuration. A generative model of this form, combined with automated classification of digital-imaging anomalies, is the best possible platform for automated discovery, because it is capable of identifying failures of the model at the pixel level. It is also, in some sense, an astronomer's "theory of everything".
The near future of astrophysics involves many large solid-angle, multi-epoch, multi-band imaging surveys. These surveys will, at their faint limits, have data on large numbers of sources that are too faint to detect at any individual epoch. Here we show that it is possible to measure in multi-epoch data the fluxes, positions, parallaxes, and proper motions of sources that are too faint to detect at any individual epoch. The method involves fitting a model of a moving point source simultaneously to all imaging, taking account of the noise and point-spread function in each image. By this method it is possible to make measurements at the minimum possible uncertainty given the information in the dataset. We demonstrate the technique on artificial data and on multi-epoch Sloan Digital Sky Survey imaging of the SDSS Southern Stripe. With the SDSS data we show that it is possible to distinguish very red brown dwarfs from very high-redshift quasars using proper motion and parallax measurements of this kind.
We present the morphological classification in two broad types (late and early) of ~50000 galaxies in the COSMOS field. Galaxies are observed in the near-infrared (Ks band) at a median redshift of z~0.8 using WirCam at CFHT. These data are particularly interesting because K-band data have the advantage of probing old stellar populations in the rest-frame for z<2, enabling a determination of galaxy morphological types unaffected by recent star formation. Moreover, no space data in this wavelength range are available today. Galaxies are classified with a new non-parametric method (GalSVM, http://www.lesia.obspm.fr/~huertas/galsvm.html) based on support vector machines. We show that a qualitative separation in two main morphological types can be obtained with an error lower than 20% even on seeing-limited images. We therefore obtain for the first time rest-frame I band morphologies based on structural parameters up to z~2. We investigate the redshift distribution per morphological type up to z~2 and compare to the same distribution obtained with rest-frame B band morphologies (HST/ACS). We do not observe significant differences between the two distributions indicating that the morphological k-correction effect is not very important at those redshifts.
We present results of morphologically classified merging galaxies in the Red Sequence Cluster Survey 2 (RCS2) data set. The morphological classification is first carried out by a pattern recognition method and then scrutinized by visual identification. We have found more than 10 thousand merging galaxies derived from our current data set. We also find nine new candidates of galaxy clusters by searching for regions with significant density enhancements of merging galaxies. Our results contains a very large number of interacting and merging galaxies based on morphological identification and will be very useful as a uniform base for further photometric and spectroscopic studies of galaxy evolution.
The Large Synoptic Survey Telescope (LSST) will be a large, wide-field ground-based system designed to obtain multiple images covering the sky that is visible from Cerro Pachon in Northern Chile. The current baseline design, with an 8.4m (6.5m effective) primary mirror, a 9.6 sq.deg. field of view, and a 3.2 Gigapixel camera, will allow about 10,000 square degrees of sky to be covered using pairs of 15-second exposures in two photometric bands every three nights on average, with 5-sigma depth for point sources of r~24.5. The survey area will include 30,000 sq.deg. with delta<+34.5 deg, and will be imaged multiple times in six bands, ugrizy, covering the wavelength range 320--1050 nm. About 90% of the observing time will be devoted to a deep-wide-fast survey mode which will observe, starting in 2015, a 20,000 sq.deg. region about 1000 times during the anticipated 10 years of operations (including all six bands). These data will result in databases including 10 billion galaxies and a similar number of stars, and will serve the majority of science programs. We will discuss various measurements that will be automatically performed for these ~20 billion sources, and how they can be used for classification and determination of their physical properties (e.g. taxonomic classification of asteroids, photometric distance and metallicity for stars, and photometric redshifts for galaxies).
The Gaia probe, set to launch in 2011, will measure an estimated billion astronomical objects, producing an enormous amount of data. One of the data analysis tasks will be the identification and classification of measured objects. The vast majority of them will be "ordinary" stars from our Galaxy but a certain percentage will belong to "peculiar" objects, and there will be a subsample of emission line stars (ELS). The characteristic feature of most ELS is the presence of hydrogen H-alpha line in emission in their spectra. In the case of Gaia measurements, the influence of this line could be detected in low resolution prismatic spectra which will be recorded both in blue (BP) and red (RP) spectral region. In this work, we compare different algorithms for detecting and characterizing H-alpha lines in Gaia spectra. These include a simple, integrated flux ratio-based algorithm, a Gaussian decomposition algorithm and several machine learning algorithms, such as neural networks, support vector machines and support vector regression. We also look at the effect of different preprocessing techniques. In particular, we consider various transformations from a wavelength domain to a frequency (Fourier) domain. As another technical advancement, we are going to study line detection both from single-transit and oversampled end-of-mission data.
Extracting galaxy physical parameters from their stellar populations spectra is challenging, yet it offers a unique opportunity to classify galaxies according to the physical processes that have led to their formation and subsequent evolution. I will review what we have learned from mega-spectral-surveys about the formation and evolution of galaxies with an special emphasis on how galaxies can be classified according to their environment.
In this poster I will demonstrate how one can find dwarf galaxies and stellar streams in the SDSS survey. I will show how the methods based on convolution of stellar densities with the kernel work when applied to the SDSS dataset, and how they allow to discover with high significance a large amount of substructure in the MW halo. The kernel-based density methods work relatively well for other datasets like 2MASS and allow to discover there a large population of open clusters. An important step for all these methods is the later verification of overdensities by using all available photometric information to check whether the overdensity is consistent with one stellar population localized at certain distance. Additionally, I will show how the Hough transform applied to the SDSS dataset allows to successfully search for and discover the stellar streams of different widths and at different distances and how the matched filtering technique helps in the determination of parameters of the discovered faint streams.
The Concentration, Asymmetry, Clumpiness, Gini's coefficient and the Momentum of the 20% brightest pixels form a set of parameters (CASGM20) that are traditionally used in the morphological classification of galaxies when a limited number of pixels is available to analyse, such as in HST Deep Field galaxies. The ESA-Gaia space mission will observe millions of galaxies up to its own magnitude 20 (around V=20 for blue objects and V=22 for red ones), nonetheless each of its observations will provide data only for a small region of ~ 2.3 x 0.3 arc-seconds around the objects. However, from the ensemble of all observations of a single object (~70 during the mission) a reconstruction of 2D images using detailed algorithms will be performed in the object processing data reduction, what will make feasible the task of measuring those parameters. Particularly, we are interessed in the study of the morphology of small galaxies (with r<2.3 arc-seconds), so we adapted the measurement of the CASGM20 parameters to the particularities of the 2D reconstructed images from the ESA-Gaia satellite data. The redefinitions of those parameters in this context will be presented. Also, we are testing support vector machines (SVM) as a discrete object classifier to draw the decision functions in the CASGM20 space. We will present the actual strategy of training the SVM, as well as the first results obtained from a three-class discrimination.
We present a multi-wavelength study on the nature of the SDSS galaxies divided into fine classes based on their morphology, color and spectral features. The SDSS galaxies are classified into early-type and late-type; red and blue; passive, HII, Seyfert and LINER, which returns a total of 16 fine classes of galaxies. The properties of galaxies in each fine class are investigated from radio to X-ray, using 2MASS, IRAS, FIRST, GALEX and ROSAT data. The optical - near-infrared colors of red late-type galaxies (RLGs) reveal that their stellar contents may be combinations of old stars and very young stars. Dust extinction may not be the dominant factor to make RLGs red, because they are detected in the mid- and far-infrared bands less than blue late-type galaxies (BLGs). The radio detection-fraction of red early-type galaxies (REGs) shows obvious dependence on their absolute magnitudes, whereas the radio detection-fraction of late-type galaxies mainly depends on their apparent magnitudes rather than their absolute magnitudes, implying the difference in their radio sources. The ultraviolet colors of blue early-type galaxies indicate that those galaxies may have complicated star formation histories. Other multi-wavelength properties in each fine class are investigated, and their implication on galaxy evolution is discussed.
The data obtained by the recent modern sky surveys enable detailed studies of the stellar distribution in the 7-D space from the solar neighborhood all the way out to the outer halo. While these results represent exciting ational breakthroughs, their interpretation is not simple. For example, traditional decomposition of the thin and thick disks predicts a strong correlation in metallicity and kinematics at ~ 1 kpc from the mid-plane; however, recent work has found an absence of this correlation for disk stars (Ivezić 2008). Instead, the variation of the metallicity and rotational velocity distributions can be modeled using non-Gaussian functions that retain their shapes and only shift as the distance from the mid-plane increases. To fully contextualize these recent observational results, a detailed comparison with sophisticated numerical models is necessary. Modern simulations have sufficient resolution and physical detail to study the formation of stellar disks and spheroids over a large baseline of masses and cosmic ages. Initial comparisons of various observed maps and model predictions from simulations by Governato, Quinn and collaborators are encouraging. A comparison of kinematic data with an N-body model of a Milky Way-like galaxy is presented.
We analyze the multi-wavelength temporal observations obtained by SDSS for 2600 spectroscopically confirmed quasars. The rest-frame time lags span the range from 1 day to 10 years, and include over 50 observations per object and each filter (ugriz). We quantify the behavior of the mean variability structure function in the four-dimensional space spanned by wavelength, time-lag, luminosity, and redshift, and study its variation for individual objects as a function of these and other parameters, such as the presence of X-ray and radio emission, and optical and near-infrared colors.
Exploration of time domain is now a vibrant area of research in astronomy, driven by the advent of digital synoptic sky surveys. Time domain contains unique information on a broad range of interesting phenomena, from the exploration of the outer Solar system, to extreme relativistic astrophysics. While panoramic surveys can detect variable or transient events, typically some follow-up observations are needed; for short-lived phenomena, a rapid response is essential. Ability to automatically classify and prioritize transient events for follow-up studies becomes critical as the data rates increase. We have been developing such methods using the data streams from the Palomar-Quest survey (http://palquest.org), the Catalina Sky Survey (http://www.lpl.arizona.edu/css) and others, using the VOEventNet framework (http://voeventnet.caltech.edu). The goal is to automatically classify transient events, using the new measurements, combined with archival data (previous and multi-wavelength measurements), and contextual information (e.g., Galactic or ecliptic latitude, presence of a possible host galaxy nearby, etc.); and to iterate them dynamically as the follow-up data come in (e.g., light curves or colors). We have been investigating Bayesian methodologies for classification, as well as discriminated follow-up to optimize the use of available resources, including Naive Bayesian approach, and the non-parametric Gaussian process regression. We will also be deploying variants of the traditional machine learning techniques such as Neural Nets and Support Vector Machines on datasets of reliably classified transients as they build up.
Digital synoptic sky surveys pose several new object classification challenges. In surveys where real-time detection and classification of transient events is a science driver, there is a need for an effective elimination of instrument-related artifacts which can masquerade as transient sources in the detection pipeline, e.g., unremoved large cosmic rays, saturation trails, reflections, crosstalk artifacts, etc. We have implemented such an Artifact Filter, using a supervised neural network, for the real-time processing pipeline in the Palomar-Quest (PQ) survey. After the training phase, for each object it takes as input a set of measured morphological parameters and returns the probability of it being a real object. Despite the relatively low number of training cases for many kinds of artifacts, the overall artifact classification rate is around 90%, with no genuine transients misclassified during our real-time scans. Another question is how to assign an optimal star-galaxy classification in a multi-pass survey, where seeing and other conditions change between different epochs, potentially producing inconsistent classifications for the same object. We have implemented a star/galaxy multi-pass classifier that makes use of external and a priori knowledge to find the optimal classification from the individually derived ones. Both these techniques can be applied to other, similar surveys and data sets.
If we are to constrain cosmology using galaxy cluster catalogs, we must understand the relation between the clusters we identify on the sky and the dark matter halos we predict in theory as precisely as possible. This requirement provides a focus for a new generation of multi-wavelength cluster finding and measurement algorithms. I will discuss a few key issues, report on some recent progress, and consider future directions.
I will give an overview of the shapelet method and show how it can be used as a general framework for morphological analysis and classification of galaxy images. For gravitational lensing measurements I will address modelling of and deconvolution from the PSF and discuss several approaches to estimate the shear.
Baryon Acoustic Oscillations (BAO) result from the coupling of baryons and photons by Thomson scattering in the early universe. This coupling allows sound waves to propagate, leaving a scale imprinted in large galaxy surveys. This scale can be used as a standard ruler to measure the expansion of the Universe and constrain dark energy models. In this talk I will briefly review the physics of BAO, some caveats about their use and measurement from surveys. Current and future observational results will be discussed.
We present preliminary results of a project carried out by the Survey Science Centre of XMM-Newton aiming at the statistical identification of all 2XMM catalogue sources. The 2XMM has been cross correlated with various other catalogues such as SDSS DR6 and 2MASS. For that purpose we have developed an original tool, based on a classical Baysian approach, which provides probabilities of identification without resorting to Monte Carlo simulations. In order to perform supervised classifications, we have built learning samples using the Downes catalogue of cataclysmic variables, and spectroscopic identifications of AGNs, galaxies and stars in the SDSS DR6. The parameter space has been reduced by a principal component analysis. We have compared classifications using a knn approach, and a more elaborated kernel density classification. The original aspect is that we take into account the heteroscedasticity of errors on each parameter. We will summarize the current status of this project and will present some interesting results arising from the cross correlations and classifications.
My talk will give an overview of what information can be extracted from stellar spectra of all qualities, from rough to detailed observations, and what techniques can be used for this purpose, from traditional methods to more recent and sophisticated proposals. I will also discuss what can be expected from observations with the Gaia Radial Velocity Spectrometer, which will collect high-dispersion spectra for about 100 million stars by 2015.
X-rays are most often detected from AGN in surveys, however with current and planned sensitive X-ray telescopes normal/starburst galaxies are detected also. We will discuss our Bayesian approaches to both source classification and the starburst galaxy and absorbed AGN luminosity functions. Key topics include the distribution of absorption in AGN and its dependence on luminosity, clustering in X-ray versus radio and optically-selected AGN samples, and the correlation of hot ISM and X-ray binaries in galaxies with star-formation rate and stellar mass as a function of galaxy type. The prospects for correlating wide-area X-ray survey data (e.g., 2XMM, XMM Slew Survey and Chandra serendipitous source catalogs) with other wide-area surveys (FIRST, 2MASS, SDSS, Pan-Starrs, etc) will be discussed as well as expected results from proposed X-ray survey missions.
The advent of large Galactic surveys is extraordinarily beneficial for Galactic astronomy. However, efficiently extracting scientific information from such huge databases becomes a critical and challenging problem. The classification of the wide variety of objects coming available requires automated and appropriate multi-dimensional data analysis techniques. We examine and test different (supervised and unsupervised) learning algorithms to classify observed objects and, once detected, determine precise/accurate and reliable intrinsic properties individually; better performances, able to deal with presence of noise and discovery of unusual data, are achieved by combining different methods. Currently, our models focus on discriminating between single stars and peculiar objects such as binaries and emission core, fast rotating stars. We present recent results and discuss various aspects of source classification and physical parametrization (effective temperature, surface gravity and metalicity, in particular) from SDSS-II/SEGUE and RAVE spectroscopy. Looking ahead, the techniques investigated form the basis for future ground/space missions classifiers essential for fully exploiting the catalogues with astrophysical information.
One of the biggest challenges in current and future time-domain surveys is to extract the objects of interest from the immense data stream. There are two aspects to achieving this goal: detecting variable sources and classifying them. I will discuss how an automated image reduction pipeline can aid in the process of accurately identifying variable sources and touch upon some of the trade-offs between optimization and automation. Once we've identified true astrophysical variables, we are faced with the challenge of classifying them. For rare events, such as supernovae and microlensing, this challenge is magnified because we must balance having selection criteria that select for the largest number of objects of interest against a high contamination rate. Throughout my talk, I will use examples from the ESSENCE and SuperMACHO projects to demonstrate how we implemented these ideas and the issues we faced.
I describe the creation of a catalog of nearly 1,000,00 quasars from SDSS photometric data using a non-parametric Bayesian classification algorithm that is vastly superior to the standard technique of making cuts in color space. Such algorithms are crucial for taking full advantage of upcoming large area sky surveys. We further describe our efforts to extend this work to other data sets and to use the resulting catalog to determine the luminosity function of quasars to limits that can constrain feedback models of AGN growth.
SEGUE, the Sloan Extension for Galactic Understanding and Exploration, is an imaging and spectroscopic survey of the structure, kinematics and chemical abundance distribution of the old populations of the Milky Way. The survey is designed to map the global structure and stellar population content of the Galactic halo, thin and thick disks, and has the depth, area and velocity accuracy necessary to undertake a systematic search for substructure. The goal is to create a dataset that can be used to investigate the formation and evolution of the Milky Way through study of its merging history, chemical abundance evolution and dynamical properties. The first phase of SEGUE is complete, with nearly 240,000 stars to be included in the public release. I will discuss the methods and results of searches for substructure in the kinematic and chemical abundance distributions of the stellar halo and thick disk of the Galaxy.
We present new results from the SDSS spectroscopic Quasar survey, examining the clustering properties of quasars via the 2-point correlation function. The evolution of quasar bias is discussed and put in context with recent observational measurements at z~2 and comparisons to theoretical models are made. We then look ahead to the high redshift part of the Baryon Oscillation Spectroscopic Survey (BOSS) which will use the exisiting SDSS telescope with upgraded spectrographs to perform a large galaxy and quasar redshift survey in order to provide a percent level measurement of the expansion history of the Universe at z < 0.7 and z~2.5.
The PanStarrs1 project is on his way to start science operations in the next months. In the next 3.5 it will produce a grizy survey of 3/4 of the sky ~2 mag deeper than Sloan. The Photometric Classification server is responsibile for the object classification based on multiband photometry and the accurate delivery of photometric redshifts. Several science projects rely on the output of the server, from transit planet search, to transient detections, the structure of the Milky Way, high redshift Quasars, galaxy evolution, cosmological shear, baryonic oscillations and galaxy cluster searches.
Explosive Events in the solar Transition Region are spectroscopic signatures of magnetic reconnection in the solar atmosphere. Due to the limited spatial resolution of currently available spectrographs, the resolution element contains both the reconnection site and a large area of the Sun with quiet network emission. In this work we present the work carried out to characterize the explosive events and create automatic classifiers to gather a statistically significant sample of such observations in the SoHO Archive (1996-) in the hope that this will solve many of the long standing controversies regarding geometrical and physical properties of explosive events.
In this presentation we give an overview of the the machine learning algorithm Random Forests as applied to the IBIS/ISGRI dataset in order to ease the production of future soft gamma-ray source catalogues. First we introduce the dataset and the problems encountered when dealing with images obtained using the coded mask technique. The initial step of source candidate searching is introduced and an initial candidate list is created. A description of the feature extraction on the initial candidate list is then performed together with feature merging for these candidates. Three training and testing sets are created and three Random forest are built: one dealing with faint persistent source recognition, one dealing with strong persisten sources and a final one dealing with tranients. In the respect of the latter, a new transient detection technique is introduced and described: the Transient Matrix. Finally the performance of the network is assesed and discussed using the testing set.
We have applied a Learning Vector Quantization (LVQ) algorithm to SDSS DR5 quasar spectra in order to create the largest-ever catalogue of broad absorption line quasars (BALQSOs). We first discuss the problems with BALQSO catalogues constructed using the conventional balnicity and/or absorption indices (BI and AI), and then describe the supervised LVQ network we have trained to recognise BALQSOs. The resulting BALQSO catalogue should be substantially more robust and complete than BI- or AI-based ones. We also discuss the uses of such networks in contemporary astronomical data mining more generally.
In this work we intend to address the problem of unsupervised classification on large datasets, magnitude around 100,000,000 objects. These objects are variable objects, which are 10% of 1,000,000,000 astronomical objects that will be collected by GAIA/ESA mission. We are building several templates to represent the main classes of variable objects as well as new classes to build a synthetic dataset of this dimension. We will run the GAIA satellite scanning law on these templates to obtain a testable large dataset. Moreover, we are testing several unsupervised classification algorithms on known datasets such OGLE and Hipparcos catalogs. As a final goal we intend to develop a general Knowledge Discovery Tool for Large Data Sets to support the scientific and technical analysis and visualization of this kind of data, before, during and after the mission. We are also going to study parallelization techniques for the selected algorithms. The tool will enable the users to manipulate attributes, test and compare several clustering algorithms, as well as to have access to specialized visualization techniques to help knowledge discovery. The tool is targeted to astronomers and physicists to allow exploration of data and obtain more knowledge about the Universe.
As part of the Gaia mission, detected point sources will be classified into broad astrophysical classes on the basis of photometry (in fact, low resolution prism spectroscopy) obtaned by the dedicated BPRP photometer. Astrometric and other information will eventually als obe available and must be incorporated. This process needs to be fully automated and able to cope with the Gaia data rate. We describe approaches and progress with this algorithm.
The ESA satellite mission Gaia will acquire spectrophotometric observations of several million unresolved galaxies during its five years of operation. Our objective is to design and implement a classification system for these data. For this purpose we need to build a new library of galaxy spectra which covers the necessary parameter space. Using the evolutionary code PEGASE.2 we have produced a library of 28885 synthetic galaxy spectra at zero redshift covering four general spectral types of galaxies over the wavelength range from 250 to 1050 nm, at a sampling of 1nm or less. The library was also reproduced for 4 random values of redshift in the range of 0-0.2 and it is computed on a random grid of four key astrophysical parameters (3 for SFR and 1 for timescale of the infall of gas). The synthetic library was compared with various photometric and spectroscopic observations (e.g. from SDSS) and found in good agreement with them. Using simulated Gaia photometry of this library we train and test the performance of Support Vector Machine (SVM) classifiers and parametrizers. The first results are promising, indicating that galaxy types can be reliably predicted and several parameters (e.g. redshift, mass to light ratio, present SFR) can be estimated with low bias and variance from Gaia observations.
I will discuss a strategy for parameter estimation in high-dimensional parameter spaces and costly likelihood functions. The strategy combines the power of distributed computing with machine learning and Markov-Chain Monte Carlo techniques for the efficient exploration of a likelihood or chi-square surface. This strategy is particularly successful in cases where computing the likelihood is costly and the number of parameters is moderate or large. I will show a successful implementation of this approach involving our machine learning code Pico and the distributed computing project Cosmology@Home. We apply this technique to the solution of the cosmological parameter estimation problem for ~10 parameters and show that we achieve a reduction of the computational time from years of CPU time to hours.
We have used the oblique decision tree classifier OC1 for a variety of astronomical applications, including cosmic ray identification (Salzberg et al. 1995), quasar candidate selection (White et al. 2000), X-ray source classification (McGlynn et al. 2004), sidelobe flagging in radio surveys (White et al. 1997, White et al. 2005), etc. I will review the algorithm and its advantages and disadvantages for typical astronomical problems. I will also describe some modifications and extensions to the algorithm, including improving accuracy using of voting decision trees and accelerating the training process using tree-based data structures.
Unveiling the recent star formation history of galaxies is important for linking populations of galaxies at different epochs, and thus revealing the most important physical processes affecting galaxy evolution. I will present the statistical technique developed in Wild et al. (2007) for identifying galaxies which have undergone a recent break in their star formation. I will show how, in a new application to the Vimos-VLT Deep Survey (VVDS), we can uncover directly the details of how the red sequence has grown since a redshift of 1. Comparison of spectroscopic results with large numerical simulations is now just becoming possible, with the output of model "spectra". Through the identical identification and comparison of quenched galaxies in VVDS and SDSS mock surveys, we can inform new and improved feedback prescriptions for the models, test new statistical techniques, and inform decisions about new observations needed.
The Optical Gravitational Lensing Experiment is long-term observing project aiming at detection of microlensing events in crowded stellar fields. As a natural by-product it collects photometric time-series of millions of variable stars towards the Galactic Centre and in Magellanic Clouds. In November 2008 third phase of the OGLE will conclude its continuous run since 2001. Huge data set gathered with superb quality is one of very few of such kind currently available. Nearly a billion objects towards the Galactic bulge and Magellanic Clouds need to be now investigated and classified into variability classes. Self-Organizing Map (SOM) is a promising tool for exploring large multi-dimensional datasets. It is quick and convenient to train in an unsupervised process and as an outcome it produces naturally clustered patterns. Application of SOM to the new OGLE-III data set will be presented along with the first preliminary results. SOM technique, tested on OGLE data, will be also implemented within Gaia mission's photometry and spectrometry analysis, in particular in classification-based Science Alerts. SOM will be used as a basis of this system as the changes in brightness and spectral behaviour of a star can be easily and quickly traced on a map trained in advance with simulated and real data from other surveys.