Astrostats 2017 at IUCAA


Computational statistics and astrostatistics

Lecturer: Coryn Bailer-Jones (Email: calj AT mpia.de)
Dates and location: Monday 2 to Tuesday 10 January 2017, IUCAA, Pune, India

Overview

This course is an introduction to using statistics and computational methods to analyse data. I will show that the process of analysing and interpreting data can be done within a simple probabilistic (Bayesian) framework. I will introduce the main concepts and methods as well as the computational tools we need to extract meaning from data in the presence of uncertainty. Many real-world problems in data analysis must be solved with a computer. One of the goals of this course is to show how theoretical ideas can be converted into practical solutions.

The course comprises 7 days of lectures in the mornings, with exercises, student presentations, and discussions in the afternoons.

Pre-course requirements

Summary: you need to install R, RStudio, certain R packages, and Jupyter (including the R kernel).

The computing language of the course in R. This is easy to learn and use, has a convenient environment for plotting, and is widely used within and outside academia. You must install R on your laptop before you arrive at IUCAA. Take time to learn the basics of R in advance; there will be be little time to learn this during the course. This is a statistics course, not an R course. You can download R and also find documentation, guides, and links to tutorials, at the CRAN web site. Install the latest version (3.3.1). Don't leave installation and learning R until the last minute!

In addition to the basic R package, you need to install the following R packages:
fields, gplots, KernSmooth, MASS, mvtnorm, RColorBrewer, splines.

I also strongly recommend you install RStudio, a GUI for using R. Install RStudio Desktop version 0.99.903 or later.

I will share code and notes with you using jupyter notebooks. For this you must install install jupyter on your computer. This is best done by installing anaconda (version 4.2.0 or later, with python version 3.5), which is quick and easy. Once you've done that (and after you've installed R), you need to install the R kernel so that jupyter can run R. This is explained here and takes just three lines in R. You can now run Jupyter (just type "jupyter notebook" in a console and it will open in your browser). Now you will be able to download, use, and edit the jupyter notebooks for the course, which are available via a github repository.

If you have problems with any of this, consult the relevant instructions and documentation on the above web pages, or contact your local IT support.

Schedule and topics

(Ignore the numbers in parentheses after the lecture titles; they are just for my reference)

Day Morning (lectures) Afternoon
09:30-10:30, 11:00-12:00 14:00-15:30, 16:00-17:30
Monday 2 Introduction (1.1, 1.10, 2.2)
Probability basics (1.2, 1.3, 1.6, 1.9)
Student presentations:
Nand Kumar Chakradhari, Light curves of type Ia supernovae
Archita Rai, Stellar populations in clusters
Taniya Parikh, Stellar population parameters
Rajeshwari Dutta, Correlation among multiple parameters with 21cm data
Anusree K G, The characteristics of hard X-ray pulsars
Tuesday 3 Introduction to inference (3.1, 3.2)
Parametric models (3.3, 3.5)
Student presentations:
Akshay Rana & Nisha Rani, Strong gravitational lensing with Gaussian Processes
Sunil Malik & Ramkishor Sharma, Galactic Faraday rotation using Gaussian Processes
Prithvi Raj Singh, Long-term cosmic ray intensity variation in relation to solar variability
Vidushi Sharma & Rajorshi Chandra, Gamma ray bursts
Priyanka Jalan, Quasar proximity regions
Wednesday 4 Parameter estimation: single parameter (5.1)
Parameter estimation: multiple parameters (6.1, 6.3)
Exercises
Thursday 5 Density estimation (7.2)
Monte Carlo methods (8.1, 8.3, 8.4, 8.5)
Exercises
Friday 6 Parameter estimation: Markov Chain Monte Carlo (9) Exercises
Monday 9 Model comparison (11.1, 11.2, 11.3, 11.5)
Comparison to frequentist hypothesis testing (10.1, 10.2, 10.6, 11.9)
Exercises
Tuesday 10 Dealing with more complicated problems (12) Wrap-up and Discussion

Exercises

Exercises will be both pen and paper, as well as computer-based using R. These will all be put on the github repository.

Text books

There are many books on statistics and/or R on the market. There is a huge range in depth and quality. The most suitable book for this course is one which I have written, but it will only be published (by CUP) in 2017 (so you can use it afterwards). Here are some other suggestions.

Crawley, Statistics. An Introduction using R
This text emphasises statistics for biological and to some extent physical (but not social) sciences. It has a reasonable balance between explaining the methods and demonstrating them in R. While there are examples, there is more of an emphasis on principles and the basic maths than there is in Everit & Hawthorn or Dalgard, for example. Indeed, the maths is very basic and many methods are not properly explained (the course will go beyond this level). It presents statistics in a traditional, frequentist way.

Dalgaard, Introductory Statistics with R
An introduction to both R and statistics. The mathematical treatment is limited and it takes a somewhat "recipes"-like approach. As the title suggests, R takes a central role.

Everitt and Hothorn, A Handbook of Statistical Analyses using R
R takes quite a very central place, with lots of examples, data sets (and perhaps a few too-many screen dumps). As the title suggests, this is a guide to using R for statistics rather than a book from which you can learn statistics. Moreover, it covers several topics which are not typical for an introductory statistics course (and which we won't cover). It is as R-centric as Crawley and Dalgaard but a bit more advanced.

Gregory, Bayesian logical data analysis for the physical sciences
A good introduction to both the principles and practical application of Bayesian methods. One of very few books giving a broad introduction and guide for physical scientists (there are lots more such books for social scientists and specific analytic models). He uses Mathematica to illustrate the methods.

Hof, A first course in Bayesian statistical methods
This book is shorter than Kruschke (see below) but at a higher level. It takes a somewhat formal, mathematical approach, and may not be ideal at the introductory level. This book is entirely Bayesian, so it doesn't compare and contrast with frequentist methods.

Ivezic et al., Statistics, data mining, and machine learning in astronomy
An excellent, well-written compilation of statistical methods and machine learning methods in general, with particular attention to their application in astronomy. Lots of examples and code in python on an accompanying web site.

Jaynes, Probability theory
E.T. Jaynes was one of the main proponents of Bayesian inference. This is a a rather unconventional book describing numerous elements of Bayesian probability theory and inference, ranging from the basics through pratical examples to funadamental philosophical discussions. This book is unconventional and even polemical in places, and is probably not appropriate for a first exposure to Bayesian inference. But it contains some very thought-provoking discussions.

Kruschke, Doing Bayesian data analysis
This is quite a long book, but for this it covers some topics at a very easy pace. A large part of it is dedicated to generalized linear models, which we won't be covering in this course. It is targeted more at the life sciences than the physical sciences.

Mackay, Information theory, inference and learning algorithms
Not a traditional statistics book, and perhaps not a first book for learning the very basics of Bayesian inference, but a great book for learning about inference both in principle and in practice. He has a good didactic style, and this book contains some very illuminating examples. Also look here for a good introduction to MCMC. Mackay and CUP have done us a great service by making the book available online.

McElreath, Statistical Rethinking
An excellent recent book which puts Bayesian analysis in the centre. It is long and a little detailed in places, and as the authors says, is suited more for cover-to-cover reading rather than as a reference book. Contains R.

Sivia, Data Analysis. A Bayesian Tutorial
The first edition was a great introduction to data analysis in the Bayesian perspective. (A new second edition adds three more chapters.) I recommended it if really want to understand what statistics is and how it relates to probability theory, rather than just learn a bunch of frequentist recipes. It includes numerous examples which are analytically solvable, but covers less on the numerical solutions. It goes beyond the scope of the course, and it does not cover R or other packages.

Venables and Ripley, Modern Applied Statistics with S (MASS)
"S" is a flavour of R. This books provides a very good introduction to R and its use for both basic and advanced data analysis. However, it assumes the reader is already reasonably familar with the techniques, so this is not a book which can be used alone to learn basic statistics. It goes well beyond the course, covering also topics such as GLIMs, neural networks and spatial statistics. The accompanying R package "MASS" contains many functions which will be used in the course.


Coryn Bailer-Jones, calj at mpia.de
Last updated 28 October 2016