Chapter 1 Introduction

As part of the MSc specializing in Data Science, this course aims to introduce the essential techniques for performing exploratory data analysis. These techniques are typically applied before formal modeling commences and allow the researcher to discover patterns, spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations. Different types of data will be described and the appropriate exploratory data analysis techniques for each data type will be introduced. The course will distinguish between univariate non-graphical, multivariate non-graphical, univariate graphical, and multivariate graphical techniques. Special attention will focus on the visualization of large data dets using appropriate software. Some of the topics to be covered include:

  1. Plotting the raw data (such as data traces, histograms, bihistograms, probability plots, lag plots, block plots, and Youden plots).
  2. Plotting simple statistics such as mean plots, standard deviation plots, box plots, and main effects plots of the raw data.
  3. Positioning such plots so as to maximize our natural pattern-recognition abilities, such as using multiple plots per page.
  4. Plotting geocoded data and creating dashboards
  5. Dimensionality reduction and clustering of similar observations

Resources

There are some really good free online textbooks by well known and respected teachers in this area – most of the material we need can be based on these three sources:

  1. Exploratory Data Analysis with R (Roger Peng = RP): https://bookdown.org/rdpeng/exdata/

  2. STA545: Data wrangling, exploration, and analysis with R (Jenny Bryan = JB): https://stat545.com/index.html

  3. R for data science (Hadley Wickham = HW): https://r4ds.had.co.nz/