Assignment description

Timeline

Topic ideas due Mon, Feb 26

Proposal due Mon, Mar 4

Draft report due Wed, Mar 13

Peer review due Mon, Mar 18

Final report due Wed, Mar 20

Presentation + slides and final Dropbox/Github repo due Fri, Mar 22

Presentation comments due Sat, Apr 30

Introduction

Research the application of modern multivariate statistical methods in one of the application areas given below (or suggest your own area).

  • BioInformatics/ Genetics / Systems Biology
  • Brain Imaging
  • Astronomy
  • Sport
  • Chemometrics
  • Finance
  • Ecology
  • Own topic

You should find the latest literature on the methodology used and computer resources available. You should find data sets and analyse them using this methodology and computer programs. You may either focus on a specific application of a specific methodology, or give and overview of a variety of techniques applied to the specific area. You should write a 10-page summary report that shows your understanding of the application area and the methodology and that includes an application of the methodology plus computer programs and interpretation of results.

Logistics

The four primary deliverables for the final project are

  • A written, reproducible report detailing your analysis
  • A Dropbox/GitHub repository corresponding to your report
  • Classroom presentation + slides
  • Formal peer review on another student’s project

Topic ideas

Identify 2-3 data sets you’re interested in potentially using for the final project. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find data sets on those topics.

The purpose of submitting project ideas is to give you time to find data for the project and to make sure you have a data set that can help you be successful in the project. Therefore, you must use one of the data sets submitted as a topic idea, unless otherwise notified by the teaching team.

Research question

  • Describe a research question you’re interested in answering using this data.

Exploratory data analysis

  • Provide an overview of each data set

Submit the PDF of the topic ideas to Dropbox. Mark all pages associated with each data set.

Project proposal

The purpose of the project proposal is to help you think about your analysis strategy early.

Include the following in the proposal:

Section 1 - Introduction

The introduction section includes

  • an introduction to the subject matter you’re investigating
  • the motivation for your research question (citing any relevant literature)
  • the general research question you wish to explore
  • your hypotheses regarding the research question of interest.

Section 2 - Data description

In this section, you will describe the data set you wish to explore. This includes

  • description of the observations in the data set,
  • description of how the data was originally collected (not how you found the data but how the original curator of the data collected it).

Section 3 - Analysis approach

In this section, you will provide a brief overview of your analysis approach. This includes:

  • Description of the response variable.
  • Visualization and summary statistics for the response variable.
  • List of variables that will be considered as predictors
  • Regression model technique (multiple linear regression and logistic regression)

Data dictionary (aka code book)

Submit a data dictionary for all the variables in your data set in the README of your project repo, in the data folder. Link to this file from your proposal writeup.

Submission

If you are using Github, push all of your final changes to the GitHub repo, and submit the PDF of your proposal to Dropbox.

This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable(s) and a few other interesting variables and relationships.

Methodology

This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re using.

Results

In this section, you will output the final model and include a brief discussion of the model assumptions, diagnostics, and any relevant model fit statistics.

This section also includes initial interpretations and conclusions drawn from the model.

Peer review

We will review others’ work critically in STA5069Z. Each student will be assigned two other student’s projects to review. This process will be blind review.

The peer review will be graded on the extent to which it comprehensively and constructively addresses the components of the reviewed report: the research context and motivation, exploratory data analysis, modeling, interpretations, and conclusions.

  • What to review:
    • Describe the goal of the project.

    • Describe the data used.

    • Describe the approaches, and methods used.

    • Provide constructive feedback on how the project might be improved.

    • Provide constructive feedback on any issues with file and/or code organization.

Written report

Your written report must be completed in an RMD file and must be reproducible. Before you finalize your write up, make sure the printing of code chunks is off with the option echo = FALSE.

You will submit the PDF of your final report on Dropbox.

The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including visualizations, should be no more than 10 pages long.

Written report breakdown

The written report is worth 50 points, broken down as follows

Total 50 pts
Introduction/data 6 pts
Methodology 15 pts
Results 20 pts
Discussion + conclusion 6 pts
Organization + formatting 3 pts

Presentation + slides

Slides

In addition to the written report, you will be presenting your material in classroom. Your slides should serve as a brief visual addition to your written report and will be graded for content and quality. Each student will have 10 mins for presentation and 5 mins for question and answers.

For submission, convert these slides to a .pdf document, and submit the PDF of the slides on Dropbox.

Presentation comments

Each student will be assigned to two presentations to evaluate. You should be asking questions after each presentation in such a way that shows your understanding of the content provided in the course.

Reproducibility + organization

All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.

The GitHub repo should have the following structure:

  • README: Short project description and data dictionary

  • written-report.qmd & written-report.pdf: Final written report

  • /data: Folder that contains the data set for the final project.

  • /previous-work: Folder that contains the topic-ideas and project-proposal files.

  • /presentation: Folder with the presentation slides.

    • If your presentation slides are online, you can put a link to the slides in a README.md file in the presentation folder.

Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.

Overall grading

The grade breakdown is as follows:

Total 100 pts
Topic ideas 5 pts
Data 5 pts
Project proposal 5 pts
Peer review 10 pts
Written report 50 pts
Slides + presentation 15 pts
Reproducibility + organization 5 pts
Presentation comments 5 pts

Note: No late project reports will be accepted.

Grading summary

Grading of the project will take into account the following:

  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?