Lecture 2-3 Class exercise

NYCFlights13 Dataset Description

On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

Data frame with columns

  • year, month, day Date of departure.
  • dep_time, arr_time Actual departure and arrival times (format HHMM or HMM), local tz.
  • sched_dep_time, sched_arr_time Scheduled departure and arrival times (format HHMM or HMM), local tz.
  • dep_delay, arr_delay Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
  • carrier Two letter carrier abbreviation. See airlines to get name.
  • flight Flight number.
  • tailnum Plane tail number. See planes for additional metadata.
  • origin, dest Origin and destination. See airports for additional metadat
  • air_time Amount of time spent in the air, in minutes.
  • distance Distance between airports, in miles.
  • hour, minute Time of scheduled departure broken into hour and minutes.
  • time_hour Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.

more information

Load the dataset

Exercises

  1. Find all flights that
  1. Had an arrival delay of two or more hours
  1. Flew to Houston (IAH or HOU)
  1. Were operated by United, American, or Delta
  1. Departed in summer (July, August, and September)
  1. Arrived more than two hours late, but didn’t leave late
  1. Were delayed by at least an hour, but made up over 30 minutes in flight
  1. Departed between midnight and 6am (inclusive)
  1. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
  1. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
  1. Sort the flights dataframe according to day, month and year
  1. Sort the flights dataframe using the arrival time in a descending order.
  1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
  1. Sort flights to find the most delayed flights. Find the flights that left earliest.
  1. Sort flights to find the fastest (highest speed) flights.
  1. Which flights travelled the farthest? Which travelled the shortest?
  1. Select all columns in the flights dataframe between year and day (inclusive).
  1. Select all columns except those from year to day (inclusive).
  1. Rename the tail_num variable in flights dataframe with tailnum.
  1. Using the pipeline operator do the following:
  1. Select all columns in the flights dataframe between year and day (inclusive).
  1. Select all columns that ends with delay and time.
  1. Select the distance and air_time variables.
  1. Think about a way of creating a gain/loss travel time for each flight and create this variable as ``gain”.
  1. What is the speed of the flight.
  1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
  1. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
  1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().
  1. What does 1:3 + 1:10 return? Why?
  1. Group the flights with year, month and day and then summarise the groupings with count, mean distance and mean arr_delay. Filter to remove noisy points and Honolulu airport. Filter the flights that took place more than 20 times. Plot distance vs delay using ggplot.
  1. Group flights by destination. Summarise to compute distance, average delay, and number of flights.
  1. Which plane (tailnum) has the worst on-time record?
  1. What time of day should you fly if you want to avoid delays as much as possible?
  1. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
  1. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.
  1. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
  1. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
  1. For each plane, count the number of flights before the first delay of greater than 1 hour.