Lecture 2-3 Class exercise
NYCFlights13 Dataset Description
On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
Data frame with columns
- year, month, day Date of departure.
- dep_time, arr_time Actual departure and arrival times (format HHMM or HMM), local tz.
- sched_dep_time, sched_arr_time Scheduled departure and arrival times (format HHMM or HMM), local tz.
- dep_delay, arr_delay Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
- carrier Two letter carrier abbreviation. See airlines to get name.
- flight Flight number.
- tailnum Plane tail number. See planes for additional metadata.
- origin, dest Origin and destination. See airports for additional metadat
- air_time Amount of time spent in the air, in minutes.
- distance Distance between airports, in miles.
- hour, minute Time of scheduled departure broken into hour and minutes.
- time_hour Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.
Load the dataset
Exercises
- Find all flights that
- Had an arrival delay of two or more hours
- Flew to Houston (IAH or HOU)
- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
- Departed between midnight and 6am (inclusive)
- Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
- How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
- Sort the flights dataframe according to day, month and year
- Sort the flights dataframe using the arrival time in a descending order.
- How could you use
arrange()to sort all missing values to the start? (Hint: useis.na()).
- Sort flights to find the most delayed flights. Find the flights that left earliest.
- Sort flights to find the fastest (highest speed) flights.
- Which flights travelled the farthest? Which travelled the shortest?
- Select all columns in the flights dataframe between year and day (inclusive).
- Select all columns except those from year to day (inclusive).
- Rename the tail_num variable in flights dataframe with tailnum.
- Using the pipeline operator do the following:
- Select all columns in the flights dataframe between year and day (inclusive).
- Select all columns that ends with delay and time.
- Select the distance and air_time variables.
- Think about a way of creating a gain/loss travel time for each flight and create this variable as ``gain”.
- What is the speed of the flight.
- Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
- Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
- Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
- Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for
min_rank().
- What does 1:3 + 1:10 return? Why?
- Group the flights with year, month and day and then summarise the groupings with count, mean distance and mean arr_delay. Filter to remove noisy points and Honolulu airport. Filter the flights that took place more than 20 times. Plot distance vs delay using ggplot.
- Group flights by destination. Summarise to compute distance, average delay, and number of flights.
- Which plane (tailnum) has the worst on-time record?
- What time of day should you fly if you want to avoid delays as much as possible?
- For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
- Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.
- Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
- Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
- For each plane, count the number of flights before the first delay of greater than 1 hour.