HW 3: Pivots & Joins

Due by 11:59 PM on Sunday, March 21, 2021

Instructions

  • Download the .Rmd template used for this homework here
  • After completing the questions, upload both the .RMD and .pdf OR .HTML files to Canvas.
  • Use tidyverse functions wherever possible.
  • Learning outcomes:
    • Apply tidyr, readr, and dplyr capabilities
    • Select the correct parsing functions
    • Identify keys in relational database tables.
    • Manipulate data from multiple tables/data frames using dplyr functions.

Scoring Rubric

Question.Part: Points Topic
1.1 1.0 Load the data frames & explain untidy
1.2a 2.0 Pivot correctly
1.2b 1.0 week and ranking numeric, NA dropped
1.3 1.0 sort correctly
1.4a 1.0 Parse columns correctly
1.4b 0.5 calculate seconds
1.5 0.5 correct average ranking
1.6 1.0 correct plot and explanation
- - -
2.1 0.5 Identify data frames
2.2 1.5 correct keys identified
2.3 1.0 demonstrates primary keys are unique
- - -
3.1 0.5 Load Data and Data Frames
3.2a 1.0 Correct Salary DF with correct number of rows
3.2b 1.0 Correct Batting Stats data frame with correct number of rows
3.3a 2.0 Correct new DF with batting and salary info and number of rows
3.3b 2.0 Hex plot with variables on correct axes and correct scales. Proper Interpretation
3.3c 2.0 Plot with just OLS lines for each year on one plot. Proper Interpretation
3.4 2.0 Correct data in Correct Order
Total 21.5

Billboard Data

Download the billboard dataset: billboard.csv

  1. a. Use a readr function to import the billboard dataset from a data directory.

    • The values in column wkx are a song’s ranking after x weeks of being released.

    b. what makes billboard untidy?

  2. Convert all the week columns into a row for each week for each song (where there is an entry).

    • Ensure the week and ranking are numeric (hint: names_prefix & names_transform) and unnecessary NAs go away (hint: values_drop_na).
    • You should wind up with 5,307 rows
  3. Sort the data by artist, track and week.

  4. Use a tidyr command to create a minutes and seconds column from the time column. Don’t remove the time column (You will receive a warning which you can ignore). Create a new column called total_seconds with the minutes and seconds columns, calculating the total seconds each song lasts (e.g. if time == “4:22”, then total_seconds == 262).

  5. Calculate the intial average billboard ranking for each song (the average ranking for the first 4 weeks after it was released (week <= 4)). However, make sure you maintain the total_seconds column, we will want that for plotting in question 6!

  6. Does there appear to be a relationship between song length (total_seconds) and intial average billboard ranking?

Identifying Table Keys in the NASA Weather Dataset

  • You might need to install the nasaweather package using the console
  • Read the description of the nasaweather dataset with the below
library(help = "nasaweather")
  1. What are the data frames in this data set?

  2. What are the primary keys in each data frame? (One doesn’t have a key)

  3. Demonstrate the primary keys generate unique rows. Note you may have to use ungroup() with one of them to avoid an error

Lahman’s Baseball Dataset

For this exercise, we’ll use the Master, Batting, Pitching, Fielding, Teams, People, and Salaries data frames from the Lahman package.

  1. Load these seven data frames into R and read about them.

  2. Some players play on multiple teams each year. a. Construct a data frame containing the total salary for each player for each year. Show the number of rows - should be 26,323. b. Construct a second data frame containing the total number of at bats and hits for each player in a year. Show the number of rows - should be 99,402.

  3. The batting average of a player is the number of Hits divided by the number of at bats
    (a larger value is is better).
    a. Using the data frames you created in part 2, create a new data frame with batting average and salary information for only players who had a minimum of 400 at bats in the years from 1985 on (when salary information started being collected). Eliminate any rows with no batting or no salary information. Show the number of rows. N Should be 5,345.
    b. Use a hex plot to explore the association between a player’s batting average (x axis) and their salary (y axis). Use a log scale for salary. Interpret the plot
    c. Use a single (not faceted) plot of just the Ordinary Least Squares lines for batting average and salary to explore if this association has changed over time. Use a log scale for salary. Interpret the plot.

  4. Find the annual salary of all players with first name “John” in even numbered years after 1985.

    • Show only the first ten values arranged in descending order of salary.
    • Your output should start like this:
## # A tibble: 5 x 4
##   yearID nameFirst nameLast   salary
##    <int> <chr>     <chr>       <int>
## 1   2010 John      Lackey   18700000
## 2   2016 John      Lackey   16000000
## 3   2012 John      Lackey   15950000
## 4   2016 John      Danks    15750000
## 5   2014 John      Lackey   15250000