HW 3: Pivots & Joins
Instructions
- Download the .Rmd template used for this homework
here
- After completing the questions, upload both the .RMD and .pdf OR .HTML files to Canvas.
- Use tidyverse functions wherever possible.
- Learning outcomes:
- Apply tidyr, readr, and dplyr capabilities
- Select the correct parsing functions
- Identify keys in relational database tables.
- Manipulate data from multiple tables/data frames using dplyr functions.
Scoring Rubric
Question.Part: | Points | Topic |
---|---|---|
1.1 | 1.0 | Load the data frames & explain untidy |
1.2a | 2.0 | Pivot correctly |
1.2b | 1.0 | week and ranking numeric, NA dropped |
1.3 | 1.0 | sort correctly |
1.4a | 1.0 | Parse columns correctly |
1.4b | 0.5 | calculate seconds |
1.5 | 0.5 | correct average ranking |
1.6 | 1.0 | correct plot and explanation |
- | - | - |
2.1 | 0.5 | Identify data frames |
2.2 | 1.5 | correct keys identified |
2.3 | 1.0 | demonstrates primary keys are unique |
- | - | - |
3.1 | 0.5 | Load Data and Data Frames |
3.2a | 1.0 | Correct Salary DF with correct number of rows |
3.2b | 1.0 | Correct Batting Stats data frame with correct number of rows |
3.3a | 2.0 | Correct new DF with batting and salary info and number of rows |
3.3b | 2.0 | Hex plot with variables on correct axes and correct scales. Proper Interpretation |
3.3c | 2.0 | Plot with just OLS lines for each year on one plot. Proper Interpretation |
3.4 | 2.0 | Correct data in Correct Order |
Total | 21.5 |
Billboard Data
Download the billboard dataset: billboard.csv
-
a. Use a readr function to import the billboard dataset from a data directory.
- The values in column
wkx
are a song’s ranking afterx
weeks of being released.
b. what makes billboard untidy?
- The values in column
-
Convert all the week columns into a row for each week for each song (where there is an entry).
- Ensure the week and ranking are numeric (hint: names_prefix & names_transform) and unnecessary
NA
s go away (hint: values_drop_na). - You should wind up with 5,307 rows
- Ensure the week and ranking are numeric (hint: names_prefix & names_transform) and unnecessary
-
Sort the data by artist, track and week.
-
Use a
tidyr
command to create aminutes
andseconds
column from thetime
column. Don’t remove thetime
column (You will receive a warning which you can ignore). Create a new column calledtotal_seconds
with theminutes
andseconds
columns, calculating the total seconds each song lasts (e.g. iftime
== “4:22”, thentotal_seconds
== 262). -
Calculate the intial average billboard ranking for each song (the average ranking for the first 4 weeks after it was released (week <= 4)). However, make sure you maintain the
total_seconds
column, we will want that for plotting in question 6! -
Does there appear to be a relationship between song length (
total_seconds
) and intial average billboard ranking?
Identifying Table Keys in the NASA Weather Dataset
- You might need to install the nasaweather package using the console
- Read the description of the nasaweather dataset with the below
library(help = "nasaweather")
-
What are the data frames in this data set?
-
What are the primary keys in each data frame? (One doesn’t have a key)
-
Demonstrate the primary keys generate unique rows. Note you may have to use
ungroup()
with one of them to avoid an error
Lahman’s Baseball Dataset
For this exercise, we’ll use the Master
, Batting
, Pitching
, Fielding
, Teams
, People
, and Salaries
data frames from the Lahman
package.
-
Load these seven data frames into R and read about them.
-
Some players play on multiple teams each year. a. Construct a data frame containing the total salary for each player for each year. Show the number of rows - should be 26,323. b. Construct a second data frame containing the total number of at bats and hits for each player in a year. Show the number of rows - should be 99,402.
-
The batting average of a player is the number of Hits divided by the number of at bats
(a larger value is is better).
a. Using the data frames you created in part 2, create a new data frame with batting average and salary information for only players who had a minimum of 400 at bats in the years from 1985 on (when salary information started being collected). Eliminate any rows with no batting or no salary information. Show the number of rows. N Should be 5,345.
b. Use a hex plot to explore the association between a player’s batting average (x axis) and their salary (y axis). Use a log scale for salary. Interpret the plot
c. Use a single (not faceted) plot of just the Ordinary Least Squares lines for batting average and salary to explore if this association has changed over time. Use a log scale for salary. Interpret the plot. -
Find the annual salary of all players with first name “John” in even numbered years after 1985.
- Show only the first ten values arranged in descending order of salary.
- Your output should start like this:
## # A tibble: 5 x 4
## yearID nameFirst nameLast salary
## <int> <chr> <chr> <int>
## 1 2010 John Lackey 18700000
## 2 2016 John Lackey 16000000
## 3 2012 John Lackey 15950000
## 4 2016 John Danks 15750000
## 5 2014 John Lackey 15250000