Lab 2
1. Gapminder
The goal of the gapminder dataset is to explore the relationship between GDP, population, and life expectancy.
-
Use
data()
to load thegapminder
dataset from thegapminder
package (hint: you’ll need to install this!). Also load thetidyverse
package. -
Create a barplot (hint:
geom_bar()
) for total counts for the different continents. Why do you think there are so many countries in each continent? Do some digging through the data’s structure until you’re able to explain why, for example, there are 396 countries in Asia in this dataset. -
Create a scatterplot (hint:
geom_point()
) with year (year
) on the x-axis, life expectancy (lifeExp
) on the y-axis, and with the points colored by continent (continent
). Make this a longitudinal / time-series plot by adding geom_line (this is a new one for us!). Hint: To fix the incorrect mapping, how will you tell ggplot that the lines should follow the country column? -
Create a density plot with life expectancy (
lifeExp
) on the x-axis. Do not include anything on the y-axis. What ways can we disaggregate this density plot by adding in continent (I can immediately think of 3 we covered in class)? What do you learn about the continents?- Add a title, axis labels, and a new theme to your preferred plot.
- Try to customize the color or fill property with a special scale - either create your own (
scale_fill_manual(values = c())
) or use a predefined scale fromggplot2
orggthemes
.
-
Finally, create a scatter plot that explores the relationship between x,
gdpPercap
and y,lifeExp
.- Map color to continent and add a OLS lm smooth line for each continent.
- Because GDP is so skewed, transform the x scale to a log10 (hint:
scale_x_log10()
) - The log scale has ugly labels. Try using
scale_x_log10(labels=scales::comma)
if you download the scales package. - Try moving the legend for continents into the plot frame, e.g., by adding
+ theme(legend.position = c(0.8, 0.2))
. What do those numbers seem to do? - Try changing the theme for this plot to one that you like from either base
ggplot2
or from theggthemes
package. Why do you like this theme? - Try making a “bubble” plot by mapping the size of each point to the point’s population (
pop
)
As announced in class, the Diamonds part is optional for your turned in version of the lab.
2. Diamonds
The goal of the diamonds dataset is to see which characteristics are most influential on price.
-
Use
data()
to load thediamonds
dataset from ggplot2. -
Perform an exploratory data analysis and develop some conjectures on what variables impact price.
-
Make plots of price vs each of the four other (
carat
,cut
,color
,clarity
) variables. Use the type of plot appropriate to each specific types of variables. For instance, is the secondary variable quantitative or categorical? -
Consider if transforming the variables or scales (like a log transformation) might be relevant to one of your axes.
-
Can some surprising associations be explained by other variables?
- For example, can the decrease in price as the cut worsens be explained by the carat of the diamond?
-