class: center, middle, inverse, title-slide # Day Ten: Frequency ## SDS 192: Introduction to Data Science ###
Lindsay Poirier
Statistical & Data Sciences
, Smith College
###
Spring 2022
--- class: center, middle # Review of last Friday's lab. --- # For Today * histograms * binning * barplots * stacking/dodging/filling --- class: center, middle # The most important take-away from today is that frequency plots (histograms and barplots) involve *counting* the values in a variable. --- # Histogram .pull-left[ * Visualizes *distribution* of a **numeric** variable * What are maximum and minimum values? * How spread out are the values? * What is the center of the values. ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-1-1.png)<!-- -->![](Day10-Frequency_files/figure-html/unnamed-chunk-1-2.png)<!-- --> ] --- # Histogram .pull-left[ 1. Create bins for numbers, each with the same range of values [i.e. 0-10, >10-20, >20-30, and so on] * Converts the linear scale to a categorical scale 2. Count the numbers in each bin 3. Set the height of a bar for that bin to the count > How does this compare to the `cut()` function we learned in last week's lab? ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-2-1.png)<!-- -->![](Day10-Frequency_files/figure-html/unnamed-chunk-2-2.png)<!-- --> ] --- # Barplots .pull-left[ * Visualizes *counts* of a **categorical** variable * Which value appears the most? * Which appears the least? * How evenly distributed are the counts? ] .pull-right[ ``` ## [1] "a" "b" "c" "a" "c" "a" "a" "b" "c" "a" "b" "c" ``` ![](Day10-Frequency_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- # Barplots .pull-left[ 1. Determine the unique values and places them on the x-axis 2. Count the number of times each value appears 3. Set the height of a bar for that category to the count ] .pull-right[ ``` ## [1] "a" "b" "c" "a" "c" "a" "a" "b" "c" "a" "b" "c" ``` ![](Day10-Frequency_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] --- # Today's Dataset * Spotify has an Application Programming Interface (API) that allows us to access data about music on the platform * We can access data about specific songs, playlists, and artists * Today we are going to access data about the tracks for a few different artists * Variables include things like acousticness, danceability, and speechiness, album information, and key --- # Data Ethics Questions 1. What assumptions and commitments informed the design of this dataset? 2. Who has had a say in data collection and analysis regarding this dataset? Who has been excluded? 3. What are the benefits and harms of this dataset, and how are they distributed amongst diverse social groups? --- # Data Import Today, for lecture, I'm going to ask that you just follow along. You will have an opportunity to practice this in today's lab. ```r artist <- get_artist_audio_features(artist = "Janelle Monae") %>% select(-c(album_images, artists, available_markets)) ``` --- # Distribution of Danceability .pull-left[ ```r ggplot(artist, aes(x = danceability)) + geom_histogram() ``` > What does this message mean? ] .pull-right[ ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` ![](Day10-Frequency_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] --- # Distribution of Danceability .pull-left[ * Binwidth indicates the width of the buckets we'd like to categorize our data into. * Bins indicates the number of bins to create. * We choose one or the other when creating histograms. ] .pull-right[ ```r ggplot(artist, aes(x = danceability)) + geom_histogram(binwidth = 0.1, color = "white") ``` ![](Day10-Frequency_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] --- # Labels for this plot .pull-left[ ```r ggplot(artist, aes(x = danceability)) + geom_histogram(binwidth = 0.1, color = "white") + labs(title = "Distribution of Danceability of Songs by Janelle Monáe, Spotify, 2022", x = "Danceability", y = "Count of Songs") ``` > How would we describe this distribution? ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] --- # Acousticness .pull-left[ ```r ggplot(artist, aes(x = acousticness)) + geom_histogram(binwidth = 0.1, color = "white") + labs(title = "Distribution of Acousticness of Songs by Janelle Monáe, Spotify, 2022", x = "Acousticness", y = "Count of Songs") ``` > How would we describe this distribution? > How would I compare this across Janelle Monáe's albums? ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] --- # Faceting a Histogram .pull-left[ ```r ggplot(artist, aes(x = danceability)) + geom_histogram(binwidth = 0.1, color = "white") + labs(title = "Distribution of Danceability of Songs by Janelle Monáe, Spotify, 2022", x = "Danceability", y = "Count of Songs") + facet_wrap(vars(album_name)) ``` > What do we learn from this plot? ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] --- # Frequency of Key Modes .pull-left[ ```r ggplot(artist, aes(x = key_mode)) + geom_bar() + coord_flip() ``` ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] --- # Labels for this plot .pull-left[ ```r ggplot(artist, aes(x = key_mode)) + geom_bar() + coord_flip() + labs(title = "Frequency of Key Modes in Songs by Janelle Monáe", x = "Key Mode", y = "Count of Songs") ``` > How might I compare this across Janelle Monáe's albums? ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] --- # Stacked Bar Plots > Note `fill = ` gets used for polygons, and `col = ` gets used for points and lines. .pull-left[ ```r ggplot(artist, aes(x = key_mode, fill = album_name)) + geom_bar() + coord_flip() + labs(title = "Frequency of Key Modes in Songs by Janelle Monáe", x = "Key Mode", y = "Count of Songs", fill = "Album Name") + scale_fill_brewer(palette = "Dark2") ``` > How might I compare this across Janelle Monáe's albums? Sys.getenv() ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-21-1.png)<!-- --> ] --- # Dodging > Note `fill = ` gets used for polygons, and `col = ` gets used for points and lines. .pull-left[ ```r ggplot(artist, aes(x = key_name, fill = album_name)) + geom_bar(position = "dodge") + coord_flip() + labs(title = "Frequency of Key Modes in Songs by Janelle Monáe", x = "Key Mode", y = "Count of Songs", fill = "Album Name") + scale_fill_brewer(palette = "Dark2") ``` ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-23-1.png)<!-- --> ] --- # Converting to a Percentage Scale > Setting the position to "fill" converts the scale of the y-axis to a percentage. .pull-left[ ```r ggplot(artist, aes(x = key_name, fill = album_name)) + geom_bar(position = "fill") + coord_flip() + labs(title = "Key Modes in Songs by Janelle Monáe", x = "Key Mode", y = "Percentage of Songs", fill = "Album Name") + scale_fill_brewer(palette = "Dark2") ``` ] .pull-right[ ![](Day10-Frequency_files/figure-html/unnamed-chunk-25-1.png)<!-- --> ] --- # For Wednesday * Quiz 1 * No reading