class: middle, inverse # Tidy Data .font100[ Bon Woo Koo & Subhro Guhathakurta 9/13/2022 ] --- ## Content * Final project proposal * Categories * Wide vs. Long forms * Saving files * Anonymous function * Converting existing data frame into a sf object --- ## Final project proposal <br> .center[ Please submit 300~500-word long write-up of the project proposal. ] <br> **Purpose**: This proposal is to encourage you to start thinking about the project topic & potential data sources. **Ungraded**: Will not affect your grade in any way. **Submit by**: Oct 7th (Friday) 11:59 PM. **Submit through**: Canvas > Assignment > Final Project Proposal. --- ## Categories on Yelp API Error correction!! * Argument name in business_search should be "categories", not "category". * The string you supply must be from the [list that Yelp provides](https://www.yelp.com/developers/documentation/v3/all_category_list). .footnotesize[ ```r business_search(api_key = Sys.getenv('yelp_api'), * categories = 'restaurants', latitude = ready_4_yelp$y[which_tract], longitude = ready_4_yelp$x[which_tract], offset = 0, radius = round(ready_4_yelp$radius[which_tract]), limit = 50) ``` ] --- ## Wide vs. Long forms .footnotesize[ .pull-left[ with `output="wide"`: ```r census_wide %>% head() %>% nice_table("350px") ``` <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:350px; overflow-x: scroll; width:100%; "><table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> GEOID </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> NAME </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> hhincomeE </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> hhincomeM </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> race.totE </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> race.totM </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 13121010122 </td> <td style="text-align:left;"> Census Tract 101.22, Fulton County, Georgia </td> <td style="text-align:right;"> 90586 </td> <td style="text-align:right;"> 14002 </td> <td style="text-align:right;"> 6383 </td> <td style="text-align:right;"> 650 </td> </tr> <tr> <td style="text-align:left;"> 13121010123 </td> <td style="text-align:left;"> Census Tract 101.23, Fulton County, Georgia </td> <td style="text-align:right;"> 77969 </td> <td style="text-align:right;"> 7510 </td> <td style="text-align:right;"> 5081 </td> <td style="text-align:right;"> 716 </td> </tr> <tr> <td style="text-align:left;"> 13121010211 </td> <td style="text-align:left;"> Census Tract 102.11, Fulton County, Georgia </td> <td style="text-align:right;"> 142750 </td> <td style="text-align:right;"> 22560 </td> <td style="text-align:right;"> 2864 </td> <td style="text-align:right;"> 347 </td> </tr> <tr> <td style="text-align:left;"> 13121007602 </td> <td style="text-align:left;"> Census Tract 76.02, Fulton County, Georgia </td> <td style="text-align:right;"> 32500 </td> <td style="text-align:right;"> 5264 </td> <td style="text-align:right;"> 2570 </td> <td style="text-align:right;"> 310 </td> </tr> <tr> <td style="text-align:left;"> 13121001700 </td> <td style="text-align:left;"> Census Tract 17, Fulton County, Georgia </td> <td style="text-align:right;"> 94750 </td> <td style="text-align:right;"> 19507 </td> <td style="text-align:right;"> 4911 </td> <td style="text-align:right;"> 403 </td> </tr> <tr> <td style="text-align:left;"> 13121007802 </td> <td style="text-align:left;"> Census Tract 78.02, Fulton County, Georgia </td> <td style="text-align:right;"> 51388 </td> <td style="text-align:right;"> 7637 </td> <td style="text-align:right;"> 10961 </td> <td style="text-align:right;"> 1177 </td> </tr> </tbody> </table></div> ] .pull-right[ with `output="long"`: ```r census_long %>% head() %>% nice_table("350px") ``` <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:350px; overflow-x: scroll; width:100%; "><table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> GEOID </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> NAME </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> variable </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> estimate </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> moe </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 13121000100 </td> <td style="text-align:left;"> Census Tract 1, Fulton County, Georgia </td> <td style="text-align:left;"> race.tot </td> <td style="text-align:right;"> 5410 </td> <td style="text-align:right;"> 359 </td> </tr> <tr> <td style="text-align:left;"> 13121000100 </td> <td style="text-align:left;"> Census Tract 1, Fulton County, Georgia </td> <td style="text-align:left;"> hhincome </td> <td style="text-align:right;"> 168396 </td> <td style="text-align:right;"> 18644 </td> </tr> <tr> <td style="text-align:left;"> 13121000200 </td> <td style="text-align:left;"> Census Tract 2, Fulton County, Georgia </td> <td style="text-align:left;"> race.tot </td> <td style="text-align:right;"> 6175 </td> <td style="text-align:right;"> 448 </td> </tr> <tr> <td style="text-align:left;"> 13121000200 </td> <td style="text-align:left;"> Census Tract 2, Fulton County, Georgia </td> <td style="text-align:left;"> hhincome </td> <td style="text-align:right;"> 158011 </td> <td style="text-align:right;"> 37856 </td> </tr> <tr> <td style="text-align:left;"> 13121000400 </td> <td style="text-align:left;"> Census Tract 4, Fulton County, Georgia </td> <td style="text-align:left;"> race.tot </td> <td style="text-align:right;"> 2047 </td> <td style="text-align:right;"> 292 </td> </tr> <tr> <td style="text-align:left;"> 13121000400 </td> <td style="text-align:left;"> Census Tract 4, Fulton County, Georgia </td> <td style="text-align:left;"> hhincome </td> <td style="text-align:right;"> 97257 </td> <td style="text-align:right;"> 30528 </td> </tr> </tbody> </table></div> ] ] --- ## Wide vs. Long forms .footnotesize[ ```r longer <- census_wide %>% pivot_longer(cols = hhincomeE:race.totM, # Cols to be affected names_to = c("variable"), # Name for the label column values_to = c("value")) # Name for the value column longer ``` ``` ## # A tibble: 816 × 4 ## GEOID NAME variable value ## <chr> <chr> <chr> <dbl> ## 1 13121010122 Census Tract 101.22, Fulton County, Georgia hhincomeE 90586 ## 2 13121010122 Census Tract 101.22, Fulton County, Georgia hhincomeM 14002 ## 3 13121010122 Census Tract 101.22, Fulton County, Georgia race.totE 6383 ## 4 13121010122 Census Tract 101.22, Fulton County, Georgia race.totM 650 ## 5 13121010123 Census Tract 101.23, Fulton County, Georgia hhincomeE 77969 ## 6 13121010123 Census Tract 101.23, Fulton County, Georgia hhincomeM 7510 ## 7 13121010123 Census Tract 101.23, Fulton County, Georgia race.totE 5081 ## 8 13121010123 Census Tract 101.23, Fulton County, Georgia race.totM 716 ## 9 13121010211 Census Tract 102.11, Fulton County, Georgia hhincomeE 142750 ## 10 13121010211 Census Tract 102.11, Fulton County, Georgia hhincomeM 22560 ## # … with 806 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` ] --- ## Wide vs. Long forms .footnotesize[ ```r wider <- longer %>% pivot_wider(id_cols = c(GEOID, NAME), names_from = c("variable"), values_from = c("value")) wider ``` ``` ## # A tibble: 204 × 6 ## GEOID NAME hhinc…¹ hhinc…² race.…³ race.…⁴ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 13121010122 Census Tract 101.22, Fulton Coun… 90586 14002 6383 650 ## 2 13121010123 Census Tract 101.23, Fulton Coun… 77969 7510 5081 716 ## 3 13121010211 Census Tract 102.11, Fulton Coun… 142750 22560 2864 347 ## 4 13121007602 Census Tract 76.02, Fulton Count… 32500 5264 2570 310 ## 5 13121001700 Census Tract 17, Fulton County, … 94750 19507 4911 403 ## 6 13121007802 Census Tract 78.02, Fulton Count… 51388 7637 10961 1177 ## 7 13121007805 Census Tract 78.05, Fulton Count… 31174 5355 3397 633 ## 8 13121009700 Census Tract 97, Fulton County, … 208750 66170 3846 316 ## 9 13121010206 Census Tract 102.06, Fulton Coun… 192375 49707 5618 361 ## 10 13121011303 Census Tract 113.03, Fulton Coun… 45942 4865 9543 666 ## # … with 194 more rows, and abbreviated variable names ¹hhincomeE, ²hhincomeM, ## # ³race.totE, ⁴race.totM ## # ℹ Use `print(n = ...)` to see more rows ``` ] --- ## Saving files ### R-Native File formats **.RData**: Native data storage format for R. Can store multiple objects. **.RDS**: Short for RData. Can only store one object. ### Read/write RDS (recommended over RData) **write_rds(), read_rds()** from readr package -> Works the same way as write.csv, read.csv. ### Read/write RData **save(), save.image()** from base R -> save() function can save multiple objects. save.image() saves the entire environment. **load()** from base R -> The biggest different from .RDS is that you do not use **<-** for load(). It stores the original object name as well. --- ## Anonymous function * When using **apply()** or **map()**, you can provide an existing or a custom-made function. * Similar to lambda in Python, R has anonymous function. * Anonymous function is a function defined on the fly and disappears after execution. .footnotesize[ .pull-left[ ```r map(1:5, # input vector function(x) x + 1) # anonymous function ``` ``` ## [[1]] ## [1] 2 ## ## [[2]] ## [1] 3 ## ## [[3]] ## [1] 4 ## ## [[4]] ## [1] 5 ## ## [[5]] ## [1] 6 ``` ] .pull-right[ ```r map(1:5, # input vector function(x){ # anonymous function with {} out <- (x + 1)*x return(out) }) ``` ``` ## [[1]] ## [1] 2 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 12 ## ## [[4]] ## [1] 20 ## ## [[5]] ## [1] 30 ``` ] ] --- * map() and other variants has a nice syntax that make the code simple. * Instead of declaring `function(x)`, you can use a tilde (~) to indicate that it is anonymous function. * `x`s inside the anonymous function needs to be preceded by a period (.). See the example below. .footnotesize[ ```r map(1:5, # input ~(.x + 1)*.x ) # tilde replaces function(). # x is preceded by a period ``` ``` ## [[1]] ## [1] 2 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 12 ## ## [[4]] ## [1] 20 ## ## [[5]] ## [1] 30 ``` ] --- ## Existing data frame into a sf object * You can convert a data frame with lng/lat into a sf object. This can be done using **st_as_sf()**. * The word 'as' indicates that it converts an *existing* object to sf rather than creating one from scratch. .footnotesize[ ```r # A data frame with XY info point_df <- data.frame(x = c(-84.3991, -84.4010, -84.3899), y = c(33.7770, 33.7748, 33.7777)) # st_as_sf point_df %>% st_as_sf(coords = c("x", "y"), crs = 4326) %>% tm_shape(.) + tm_dots() ``` <img src="Module1_Tidy_Yelp_Slide_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> ] --- ## Extension to the Mini Assignment 1 * Now due next Tuesday (20). * Mini-assignment 2 is due next Friday (23). * Instructions on mini-assignment 2 can be found [here]() or syllabus. --- ## Detecting string - str_detect/grepl * Some people downloaded two Yelp categories at once. * Rows for category A and B will be mixed. * To count how many As and Bs are there each, you need to be able to search strings. grepl(.red[pattern], .blue[string]) ```r a <- c("yoga studio", "health gym", "pizza", "YoGa") grepl("yoga", a) ``` ``` ## [1] TRUE FALSE FALSE FALSE ``` str_detect(.blue[string], .red[pattern]) ```r str_detect(a, "yoga") ``` ``` ## [1] TRUE FALSE FALSE FALSE ```