Tidying Yelp and Census Data

Bonwoo Koo & Subhrajit Guhathakurta

2022-09-13

This R Markdown document has two parts. First, we will go over techniques for cleaning dataset with simple examples to introduce what each function is doing. Second, we will clean the Census and Yelp data we downloaded.

Data are messy

When you read textbooks on statistics or online tutorials on Urban Analytics, they often use data that’s already nice and clean. They often do not show you all the hassle they went through to clean the data.

Data in the wild are messy and have all sorts of issues. Even when you are using highly systematized data sources (e.g., the Census Bureau), you will need some data cleaning. Data for urban analytics are often much messier. I usually spend between 40% ~ 60% of my time as a research on data cleaning.

What do I mean by ‘messy?’ There isn’t a formal definition of messiness (at least to the best of my knowledge), which is illustrated in Hadley Wickham’s quote of Tolstoy, “Happy families are all alike; every unhappy family is unhappy in its own way”. However, there are some common issues I have encountered frequently, for example:

  1. Rows and columns do not represent observations and variables, respectively.
  2. Duplicated or redundant rows/columns
  3. multiple variables in one column.
  4. Missing data.
  5. Unstructured data
    and the list goes on..

Why are messy data problem?

In the end, the urban analytics is interested in extracting useful knowledge from the data. This extraction is often done using such tools as visualization, cartography, and statistical analysis. Many of these tools are not good at handling the issues listed above. Among the five listed above, the hardest to deal with is perhaps the unstructured data.

However, Module 3 and 4 are devoted to transforming unstructured data into structured data, so let’s focus on the other ones here in this document.

Issue 1 - Rows and columns do not represent observations and variables

There is a standardized framework that illustrates what is not messy – Tidy Data. Note that all of my writing about tidy data is borrowed from Hadley Wickham’s paper and book.

In tidy data,
“Each variable must have its own column.”
“Each observation must have its own row.”
“Each value must have its own cell.”
(from Chapter 12.2 in R for Data Science)

While the definition of observation is usually easy to understand and clear, the definition of variable can be much less straight-forward. Consider these example tables (they are from this paper).

  • In Table 1, treatment type forms rows and each column is observation. This is clearly inconsistent with the principles of tidy data.
  • Table 2 is a transpose of Table 1. Now, the rows are observations and the two columns represent the two treatment types. We often describe this structure of data table as “wide form.”
  • In Table 3. the treatment type is now considered as a variable, and the names are repeated twice. As opposed to “wide,” we often call this data structure the “long form.” This data structure is said to be ‘tidy’ in the paper.

Table 1. Typical presentation of data

Treament John Smith Jane Doe Mary Johnson
treatment_a - 16 3
treatment_b 2 11 1

Table 2. A transpose of Table 1

name treatment_a treatment_b
John Smith - 2
Jane Doe 16 11
Mary Johnson 3 1

Table 3. Same data but variables in columns and obs. in rows

name treatment result
John Smith a
Jane Doe a 16
Mary Johnson a 3
John Smith b 2
Jane Doe b 11
Mary Johnson b 1

I personally think both Table 2 and 3 have their own use cases (let’s not consider Table 1 anymore). For example, if we are interested in comparing treatment A and B (e.g., a t-test, correlation analysis, etc.), wide form would be more intuitive. In contrast, if we are interested in making graphs like these examples, ggplot2 package in R requires data to be in a long form.

In R, there are multiple ways to transform data between wide and long forms. I recommend pivot_longer() and pivot_wider() functions in tidyr package. Let’s do some exercises using a toy data.

library(tidyverse) # tidyr is included in tidyverse package.
# Toy dataset
toy_df <- data.frame(name = c("John", "Jane", "Mary"), 
                     treatment_a = c(NA, 16, 3),
                     treatment_b = c(2, 11, 1),
                     treatment_c = c(6, 12, NA))
print(toy_df)
##   name treatment_a treatment_b treatment_c
## 1 John          NA           2           6
## 2 Jane          16          11          12
## 3 Mary           3           1          NA
# pivot longer
(toy_long <- toy_df %>% 
  pivot_longer(cols = treatment_a:treatment_c, 
               names_to = 'treatment', # new column name for 'cols' in character
               values_to = 'result')) # new name for the column storing the values in character
## # A tibble: 9 × 3
##   name  treatment   result
##   <chr> <chr>        <dbl>
## 1 John  treatment_a     NA
## 2 John  treatment_b      2
## 3 John  treatment_c      6
## 4 Jane  treatment_a     16
## 5 Jane  treatment_b     11
## 6 Jane  treatment_c     12
## 7 Mary  treatment_a      3
## 8 Mary  treatment_b      1
## 9 Mary  treatment_c     NA
# back to wider
(toy_wide <- toy_long %>% 
  pivot_wider(id_cols = name, # unique identifier
              names_from = treatment, # from which columns should the new column names come?
              values_from = result)) # from which columns should the values come?
## # A tibble: 3 × 4
##   name  treatment_a treatment_b treatment_c
##   <chr>       <dbl>       <dbl>       <dbl>
## 1 John           NA           2           6
## 2 Jane           16          11          12
## 3 Mary            3           1          NA

This issue of defining rows and columns may not seem as useful because we are used to dealing with datasets that are created by people who already thought about the issue.

However, in Urban Analytics, we often create our own datasets from sources that innately don’t have well-thought out data structure. For example, there isn’t a universally accepted way to convert image data into Excel-like spreadsheet. When converting a free-form text (e.g., Twitter data) into a spreadsheet, you will need to think about how to define your observations and variables.

Issue 2 - Duplicated or redundant rows/columns

Duplicated rows is a common issue when using API to acquire data. Multiple functions exist in R to deal with duplicates. Most frequently used ones include: duplicated() and distinct().

duplicated()

duplicated() takes a vector and returns a logical vector. When this function sees a duplicated values in a given vector (e.g., c(“A”, “A”, “A”)), it returns FALSE for the first of the duplicated values and TRUE for the rest of the duplicated values (e.g., c(FALSE, TRUE, TRUE)). So it is saying that the first one is not a duplicate but all others after that are duplicates. We can flip this logical vector using the negation operator ! and use it as a filter. See below.

dupl_df <- data.frame(name = c("A", "A", "B", "C", "C", "C", "D"),
                      GPA = c(3.5, 3.5, 4.0, 2.0, 3.0, 3.0, 2.0)) 

# Base R
duplicated(dupl_df$name)
## [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
# Duplicates in column "name" removed.
dupl_df[!duplicated(dupl_df$name),]
##   name GPA
## 1    A 3.5
## 3    B 4.0
## 4    C 2.0
## 7    D 2.0

distinct()

This is a powerful function in dplyr package. It takes variable names as the argument and outputs a data frame in which no-duplicate version of the variables are stored. Notice that by default, this function drops other columns that are not included in the argument. To keep all other columns in the output, you need to set .keep_all = TRUE argument.

You can also specify multiple columns, and the function will consider those multiple columns when determining duplicates.

# Returns a vector, not data frame
dupl_df %>% 
  distinct(name) # Try adding .keep_all = TRUE argument 
##   name
## 1    A
## 2    B
## 3    C
## 4    D
# Returns a data frame
dupl_df %>% 
  distinct(name, GPA)
##   name GPA
## 1    A 3.5
## 2    B 4.0
## 3    C 2.0
## 4    C 3.0
## 5    D 2.0

Issue 3 - Multiple variables in one column

This issue comes in two broad types.

The first type: Different variables can be concatenated into a long string. A very commonly found example is from Census data: they sometimes return something like “Census Tract 9501, Appling County, Georgia” in a single column. It contains at least three variables: Tract, County, and State.

We can break the string down into pieces using separate() from tidyr package.

  • Notice in the print out below the warning message. This is telling you that separate() expects that there are equal number of components for each row after the separation; if this isn’t the case, it discards the overflown items. Also notice in the second warning that the first row doesn’t have _ inside it. So the numeric column is filled with NA.
  • The sep argument sometimes causes issues that are not easily detected. With “Census Tract 9501, Appling County, Georgia”, setting sep="," will be incorrect because a space is also a character in R. The output would be “Census Tract 9501”, ” Appling County”, ” Georgia” – notice the space in front of county and state name. The correct separator should be sep = ", ", with the space after the comma.
# A character vector to split
onecol_df <- data.frame(labels = c('a1','b_2','c_3_2','d_4_1'))
# split the character at _
onecol_df %>% separate(col = "labels", sep = "_", into = c("alphabet", "numeric")) 
## Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [3, 4].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
##   alphabet numeric
## 1       a1    <NA>
## 2        b       2
## 3        c       3
## 4        d       4

The second type: One column in a data frame can contain another data frame or a list. This is the structure used to construct sf objects (remember the sfc class?). This data structure is frequently found when we convert something like a JSON, which can have nested structure, into a data frame. This list-column isn’t actually an issue per se–it is a legitimate data structure that has many use cases. But since it can be a bit unintuitive for some people, let’s transform it into a more common data frame.

Let’s use the data we downloaded last week. Notice in the print out below that categories and transactions contains lists and coordinates and location contains data frames which has another nested structure within it. Note that when we print out, coordinates and location are automatically flattened and displayed like normal columns, but that’s just in print out. They are still nested.

# Read a subset of Yelp data we downloaded last week
yelp_subset <- read_rds(here("Lab", "module_1", "week2", "yelp_subset.rds"))
# Print to see what's inside
yelp_subset %>% 
  tibble() %>% 
  print(width = 1000)
## # A tibble: 10 × 5
##    id                     categories   transactions coordinates$latitude
##    <chr>                  <list>       <list>                      <dbl>
##  1 eG-UO83g_5zDk70FIJbm2w <df [1 × 2]> <chr [1]>                    33.8
##  2 lmqNHL01VGjOPMnO_HJ9RQ <df [3 × 2]> <chr [2]>                    33.8
##  3 _iqFvc3zToL08WZrNeFP3Q <df [3 × 2]> <chr [2]>                    33.8
##  4 tDv2qG4N7PsYLN0QYuuaZQ <df [3 × 2]> <chr [2]>                    33.8
##  5 3ehyrexo3WcoTy74c2jDKA <df [3 × 2]> <chr [2]>                    33.8
##  6 hmrRb7qX3K705MuxHHfgNA <df [3 × 2]> <chr [2]>                    33.8
##  7 Gwi9PMVb61nrrpUNa9_wfQ <df [3 × 2]> <chr [1]>                    33.8
##  8 Pl2c7HLD9UgFL2zHP0cZVQ <df [1 × 2]> <chr [2]>                    33.8
##  9 qS2Gg7_FH8iHFiorOVd7hg <df [3 × 2]> <chr [1]>                    33.8
## 10 Z2qMwUhnGt_2pA9uQbS7Uw <df [3 × 2]> <chr [1]>                    33.8
##    $longitude location$address1      $address2 $address3 $city   $zip_code
##         <dbl> <chr>                  <chr>     <chr>     <chr>   <chr>    
##  1      -84.4 1144 Crescent Ave NE   ""        ""        Atlanta 30309    
##  2      -84.4 1100 Peachtree St NE   "Ste 110" ""        Atlanta 30309    
##  3      -84.4 1075 Peachtree St NE   ""        ""        Atlanta 30309    
##  4      -84.4 60 11th St NE          ""        <NA>      Atlanta 30309    
##  5      -84.4 1001 Piedmont Ave      ""        ""        Atlanta 30309    
##  6      -84.4 1065 Peachtree St NE   ""        ""        Atlanta 30309    
##  7      -84.4 1106 Crescent Ave NE   ""        ""        Atlanta 30309    
##  8      -84.4 77TH 12th St NE        "Ste 2"   ""        Atlanta 30309    
##  9      -84.4 1123 Peachtree Walk NE ""        ""        Atlanta 30309    
## 10      -84.4 878 Peachtree St NE    ""        ""        Atlanta 30309    
##    $country $state $display_address
##    <chr>    <chr>  <list>          
##  1 US       GA     <chr [2]>       
##  2 US       GA     <chr [3]>       
##  3 US       GA     <chr [2]>       
##  4 US       GA     <chr [2]>       
##  5 US       GA     <chr [2]>       
##  6 US       GA     <chr [2]>       
##  7 US       GA     <chr [2]>       
##  8 US       GA     <chr [3]>       
##  9 US       GA     <chr [2]>       
## 10 US       GA     <chr [2]>
yelp_subset$coordinates %>% head()
##   latitude longitude
## 1 33.78600 -84.38456
## 2 33.78480 -84.38354
## 3 33.78418 -84.38283
## 4 33.78354 -84.38483
## 5 33.78200 -84.38015
## 6 33.78331 -84.38356

Let’s try flattening the data frame within coordinatesand location columns. We can do it using flatten() function in jsonlite package. Currently, coordinates column contains a data frame with two columns, latitude and longitude.

After applying this function, we can see new columns, e.g., coordinates.latitudes and coordinates.longitude. When flattening, the new columns names are automatically generated by concatenating the name of the given column and the names of the nested columns within.

yelp_flat <- yelp_subset %>% 
  jsonlite::flatten() %>% 
  as_tibble() 

yelp_flat$coordinates %>% head()
## Warning: Unknown or uninitialised column: `coordinates`.
## NULL

Even after flattening, we still have list-columns remaining. Flattening list-columns can be tricky because, to the best of my knowledge, there is no single function that you can use to flatten any list-column you encounter. Because lists in R are the most flexible way to store data, there can be anything inside a list-column, which makes writing a function that can work universally for any given list-columns challenging.

In our case, a closer look at the list-columns showed that transaction and location.display_address have relatively simple structure: each component of the list contains a character vector. We can simply concatenate them. The output of the code below will be one character string per row.

# Concatenate what's inside the list
yelp_concat <- yelp_flat %>% 
  mutate(transactions = transactions %>% 
           map_chr(., function(x) str_c(x, collapse=", ")),
         location.display_address = location.display_address %>% 
           map_chr(., function(x) str_c(x, collapse=", ")))

Now, categories column is slightly more complicated: in each component of the list, there is a data frame with two columns and varying row numbers. I think the easiest way to flatten this column is to write a custom function. The course of action is:

  1. Use lapply or map to loop through each component of the list-column.
  2. For each component (which is a data frame), we extract title column.
  3. For however many components in the extracted title column (which now is a vector after the extraction), we do the same thing we just did above–concatenate them all.
# Custom function that takes the data frame in "categories" column in Yelp data
# and returns a character vector
concate_list <- function(x){
  # x is a data frame with columns "alias" and "title" from Yelp$categories
  # returns a character vector containing category concatenated titles 
  titles <- x[["title"]] %>% str_c(collapse = ", ")
  return(titles)
}

yelp_flat2 <- yelp_concat %>% 
  mutate(categories = categories %>% map_chr(concate_list))

yelp_flat2 %>% print(width = 1000)
## # A tibble: 10 × 13
##    id                     categories                                  
##    <chr>                  <chr>                                       
##  1 eG-UO83g_5zDk70FIJbm2w Southern                                    
##  2 lmqNHL01VGjOPMnO_HJ9RQ Southern, Cocktail Bars, Seafood            
##  3 _iqFvc3zToL08WZrNeFP3Q Steakhouses, American (New), Cocktail Bars  
##  4 tDv2qG4N7PsYLN0QYuuaZQ Spanish, Gastropubs, Tapas Bars             
##  5 3ehyrexo3WcoTy74c2jDKA Breakfast & Brunch, Southern, American (New)
##  6 hmrRb7qX3K705MuxHHfgNA Cafes, Bars, Desserts                       
##  7 Gwi9PMVb61nrrpUNa9_wfQ Seafood, Bars, American (New)               
##  8 Pl2c7HLD9UgFL2zHP0cZVQ Indian                                      
##  9 qS2Gg7_FH8iHFiorOVd7hg Italian, Wine Bars, Pasta Shops             
## 10 Z2qMwUhnGt_2pA9uQbS7Uw Burgers, American (Traditional), Bars       
##    transactions     coordinates.latitude coordinates.longitude
##    <chr>                           <dbl>                 <dbl>
##  1 delivery                         33.8                 -84.4
##  2 delivery, pickup                 33.8                 -84.4
##  3 delivery, pickup                 33.8                 -84.4
##  4 delivery, pickup                 33.8                 -84.4
##  5 delivery, pickup                 33.8                 -84.4
##  6 delivery, pickup                 33.8                 -84.4
##  7 delivery                         33.8                 -84.4
##  8 delivery, pickup                 33.8                 -84.4
##  9 delivery                         33.8                 -84.4
## 10 delivery                         33.8                 -84.4
##    location.address1      location.address2 location.address3 location.city
##    <chr>                  <chr>             <chr>             <chr>        
##  1 1144 Crescent Ave NE   ""                ""                Atlanta      
##  2 1100 Peachtree St NE   "Ste 110"         ""                Atlanta      
##  3 1075 Peachtree St NE   ""                ""                Atlanta      
##  4 60 11th St NE          ""                <NA>              Atlanta      
##  5 1001 Piedmont Ave      ""                ""                Atlanta      
##  6 1065 Peachtree St NE   ""                ""                Atlanta      
##  7 1106 Crescent Ave NE   ""                ""                Atlanta      
##  8 77TH 12th St NE        "Ste 2"           ""                Atlanta      
##  9 1123 Peachtree Walk NE ""                ""                Atlanta      
## 10 878 Peachtree St NE    ""                ""                Atlanta      
##    location.zip_code location.country location.state
##    <chr>             <chr>            <chr>         
##  1 30309             US               GA            
##  2 30309             US               GA            
##  3 30309             US               GA            
##  4 30309             US               GA            
##  5 30309             US               GA            
##  6 30309             US               GA            
##  7 30309             US               GA            
##  8 30309             US               GA            
##  9 30309             US               GA            
## 10 30309             US               GA            
##    location.display_address                        
##    <chr>                                           
##  1 1144 Crescent Ave NE, Atlanta, GA 30309         
##  2 1100 Peachtree St NE, Ste 110, Atlanta, GA 30309
##  3 1075 Peachtree St NE, Atlanta, GA 30309         
##  4 60 11th St NE, Atlanta, GA 30309                
##  5 1001 Piedmont Ave, Atlanta, GA 30309            
##  6 1065 Peachtree St NE, Atlanta, GA 30309         
##  7 1106 Crescent Ave NE, Atlanta, GA 30309         
##  8 77TH 12th St NE, Ste 2, Atlanta, GA 30309       
##  9 1123 Peachtree Walk NE, Atlanta, GA 30309       
## 10 878 Peachtree St NE, Atlanta, GA 30309

Issue 4 - Missing Values

In R, NA is used to represent missing data. Many core functions in R does not work properly when NA is in the vector (e.g., try mean(c(NA, 1, 2, 3))). So it is imperative that NAs are taken care of.

There are many functions that can find NAs and drop them. Base R provide is.na() function that returns a logical object. This logical object would be filled with TRUE if the given value is NA and FALSE otherwise. You can mix this function with dplyr verbs to drop rows that have NA value(s) in a specific column.

Similarly, the tidyr package has drop_na() function, but I recommend that you exercise caution when using drop_na() function. Unlike is.na(), drop_na() can work without specifying columns. Then, the function will drop every row that have NA in any one of the columns, and only those rows that do not have NA across all columns will be retained. You may end up dropping too many rows for no reason and not even notice it .

A common mistake is that some people drop rows that have NAs in columns that they do not use. If some rows have NAs in columns that you have no intention of using in your analysis, those is harmless! (unless those NAs are indicative of other issues..) So, do not blindly drop NAs, as you may lose valuable information for no good reason.

# This is the same toy_df from above
toy_df <- data.frame(name = c("John", "Jane", "Mary"), 
                     treatment_a = c(NA, 16, 3),
                     treatment_b = c(2, 11, 1),
                     treatment_c = c(6, 12, NA))

# Dropping NA using is.na()
toy_df %>% 
  filter(!is.na(treatment_a))
##   name treatment_a treatment_b treatment_c
## 1 Jane          16          11          12
## 2 Mary           3           1          NA
# This check across all columns and drops all rows that have at least one NA.
toy_df %>% 
  drop_na()
##   name treatment_a treatment_b treatment_c
## 1 Jane          16          11          12

Tidying Yelp data

We have all the tools we need to clean the Yelp data. Our Yelp data doesn’t seem to have Issue 1, so let’s start from Issue 2.

Issue 2 - Duplicates

We do have many duplicated rows in our data, which is expected. In practice, we normally spend good amount of time on thinking about which column (or a combination of columns) should be used to identify duplicates. Luckily, Yelp data provides a unique ID column for each business.

# Read the full data
my_yelp <- read_rds(here("Lab", "module_1", "week2", "yelp_all.rds"))

# Issue 2 ------------------------------
yelp_unique <- my_yelp %>% 
  distinct(id, .keep_all=T)

glue::glue("Before dropping NA, there were {nrow(my_yelp)} rows. After dropping them, there are {nrow(yelp_unique)} rows") %>% 
  print()
## Before dropping NA, there were 44005 rows. After dropping them, there are 8130 rows

Multiple variables in one column

This is the main issue in Yelp data, caused by the original data format being JSON.

# Issue 3 ------------------------------
yelp_flat <- yelp_unique %>% 
  # 1. Flattening columns with data frame
  jsonlite::flatten() %>% 
  # 2. Handling list-columns
  mutate(transactions = transactions %>% 
           map_chr(., function(x) str_c(x, collapse=", ")),
         location.display_address = location.display_address %>% 
           map_chr(., function(x) str_c(x, collapse=", ")),
         categories = categories %>% map_chr(concate_list)) # concate_list is the custom function

Missing values

We first need to identify whether there exists any NA values.

# Issue 4 ------------------------------
yelp_flat %>% 
  map_dbl(., function(x) sum(is.na(x))) 
##                       id                    alias                     name 
##                        0                        0                        0 
##                image_url                is_closed                      url 
##                        0                        0                        0 
##             review_count               categories                   rating 
##                        0                        0                        0 
##             transactions                    price                    phone 
##                        0                     2846                        0 
##            display_phone                 distance     coordinates.latitude 
##                        0                        0                        4 
##    coordinates.longitude        location.address1        location.address2 
##                        4                       71                     1389 
##        location.address3            location.city        location.zip_code 
##                     2581                        0                        0 
##         location.country           location.state location.display_address 
##                        0                        0                        0
  # map_dbl is a variant of map() which outputs 
  # numeric vector rather than a list.

There seems to be many missing values in ‘price’ and ‘location’ columns. Remember that you do not need to drop rows just because the row has NAs in some columns IF YOU DON”T NEED THAT COLUMN FOR YOUR ANALYSIS.

In this dataset, missing values in location.address1 are not problem because we have coordinate information. We, however, must drop the four NAs in coordinates.latitude and coordinates.longitude. Sf package cannot handle NAs in coordinates.

Also, missing values in ‘price’ column can be an issue, as that’s one of the main variables. Assuming that we will be using the price column, let’s drop NAs in ‘price’ column too.

# Fist, let's verify that the 4 missing values in lat/long columns are in the same rows.
identical(is.na(yelp_flat$coordinates.latitude),
          is.na(yelp_flat$coordinates.longitude)) # Yes, they are in the same 4 rows.
## [1] TRUE
# Drop them.
yelp_dropna1 <- yelp_flat %>% 
  drop_na(coordinates.longitude)

# Dropping NAs in price
yelp_dropna2 <- yelp_dropna1 %>% 
  drop_na(price)

Additional issue in geographic information in Yelp

The last plot in the last week’s R Markdown document was a map of the downloaded Yelp data. There were many points that fell outside of Fulton and DeKalb counties. These are clearly errors. Let’s delete them.

# census boundary
census <- st_read("https://raw.githubusercontent.com/BonwooKoo/UrbanAnalytics2022/main/Lab/module_0/testdata.geojson") 
## Reading layer `testdata' from data source 
##   `https://raw.githubusercontent.com/BonwooKoo/UrbanAnalytics2022/main/Lab/module_0/testdata.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 519 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.85071 ymin: 33.35246 xmax: -84.02371 ymax: 34.18629
## Geodetic CRS:  WGS 84
# Converting yelp_dropna2 into a sf object
yelp_sf <- yelp_dropna2 %>% 
  st_as_sf(coords=c("coordinates.longitude", "coordinates.latitude"), crs = 4326)
  
# sf subsets
yelp_in <- yelp_sf[census %>% 
                     filter(county %in% c("Fulton County", "DeKalb County")) %>% 
                     st_union(), ,op = st_intersects]

Comparing messy vs. tidy

Through tidying the data, the Yelp data is ready for analysis. What follows below is a summary of what/how the data changed.

glue::glue("nrow before: {nrow(my_yelp)} -> nrow after: {nrow(yelp_in)} \n
            ncol before: {ncol(my_yelp)} -> ncol after: {ncol(yelp_in)} \n") %>% 
  print()
## nrow before: 44005 -> nrow after: 4271 
## 
## ncol before: 16 -> ncol after: 23
# Visualize
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(yelp_in) + tm_dots(col = "price")

Appending Census Data

Starting from next week, we will go over simple statistical analyses to derive important insights. We will need to add Census Data to provide social context to the Yelp data. Because Yelp and Census data are created without any references to each other, the only way that I know to join the two is based on spatial proximity between them.

As we’ve reviewed last week, spatial join can be done using st_join() function. This function takes two sf objects of the same CRS and joins them using spatial relationship defined by the predicate function. Remember that the order matters: st_join(x,y) will return x with y attributes appended. st_join(y,x) will return y with x attributes appended. Let’s try both.

# census is currently sfc. Convert it to sf.
census_sf <- census %>% st_sf()

# Spatial join
census_yelp <- st_join(census_sf, yelp_in, join = st_intersects)
yelp_census <- st_join(yelp_in, census_sf, join = st_intersects)

# View
census_yelp %>% head()
## Simple feature collection with 6 features and 30 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.70304 ymin: 33.54892 xmax: -84.26339 ymax: 34.05224
## Geodetic CRS:  WGS 84
##         GEOID               tract         county   state hhincome race.tot
## 1 13063040307 Census Tract 403.07 Clayton County Georgia    37064     4939
## 2 13063040512 Census Tract 405.12 Clayton County Georgia    36414     5478
## 3 13063040519 Census Tract 405.19 Clayton County Georgia    32568     4380
## 4 13063040617 Census Tract 406.17 Clayton County Georgia    33182     1418
## 5 13067030224 Census Tract 302.24    Cobb County Georgia    73061     6318
## 6 13067030235 Census Tract 302.35    Cobb County Georgia   104265     4862
##   race.white race.black   id alias name image_url is_closed  url review_count
## 1       1442       1784 <NA>  <NA> <NA>      <NA>        NA <NA>           NA
## 2        351       4996 <NA>  <NA> <NA>      <NA>        NA <NA>           NA
## 3         35       4178 <NA>  <NA> <NA>      <NA>        NA <NA>           NA
## 4        941        147 <NA>  <NA> <NA>      <NA>        NA <NA>           NA
## 5       4019       1262 <NA>  <NA> <NA>      <NA>        NA <NA>           NA
## 6       3904        682 <NA>  <NA> <NA>      <NA>        NA <NA>           NA
##   categories rating transactions price phone display_phone distance
## 1       <NA>     NA         <NA>  <NA>  <NA>          <NA>       NA
## 2       <NA>     NA         <NA>  <NA>  <NA>          <NA>       NA
## 3       <NA>     NA         <NA>  <NA>  <NA>          <NA>       NA
## 4       <NA>     NA         <NA>  <NA>  <NA>          <NA>       NA
## 5       <NA>     NA         <NA>  <NA>  <NA>          <NA>       NA
## 6       <NA>     NA         <NA>  <NA>  <NA>          <NA>       NA
##   location.address1 location.address2 location.address3 location.city
## 1              <NA>              <NA>              <NA>          <NA>
## 2              <NA>              <NA>              <NA>          <NA>
## 3              <NA>              <NA>              <NA>          <NA>
## 4              <NA>              <NA>              <NA>          <NA>
## 5              <NA>              <NA>              <NA>          <NA>
## 6              <NA>              <NA>              <NA>          <NA>
##   location.zip_code location.country location.state location.display_address
## 1              <NA>             <NA>           <NA>                     <NA>
## 2              <NA>             <NA>           <NA>                     <NA>
## 3              <NA>             <NA>           <NA>                     <NA>
## 4              <NA>             <NA>           <NA>                     <NA>
## 5              <NA>             <NA>           <NA>                     <NA>
## 6              <NA>             <NA>           <NA>                     <NA>
##                         geometry
## 1 MULTIPOLYGON (((-84.37445 3...
## 2 MULTIPOLYGON (((-84.44869 3...
## 3 MULTIPOLYGON (((-84.42895 3...
## 4 MULTIPOLYGON (((-84.27173 3...
## 5 MULTIPOLYGON (((-84.65095 3...
## 6 MULTIPOLYGON (((-84.70304 3...
yelp_census %>% head()
## Simple feature collection with 6 features and 30 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.38483 ymin: 33.782 xmax: -84.38015 ymax: 33.786
## Geodetic CRS:  WGS 84
##                       id                                alias
## 1 eG-UO83g_5zDk70FIJbm2w south-city-kitchen-midtown-atlanta-2
## 2 lmqNHL01VGjOPMnO_HJ9RQ                     joy-cafe-atlanta
## 3 _iqFvc3zToL08WZrNeFP3Q             stk-steakhouse-atlanta-2
## 4 tDv2qG4N7PsYLN0QYuuaZQ            bulla-gastrobar-atlanta-2
## 5 3ehyrexo3WcoTy74c2jDKA  flying-biscuit-café-midtown-atlanta
## 6 hmrRb7qX3K705MuxHHfgNA    cafe-intermezzo-midtown-atlanta-3
##                            name
## 1    South City Kitchen Midtown
## 2                      Joy Cafe
## 3                STK Steakhouse
## 4               Bulla Gastrobar
## 5 Flying Biscuit Café - Midtown
## 6     Cafe Intermezzo - Midtown
##                                                              image_url
## 1 https://s3-media3.fl.yelpcdn.com/bphoto/L1qX2ttHqvNMqgsw_JQNLQ/o.jpg
## 2 https://s3-media2.fl.yelpcdn.com/bphoto/pk6Vn8NZfZ4SZRBAJqJAiQ/o.jpg
## 3 https://s3-media1.fl.yelpcdn.com/bphoto/cHKMsDThzTVpoRqS39lPXA/o.jpg
## 4 https://s3-media1.fl.yelpcdn.com/bphoto/ZCovd0lMaKQpasok3yivWA/o.jpg
## 5 https://s3-media4.fl.yelpcdn.com/bphoto/9PiLUl2SWUyIQhOECrBcsA/o.jpg
## 6 https://s3-media4.fl.yelpcdn.com/bphoto/9zuh2IfmOwYmkJPnwgH7rw/o.jpg
##   is_closed
## 1     FALSE
## 2     FALSE
## 3     FALSE
## 4     FALSE
## 5     FALSE
## 6     FALSE
##                                                                                                                                                                                                     url
## 1     https://www.yelp.com/biz/south-city-kitchen-midtown-atlanta-2?adjust_creative=b4FQCVifxBm7ICooFs7HiA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=b4FQCVifxBm7ICooFs7HiA
## 2                         https://www.yelp.com/biz/joy-cafe-atlanta?adjust_creative=b4FQCVifxBm7ICooFs7HiA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=b4FQCVifxBm7ICooFs7HiA
## 3                 https://www.yelp.com/biz/stk-steakhouse-atlanta-2?adjust_creative=b4FQCVifxBm7ICooFs7HiA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=b4FQCVifxBm7ICooFs7HiA
## 4                https://www.yelp.com/biz/bulla-gastrobar-atlanta-2?adjust_creative=b4FQCVifxBm7ICooFs7HiA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=b4FQCVifxBm7ICooFs7HiA
## 5 https://www.yelp.com/biz/flying-biscuit-caf%C3%A9-midtown-atlanta?adjust_creative=b4FQCVifxBm7ICooFs7HiA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=b4FQCVifxBm7ICooFs7HiA
## 6        https://www.yelp.com/biz/cafe-intermezzo-midtown-atlanta-3?adjust_creative=b4FQCVifxBm7ICooFs7HiA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=b4FQCVifxBm7ICooFs7HiA
##   review_count                                   categories rating
## 1         2922                                     Southern    4.5
## 2         1148             Southern, Cocktail Bars, Seafood    4.5
## 3         1751   Steakhouses, American (New), Cocktail Bars    3.5
## 4          750              Spanish, Gastropubs, Tapas Bars    4.0
## 5         1800 Breakfast & Brunch, Southern, American (New)    4.0
## 6         1368                        Cafes, Bars, Desserts    3.5
##       transactions price        phone  display_phone   distance
## 1         delivery    $$ +14048737358 (404) 873-7358 242.850935
## 2 delivery, pickup    $$ +14049961377 (404) 996-1377  91.544936
## 3 delivery, pickup  $$$$ +14047930144 (404) 793-0144   4.314353
## 4 delivery, pickup    $$ +14049006926 (404) 900-6926 196.780359
## 5 delivery, pickup    $$ +14048748887 (404) 874-8887 344.764641
## 6 delivery, pickup    $$ +14708783137 (470) 878-3137 118.986629
##      location.address1 location.address2 location.address3 location.city
## 1 1144 Crescent Ave NE                                           Atlanta
## 2 1100 Peachtree St NE           Ste 110                         Atlanta
## 3 1075 Peachtree St NE                                           Atlanta
## 4        60 11th St NE                                <NA>       Atlanta
## 5    1001 Piedmont Ave                                           Atlanta
## 6 1065 Peachtree St NE                                           Atlanta
##   location.zip_code location.country location.state
## 1             30309               US             GA
## 2             30309               US             GA
## 3             30309               US             GA
## 4             30309               US             GA
## 5             30309               US             GA
## 6             30309               US             GA
##                           location.display_address       GEOID           tract
## 1          1144 Crescent Ave NE, Atlanta, GA 30309 13121001100 Census Tract 11
## 2 1100 Peachtree St NE, Ste 110, Atlanta, GA 30309 13121001100 Census Tract 11
## 3          1075 Peachtree St NE, Atlanta, GA 30309 13121001100 Census Tract 11
## 4                 60 11th St NE, Atlanta, GA 30309 13121001100 Census Tract 11
## 5             1001 Piedmont Ave, Atlanta, GA 30309 13121001100 Census Tract 11
## 6          1065 Peachtree St NE, Atlanta, GA 30309 13121001100 Census Tract 11
##          county   state hhincome race.tot race.white race.black
## 1 Fulton County Georgia    99205     4356       3205        356
## 2 Fulton County Georgia    99205     4356       3205        356
## 3 Fulton County Georgia    99205     4356       3205        356
## 4 Fulton County Georgia    99205     4356       3205        356
## 5 Fulton County Georgia    99205     4356       3205        356
## 6 Fulton County Georgia    99205     4356       3205        356
##                     geometry
## 1   POINT (-84.38456 33.786)
## 2  POINT (-84.38354 33.7848)
## 3 POINT (-84.38283 33.78418)
## 4 POINT (-84.38483 33.78354)
## 5   POINT (-84.38015 33.782)
## 6 POINT (-84.38356 33.78331)
tm_shape(census_yelp %>% group_by(GEOID) %>% summarise(rating=mean(rating))) + 
  tm_polygons(col = "rating", style = "quantile")
tm_shape(yelp_census) + tm_dots(col="hhincome")

Other useful functions (Optional)

There are many other tools and functions that are very useful. I introduce some of them, so that you are introduced with them and have a reference that you can come back to in the future should you need it.

When you need to re-code a variable’s value - case_when()

the arguments for case_when() consists of two parts, condition and output. In the example below, review_count > 1000 is the condition part. The part following the tilde (~) is the output part. So, it evaluates the condition and spits out appropriate outputs.

yelp_in %>% 
  # Use mutate bc the re-coded variable is a new variable
  mutate(review_count_binary = case_when(review_count > 1000 ~ "many",
                                         review_count <= 1000 ~ "few")) %>% 
  # Select these two columns to simplify the print out
  select(review_count, review_count_binary) %>% 
  head()
## Simple feature collection with 6 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.38483 ymin: 33.782 xmax: -84.38015 ymax: 33.786
## Geodetic CRS:  WGS 84
##   review_count review_count_binary                   geometry
## 1         2922                many   POINT (-84.38456 33.786)
## 2         1148                many  POINT (-84.38354 33.7848)
## 3         1751                many POINT (-84.38283 33.78418)
## 4          750                 few POINT (-84.38483 33.78354)
## 5         1800                many   POINT (-84.38015 33.782)
## 6         1368                many POINT (-84.38356 33.78331)

When you need to select or modify multiple columns based on common characteristics - across()

across() function allows us to select multiple columns and apply an operation to all of them. In the code below, is.nemeric() function is supplied to indicate that I want all numeric columns. The scale() function indicates that the operation I want to apply to all numeric columns is to standardize them into z-scores.

yelp_in %>% 
  mutate(across(is.numeric, scale)) %>% 
  select(is.numeric)
## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.numeric)
## 
##   # Good
##   data %>% select(where(is.numeric))
## 
## ℹ Please update your code.
## This message is displayed once per session.
## Simple feature collection with 4271 features and 3 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.74486 ymin: 33.51273 xmax: -84.07477 ymax: 34.16806
## Geodetic CRS:  WGS 84
## First 10 features:
##    review_count    rating   distance                   geometry
## 1      9.599010 1.2416259 -1.1462974   POINT (-84.38456 33.786)
## 2      3.409498 1.2416259 -1.2565221  POINT (-84.38354 33.7848)
## 3      5.513374 0.1277876 -1.3200686 POINT (-84.38283 33.78418)
## 4      2.020871 0.6847067 -1.1798593 POINT (-84.38483 33.78354)
## 5      5.684336 0.6847067 -1.0720545   POINT (-84.38015 33.782)
## 6      4.177081 0.1277876 -1.2365312 POINT (-84.38356 33.78331)
## 7      1.609167 0.6847067 -1.1924662   POINT (-84.38452 33.785)
## 8      1.532408 0.6847067 -1.1680337  POINT (-84.38515 33.7844)
## 9      2.244167 0.6847067 -1.0875976 POINT (-84.38609 33.78532)
## 10     6.001836 0.6847067 -0.8920605 POINT (-84.38444 33.77905)

When you need to count the frequency of each value in a variables - table() or count()

Often you need to check how many occurrences there are for each value in a given variable. This is used perhaps more often to study a dataset in your hand and less often as a part of tidying your data. You can use table() from the base R or count() from the dplyr package.

# Using table
yelp_in %>% 
  pull(price) %>% table
## .
##    $   $$  $$$ $$$$ 
## 1914 2173  159   25
# Using count
yelp_in %>% 
  count(price)
## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: -84.74486 ymin: 33.51273 xmax: -84.07477 ymax: 34.16806
## Geodetic CRS:  WGS 84
##   price    n                       geometry
## 1     $ 1914 MULTIPOINT ((-84.66889 33.5...
## 2    $$ 2173 MULTIPOINT ((-84.73536 33.5...
## 3   $$$  159 MULTIPOINT ((-84.74486 33.5...
## 4  $$$$   25 MULTIPOINT ((-84.3699 33.75...

References

Codd, E. F. (1990). The relational model for database management: version 2. Addison-Wesley Longman Publishing Co., Inc..

Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. ” O’Reilly Media, Inc.”.