Intro to Urban Analytics

Bon Woo Koo & Subhrajit Guhathakurta

2022-08-20

To answer the question we need to first determine what data we need

Since we will be only using cross-sectional data (i.e., data from one period of time) we are limited in examining dynamics. You would probably look for change in several variables over the years to do a proper analysis of this questions!

But given this example of using data to seek some preliminary answers to the questions and strengthen a hypothesis, let’s consider the following data: 1. Yoga Studios in Fulton and Dekalb Counties (already downloaded) 2. Neighborhood parameters (household income, racial make up, home values? Change in home values? Rental values?)

In this example I am hypothesizing that household income, racial mix, housing costs, and real estate taxes will have something to say about gentrification So, I will first download my data for the most recent year from Census

FD_tract <- suppressMessages(
  get_acs(geography = "tract", # or "block group", "county", "state" etc.
          state = "GA",
          county = c("Fulton", "Dekalb"),
          variables = c(hhincome = 'B19019_001',
                        race.tot = "B02001_001",
                        race.white = "B02001_002",
                        race.black = "B02001_003",
                        trans.total = "B08006_001",
                        trans.car = "B08006_002",
                        trans.drovealone = "B08006_003",
                        trans.carpooled = "B08006_004", # Notice that I was not interested in 005-007 (2 person/ 4 person carpool etc.)
                        trans.pubtrans = "B08006_008", # Did not want to download any details about the type of public transport (009-0013)
                        trans.bicycle = "B08006_014",
                        trans.walk = "B08006_015",
                        trans.WfH = "B08006_017",
                        med_housexp = "B25104_001",
                        med_realestate_taxes = "B25103_001"
          ),
          year = 2020,
          survey = "acs5", # American Community Survey 5-year estimate
          geometry = TRUE, # returns sf objects
          output = "wide") # wide vs. long
)


The data contains several redundant columns that we will not use. So, let’s subset the data to only have the columns we will use

FD_tract <- FD_tract %>%
  select(GEOID,
         hhincome = hhincomeE, # New name = old name
         race.tot = race.totE,
         race.white = race.whiteE,
         race.black = race.blackE,
         trans.total = trans.totalE,
         trans.car = trans.carE,
         trans.drovealone = trans.drovealoneE,
         trans.carpooled = trans.carpooledE,
         trans.pubtrans = trans.pubtransE,
         trans.bicycle = trans.bicycleE,
         trans.walk = trans.walkE,
         trans.WfH = trans.WfHE,
         Med_HHExp = med_housexpE,
         med_RETaxes = med_realestate_taxesE)

Let’s display what we have!

tmap_mode("view")
## tmap mode set to interactive viewing
RE_Taxes <- tm_shape(FD_tract) + tm_polygons("med_RETaxes")
HH_exp <- tm_shape(FD_tract) + tm_polygons("Med_HHExp")
tmap_arrange(RE_Taxes, HH_exp)

Now download the Yelp data. This has been done for you (given the problems we have been having with Yelp API)

To save time in class, we will use the Yelp data that Bonwoo downloaded for today’s class.

The Yelp data can be found on Canvas > Files > yelp_in.rds. This data is for Fulton and DeKalb County, GA, and contains Yelp data with categories = “yoga”. This data is already cleaned.

To read the data into R, we will use read_rds() function in readr package. Notice that in the code below, only the name of the .rds file is specified. This works only when the file is in the same folder as your .Rmd file. If not, you will need to provide a full path to the file.

# Reading the yelp data
yelp_in <- read_rds("yelp_in.rds")

I like to constantly check my datasets to see if the results look fine

I will check the file and then display it (i.e., the point locations of Yoga Studios) on the census data

skim(yelp_in)
## Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
## provided. Falling back to `character`.
Data summary
Name yelp_in
Number of rows 161
Number of columns 23
_______________________
Column type frequency:
character 19
logical 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1.00 22 22 0 161 0
alias 0 1.00 17 58 0 161 0
name 0 1.00 7 49 0 139 0
image_url 0 1.00 0 68 6 156 0
url 0 1.00 174 215 0 161 0
categories 0 1.00 4 47 0 68 0
transactions 0 1.00 0 0 161 1 0
phone 0 1.00 0 12 6 147 0
display_phone 0 1.00 0 14 6 147 0
price 155 0.04 2 2 0 1 0
location.address1 9 0.94 0 30 7 142 0
location.address2 35 0.78 0 11 52 59 0
location.address3 30 0.81 0 26 126 6 0
location.city 0 1.00 6 19 0 21 0
location.zip_code 0 1.00 0 5 1 51 0
location.country 0 1.00 2 2 0 1 0
location.state 0 1.00 2 2 0 1 0
location.display_address 0 1.00 11 76 0 158 0
geometry 0 1.00 21 38 0 160 0

Variable type: logical

skim_variable n_missing complete_rate mean count
is_closed 0 1 0 FAL: 161

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
review_count 0 1 22.85 34.23 1.00 3.00 9.00 25.00 159.00 ▇▁▁▁▁
rating 0 1 4.45 0.77 2.50 4.00 5.00 5.00 5.00 ▂▁▂▃▇
distance 0 1 1943.22 3926.03 58.82 846.53 1247.88 1789.03 43017.73 ▇▁▁▁▁
tm_shape(yelp_in) + tm_dots(col="red") + tm_shape(FD_tract) + tm_borders()

Appending Census data

Joining census data in this case is somewhat different from earlier. We will do the following: 1. make sure the projection (CRS) of both files are correct 2. Calculate how many yoga studios are within each census tract (there will be many tracts with 0 studios) 3. Join the two files with the count of Yoga studios

## Check to see that the CRS for both the sf files are the same
head(FD_tract)
## Simple feature collection with 6 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.47101 ymin: 33.65339 xmax: -84.18201 ymax: 33.81229
## Geodetic CRS:  NAD83
##         GEOID hhincome race.tot race.white race.black trans.total trans.car
## 1 13089021908    43191     4829        323       3368        2421      2137
## 2 13089020400   123711     2743       2435         48        1691      1202
## 3 13089023410    41065     3903        131       3562        2030      1562
## 4 13121010601    55479     3673       1057       2514        1699      1290
## 5 13121008302    28214     1891         33       1836         481       401
## 6 13121004200    23906     2675        252       2423        1073       487
##   trans.drovealone trans.carpooled trans.pubtrans trans.bicycle trans.walk
## 1             1759             378             86             0        101
## 2             1069             133             63            62         13
## 3             1339             223            307             0         87
## 4             1131             159            154             0        113
## 5              347              54             57             0          8
## 6              437              50            460            19          0
##   trans.WfH Med_HHExp med_RETaxes                       geometry
## 1        97      1994        2142 MULTIPOLYGON (((-84.20619 3...
## 2       300      1219        4885 MULTIPOLYGON (((-84.34921 3...
## 3        55      1564        1067 MULTIPOLYGON (((-84.29784 3...
## 4       116      1712        2527 MULTIPOLYGON (((-84.46957 3...
## 5         0       638         604 MULTIPOLYGON (((-84.46315 3...
## 6       106      1470        1652 MULTIPOLYGON (((-84.42334 3...
head(yelp_in)
## Simple feature collection with 6 features and 22 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.34933 ymin: 33.76259 xmax: -84.19098 ymax: 33.85257
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 23
##   id      alias name  image…¹ is_cl…² url   revie…³ categ…⁴ rating trans…⁵ phone
##   <chr>   <chr> <chr> <chr>   <lgl>   <chr>   <int> <chr>    <dbl> <chr>   <chr>
## 1 NdD6VW… yoga… Yoga… https:… FALSE   http…       1 Yoga, …    5   ""      +147…
## 2 k5wcy4… spa-… Spa … https:… FALSE   http…      11 Massag…    4.5 ""      +140…
## 3 fdK7PX… will… Will… https:… FALSE   http…       8 Yoga, …    4.5 ""      +140…
## 4 scTOiG… the-… The … https:… FALSE   http…       1 Perfor…    5   ""      +167…
## 5 PuGXt2… toug… Toug… https:… FALSE   http…      42 Traine…    5   ""      +140…
## 6 q9gP-C… boho… Boho… https:… FALSE   http…       9 Yoga       5   ""      +140…
## # … with 12 more variables: display_phone <chr>, distance <dbl>, price <chr>,
## #   location.address1 <chr>, location.address2 <chr>, location.address3 <chr>,
## #   location.city <chr>, location.zip_code <chr>, location.country <chr>,
## #   location.state <chr>, location.display_address <chr>, geometry <POINT [°]>,
## #   and abbreviated variable names ¹​image_url, ²​is_closed, ³​review_count,
## #   ⁴​categories, ⁵​transactions
## # ℹ Use `colnames()` to see all variable names
## It turns out they are NOT! So we need to transform their CRS into the same
FD_tract_Geom <- st_transform(FD_tract, crs=4326)
Yelp_in_Geom <- st_transform(yelp_in, crs=4326)

Yoga_in_tract <- st_join(FD_tract_Geom, Yelp_in_Geom, join = st_intersects)
skim(Yoga_in_tract)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
Data summary
Name Yoga_in_tract
Number of rows 578
Number of columns 38
_______________________
Column type frequency:
character 20
logical 1
numeric 17
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
GEOID 0 1.00 11 11 0 530 0
id 417 0.28 22 22 0 161 0
alias 417 0.28 17 58 0 161 0
name 417 0.28 7 49 0 139 0
image_url 417 0.28 0 68 6 156 0
url 417 0.28 174 215 0 161 0
categories 417 0.28 4 47 0 68 0
transactions 417 0.28 0 0 161 1 0
phone 417 0.28 0 12 6 147 0
display_phone 417 0.28 0 14 6 147 0
price 572 0.01 2 2 0 1 0
location.address1 426 0.26 0 30 7 142 0
location.address2 452 0.22 0 11 52 59 0
location.address3 447 0.23 0 26 126 6 0
location.city 417 0.28 6 19 0 21 0
location.zip_code 417 0.28 0 5 1 51 0
location.country 417 0.28 2 2 0 1 0
location.state 417 0.28 2 2 0 1 0
location.display_address 417 0.28 11 76 0 158 0
geometry 0 1.00 193 3750 0 530 0

Variable type: logical

skim_variable n_missing complete_rate mean count
is_closed 417 0.28 0 FAL: 161

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hhincome 8 0.99 83324.83 48534.12 13577.00 48668.25 73387.50 101281.75 250001.00 ▇▇▃▁▁
race.tot 0 1.00 3435.28 1269.31 0.00 2498.25 3314.00 4260.50 7401.00 ▁▇▇▃▁
race.white 0 1.00 1442.49 1235.59 0.00 257.00 1264.50 2311.75 5039.00 ▇▅▃▂▁
race.black 0 1.00 1546.21 1517.43 0.00 357.50 903.50 2609.25 7021.00 ▇▂▂▁▁
trans.total 0 1.00 1749.09 686.47 0.00 1274.00 1681.00 2213.00 4114.00 ▂▇▇▃▁
trans.car 0 1.00 1332.79 583.89 0.00 943.00 1257.50 1720.00 3342.00 ▂▇▆▂▁
trans.drovealone 0 1.00 1199.21 538.29 0.00 839.00 1132.50 1531.50 3164.00 ▃▇▆▂▁
trans.carpooled 0 1.00 133.58 130.37 0.00 48.00 98.50 181.75 1109.00 ▇▂▁▁▁
trans.pubtrans 0 1.00 116.22 125.95 0.00 25.25 88.50 166.00 1194.00 ▇▁▁▁▁
trans.bicycle 0 1.00 9.56 25.71 0.00 0.00 0.00 0.00 220.00 ▇▁▁▁▁
trans.walk 0 1.00 44.05 92.15 0.00 0.00 12.00 41.00 788.00 ▇▁▁▁▁
trans.WfH 0 1.00 211.10 160.63 0.00 90.00 184.00 299.75 856.00 ▇▆▂▁▁
Med_HHExp 0 1.00 1368.57 494.20 0.00 1020.00 1350.00 1714.00 2877.00 ▁▆▇▅▁
med_RETaxes 44 0.92 3389.94 2371.97 199.00 1464.75 2854.50 4696.50 10001.00 ▇▆▃▂▁
review_count 417 0.28 22.85 34.23 1.00 3.00 9.00 25.00 159.00 ▇▁▁▁▁
rating 417 0.28 4.45 0.77 2.50 4.00 5.00 5.00 5.00 ▂▁▂▃▇
distance 417 0.28 1943.22 3926.03 58.82 846.53 1247.88 1789.03 43017.73 ▇▁▁▁▁
# Now count the Yoga Studios by tract
yoga_count_tract <- count(as_tibble(Yoga_in_tract), GEOID) %>%
  print()
## # A tibble: 530 × 2
##    GEOID           n
##    <chr>       <int>
##  1 13089020100     1
##  2 13089020200     1
##  3 13089020300     4
##  4 13089020400     1
##  5 13089020500     1
##  6 13089020600     1
##  7 13089020700     1
##  8 13089020801     1
##  9 13089020802     1
## 10 13089020901     2
## # … with 520 more rows
## # ℹ Use `print(n = ...)` to see more rows
# Join tract geometry with the number of Yoga studios in tract
test <- st_join(FD_tract_Geom, Yelp_in_Geom %>% mutate(count = 1))
out <- test %>%
  group_by(GEOID) %>%
  summarise(count = sum(count, na.rm = T))

# Lets' check to see if the polygons and the poin data on Yoga Studios match
tm_shape(out) + tm_polygons(col = "count") + tm_shape(Yelp_in_Geom)  + tm_dots()
## OK Now we are ready to join back the counts of Yoga Studios to the Tract data
FD_tract_Geom_yoga <- FD_tract_Geom %>%
  left_join(out %>% st_set_geometry(NULL), by = "GEOID")

Make sure that the final dataset we will be using for our analysis is as expected

skim(FD_tract_Geom_yoga)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
Data summary
Name FD_tract_Geom_yoga
Number of rows 530
Number of columns 17
_______________________
Column type frequency:
character 2
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
GEOID 0 1 11 11 0 530 0
geometry 0 1 193 3750 0 530 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hhincome 8 0.98 81559.87 49076.11 13577 46780.50 69623.0 100893.50 250001 ▇▇▃▁▁
race.tot 0 1.00 3409.13 1276.00 0 2498.25 3267.0 4243.00 7401 ▁▇▇▃▁
race.white 0 1.00 1337.87 1198.31 0 237.75 1062.5 2146.00 5039 ▇▅▃▁▁
race.black 0 1.00 1624.68 1545.94 0 367.00 996.5 2739.25 7021 ▇▂▂▁▁
trans.total 0 1.00 1718.82 683.32 0 1268.00 1653.5 2140.25 4114 ▂▇▇▂▁
trans.car 0 1.00 1315.87 585.48 0 926.75 1250.0 1690.50 3342 ▂▇▆▂▁
trans.drovealone 0 1.00 1179.78 536.94 0 831.25 1121.0 1525.75 3164 ▃▇▆▂▁
trans.carpooled 0 1.00 136.09 134.39 0 46.25 99.0 191.25 1109 ▇▂▁▁▁
trans.pubtrans 0 1.00 117.38 128.89 0 26.00 87.5 166.00 1194 ▇▁▁▁▁
trans.bicycle 0 1.00 7.84 21.88 0 0.00 0.0 0.00 220 ▇▁▁▁▁
trans.walk 0 1.00 38.72 84.26 0 0.00 8.5 38.00 788 ▇▁▁▁▁
trans.WfH 0 1.00 203.40 161.01 0 83.00 165.0 292.75 856 ▇▅▂▁▁
Med_HHExp 0 1.00 1341.85 488.96 0 1003.75 1330.0 1680.00 2877 ▁▆▇▅▁
med_RETaxes 41 0.92 3245.91 2338.90 199 1397.00 2607.0 4500.00 10001 ▇▅▃▂▁
count 0 1.00 0.30 0.71 0 0.00 0.0 0.00 6 ▇▁▁▁▁
tm_shape(FD_tract_Geom_yoga) + tm_polygons(col="count") +tm_shape(Yelp_in_Geom) +tm_dots()

Let’s clean up the data and drop the “NA”s for important variables like househiold income and real estate taxes

## Now we have the file to work with (I hope!) Let's check again

print(skim(FD_tract_Geom_yoga))
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
## ── Data Summary ────────────────────────
##                            Values            
## Name                       FD_tract_Geom_yoga
## Number of rows             530               
## Number of columns          17                
## _______________________                      
## Column type frequency:                       
##   character                2                 
##   numeric                  15                
## ________________________                     
## Group variables            None              
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min  max empty n_unique whitespace
## 1 GEOID                 0             1  11   11     0      530          0
## 2 geometry              0             1 193 3750     0      530          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable    n_missing complete_rate      mean        sd    p0     p25
##  1 hhincome                 8         0.985 81560.    49076.    13577 46780. 
##  2 race.tot                 0         1      3409.     1276.        0  2498. 
##  3 race.white               0         1      1338.     1198.        0   238. 
##  4 race.black               0         1      1625.     1546.        0   367  
##  5 trans.total              0         1      1719.      683.        0  1268  
##  6 trans.car                0         1      1316.      585.        0   927. 
##  7 trans.drovealone         0         1      1180.      537.        0   831. 
##  8 trans.carpooled          0         1       136.      134.        0    46.2
##  9 trans.pubtrans           0         1       117.      129.        0    26  
## 10 trans.bicycle            0         1         7.84     21.9       0     0  
## 11 trans.walk               0         1        38.7      84.3       0     0  
## 12 trans.WfH                0         1       203.      161.        0    83  
## 13 Med_HHExp                0         1      1342.      489.        0  1004. 
## 14 med_RETaxes             41         0.923  3246.     2339.      199  1397  
## 15 count                    0         1         0.304     0.715     0     0  
##        p50     p75   p100 hist 
##  1 69623   100894. 250001 ▇▇▃▁▁
##  2  3267     4243    7401 ▁▇▇▃▁
##  3  1062.    2146    5039 ▇▅▃▁▁
##  4   996.    2739.   7021 ▇▂▂▁▁
##  5  1654.    2140.   4114 ▂▇▇▂▁
##  6  1250     1690.   3342 ▂▇▆▂▁
##  7  1121     1526.   3164 ▃▇▆▂▁
##  8    99      191.   1109 ▇▂▁▁▁
##  9    87.5    166    1194 ▇▁▁▁▁
## 10     0        0     220 ▇▁▁▁▁
## 11     8.5     38     788 ▇▁▁▁▁
## 12   165      293.    856 ▇▅▂▁▁
## 13  1330     1680    2877 ▁▆▇▅▁
## 14  2607     4500   10001 ▇▅▃▂▁
## 15     0        0       6 ▇▁▁▁▁
## $character
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min  max empty n_unique whitespace
## 1 GEOID                 0             1  11   11     0      530          0
## 2 geometry              0             1 193 3750     0      530          0
## 
## $numeric
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_vari…¹ n_mis…² compl…³    mean      sd    p0    p25    p50    p75   p100
##  1 hhincome          8   0.985 8.16e+4 4.91e+4 13577 4.68e4 6.96e4 1.01e5 250001
##  2 race.tot          0   1     3.41e+3 1.28e+3     0 2.50e3 3.27e3 4.24e3   7401
##  3 race.white        0   1     1.34e+3 1.20e+3     0 2.38e2 1.06e3 2.15e3   5039
##  4 race.black        0   1     1.62e+3 1.55e+3     0 3.67e2 9.96e2 2.74e3   7021
##  5 trans.total       0   1     1.72e+3 6.83e+2     0 1.27e3 1.65e3 2.14e3   4114
##  6 trans.car         0   1     1.32e+3 5.85e+2     0 9.27e2 1.25e3 1.69e3   3342
##  7 trans.drov…       0   1     1.18e+3 5.37e+2     0 8.31e2 1.12e3 1.53e3   3164
##  8 trans.carp…       0   1     1.36e+2 1.34e+2     0 4.62e1 9.9 e1 1.91e2   1109
##  9 trans.pubt…       0   1     1.17e+2 1.29e+2     0 2.6 e1 8.75e1 1.66e2   1194
## 10 trans.bicy…       0   1     7.84e+0 2.19e+1     0 0      0      0         220
## 11 trans.walk        0   1     3.87e+1 8.43e+1     0 0      8.5 e0 3.8 e1    788
## 12 trans.WfH         0   1     2.03e+2 1.61e+2     0 8.3 e1 1.65e2 2.93e2    856
## 13 Med_HHExp         0   1     1.34e+3 4.89e+2     0 1.00e3 1.33e3 1.68e3   2877
## 14 med_RETaxes      41   0.923 3.25e+3 2.34e+3   199 1.40e3 2.61e3 4.5 e3  10001
## 15 count             0   1     3.04e-1 7.15e-1     0 0      0      0           6
## # … with 1 more variable: hist <chr>, and abbreviated variable names
## #   ¹​skim_variable, ²​n_missing, ³​complete_rate
## # ℹ Use `colnames()` to see all variable names
# Dropping the missing values
yoga_census_dropnaHH2 <- FD_tract_Geom_yoga[!is.na(FD_tract_Geom_yoga$hhincome),]

# Just to check whether the 8 NAs have been dropped from hhincome
print(skim(yoga_census_dropnaHH2))
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
## ── Data Summary ────────────────────────
##                            Values               
## Name                       yoga_census_dropnaHH2
## Number of rows             522                  
## Number of columns          17                   
## _______________________                         
## Column type frequency:                          
##   character                2                    
##   numeric                  15                   
## ________________________                        
## Group variables            None                 
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min  max empty n_unique whitespace
## 1 GEOID                 0             1  11   11     0      522          0
## 2 geometry              0             1 194 3750     0      522          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable    n_missing complete_rate      mean        sd    p0     p25
##  1 hhincome                 0         1     81560.    49076.    13577 46780. 
##  2 race.tot                 0         1      3434.     1263.      176  2548. 
##  3 race.white               0         1      1349.     1200.        0   240. 
##  4 race.black               0         1      1634.     1553.        0   367  
##  5 trans.total              0         1      1738.      668.       54  1274. 
##  6 trans.car                0         1      1332.      575.       54   944. 
##  7 trans.drovealone         0         1      1194.      528.       54   834. 
##  8 trans.carpooled          0         1       138.      135.        0    47.2
##  9 trans.pubtrans           0         1       119.      129.        0    26.2
## 10 trans.bicycle            0         1         7.89     22.0       0     0  
## 11 trans.walk               0         1        38.4      83.9       0     0  
## 12 trans.WfH                0         1       206.      161.        0    86  
## 13 Med_HHExp                0         1      1352.      478.      146  1011. 
## 14 med_RETaxes             37         0.929  3249.     2338.      199  1397  
## 15 count                    0         1         0.305     0.718     0     0  
##        p50     p75   p100 hist 
##  1 69623   100894. 250001 ▇▇▃▁▁
##  2  3304.    4252    7401 ▁▇▇▃▁
##  3  1094.    2169.   5039 ▇▅▃▁▁
##  4   994.    2765.   7021 ▇▂▂▁▁
##  5  1663     2149.   4114 ▂▇▇▂▁
##  6  1256     1710.   3342 ▂▇▆▂▁
##  7  1130.    1528.   3164 ▃▇▆▂▁
##  8   102.     193.   1109 ▇▂▁▁▁
##  9    88.5    168.   1194 ▇▁▁▁▁
## 10     0        0     220 ▇▁▁▁▁
## 11     9       38     788 ▇▁▁▁▁
## 12   169      293     856 ▇▅▂▁▁
## 13  1332     1684.   2877 ▂▇▇▅▁
## 14  2607     4500   10001 ▇▅▃▂▁
## 15     0        0       6 ▇▁▁▁▁
## $character
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min  max empty n_unique whitespace
## 1 GEOID                 0             1  11   11     0      522          0
## 2 geometry              0             1 194 3750     0      522          0
## 
## $numeric
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_vari…¹ n_mis…² compl…³    mean      sd    p0    p25    p50    p75   p100
##  1 hhincome          0   1     8.16e+4 4.91e+4 13577 4.68e4 6.96e4 1.01e5 250001
##  2 race.tot          0   1     3.43e+3 1.26e+3   176 2.55e3 3.30e3 4.25e3   7401
##  3 race.white        0   1     1.35e+3 1.20e+3     0 2.40e2 1.09e3 2.17e3   5039
##  4 race.black        0   1     1.63e+3 1.55e+3     0 3.67e2 9.94e2 2.76e3   7021
##  5 trans.total       0   1     1.74e+3 6.68e+2    54 1.27e3 1.66e3 2.15e3   4114
##  6 trans.car         0   1     1.33e+3 5.75e+2    54 9.44e2 1.26e3 1.71e3   3342
##  7 trans.drov…       0   1     1.19e+3 5.28e+2    54 8.35e2 1.13e3 1.53e3   3164
##  8 trans.carp…       0   1     1.38e+2 1.35e+2     0 4.72e1 1.01e2 1.93e2   1109
##  9 trans.pubt…       0   1     1.19e+2 1.29e+2     0 2.62e1 8.85e1 1.68e2   1194
## 10 trans.bicy…       0   1     7.89e+0 2.20e+1     0 0      0      0         220
## 11 trans.walk        0   1     3.84e+1 8.39e+1     0 0      9   e0 3.8 e1    788
## 12 trans.WfH         0   1     2.06e+2 1.61e+2     0 8.6 e1 1.69e2 2.93e2    856
## 13 Med_HHExp         0   1     1.35e+3 4.78e+2   146 1.01e3 1.33e3 1.68e3   2877
## 14 med_RETaxes      37   0.929 3.25e+3 2.34e+3   199 1.40e3 2.61e3 4.5 e3  10001
## 15 count             0   1     3.05e-1 7.18e-1     0 0      0      0           6
## # … with 1 more variable: hist <chr>, and abbreviated variable names
## #   ¹​skim_variable, ²​n_missing, ³​complete_rate
## # ℹ Use `colnames()` to see all variable names
# What about the missing values of med_RETaxes? Let's get rid of those NAs too

y_census_dropnaHHTX <- yoga_census_dropnaHH2[!is.na(yoga_census_dropnaHH2$med_RETaxes),]
skim(y_census_dropnaHHTX)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
Data summary
Name y_census_dropnaHHTX
Number of rows 485
Number of columns 17
_______________________
Column type frequency:
character 2
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
GEOID 0 1 11 11 0 485 0
geometry 0 1 194 3750 0 485 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hhincome 0 1 84346.51 49491.12 13577 49597 72466 104492 250001 ▇▇▃▁▁
race.tot 0 1 3502.86 1226.89 875 2594 3376 4269 7363 ▃▇▆▃▁
race.white 0 1 1406.58 1211.23 0 250 1227 2251 5039 ▇▅▃▂▁
race.black 0 1 1640.36 1580.14 0 357 984 2796 7021 ▇▂▂▁▁
trans.total 0 1 1784.75 648.34 421 1334 1697 2210 4114 ▃▇▆▂▁
trans.car 0 1 1371.62 558.27 140 982 1300 1745 3342 ▂▇▆▂▁
trans.drovealone 0 1 1231.82 509.93 115 861 1160 1553 3164 ▂▇▆▁▁
trans.carpooled 0 1 139.80 134.58 0 49 105 194 1109 ▇▂▁▁▁
trans.pubtrans 0 1 116.98 129.51 0 26 87 166 1194 ▇▁▁▁▁
trans.bicycle 0 1 7.83 22.05 0 0 0 0 220 ▇▁▁▁▁
trans.walk 0 1 35.96 74.50 0 0 8 38 595 ▇▁▁▁▁
trans.WfH 0 1 215.85 161.00 0 98 185 299 856 ▇▆▂▁▁
Med_HHExp 0 1 1383.44 464.09 246 1046 1359 1712 2877 ▂▇▇▃▁
med_RETaxes 0 1 3249.27 2338.31 199 1397 2607 4500 10001 ▇▅▃▂▁
count 0 1 0.31 0.72 0 0 0 0 6 ▇▁▁▁▁
# To check if it is still a sf file
class(y_census_dropnaHHTX)
## [1] "sf"         "data.frame"

Now we are ready to ask some probing questions about the data

  1. Are yoga studios in places with high median household incomes?

We could use the following ggplot command.

ggplot(y_census_dropnaHHTX, aes(x=hhincome, y=count)) +
  geom_point() +
  ylab("Number of Yoga Studios and Household Median Income in Tract")

How about we just look at a binary field that shows the presence or absence of Yoga Studios in tract?

y_census_dropnaHHTX$Yoga <- ifelse(y_census_dropnaHHTX$count>0, 1, 0)

A better way to visualize this categorical variable is a “boxplot”

boxplot(hhincome~Yoga, data=y_census_dropnaHHTX, main="Boxplot of Yoga Studios by Income", xlab="Whether Yoga Studios are present", ylab="Household median income")

# What about a similat boxplot with Real Estate Taxes?

boxplot(med_RETaxes~Yoga, data=y_census_dropnaHHTX, main="Boxplot of Yoga Studios by Real Estate Taxes", xlab="Whether Yoga Studios are present", ylab="Real Estate Taxes in tract")

Finally, let’s try a binary logistic regression

binary_Yoga1 <- glm(Yoga~hhincome, family=binomial, data=y_census_dropnaHHTX)
summary(binary_Yoga1)
## 
## Call:
## glm(formula = Yoga ~ hhincome, family = binomial, data = y_census_dropnaHHTX)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1200  -0.6891  -0.6304  -0.5736   1.9475  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.903e+00  2.223e-01  -8.558  < 2e-16 ***
## hhincome     7.065e-06  2.066e-06   3.420 0.000626 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.32  on 484  degrees of freedom
## Residual deviance: 497.93  on 483  degrees of freedom
## AIC: 501.93
## 
## Number of Fisher Scoring iterations: 4
binary_Yoga2 <- glm(Yoga~med_RETaxes, family=binomial, data=y_census_dropnaHHTX)
summary(binary_Yoga2)
## 
## Call:
## glm(formula = Yoga ~ med_RETaxes, family = binomial, data = y_census_dropnaHHTX)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1968  -0.6965  -0.5880  -0.5323   2.0445  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.999e+00  2.040e-01  -9.799  < 2e-16 ***
## med_RETaxes  2.044e-04  4.469e-05   4.574 4.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.32  on 484  degrees of freedom
## Residual deviance: 488.41  on 483  degrees of freedom
## AIC: 492.41
## 
## Number of Fisher Scoring iterations: 4

So what do you think about our initial question? Are Yoga Studios associated with gentrified neighborhoods?

As an exercise, download the data on median rents and examine if there is an association of yoga studios with median rents.

```