Intro to Urban Analytics

Bon Woo Koo & Subhrajit Guhathakurta

2022-08-20

To answer the question we need to first determine what data we need

Since we will be only using cross-sectional data (i.e., data from one period of time) we are limited in examining dynamics. You would probably look for change in several variables over the years to do a proper analysis of this questions!

But given this example of using data to seek some preliminary answers to the questions and strengthen a hypothesis, let’s consider the following data: 1. Yoga Studios in Fulton and Dekalb Counties (already downloaded) 2. Neighborhood parameters (household income, racial make up, home values? Change in home values? Rental values?)

In this example I am hypothesizing that household income, racial mix, housing costs, and real estate taxes will have something to say about gentrification So, I will first download my data for the most recent year from Census

FD_tract <- suppressMessages(
  get_acs(geography = "tract", # or "block group", "county", "state" etc.
          state = "GA",
          county = c("Fulton", "Dekalb"),
          variables = c(hhincome = 'B19019_001',
                        race.tot = "B02001_001",
                        race.white = "B02001_002",
                        race.black = "B02001_003",
                        trans.total = "B08006_001",
                        trans.car = "B08006_002",
                        trans.drovealone = "B08006_003",
                        trans.carpooled = "B08006_004", # Notice that I was not interested in 005-007 (2 person/ 4 person carpool etc.)
                        trans.pubtrans = "B08006_008", # Did not want to download any details about the type of public transport (009-0013)
                        trans.bicycle = "B08006_014",
                        trans.walk = "B08006_015",
                        trans.WfH = "B08006_017",
                        med_housexp = "B25104_001",
                        med_realestate_taxes = "B25103_001"
          ),
          year = 2020,
          survey = "acs5", # American Community Survey 5-year estimate
          geometry = TRUE, # returns sf objects
          output = "wide") # wide vs. long
)

The data contains several redundant columns that we will not use. So, let’s subset the data to only have the columns we will use

FD_tract <- FD_tract %>%
  select(GEOID,
         hhincome = hhincomeE, # New name = old name
         race.tot = race.totE,
         race.white = race.whiteE,
         race.black = race.blackE,
         trans.total = trans.totalE,
         trans.car = trans.carE,
         trans.drovealone = trans.drovealoneE,
         trans.carpooled = trans.carpooledE,
         trans.pubtrans = trans.pubtransE,
         trans.bicycle = trans.bicycleE,
         trans.walk = trans.walkE,
         trans.WfH = trans.WfHE,
         Med_HHExp = med_housexpE,
         med_RETaxes = med_realestate_taxesE)

Let’s display what we have!

tmap_mode("view")
## tmap mode set to interactive viewing
RE_Taxes <- tm_shape(FD_tract) + tm_polygons("med_RETaxes")
HH_exp <- tm_shape(FD_tract) + tm_polygons("Med_HHExp")
tmap_arrange(RE_Taxes, HH_exp)

Now download the Yelp data. This has been done for you (given the problems we have been having with Yelp API)

To save time in class, we will use the Yelp data that Bonwoo downloaded for today’s class.

The Yelp data can be found on Canvas > Files > yelp_in.rds. This data is for Fulton and DeKalb County, GA, and contains Yelp data with categories = “yoga”. This data is already cleaned.

To read the data into R, we will use read_rds() function in readr package. Notice that in the code below, only the name of the .rds file is specified. This works only when the file is in the same folder as your .Rmd file. If not, you will need to provide a full path to the file.

# Reading the yelp data
yelp_in <- read_rds("yelp_in.rds")

I like to constantly check my datasets to see if the results look fine

I will check the file and then display it (i.e., the point locations of Yoga Studios) on the census data

skim(yelp_in)

## Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
## provided. Falling back to `character`.

Data summary
Name	yelp_in
Number of rows	161
Number of columns	23
_______________________
Column type frequency:
character	19
logical	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique
id	0	1.00	22	22	0	161
alias	0	1.00	17	58	0	161
name	0	1.00	7	49	0	139
image_url	0	1.00	0	68	6	156
url	0	1.00	174	215	0	161
categories	0	1.00	4	47	0	68
transactions	0	1.00	0	0	161	1
phone	0	1.00	0	12	6	147
display_phone	0	1.00	0	14	6	147
price	155	0.04	2	2	0	1
location.address1	9	0.94	0	30	7	142
location.address2	35	0.78	0	11	52	59
location.address3	30	0.81	0	26	126	6
location.city	0	1.00	6	19	0	21
location.zip_code	0	1.00	0	5	1	51
location.country	0	1.00	2	2	0	1
location.state	0	1.00	2	2	0	1
location.display_address	0	1.00	11	76	0	158
geometry	0	1.00	21	38	0	160

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
is_closed	0	1	0	FAL: 161

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
review_count	1	22.85	34.23	1.00	3.00	9.00	25.00	159.00	▇▁▁▁▁
rating	1	4.45	0.77	2.50	4.00	5.00	5.00	5.00	▂▁▂▃▇
distance	1	1943.22	3926.03	58.82	846.53	1247.88	1789.03	43017.73	▇▁▁▁▁

tm_shape(yelp_in) + tm_dots(col="red") + tm_shape(FD_tract) + tm_borders()

Appending Census data

Joining census data in this case is somewhat different from earlier. We will do the following: 1. make sure the projection (CRS) of both files are correct 2. Calculate how many yoga studios are within each census tract (there will be many tracts with 0 studios) 3. Join the two files with the count of Yoga studios

## Check to see that the CRS for both the sf files are the same
head(FD_tract)

## Simple feature collection with 6 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.47101 ymin: 33.65339 xmax: -84.18201 ymax: 33.81229
## Geodetic CRS:  NAD83
##         GEOID hhincome race.tot race.white race.black trans.total trans.car
## 1 13089021908    43191     4829        323       3368        2421      2137
## 2 13089020400   123711     2743       2435         48        1691      1202
## 3 13089023410    41065     3903        131       3562        2030      1562
## 4 13121010601    55479     3673       1057       2514        1699      1290
## 5 13121008302    28214     1891         33       1836         481       401
## 6 13121004200    23906     2675        252       2423        1073       487
##   trans.drovealone trans.carpooled trans.pubtrans trans.bicycle trans.walk
## 1             1759             378             86             0        101
## 2             1069             133             63            62         13
## 3             1339             223            307             0         87
## 4             1131             159            154             0        113
## 5              347              54             57             0          8
## 6              437              50            460            19          0
##   trans.WfH Med_HHExp med_RETaxes                       geometry
## 1        97      1994        2142 MULTIPOLYGON (((-84.20619 3...
## 2       300      1219        4885 MULTIPOLYGON (((-84.34921 3...
## 3        55      1564        1067 MULTIPOLYGON (((-84.29784 3...
## 4       116      1712        2527 MULTIPOLYGON (((-84.46957 3...
## 5         0       638         604 MULTIPOLYGON (((-84.46315 3...
## 6       106      1470        1652 MULTIPOLYGON (((-84.42334 3...

head(yelp_in)

## Simple feature collection with 6 features and 22 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.34933 ymin: 33.76259 xmax: -84.19098 ymax: 33.85257
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 23
##   id      alias name  image…¹ is_cl…² url   revie…³ categ…⁴ rating trans…⁵ phone
##   <chr>   <chr> <chr> <chr>   <lgl>   <chr>   <int> <chr>    <dbl> <chr>   <chr>
## 1 NdD6VW… yoga… Yoga… https:… FALSE   http…       1 Yoga, …    5   ""      +147…
## 2 k5wcy4… spa-… Spa … https:… FALSE   http…      11 Massag…    4.5 ""      +140…
## 3 fdK7PX… will… Will… https:… FALSE   http…       8 Yoga, …    4.5 ""      +140…
## 4 scTOiG… the-… The … https:… FALSE   http…       1 Perfor…    5   ""      +167…
## 5 PuGXt2… toug… Toug… https:… FALSE   http…      42 Traine…    5   ""      +140…
## 6 q9gP-C… boho… Boho… https:… FALSE   http…       9 Yoga       5   ""      +140…
## # … with 12 more variables: display_phone <chr>, distance <dbl>, price <chr>,
## #   location.address1 <chr>, location.address2 <chr>, location.address3 <chr>,
## #   location.city <chr>, location.zip_code <chr>, location.country <chr>,
## #   location.state <chr>, location.display_address <chr>, geometry <POINT [°]>,
## #   and abbreviated variable names ¹image_url, ²is_closed, ³review_count,
## #   ⁴categories, ⁵transactions
## # ℹ Use `colnames()` to see all variable names

## It turns out they are NOT! So we need to transform their CRS into the same
FD_tract_Geom <- st_transform(FD_tract, crs=4326)
Yelp_in_Geom <- st_transform(yelp_in, crs=4326)

Yoga_in_tract <- st_join(FD_tract_Geom, Yelp_in_Geom, join = st_intersects)
skim(Yoga_in_tract)

## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.

Data summary
Name	Yoga_in_tract
Number of rows	578
Number of columns	38
_______________________
Column type frequency:
character	20
logical	1
numeric	17
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique
GEOID	0	1.00	11	11	0	530
id	417	0.28	22	22	0	161
alias	417	0.28	17	58	0	161
name	417	0.28	7	49	0	139
image_url	417	0.28	0	68	6	156
url	417	0.28	174	215	0	161
categories	417	0.28	4	47	0	68
transactions	417	0.28	0	0	161	1
phone	417	0.28	0	12	6	147
display_phone	417	0.28	0	14	6	147
price	572	0.01	2	2	0	1
location.address1	426	0.26	0	30	7	142
location.address2	452	0.22	0	11	52	59
location.address3	447	0.23	0	26	126	6
location.city	417	0.28	6	19	0	21
location.zip_code	417	0.28	0	5	1	51
location.country	417	0.28	2	2	0	1
location.state	417	0.28	2	2	0	1
location.display_address	417	0.28	11	76	0	158
geometry	0	1.00	193	3750	0	530

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
is_closed	417	0.28	0	FAL: 161

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hhincome	8	0.99	83324.83	48534.12	13577.00	48668.25	73387.50	101281.75	250001.00	▇▇▃▁▁
race.tot	0	1.00	3435.28	1269.31	0.00	2498.25	3314.00	4260.50	7401.00	▁▇▇▃▁
race.white	0	1.00	1442.49	1235.59	0.00	257.00	1264.50	2311.75	5039.00	▇▅▃▂▁
race.black	0	1.00	1546.21	1517.43	0.00	357.50	903.50	2609.25	7021.00	▇▂▂▁▁
trans.total	0	1.00	1749.09	686.47	0.00	1274.00	1681.00	2213.00	4114.00	▂▇▇▃▁
trans.car	0	1.00	1332.79	583.89	0.00	943.00	1257.50	1720.00	3342.00	▂▇▆▂▁
trans.drovealone	0	1.00	1199.21	538.29	0.00	839.00	1132.50	1531.50	3164.00	▃▇▆▂▁
trans.carpooled	0	1.00	133.58	130.37	0.00	48.00	98.50	181.75	1109.00	▇▂▁▁▁
trans.pubtrans	0	1.00	116.22	125.95	0.00	25.25	88.50	166.00	1194.00	▇▁▁▁▁
trans.bicycle	0	1.00	9.56	25.71	0.00	0.00	0.00	0.00	220.00	▇▁▁▁▁
trans.walk	0	1.00	44.05	92.15	0.00	0.00	12.00	41.00	788.00	▇▁▁▁▁
trans.WfH	0	1.00	211.10	160.63	0.00	90.00	184.00	299.75	856.00	▇▆▂▁▁
Med_HHExp	0	1.00	1368.57	494.20	0.00	1020.00	1350.00	1714.00	2877.00	▁▆▇▅▁
med_RETaxes	44	0.92	3389.94	2371.97	199.00	1464.75	2854.50	4696.50	10001.00	▇▆▃▂▁
review_count	417	0.28	22.85	34.23	1.00	3.00	9.00	25.00	159.00	▇▁▁▁▁
rating	417	0.28	4.45	0.77	2.50	4.00	5.00	5.00	5.00	▂▁▂▃▇
distance	417	0.28	1943.22	3926.03	58.82	846.53	1247.88	1789.03	43017.73	▇▁▁▁▁

# Now count the Yoga Studios by tract
yoga_count_tract <- count(as_tibble(Yoga_in_tract), GEOID) %>%
  print()

## # A tibble: 530 × 2
##    GEOID           n
##    <chr>       <int>
##  1 13089020100     1
##  2 13089020200     1
##  3 13089020300     4
##  4 13089020400     1
##  5 13089020500     1
##  6 13089020600     1
##  7 13089020700     1
##  8 13089020801     1
##  9 13089020802     1
## 10 13089020901     2
## # … with 520 more rows
## # ℹ Use `print(n = ...)` to see more rows

# Join tract geometry with the number of Yoga studios in tract
test <- st_join(FD_tract_Geom, Yelp_in_Geom %>% mutate(count = 1))
out <- test %>%
  group_by(GEOID) %>%
  summarise(count = sum(count, na.rm = T))

# Lets' check to see if the polygons and the poin data on Yoga Studios match
tm_shape(out) + tm_polygons(col = "count") + tm_shape(Yelp_in_Geom)  + tm_dots()

## OK Now we are ready to join back the counts of Yoga Studios to the Tract data
FD_tract_Geom_yoga <- FD_tract_Geom %>%
  left_join(out %>% st_set_geometry(NULL), by = "GEOID")

Make sure that the final dataset we will be using for our analysis is as expected

skim(FD_tract_Geom_yoga)

## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.

Data summary
Name	FD_tract_Geom_yoga
Number of rows	530
Number of columns	17
_______________________
Column type frequency:
character	2
numeric	15
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
GEOID	0	1	11	11	0	530	0
geometry	0	1	193	3750	0	530	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hhincome	8	0.98	81559.87	49076.11	13577	46780.50	69623.0	100893.50	250001	▇▇▃▁▁
race.tot	0	1.00	3409.13	1276.00	0	2498.25	3267.0	4243.00	7401	▁▇▇▃▁
race.white	0	1.00	1337.87	1198.31	0	237.75	1062.5	2146.00	5039	▇▅▃▁▁
race.black	0	1.00	1624.68	1545.94	0	367.00	996.5	2739.25	7021	▇▂▂▁▁
trans.total	0	1.00	1718.82	683.32	0	1268.00	1653.5	2140.25	4114	▂▇▇▂▁
trans.car	0	1.00	1315.87	585.48	0	926.75	1250.0	1690.50	3342	▂▇▆▂▁
trans.drovealone	0	1.00	1179.78	536.94	0	831.25	1121.0	1525.75	3164	▃▇▆▂▁
trans.carpooled	0	1.00	136.09	134.39	0	46.25	99.0	191.25	1109	▇▂▁▁▁
trans.pubtrans	0	1.00	117.38	128.89	0	26.00	87.5	166.00	1194	▇▁▁▁▁
trans.bicycle	0	1.00	7.84	21.88	0	0.00	0.0	0.00	220	▇▁▁▁▁
trans.walk	0	1.00	38.72	84.26	0	0.00	8.5	38.00	788	▇▁▁▁▁
trans.WfH	0	1.00	203.40	161.01	0	83.00	165.0	292.75	856	▇▅▂▁▁
Med_HHExp	0	1.00	1341.85	488.96	0	1003.75	1330.0	1680.00	2877	▁▆▇▅▁
med_RETaxes	41	0.92	3245.91	2338.90	199	1397.00	2607.0	4500.00	10001	▇▅▃▂▁
count	0	1.00	0.30	0.71	0	0.00	0.0	0.00	6	▇▁▁▁▁

tm_shape(FD_tract_Geom_yoga) + tm_polygons(col="count") +tm_shape(Yelp_in_Geom) +tm_dots()

Let’s clean up the data and drop the “NA”s for important variables like househiold income and real estate taxes

## Now we have the file to work with (I hope!) Let's check again

print(skim(FD_tract_Geom_yoga))

## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.

## ── Data Summary ────────────────────────
##                            Values            
## Name                       FD_tract_Geom_yoga
## Number of rows             530               
## Number of columns          17                
## _______________________                      
## Column type frequency:                       
##   character                2                 
##   numeric                  15                
## ________________________                     
## Group variables            None              
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min  max empty n_unique whitespace
## 1 GEOID                 0             1  11   11     0      530          0
## 2 geometry              0             1 193 3750     0      530          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable    n_missing complete_rate      mean        sd    p0     p25
##  1 hhincome                 8         0.985 81560.    49076.    13577 46780. 
##  2 race.tot                 0         1      3409.     1276.        0  2498. 
##  3 race.white               0         1      1338.     1198.        0   238. 
##  4 race.black               0         1      1625.     1546.        0   367  
##  5 trans.total              0         1      1719.      683.        0  1268  
##  6 trans.car                0         1      1316.      585.        0   927. 
##  7 trans.drovealone         0         1      1180.      537.        0   831. 
##  8 trans.carpooled          0         1       136.      134.        0    46.2
##  9 trans.pubtrans           0         1       117.      129.        0    26  
## 10 trans.bicycle            0         1         7.84     21.9       0     0  
## 11 trans.walk               0         1        38.7      84.3       0     0  
## 12 trans.WfH                0         1       203.      161.        0    83  
## 13 Med_HHExp                0         1      1342.      489.        0  1004. 
## 14 med_RETaxes             41         0.923  3246.     2339.      199  1397  
## 15 count                    0         1         0.304     0.715     0     0  
##        p50     p75   p100 hist 
##  1 69623   100894. 250001 ▇▇▃▁▁
##  2  3267     4243    7401 ▁▇▇▃▁
##  3  1062.    2146    5039 ▇▅▃▁▁
##  4   996.    2739.   7021 ▇▂▂▁▁
##  5  1654.    2140.   4114 ▂▇▇▂▁
##  6  1250     1690.   3342 ▂▇▆▂▁
##  7  1121     1526.   3164 ▃▇▆▂▁
##  8    99      191.   1109 ▇▂▁▁▁
##  9    87.5    166    1194 ▇▁▁▁▁
## 10     0        0     220 ▇▁▁▁▁
## 11     8.5     38     788 ▇▁▁▁▁
## 12   165      293.    856 ▇▅▂▁▁
## 13  1330     1680    2877 ▁▆▇▅▁
## 14  2607     4500   10001 ▇▅▃▂▁
## 15     0        0       6 ▇▁▁▁▁

## $character
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min  max empty n_unique whitespace
## 1 GEOID                 0             1  11   11     0      530          0
## 2 geometry              0             1 193 3750     0      530          0
## 
## $numeric
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_vari…¹ n_mis…² compl…³    mean      sd    p0    p25    p50    p75   p100
##  1 hhincome          8   0.985 8.16e+4 4.91e+4 13577 4.68e4 6.96e4 1.01e5 250001
##  2 race.tot          0   1     3.41e+3 1.28e+3     0 2.50e3 3.27e3 4.24e3   7401
##  3 race.white        0   1     1.34e+3 1.20e+3     0 2.38e2 1.06e3 2.15e3   5039
##  4 race.black        0   1     1.62e+3 1.55e+3     0 3.67e2 9.96e2 2.74e3   7021
##  5 trans.total       0   1     1.72e+3 6.83e+2     0 1.27e3 1.65e3 2.14e3   4114
##  6 trans.car         0   1     1.32e+3 5.85e+2     0 9.27e2 1.25e3 1.69e3   3342
##  7 trans.drov…       0   1     1.18e+3 5.37e+2     0 8.31e2 1.12e3 1.53e3   3164
##  8 trans.carp…       0   1     1.36e+2 1.34e+2     0 4.62e1 9.9 e1 1.91e2   1109
##  9 trans.pubt…       0   1     1.17e+2 1.29e+2     0 2.6 e1 8.75e1 1.66e2   1194
## 10 trans.bicy…       0   1     7.84e+0 2.19e+1     0 0      0      0         220
## 11 trans.walk        0   1     3.87e+1 8.43e+1     0 0      8.5 e0 3.8 e1    788
## 12 trans.WfH         0   1     2.03e+2 1.61e+2     0 8.3 e1 1.65e2 2.93e2    856
## 13 Med_HHExp         0   1     1.34e+3 4.89e+2     0 1.00e3 1.33e3 1.68e3   2877
## 14 med_RETaxes      41   0.923 3.25e+3 2.34e+3   199 1.40e3 2.61e3 4.5 e3  10001
## 15 count             0   1     3.04e-1 7.15e-1     0 0      0      0           6
## # … with 1 more variable: hist <chr>, and abbreviated variable names
## #   ¹skim_variable, ²n_missing, ³complete_rate
## # ℹ Use `colnames()` to see all variable names

# Dropping the missing values
yoga_census_dropnaHH2 <- FD_tract_Geom_yoga[!is.na(FD_tract_Geom_yoga$hhincome),]

# Just to check whether the 8 NAs have been dropped from hhincome
print(skim(yoga_census_dropnaHH2))

## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.

## ── Data Summary ────────────────────────
##                            Values               
## Name                       yoga_census_dropnaHH2
## Number of rows             522                  
## Number of columns          17                   
## _______________________                         
## Column type frequency:                          
##   character                2                    
##   numeric                  15                   
## ________________________                        
## Group variables            None                 
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min  max empty n_unique whitespace
## 1 GEOID                 0             1  11   11     0      522          0
## 2 geometry              0             1 194 3750     0      522          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable    n_missing complete_rate      mean        sd    p0     p25
##  1 hhincome                 0         1     81560.    49076.    13577 46780. 
##  2 race.tot                 0         1      3434.     1263.      176  2548. 
##  3 race.white               0         1      1349.     1200.        0   240. 
##  4 race.black               0         1      1634.     1553.        0   367  
##  5 trans.total              0         1      1738.      668.       54  1274. 
##  6 trans.car                0         1      1332.      575.       54   944. 
##  7 trans.drovealone         0         1      1194.      528.       54   834. 
##  8 trans.carpooled          0         1       138.      135.        0    47.2
##  9 trans.pubtrans           0         1       119.      129.        0    26.2
## 10 trans.bicycle            0         1         7.89     22.0       0     0  
## 11 trans.walk               0         1        38.4      83.9       0     0  
## 12 trans.WfH                0         1       206.      161.        0    86  
## 13 Med_HHExp                0         1      1352.      478.      146  1011. 
## 14 med_RETaxes             37         0.929  3249.     2338.      199  1397  
## 15 count                    0         1         0.305     0.718     0     0  
##        p50     p75   p100 hist 
##  1 69623   100894. 250001 ▇▇▃▁▁
##  2  3304.    4252    7401 ▁▇▇▃▁
##  3  1094.    2169.   5039 ▇▅▃▁▁
##  4   994.    2765.   7021 ▇▂▂▁▁
##  5  1663     2149.   4114 ▂▇▇▂▁
##  6  1256     1710.   3342 ▂▇▆▂▁
##  7  1130.    1528.   3164 ▃▇▆▂▁
##  8   102.     193.   1109 ▇▂▁▁▁
##  9    88.5    168.   1194 ▇▁▁▁▁
## 10     0        0     220 ▇▁▁▁▁
## 11     9       38     788 ▇▁▁▁▁
## 12   169      293     856 ▇▅▂▁▁
## 13  1332     1684.   2877 ▂▇▇▅▁
## 14  2607     4500   10001 ▇▅▃▂▁
## 15     0        0       6 ▇▁▁▁▁

## $character
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min  max empty n_unique whitespace
## 1 GEOID                 0             1  11   11     0      522          0
## 2 geometry              0             1 194 3750     0      522          0
## 
## $numeric
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_vari…¹ n_mis…² compl…³    mean      sd    p0    p25    p50    p75   p100
##  1 hhincome          0   1     8.16e+4 4.91e+4 13577 4.68e4 6.96e4 1.01e5 250001
##  2 race.tot          0   1     3.43e+3 1.26e+3   176 2.55e3 3.30e3 4.25e3   7401
##  3 race.white        0   1     1.35e+3 1.20e+3     0 2.40e2 1.09e3 2.17e3   5039
##  4 race.black        0   1     1.63e+3 1.55e+3     0 3.67e2 9.94e2 2.76e3   7021
##  5 trans.total       0   1     1.74e+3 6.68e+2    54 1.27e3 1.66e3 2.15e3   4114
##  6 trans.car         0   1     1.33e+3 5.75e+2    54 9.44e2 1.26e3 1.71e3   3342
##  7 trans.drov…       0   1     1.19e+3 5.28e+2    54 8.35e2 1.13e3 1.53e3   3164
##  8 trans.carp…       0   1     1.38e+2 1.35e+2     0 4.72e1 1.01e2 1.93e2   1109
##  9 trans.pubt…       0   1     1.19e+2 1.29e+2     0 2.62e1 8.85e1 1.68e2   1194
## 10 trans.bicy…       0   1     7.89e+0 2.20e+1     0 0      0      0         220
## 11 trans.walk        0   1     3.84e+1 8.39e+1     0 0      9   e0 3.8 e1    788
## 12 trans.WfH         0   1     2.06e+2 1.61e+2     0 8.6 e1 1.69e2 2.93e2    856
## 13 Med_HHExp         0   1     1.35e+3 4.78e+2   146 1.01e3 1.33e3 1.68e3   2877
## 14 med_RETaxes      37   0.929 3.25e+3 2.34e+3   199 1.40e3 2.61e3 4.5 e3  10001
## 15 count             0   1     3.05e-1 7.18e-1     0 0      0      0           6
## # … with 1 more variable: hist <chr>, and abbreviated variable names
## #   ¹skim_variable, ²n_missing, ³complete_rate
## # ℹ Use `colnames()` to see all variable names

# What about the missing values of med_RETaxes? Let's get rid of those NAs too

y_census_dropnaHHTX <- yoga_census_dropnaHH2[!is.na(yoga_census_dropnaHH2$med_RETaxes),]
skim(y_census_dropnaHHTX)

## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.

Data summary
Name	y_census_dropnaHHTX
Number of rows	485
Number of columns	17
_______________________
Column type frequency:
character	2
numeric	15
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
GEOID	0	1	11	11	0	485	0
geometry	0	1	194	3750	0	485	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hhincome	1	84346.51	49491.12	13577	49597	72466	104492	250001	▇▇▃▁▁
race.tot	1	3502.86	1226.89	875	2594	3376	4269	7363	▃▇▆▃▁
race.white	1	1406.58	1211.23	0	250	1227	2251	5039	▇▅▃▂▁
race.black	1	1640.36	1580.14	0	357	984	2796	7021	▇▂▂▁▁
trans.total	1	1784.75	648.34	421	1334	1697	2210	4114	▃▇▆▂▁
trans.car	1	1371.62	558.27	140	982	1300	1745	3342	▂▇▆▂▁
trans.drovealone	1	1231.82	509.93	115	861	1160	1553	3164	▂▇▆▁▁
trans.carpooled	1	139.80	134.58	0	49	105	194	1109	▇▂▁▁▁
trans.pubtrans	1	116.98	129.51	0	26	87	166	1194	▇▁▁▁▁
trans.bicycle	1	7.83	22.05	0	0	0	0	220	▇▁▁▁▁
trans.walk	1	35.96	74.50	0	0	8	38	595	▇▁▁▁▁
trans.WfH	1	215.85	161.00	0	98	185	299	856	▇▆▂▁▁
Med_HHExp	1	1383.44	464.09	246	1046	1359	1712	2877	▂▇▇▃▁
med_RETaxes	1	3249.27	2338.31	199	1397	2607	4500	10001	▇▅▃▂▁
count	1	0.31	0.72	0	0	0	0	6	▇▁▁▁▁

# To check if it is still a sf file
class(y_census_dropnaHHTX)

## [1] "sf"         "data.frame"

Now we are ready to ask some probing questions about the data

Are yoga studios in places with high median household incomes?

We could use the following ggplot command.

ggplot(y_census_dropnaHHTX, aes(x=hhincome, y=count)) +
  geom_point() +
  ylab("Number of Yoga Studios and Household Median Income in Tract")

How about we just look at a binary field that shows the presence or absence of Yoga Studios in tract?

y_census_dropnaHHTX$Yoga <- ifelse(y_census_dropnaHHTX$count>0, 1, 0)

A better way to visualize this categorical variable is a “boxplot”

boxplot(hhincome~Yoga, data=y_census_dropnaHHTX, main="Boxplot of Yoga Studios by Income", xlab="Whether Yoga Studios are present", ylab="Household median income")

# What about a similat boxplot with Real Estate Taxes?

boxplot(med_RETaxes~Yoga, data=y_census_dropnaHHTX, main="Boxplot of Yoga Studios by Real Estate Taxes", xlab="Whether Yoga Studios are present", ylab="Real Estate Taxes in tract")

Finally, let’s try a binary logistic regression

binary_Yoga1 <- glm(Yoga~hhincome, family=binomial, data=y_census_dropnaHHTX)
summary(binary_Yoga1)

## 
## Call:
## glm(formula = Yoga ~ hhincome, family = binomial, data = y_census_dropnaHHTX)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1200  -0.6891  -0.6304  -0.5736   1.9475  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.903e+00  2.223e-01  -8.558  < 2e-16 ***
## hhincome     7.065e-06  2.066e-06   3.420 0.000626 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.32  on 484  degrees of freedom
## Residual deviance: 497.93  on 483  degrees of freedom
## AIC: 501.93
## 
## Number of Fisher Scoring iterations: 4

binary_Yoga2 <- glm(Yoga~med_RETaxes, family=binomial, data=y_census_dropnaHHTX)
summary(binary_Yoga2)

## 
## Call:
## glm(formula = Yoga ~ med_RETaxes, family = binomial, data = y_census_dropnaHHTX)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1968  -0.6965  -0.5880  -0.5323   2.0445  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.999e+00  2.040e-01  -9.799  < 2e-16 ***
## med_RETaxes  2.044e-04  4.469e-05   4.574 4.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 509.32  on 484  degrees of freedom
## Residual deviance: 488.41  on 483  degrees of freedom
## AIC: 492.41
## 
## Number of Fisher Scoring iterations: 4

So what do you think about our initial question? Are Yoga Studios associated with gentrified neighborhoods?

As an exercise, download the data on median rents and examine if there is an association of yoga studios with median rents.

```