To answer the question we need to first determine what data we need
Since we will be only using cross-sectional data (i.e., data from one period of time) we are limited in examining dynamics. You would probably look for change in several variables over the years to do a proper analysis of this questions!
But given this example of using data to seek some preliminary answers to the questions and strengthen a hypothesis, let’s consider the following data: 1. Yoga Studios in Fulton and Dekalb Counties (already downloaded) 2. Neighborhood parameters (household income, racial make up, home values? Change in home values? Rental values?)
In this example I am hypothesizing that household income, racial mix, housing costs, and real estate taxes will have something to say about gentrification So, I will first download my data for the most recent year from Census
FD_tract <- suppressMessages(
get_acs(geography = "tract", # or "block group", "county", "state" etc.
state = "GA",
county = c("Fulton", "Dekalb"),
variables = c(hhincome = 'B19019_001',
race.tot = "B02001_001",
race.white = "B02001_002",
race.black = "B02001_003",
trans.total = "B08006_001",
trans.car = "B08006_002",
trans.drovealone = "B08006_003",
trans.carpooled = "B08006_004", # Notice that I was not interested in 005-007 (2 person/ 4 person carpool etc.)
trans.pubtrans = "B08006_008", # Did not want to download any details about the type of public transport (009-0013)
trans.bicycle = "B08006_014",
trans.walk = "B08006_015",
trans.WfH = "B08006_017",
med_housexp = "B25104_001",
med_realestate_taxes = "B25103_001"
),
year = 2020,
survey = "acs5", # American Community Survey 5-year estimate
geometry = TRUE, # returns sf objects
output = "wide") # wide vs. long
)
The data contains several redundant columns that we will not use. So, let’s subset the data to only have the columns we will use
FD_tract <- FD_tract %>%
select(GEOID,
hhincome = hhincomeE, # New name = old name
race.tot = race.totE,
race.white = race.whiteE,
race.black = race.blackE,
trans.total = trans.totalE,
trans.car = trans.carE,
trans.drovealone = trans.drovealoneE,
trans.carpooled = trans.carpooledE,
trans.pubtrans = trans.pubtransE,
trans.bicycle = trans.bicycleE,
trans.walk = trans.walkE,
trans.WfH = trans.WfHE,
Med_HHExp = med_housexpE,
med_RETaxes = med_realestate_taxesE)
Let’s display what we have!
tmap_mode("view")
## tmap mode set to interactive viewing
RE_Taxes <- tm_shape(FD_tract) + tm_polygons("med_RETaxes")
HH_exp <- tm_shape(FD_tract) + tm_polygons("Med_HHExp")
tmap_arrange(RE_Taxes, HH_exp)
Now download the Yelp data. This has been done for you (given the problems we have been having with Yelp API)
To save time in class, we will use the Yelp data that Bonwoo downloaded for today’s class.
The Yelp data can be found on Canvas > Files > yelp_in.rds. This data is for Fulton and DeKalb County, GA, and contains Yelp data with categories = “yoga”. This data is already cleaned.
To read the data into R, we will use read_rds() function in readr package. Notice that in the code below, only the name of the .rds file is specified. This works only when the file is in the same folder as your .Rmd file. If not, you will need to provide a full path to the file.
# Reading the yelp data
yelp_in <- read_rds("yelp_in.rds")
I like to constantly check my datasets to see if the results look fine
I will check the file and then display it (i.e., the point locations of Yoga Studios) on the census data
skim(yelp_in)
## Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
## provided. Falling back to `character`.
Name | yelp_in |
Number of rows | 161 |
Number of columns | 23 |
_______________________ | |
Column type frequency: | |
character | 19 |
logical | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 22 | 22 | 0 | 161 | 0 |
alias | 0 | 1.00 | 17 | 58 | 0 | 161 | 0 |
name | 0 | 1.00 | 7 | 49 | 0 | 139 | 0 |
image_url | 0 | 1.00 | 0 | 68 | 6 | 156 | 0 |
url | 0 | 1.00 | 174 | 215 | 0 | 161 | 0 |
categories | 0 | 1.00 | 4 | 47 | 0 | 68 | 0 |
transactions | 0 | 1.00 | 0 | 0 | 161 | 1 | 0 |
phone | 0 | 1.00 | 0 | 12 | 6 | 147 | 0 |
display_phone | 0 | 1.00 | 0 | 14 | 6 | 147 | 0 |
price | 155 | 0.04 | 2 | 2 | 0 | 1 | 0 |
location.address1 | 9 | 0.94 | 0 | 30 | 7 | 142 | 0 |
location.address2 | 35 | 0.78 | 0 | 11 | 52 | 59 | 0 |
location.address3 | 30 | 0.81 | 0 | 26 | 126 | 6 | 0 |
location.city | 0 | 1.00 | 6 | 19 | 0 | 21 | 0 |
location.zip_code | 0 | 1.00 | 0 | 5 | 1 | 51 | 0 |
location.country | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
location.state | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
location.display_address | 0 | 1.00 | 11 | 76 | 0 | 158 | 0 |
geometry | 0 | 1.00 | 21 | 38 | 0 | 160 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
is_closed | 0 | 1 | 0 | FAL: 161 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
review_count | 0 | 1 | 22.85 | 34.23 | 1.00 | 3.00 | 9.00 | 25.00 | 159.00 | ▇▁▁▁▁ |
rating | 0 | 1 | 4.45 | 0.77 | 2.50 | 4.00 | 5.00 | 5.00 | 5.00 | ▂▁▂▃▇ |
distance | 0 | 1 | 1943.22 | 3926.03 | 58.82 | 846.53 | 1247.88 | 1789.03 | 43017.73 | ▇▁▁▁▁ |
tm_shape(yelp_in) + tm_dots(col="red") + tm_shape(FD_tract) + tm_borders()
Appending Census data
Joining census data in this case is somewhat different from earlier. We will do the following: 1. make sure the projection (CRS) of both files are correct 2. Calculate how many yoga studios are within each census tract (there will be many tracts with 0 studios) 3. Join the two files with the count of Yoga studios
## Check to see that the CRS for both the sf files are the same
head(FD_tract)
## Simple feature collection with 6 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.47101 ymin: 33.65339 xmax: -84.18201 ymax: 33.81229
## Geodetic CRS: NAD83
## GEOID hhincome race.tot race.white race.black trans.total trans.car
## 1 13089021908 43191 4829 323 3368 2421 2137
## 2 13089020400 123711 2743 2435 48 1691 1202
## 3 13089023410 41065 3903 131 3562 2030 1562
## 4 13121010601 55479 3673 1057 2514 1699 1290
## 5 13121008302 28214 1891 33 1836 481 401
## 6 13121004200 23906 2675 252 2423 1073 487
## trans.drovealone trans.carpooled trans.pubtrans trans.bicycle trans.walk
## 1 1759 378 86 0 101
## 2 1069 133 63 62 13
## 3 1339 223 307 0 87
## 4 1131 159 154 0 113
## 5 347 54 57 0 8
## 6 437 50 460 19 0
## trans.WfH Med_HHExp med_RETaxes geometry
## 1 97 1994 2142 MULTIPOLYGON (((-84.20619 3...
## 2 300 1219 4885 MULTIPOLYGON (((-84.34921 3...
## 3 55 1564 1067 MULTIPOLYGON (((-84.29784 3...
## 4 116 1712 2527 MULTIPOLYGON (((-84.46957 3...
## 5 0 638 604 MULTIPOLYGON (((-84.46315 3...
## 6 106 1470 1652 MULTIPOLYGON (((-84.42334 3...
head(yelp_in)
## Simple feature collection with 6 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -84.34933 ymin: 33.76259 xmax: -84.19098 ymax: 33.85257
## Geodetic CRS: WGS 84
## # A tibble: 6 × 23
## id alias name image…¹ is_cl…² url revie…³ categ…⁴ rating trans…⁵ phone
## <chr> <chr> <chr> <chr> <lgl> <chr> <int> <chr> <dbl> <chr> <chr>
## 1 NdD6VW… yoga… Yoga… https:… FALSE http… 1 Yoga, … 5 "" +147…
## 2 k5wcy4… spa-… Spa … https:… FALSE http… 11 Massag… 4.5 "" +140…
## 3 fdK7PX… will… Will… https:… FALSE http… 8 Yoga, … 4.5 "" +140…
## 4 scTOiG… the-… The … https:… FALSE http… 1 Perfor… 5 "" +167…
## 5 PuGXt2… toug… Toug… https:… FALSE http… 42 Traine… 5 "" +140…
## 6 q9gP-C… boho… Boho… https:… FALSE http… 9 Yoga 5 "" +140…
## # … with 12 more variables: display_phone <chr>, distance <dbl>, price <chr>,
## # location.address1 <chr>, location.address2 <chr>, location.address3 <chr>,
## # location.city <chr>, location.zip_code <chr>, location.country <chr>,
## # location.state <chr>, location.display_address <chr>, geometry <POINT [°]>,
## # and abbreviated variable names ¹image_url, ²is_closed, ³review_count,
## # ⁴categories, ⁵transactions
## # ℹ Use `colnames()` to see all variable names
## It turns out they are NOT! So we need to transform their CRS into the same
FD_tract_Geom <- st_transform(FD_tract, crs=4326)
Yelp_in_Geom <- st_transform(yelp_in, crs=4326)
Yoga_in_tract <- st_join(FD_tract_Geom, Yelp_in_Geom, join = st_intersects)
skim(Yoga_in_tract)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
Name | Yoga_in_tract |
Number of rows | 578 |
Number of columns | 38 |
_______________________ | |
Column type frequency: | |
character | 20 |
logical | 1 |
numeric | 17 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
GEOID | 0 | 1.00 | 11 | 11 | 0 | 530 | 0 |
id | 417 | 0.28 | 22 | 22 | 0 | 161 | 0 |
alias | 417 | 0.28 | 17 | 58 | 0 | 161 | 0 |
name | 417 | 0.28 | 7 | 49 | 0 | 139 | 0 |
image_url | 417 | 0.28 | 0 | 68 | 6 | 156 | 0 |
url | 417 | 0.28 | 174 | 215 | 0 | 161 | 0 |
categories | 417 | 0.28 | 4 | 47 | 0 | 68 | 0 |
transactions | 417 | 0.28 | 0 | 0 | 161 | 1 | 0 |
phone | 417 | 0.28 | 0 | 12 | 6 | 147 | 0 |
display_phone | 417 | 0.28 | 0 | 14 | 6 | 147 | 0 |
price | 572 | 0.01 | 2 | 2 | 0 | 1 | 0 |
location.address1 | 426 | 0.26 | 0 | 30 | 7 | 142 | 0 |
location.address2 | 452 | 0.22 | 0 | 11 | 52 | 59 | 0 |
location.address3 | 447 | 0.23 | 0 | 26 | 126 | 6 | 0 |
location.city | 417 | 0.28 | 6 | 19 | 0 | 21 | 0 |
location.zip_code | 417 | 0.28 | 0 | 5 | 1 | 51 | 0 |
location.country | 417 | 0.28 | 2 | 2 | 0 | 1 | 0 |
location.state | 417 | 0.28 | 2 | 2 | 0 | 1 | 0 |
location.display_address | 417 | 0.28 | 11 | 76 | 0 | 158 | 0 |
geometry | 0 | 1.00 | 193 | 3750 | 0 | 530 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
is_closed | 417 | 0.28 | 0 | FAL: 161 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
hhincome | 8 | 0.99 | 83324.83 | 48534.12 | 13577.00 | 48668.25 | 73387.50 | 101281.75 | 250001.00 | ▇▇▃▁▁ |
race.tot | 0 | 1.00 | 3435.28 | 1269.31 | 0.00 | 2498.25 | 3314.00 | 4260.50 | 7401.00 | ▁▇▇▃▁ |
race.white | 0 | 1.00 | 1442.49 | 1235.59 | 0.00 | 257.00 | 1264.50 | 2311.75 | 5039.00 | ▇▅▃▂▁ |
race.black | 0 | 1.00 | 1546.21 | 1517.43 | 0.00 | 357.50 | 903.50 | 2609.25 | 7021.00 | ▇▂▂▁▁ |
trans.total | 0 | 1.00 | 1749.09 | 686.47 | 0.00 | 1274.00 | 1681.00 | 2213.00 | 4114.00 | ▂▇▇▃▁ |
trans.car | 0 | 1.00 | 1332.79 | 583.89 | 0.00 | 943.00 | 1257.50 | 1720.00 | 3342.00 | ▂▇▆▂▁ |
trans.drovealone | 0 | 1.00 | 1199.21 | 538.29 | 0.00 | 839.00 | 1132.50 | 1531.50 | 3164.00 | ▃▇▆▂▁ |
trans.carpooled | 0 | 1.00 | 133.58 | 130.37 | 0.00 | 48.00 | 98.50 | 181.75 | 1109.00 | ▇▂▁▁▁ |
trans.pubtrans | 0 | 1.00 | 116.22 | 125.95 | 0.00 | 25.25 | 88.50 | 166.00 | 1194.00 | ▇▁▁▁▁ |
trans.bicycle | 0 | 1.00 | 9.56 | 25.71 | 0.00 | 0.00 | 0.00 | 0.00 | 220.00 | ▇▁▁▁▁ |
trans.walk | 0 | 1.00 | 44.05 | 92.15 | 0.00 | 0.00 | 12.00 | 41.00 | 788.00 | ▇▁▁▁▁ |
trans.WfH | 0 | 1.00 | 211.10 | 160.63 | 0.00 | 90.00 | 184.00 | 299.75 | 856.00 | ▇▆▂▁▁ |
Med_HHExp | 0 | 1.00 | 1368.57 | 494.20 | 0.00 | 1020.00 | 1350.00 | 1714.00 | 2877.00 | ▁▆▇▅▁ |
med_RETaxes | 44 | 0.92 | 3389.94 | 2371.97 | 199.00 | 1464.75 | 2854.50 | 4696.50 | 10001.00 | ▇▆▃▂▁ |
review_count | 417 | 0.28 | 22.85 | 34.23 | 1.00 | 3.00 | 9.00 | 25.00 | 159.00 | ▇▁▁▁▁ |
rating | 417 | 0.28 | 4.45 | 0.77 | 2.50 | 4.00 | 5.00 | 5.00 | 5.00 | ▂▁▂▃▇ |
distance | 417 | 0.28 | 1943.22 | 3926.03 | 58.82 | 846.53 | 1247.88 | 1789.03 | 43017.73 | ▇▁▁▁▁ |
# Now count the Yoga Studios by tract
yoga_count_tract <- count(as_tibble(Yoga_in_tract), GEOID) %>%
print()
## # A tibble: 530 × 2
## GEOID n
## <chr> <int>
## 1 13089020100 1
## 2 13089020200 1
## 3 13089020300 4
## 4 13089020400 1
## 5 13089020500 1
## 6 13089020600 1
## 7 13089020700 1
## 8 13089020801 1
## 9 13089020802 1
## 10 13089020901 2
## # … with 520 more rows
## # ℹ Use `print(n = ...)` to see more rows
# Join tract geometry with the number of Yoga studios in tract
test <- st_join(FD_tract_Geom, Yelp_in_Geom %>% mutate(count = 1))
out <- test %>%
group_by(GEOID) %>%
summarise(count = sum(count, na.rm = T))
# Lets' check to see if the polygons and the poin data on Yoga Studios match
tm_shape(out) + tm_polygons(col = "count") + tm_shape(Yelp_in_Geom) + tm_dots()
## OK Now we are ready to join back the counts of Yoga Studios to the Tract data
FD_tract_Geom_yoga <- FD_tract_Geom %>%
left_join(out %>% st_set_geometry(NULL), by = "GEOID")
Make sure that the final dataset we will be using for our analysis is as expected
skim(FD_tract_Geom_yoga)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
Name | FD_tract_Geom_yoga |
Number of rows | 530 |
Number of columns | 17 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 15 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
GEOID | 0 | 1 | 11 | 11 | 0 | 530 | 0 |
geometry | 0 | 1 | 193 | 3750 | 0 | 530 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
hhincome | 8 | 0.98 | 81559.87 | 49076.11 | 13577 | 46780.50 | 69623.0 | 100893.50 | 250001 | ▇▇▃▁▁ |
race.tot | 0 | 1.00 | 3409.13 | 1276.00 | 0 | 2498.25 | 3267.0 | 4243.00 | 7401 | ▁▇▇▃▁ |
race.white | 0 | 1.00 | 1337.87 | 1198.31 | 0 | 237.75 | 1062.5 | 2146.00 | 5039 | ▇▅▃▁▁ |
race.black | 0 | 1.00 | 1624.68 | 1545.94 | 0 | 367.00 | 996.5 | 2739.25 | 7021 | ▇▂▂▁▁ |
trans.total | 0 | 1.00 | 1718.82 | 683.32 | 0 | 1268.00 | 1653.5 | 2140.25 | 4114 | ▂▇▇▂▁ |
trans.car | 0 | 1.00 | 1315.87 | 585.48 | 0 | 926.75 | 1250.0 | 1690.50 | 3342 | ▂▇▆▂▁ |
trans.drovealone | 0 | 1.00 | 1179.78 | 536.94 | 0 | 831.25 | 1121.0 | 1525.75 | 3164 | ▃▇▆▂▁ |
trans.carpooled | 0 | 1.00 | 136.09 | 134.39 | 0 | 46.25 | 99.0 | 191.25 | 1109 | ▇▂▁▁▁ |
trans.pubtrans | 0 | 1.00 | 117.38 | 128.89 | 0 | 26.00 | 87.5 | 166.00 | 1194 | ▇▁▁▁▁ |
trans.bicycle | 0 | 1.00 | 7.84 | 21.88 | 0 | 0.00 | 0.0 | 0.00 | 220 | ▇▁▁▁▁ |
trans.walk | 0 | 1.00 | 38.72 | 84.26 | 0 | 0.00 | 8.5 | 38.00 | 788 | ▇▁▁▁▁ |
trans.WfH | 0 | 1.00 | 203.40 | 161.01 | 0 | 83.00 | 165.0 | 292.75 | 856 | ▇▅▂▁▁ |
Med_HHExp | 0 | 1.00 | 1341.85 | 488.96 | 0 | 1003.75 | 1330.0 | 1680.00 | 2877 | ▁▆▇▅▁ |
med_RETaxes | 41 | 0.92 | 3245.91 | 2338.90 | 199 | 1397.00 | 2607.0 | 4500.00 | 10001 | ▇▅▃▂▁ |
count | 0 | 1.00 | 0.30 | 0.71 | 0 | 0.00 | 0.0 | 0.00 | 6 | ▇▁▁▁▁ |
tm_shape(FD_tract_Geom_yoga) + tm_polygons(col="count") +tm_shape(Yelp_in_Geom) +tm_dots()
Let’s clean up the data and drop the “NA”s for important variables like househiold income and real estate taxes
## Now we have the file to work with (I hope!) Let's check again
print(skim(FD_tract_Geom_yoga))
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
## ── Data Summary ────────────────────────
## Values
## Name FD_tract_Geom_yoga
## Number of rows 530
## Number of columns 17
## _______________________
## Column type frequency:
## character 2
## numeric 15
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 GEOID 0 1 11 11 0 530 0
## 2 geometry 0 1 193 3750 0 530 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25
## 1 hhincome 8 0.985 81560. 49076. 13577 46780.
## 2 race.tot 0 1 3409. 1276. 0 2498.
## 3 race.white 0 1 1338. 1198. 0 238.
## 4 race.black 0 1 1625. 1546. 0 367
## 5 trans.total 0 1 1719. 683. 0 1268
## 6 trans.car 0 1 1316. 585. 0 927.
## 7 trans.drovealone 0 1 1180. 537. 0 831.
## 8 trans.carpooled 0 1 136. 134. 0 46.2
## 9 trans.pubtrans 0 1 117. 129. 0 26
## 10 trans.bicycle 0 1 7.84 21.9 0 0
## 11 trans.walk 0 1 38.7 84.3 0 0
## 12 trans.WfH 0 1 203. 161. 0 83
## 13 Med_HHExp 0 1 1342. 489. 0 1004.
## 14 med_RETaxes 41 0.923 3246. 2339. 199 1397
## 15 count 0 1 0.304 0.715 0 0
## p50 p75 p100 hist
## 1 69623 100894. 250001 ▇▇▃▁▁
## 2 3267 4243 7401 ▁▇▇▃▁
## 3 1062. 2146 5039 ▇▅▃▁▁
## 4 996. 2739. 7021 ▇▂▂▁▁
## 5 1654. 2140. 4114 ▂▇▇▂▁
## 6 1250 1690. 3342 ▂▇▆▂▁
## 7 1121 1526. 3164 ▃▇▆▂▁
## 8 99 191. 1109 ▇▂▁▁▁
## 9 87.5 166 1194 ▇▁▁▁▁
## 10 0 0 220 ▇▁▁▁▁
## 11 8.5 38 788 ▇▁▁▁▁
## 12 165 293. 856 ▇▅▂▁▁
## 13 1330 1680 2877 ▁▆▇▅▁
## 14 2607 4500 10001 ▇▅▃▂▁
## 15 0 0 6 ▇▁▁▁▁
## $character
##
## ── Variable type: character ────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 GEOID 0 1 11 11 0 530 0
## 2 geometry 0 1 193 3750 0 530 0
##
## $numeric
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_vari…¹ n_mis…² compl…³ mean sd p0 p25 p50 p75 p100
## 1 hhincome 8 0.985 8.16e+4 4.91e+4 13577 4.68e4 6.96e4 1.01e5 250001
## 2 race.tot 0 1 3.41e+3 1.28e+3 0 2.50e3 3.27e3 4.24e3 7401
## 3 race.white 0 1 1.34e+3 1.20e+3 0 2.38e2 1.06e3 2.15e3 5039
## 4 race.black 0 1 1.62e+3 1.55e+3 0 3.67e2 9.96e2 2.74e3 7021
## 5 trans.total 0 1 1.72e+3 6.83e+2 0 1.27e3 1.65e3 2.14e3 4114
## 6 trans.car 0 1 1.32e+3 5.85e+2 0 9.27e2 1.25e3 1.69e3 3342
## 7 trans.drov… 0 1 1.18e+3 5.37e+2 0 8.31e2 1.12e3 1.53e3 3164
## 8 trans.carp… 0 1 1.36e+2 1.34e+2 0 4.62e1 9.9 e1 1.91e2 1109
## 9 trans.pubt… 0 1 1.17e+2 1.29e+2 0 2.6 e1 8.75e1 1.66e2 1194
## 10 trans.bicy… 0 1 7.84e+0 2.19e+1 0 0 0 0 220
## 11 trans.walk 0 1 3.87e+1 8.43e+1 0 0 8.5 e0 3.8 e1 788
## 12 trans.WfH 0 1 2.03e+2 1.61e+2 0 8.3 e1 1.65e2 2.93e2 856
## 13 Med_HHExp 0 1 1.34e+3 4.89e+2 0 1.00e3 1.33e3 1.68e3 2877
## 14 med_RETaxes 41 0.923 3.25e+3 2.34e+3 199 1.40e3 2.61e3 4.5 e3 10001
## 15 count 0 1 3.04e-1 7.15e-1 0 0 0 0 6
## # … with 1 more variable: hist <chr>, and abbreviated variable names
## # ¹skim_variable, ²n_missing, ³complete_rate
## # ℹ Use `colnames()` to see all variable names
# Dropping the missing values
yoga_census_dropnaHH2 <- FD_tract_Geom_yoga[!is.na(FD_tract_Geom_yoga$hhincome),]
# Just to check whether the 8 NAs have been dropped from hhincome
print(skim(yoga_census_dropnaHH2))
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
## ── Data Summary ────────────────────────
## Values
## Name yoga_census_dropnaHH2
## Number of rows 522
## Number of columns 17
## _______________________
## Column type frequency:
## character 2
## numeric 15
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 GEOID 0 1 11 11 0 522 0
## 2 geometry 0 1 194 3750 0 522 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25
## 1 hhincome 0 1 81560. 49076. 13577 46780.
## 2 race.tot 0 1 3434. 1263. 176 2548.
## 3 race.white 0 1 1349. 1200. 0 240.
## 4 race.black 0 1 1634. 1553. 0 367
## 5 trans.total 0 1 1738. 668. 54 1274.
## 6 trans.car 0 1 1332. 575. 54 944.
## 7 trans.drovealone 0 1 1194. 528. 54 834.
## 8 trans.carpooled 0 1 138. 135. 0 47.2
## 9 trans.pubtrans 0 1 119. 129. 0 26.2
## 10 trans.bicycle 0 1 7.89 22.0 0 0
## 11 trans.walk 0 1 38.4 83.9 0 0
## 12 trans.WfH 0 1 206. 161. 0 86
## 13 Med_HHExp 0 1 1352. 478. 146 1011.
## 14 med_RETaxes 37 0.929 3249. 2338. 199 1397
## 15 count 0 1 0.305 0.718 0 0
## p50 p75 p100 hist
## 1 69623 100894. 250001 ▇▇▃▁▁
## 2 3304. 4252 7401 ▁▇▇▃▁
## 3 1094. 2169. 5039 ▇▅▃▁▁
## 4 994. 2765. 7021 ▇▂▂▁▁
## 5 1663 2149. 4114 ▂▇▇▂▁
## 6 1256 1710. 3342 ▂▇▆▂▁
## 7 1130. 1528. 3164 ▃▇▆▂▁
## 8 102. 193. 1109 ▇▂▁▁▁
## 9 88.5 168. 1194 ▇▁▁▁▁
## 10 0 0 220 ▇▁▁▁▁
## 11 9 38 788 ▇▁▁▁▁
## 12 169 293 856 ▇▅▂▁▁
## 13 1332 1684. 2877 ▂▇▇▅▁
## 14 2607 4500 10001 ▇▅▃▂▁
## 15 0 0 6 ▇▁▁▁▁
## $character
##
## ── Variable type: character ────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 GEOID 0 1 11 11 0 522 0
## 2 geometry 0 1 194 3750 0 522 0
##
## $numeric
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_vari…¹ n_mis…² compl…³ mean sd p0 p25 p50 p75 p100
## 1 hhincome 0 1 8.16e+4 4.91e+4 13577 4.68e4 6.96e4 1.01e5 250001
## 2 race.tot 0 1 3.43e+3 1.26e+3 176 2.55e3 3.30e3 4.25e3 7401
## 3 race.white 0 1 1.35e+3 1.20e+3 0 2.40e2 1.09e3 2.17e3 5039
## 4 race.black 0 1 1.63e+3 1.55e+3 0 3.67e2 9.94e2 2.76e3 7021
## 5 trans.total 0 1 1.74e+3 6.68e+2 54 1.27e3 1.66e3 2.15e3 4114
## 6 trans.car 0 1 1.33e+3 5.75e+2 54 9.44e2 1.26e3 1.71e3 3342
## 7 trans.drov… 0 1 1.19e+3 5.28e+2 54 8.35e2 1.13e3 1.53e3 3164
## 8 trans.carp… 0 1 1.38e+2 1.35e+2 0 4.72e1 1.01e2 1.93e2 1109
## 9 trans.pubt… 0 1 1.19e+2 1.29e+2 0 2.62e1 8.85e1 1.68e2 1194
## 10 trans.bicy… 0 1 7.89e+0 2.20e+1 0 0 0 0 220
## 11 trans.walk 0 1 3.84e+1 8.39e+1 0 0 9 e0 3.8 e1 788
## 12 trans.WfH 0 1 2.06e+2 1.61e+2 0 8.6 e1 1.69e2 2.93e2 856
## 13 Med_HHExp 0 1 1.35e+3 4.78e+2 146 1.01e3 1.33e3 1.68e3 2877
## 14 med_RETaxes 37 0.929 3.25e+3 2.34e+3 199 1.40e3 2.61e3 4.5 e3 10001
## 15 count 0 1 3.05e-1 7.18e-1 0 0 0 0 6
## # … with 1 more variable: hist <chr>, and abbreviated variable names
## # ¹skim_variable, ²n_missing, ³complete_rate
## # ℹ Use `colnames()` to see all variable names
# What about the missing values of med_RETaxes? Let's get rid of those NAs too
y_census_dropnaHHTX <- yoga_census_dropnaHH2[!is.na(yoga_census_dropnaHH2$med_RETaxes),]
skim(y_census_dropnaHHTX)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
Name | y_census_dropnaHHTX |
Number of rows | 485 |
Number of columns | 17 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 15 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
GEOID | 0 | 1 | 11 | 11 | 0 | 485 | 0 |
geometry | 0 | 1 | 194 | 3750 | 0 | 485 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
hhincome | 0 | 1 | 84346.51 | 49491.12 | 13577 | 49597 | 72466 | 104492 | 250001 | ▇▇▃▁▁ |
race.tot | 0 | 1 | 3502.86 | 1226.89 | 875 | 2594 | 3376 | 4269 | 7363 | ▃▇▆▃▁ |
race.white | 0 | 1 | 1406.58 | 1211.23 | 0 | 250 | 1227 | 2251 | 5039 | ▇▅▃▂▁ |
race.black | 0 | 1 | 1640.36 | 1580.14 | 0 | 357 | 984 | 2796 | 7021 | ▇▂▂▁▁ |
trans.total | 0 | 1 | 1784.75 | 648.34 | 421 | 1334 | 1697 | 2210 | 4114 | ▃▇▆▂▁ |
trans.car | 0 | 1 | 1371.62 | 558.27 | 140 | 982 | 1300 | 1745 | 3342 | ▂▇▆▂▁ |
trans.drovealone | 0 | 1 | 1231.82 | 509.93 | 115 | 861 | 1160 | 1553 | 3164 | ▂▇▆▁▁ |
trans.carpooled | 0 | 1 | 139.80 | 134.58 | 0 | 49 | 105 | 194 | 1109 | ▇▂▁▁▁ |
trans.pubtrans | 0 | 1 | 116.98 | 129.51 | 0 | 26 | 87 | 166 | 1194 | ▇▁▁▁▁ |
trans.bicycle | 0 | 1 | 7.83 | 22.05 | 0 | 0 | 0 | 0 | 220 | ▇▁▁▁▁ |
trans.walk | 0 | 1 | 35.96 | 74.50 | 0 | 0 | 8 | 38 | 595 | ▇▁▁▁▁ |
trans.WfH | 0 | 1 | 215.85 | 161.00 | 0 | 98 | 185 | 299 | 856 | ▇▆▂▁▁ |
Med_HHExp | 0 | 1 | 1383.44 | 464.09 | 246 | 1046 | 1359 | 1712 | 2877 | ▂▇▇▃▁ |
med_RETaxes | 0 | 1 | 3249.27 | 2338.31 | 199 | 1397 | 2607 | 4500 | 10001 | ▇▅▃▂▁ |
count | 0 | 1 | 0.31 | 0.72 | 0 | 0 | 0 | 0 | 6 | ▇▁▁▁▁ |
# To check if it is still a sf file
class(y_census_dropnaHHTX)
## [1] "sf" "data.frame"
Now we are ready to ask some probing questions about the data
- Are yoga studios in places with high median household incomes?
We could use the following ggplot command.
ggplot(y_census_dropnaHHTX, aes(x=hhincome, y=count)) +
geom_point() +
ylab("Number of Yoga Studios and Household Median Income in Tract")
How about we just look at a binary field that shows the presence or absence of Yoga Studios in tract?
y_census_dropnaHHTX$Yoga <- ifelse(y_census_dropnaHHTX$count>0, 1, 0)
A better way to visualize this categorical variable is a “boxplot”
boxplot(hhincome~Yoga, data=y_census_dropnaHHTX, main="Boxplot of Yoga Studios by Income", xlab="Whether Yoga Studios are present", ylab="Household median income")
# What about a similat boxplot with Real Estate Taxes?
boxplot(med_RETaxes~Yoga, data=y_census_dropnaHHTX, main="Boxplot of Yoga Studios by Real Estate Taxes", xlab="Whether Yoga Studios are present", ylab="Real Estate Taxes in tract")
Finally, let’s try a binary logistic regression
binary_Yoga1 <- glm(Yoga~hhincome, family=binomial, data=y_census_dropnaHHTX)
summary(binary_Yoga1)
##
## Call:
## glm(formula = Yoga ~ hhincome, family = binomial, data = y_census_dropnaHHTX)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1200 -0.6891 -0.6304 -0.5736 1.9475
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.903e+00 2.223e-01 -8.558 < 2e-16 ***
## hhincome 7.065e-06 2.066e-06 3.420 0.000626 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.32 on 484 degrees of freedom
## Residual deviance: 497.93 on 483 degrees of freedom
## AIC: 501.93
##
## Number of Fisher Scoring iterations: 4
binary_Yoga2 <- glm(Yoga~med_RETaxes, family=binomial, data=y_census_dropnaHHTX)
summary(binary_Yoga2)
##
## Call:
## glm(formula = Yoga ~ med_RETaxes, family = binomial, data = y_census_dropnaHHTX)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1968 -0.6965 -0.5880 -0.5323 2.0445
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.999e+00 2.040e-01 -9.799 < 2e-16 ***
## med_RETaxes 2.044e-04 4.469e-05 4.574 4.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 509.32 on 484 degrees of freedom
## Residual deviance: 488.41 on 483 degrees of freedom
## AIC: 492.41
##
## Number of Fisher Scoring iterations: 4
So what do you think about our initial question? Are Yoga Studios associated with gentrified neighborhoods?
As an exercise, download the data on median rents and examine if there is an association of yoga studios with median rents.
```