Data for Urban Analytics

class: middle, inverse

# Data for Urban Analytics

.font100[
Bon Woo Koo & Subhro Guhathakurta

8/30/2022
]

---

## Data for Urban Analytics

**Big data is not just about size.** If the size was all that makes a data big, then there is nothing new about big data; we've had it for a long time. For example,

* National Decennial Census .footnotesize[.gray[(which technically covers the entire population; hundreds of millions)]]
* Survey response for electoral predictions 
 .footnotesize[.gray[(e.g., The famous 2.4 million survey responses by Literary Digest in 1936, which, despite the size, miserably failed)]]

Also, **size per observation** can be very different (e.g., text vs. image).

.center[
### The characteristics of modern data that gave rise to urban analytics is more complex.
]
---
## Characteristics of Big Data

Kitchin (2014) details that big data can be characterized by:

.footnotesize[
* **Volume**, consisting of terabytes or petabytes of data

* **Velocity**, being created in or near real-time .red[**(Key trait)**]

* **Variety**, being structured and unstructured in nature
]

--
.footnotesize[
And goes on to include:

* **Exhaustive** in scope, striving to capture entire populations or systems (n=all) .red[**(Key trait)**]

* fine-grained in **resolution**, aiming to be as detailed as possible;

* **relational** in nature, containing common fields that can join different data sets;

* **flexible** (can add new fields easily) and **scaleability** (can expand in size rapidly).
]

.center[These characteristics affects the choice of the analytical methods]

???
**Volume**: Walmart generates 2.5 petabyte of transaction data **every hour** (Kitchin & McArdle, 2014).  
**Velocity**: Note, however, that frequency of **generation** != **publishing** (Kitchin & McArdle, 2014).  
**Variety**: Unstructured data were used in the past, but **not at the scale we use them now**.

**Exhaustiity**: Twitter data contains the entirety of the Twitter users. The definition of "all" can be relative (school vs. student). Another key trait of big data

---
## Sources of Big Data

* **Directed** &rarr; "Generated by traditional forms of surveillance, wherein the gaze of the technology is controlled by a * **human operator** *" (Kitchin 2014, p.4). &#x25cf; Satellite &#x25cf; CCTV &#x25cf; Google Street View images.

* **Automated** &rarr; "Generated as an inherent, * **automatic function of the device or system** *" (Kitchin 2014, p.4). &#x25cf; GPS from cell phones &#x25cf; clickstream on Amazon &#x25cf; tap-in and tap-out records from public transportation systems.

* **Volunteered** &rarr; Generated by * **users.** * &#x25cf; Postings on social media &#x25cf; images on Pinterest &#x25cf; reviews on Yelp &#x25cf; crowd-sourced databases such as OpenStreetMap.

---

## Some caveats

In addition to data ethics,

1. When they say "n=all", their "all" and your "all" may be different.
  * We often don't know who's over- and under-represented.
  * Even when we can check, we tend to assume "n=all".
  * Falsely assuming population representativeness can be dangerous.

2. Embedded biases in data and models are often not easily visible.  
.footnotesize[.gray[E.g., [Angry men and happy women](https://psycnet.apa.org/record/2007-00654-002)]]

3. Even with all the data, answering "why" can still be difficult.

4. Correlation != causation. 

 * Orange used cars have the best-kept engines.
 * Passengers who pre-order vegetarian meals usually make their flights
 * Spikes in the sale of pre-paid phone cards can predict the location of impending massacres in the Congo (Hardy, 2012a, taken from Hilbert, 2016)

???
Participants were faster and more accurate at detecting angry expressions on male faces and at detecting happy expressions on female faces

---
## Are we abandoning small data?

* **Certainly not.** There are many caveats in Big Data and we are still learning what those are.

* Small data studies will continue to be valuable because of their utility in answering targeted queries (Kitchin, 2015, 463).

* Best approach at the moment is to **leverage them wisely so that Big and small data complement each other**.

* Be mindful of the **unit/frequency of data**: When merging data sets, the one with more small data-like quality determines the output.  
.footnotesize[.gray[E.g., joining coordinate-level data with Census Tract-level data, merging real-time feed with quarterly data]]