class: middle, inverse # Data for Urban Analytics .font100[ Bon Woo Koo & Subhro Guhathakurta 8/30/2022 ] --- ## Data for Urban Analytics **Big data is not just about size.** If the size was all that makes a data big, then there is nothing new about big data; we've had it for a long time. For example, * National Decennial Census <br> .footnotesize[.gray[(which technically covers the entire population; hundreds of millions)]] * Survey response for electoral predictions <br> .footnotesize[.gray[(e.g., The famous 2.4 million survey responses by Literary Digest in 1936, which, despite the size, miserably failed)]] Also, **size per observation** can be very different (e.g., text vs. image). .center[ ### The characteristics of modern data that gave rise to urban analytics is more complex. ] --- ## Characteristics of Big Data Kitchin (2014) details that big data can be characterized by: .footnotesize[ * **Volume**, consisting of terabytes or petabytes of data * **Velocity**, being created in or near real-time .red[**(Key trait)**] * **Variety**, being structured and unstructured in nature ] -- .footnotesize[ And goes on to include: * **Exhaustive** in scope, striving to capture entire populations or systems (n=all) .red[**(Key trait)**] * fine-grained in **resolution**, aiming to be as detailed as possible; * **relational** in nature, containing common fields that can join different data sets; * **flexible** (can add new fields easily) and **scaleability** (can expand in size rapidly). ] .center[These characteristics affects the choice of the analytical methods] ??? **Volume**: Walmart generates 2.5 petabyte of transaction data **every hour** (Kitchin & McArdle, 2014). **Velocity**: Note, however, that frequency of **generation** != **publishing** (Kitchin & McArdle, 2014). **Variety**: Unstructured data were used in the past, but **not at the scale we use them now**. **Exhaustiity**: Twitter data contains the entirety of the Twitter users. The definition of "all" can be relative (school vs. student). Another key trait of big data --- ## Sources of Big Data * **Directed** → "Generated by traditional forms of surveillance, wherein the gaze of the technology is controlled by a * **human operator** *" (Kitchin 2014, p.4). <font color="gray" size=4px>● Satellite ● CCTV ● Google Street View images.</font> * **Automated** → "Generated as an inherent, * **automatic function of the device or system** *" (Kitchin 2014, p.4). <font color="gray" size=4px> ● GPS from cell phones ● clickstream on Amazon ● tap-in and tap-out records from public transportation systems. </font> * **Volunteered** → Generated by * **users.** * <font color="gray" size=4px> ● Postings on social media ● images on Pinterest ● reviews on Yelp ● crowd-sourced databases such as OpenStreetMap. </font> --- ## Some caveats In addition to data ethics, 1. When they say "n=all", their "all" and your "all" may be different. * We often don't know who's over- and under-represented. * Even when we can check, we tend to assume "n=all". * Falsely assuming population representativeness can be dangerous. 2. Embedded biases in data and models are often not easily visible. .footnotesize[.gray[E.g., [Angry men and happy women](https://psycnet.apa.org/record/2007-00654-002)]] 3. Even with all the data, answering "why" can still be difficult. 4. Correlation != causation. <font size=4px color=gray> * Orange used cars have the best-kept engines. * Passengers who pre-order vegetarian meals usually make their flights * Spikes in the sale of pre-paid phone cards can predict the location of impending massacres in the Congo (Hardy, 2012a, taken from Hilbert, 2016) ??? Participants were faster and more accurate at detecting angry expressions on male faces and at detecting happy expressions on female faces <!-- --- --> <!-- Many of these big data are unstructured. Only about 5 percent of all existing data are --> <!-- structured (that is, tabular data in a spreadsheet or similar formats) while the rest is not --> <!-- in these formats (Cukier 2010; Gandomi and Haider 2015) --> <!-- Unstructured data, such as --> <!-- images, audio, video and unstructured texts, often need to be translated to into structured --> <!-- formats required by analysis and modelling conventions (Gandomi and Haider 2015). --> <!-- Many data coming from sensors and wireless networks (for example, smartphones) are --> <!-- inherently spatial and spatiotemporal (Jardak et al. 2014). A study in 2012 noted that Google generates --> <!-- about 25 petabytes of data per day, and a significant portion of the data has spatiotemporal components (Vatsavai et al. 2012). --> <!-- In addition to relational data, the spatial dimension --> <!-- of big data offers important insights and allows researchers to gain greater value from the --> <!-- data by, for example, joining different datasets that are otherwise disconnected. --> --- ## Are we abandoning small data? * **Certainly not.** There are many caveats in Big Data and we are still learning what those are. * Small data studies will continue to be valuable because of their utility in answering targeted queries (Kitchin, 2015, 463). * Best approach at the moment is to **leverage them wisely so that Big and small data complement each other**. * Be mindful of the **unit/frequency of data**: When merging data sets, the one with more small data-like quality determines the output. .footnotesize[.gray[E.g., joining coordinate-level data with Census Tract-level data, merging real-time feed with quarterly data]]