Twitter Data Analysis

Originally written by Florina Dutt & Clio Andris (modified by Bonwoo Koo)

2022-11-1

# Package names
packages <- c("rtweet", "ggplot2", "dplyr", "tidytext", "tidyverse", "igraph", "ggraph", "tidyr", "wordcloud2", "textdata", "sf", "tmap")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

DISCLAIMER: Due to the rapid production of tweets, and that they are uncensored, I can’t be sure that you may not see some images that are not suited for work / school. You may see some that are controversial as well. This is part of the urban analytics field. That said, I hope no one gets offended or upset by anything we may encounter.

Section 0. Get your API credential

If you don’t have credentials for Twitter API, go to this webpage to find instructions on how to get one for yourself.

Note that there are multiple tiers in Twitter API - Essential Access, Elevated Access, and Academic Research Access. Each level has their own caps in terms of the maximum number of Apps per Project, Tweet consumption per month, and others (see this page for more details).

Twitter API provides a very well-organized document that’s great for understanding the structure of the API. There are Apps and Projects in Twitter API that help you organize your work. Each Project can contain a single App if you have Essential access, and up to three Apps if you have Elevated or greater access. We will be using Elevated Access that allows up to 3 Apps within a Project and Tweet consumption cap of 2 million Tweets per month.

Section 1. Let’s download some Tweets

In this document, we will use two different ways of collecting Twitter data: (1) collecting live tweets and (2) collecting tweets from the timeline of some users (including past tweets).

Step 1-1: Getting Live Tweets

# whatever name you assigned to your created app
appname <- "UrbanAnalytics_tutorial"

# create token named "twitter_token"
# the keys used should be replaced by your own keys obtained by creating the app  

twitter_token <- create_token(
 app = appname,
  consumer_key = Sys.getenv("twitter_key"), 
  consumer_secret = Sys.getenv("twitter_key_secret"),
  access_token = Sys.getenv("twitter_access_token"),
  access_secret = Sys.getenv("twitter_access_token_secret"))
## Warning: `create_token()` was deprecated in rtweet 1.0.0.
## See vignette('auth') for details
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Saving auth to 'C:\Users\bkoo34\AppData\Roaming/R/config/R/rtweet/
## create_token.rds'

We use the keyword Piedmont Park - a popular urban park in Atlanta.

#get/search tweets with any words and hashtags   
#as an example 'piedmont park' is used which is save under the data frame name - 'park_tweets'
#'search_tweets'- is the function of 'rtweet' library
#q is the words/hashtags used to search tweets; n is the number of tweets to download at a time. Here it is 100.
park_tweets_all <- search_tweets(q = "Piedmont Park", n = 200) 
park_tweets_all <- bind_cols(park_tweets_all, 
                             users_data(park_tweets_all) %>% select(screen_name, location))

The search_tweets() function can go back up to about 7 days. Let’s plot when the Tweets were made.

# When were these tweet made?
park_tweets_all %>% 
  ggplot(data = .) +
  geom_histogram(mapping = aes(x = created_at), color="gray") + 
  scale_x_datetime(date_labels = " %H:%M on %b %d")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Step 1-2: Tokenization & Barplot

Tokenization is the fundamental starting point in any NLP pipeline. “Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization” (direct quotes from here).

As an example, consider “How are you?” Most common way of tokenization is to break at space, resulting in how-are-you. You can also do character token, e.g., h-e-l-l-o, or sub-word, e.g., smart-er.

Then, if you code each token with a number, which allows us to represent textual data into numeric. For example, if we code h: 1, e: 2, l: 3, o:4, hello would be represented as 12334. The word heel, then can be written as 1223, and so on. You can already see the similarity between the two words, 12334 and 1223. This tokenization will be useful later when we do more sophisticated works.

To learn more, I found this video very intuitive.

# unique screen_name and location of those who tweeted about the park.
# Note that screen_name is less stable over time than user_id.
# You can consider replacing screen_name with user_id.
users <- park_tweets_all %>% 
  group_by(screen_name) %>% 
  summarise(location = first(location))

user_loc <- users$location %>% 
  table() %>% 
  as_tibble()

names(user_loc) <- c("location", "n")

loc_word <- user_loc %>% 
  unnest_tokens(input = location, # name of column in the input data
                output = Location) %>% # name of the column that will be in the output
  group_by(Location) %>% 
  summarize(n = sum(n)) %>%
  arrange(desc(n))

loc_word <- loc_word %>% 
  filter(n > 1 & Location != "")

loc_word %>% 
  mutate(Location = factor(Location, levels = Location)) %>% 
  ggplot(data = .) + 
  geom_col(mapping = aes(x = Location, y = n)) +
  coord_flip()

Step 1-3: Getting Tweets from User Timeline

This section is heavily borrowed from Median

# Get time lines
obama <- rtweet::get_timeline("BarackObama", n = 3200)
biden <- rtweet::get_timeline("JoeBiden", n=3200)

# Add screen nam
obama <- bind_cols(obama, 
                   users_data(obama) %>% select(screen_name, location))

biden <- bind_cols(biden, 
                   users_data(biden) %>% select(screen_name, location))

# Row-bind the two
tweets <- bind_rows(
  obama %>% select(text, screen_name, created_at, retweet_count, favorite_count),
  biden %>% select(text, screen_name, created_at, retweet_count, favorite_count)
  )
ggplot(tweets, aes(x = created_at, fill = screen_name)) +
  geom_histogram(position = "identity", bins = 50, show.legend = FALSE) +
  facet_wrap(~screen_name, ncol = 1) + 
  ggtitle("Tweet Activity")

Let’s do the tokenization and delete URLs and stop words (more on this later) from the text.

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

# Tidy the string
tidy_tweets_words <- tweets %>% 
  # Drop retweets
  filter(!str_detect(text, "^RT")) %>%
  # Drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Add id column
  mutate(id = row_number()) %>%
  # Tokenization (word tokens)
  unnest_tokens(
    word, text, token = "words") %>%
  # Drop Stopwords & drop non-alphabet-only strings
  filter(!word %in% stop_words$word, str_detect(word, "[a-z]"))

tidy_tweets_words
## # A tibble: 61,179 × 6
##    screen_name created_at          retweet_count favorite_count    id word     
##    <chr>       <dttm>                      <int>          <int> <int> <chr>    
##  1 BarackObama 2022-11-03 11:21:06           491           2110     1 election 
##  2 BarackObama 2022-11-03 11:21:06           491           2110     1 set      
##  3 BarackObama 2022-11-03 11:21:06           491           2110     1 direction
##  4 BarackObama 2022-11-03 11:21:06           491           2110     1 country  
##  5 BarackObama 2022-11-03 11:21:06           491           2110     1 it’s     
##  6 BarackObama 2022-11-03 11:21:06           491           2110     1 voice    
##  7 BarackObama 2022-11-03 11:21:06           491           2110     1 heard    
##  8 BarackObama 2022-11-03 11:21:06           491           2110     1 voting   
##  9 BarackObama 2022-11-03 11:21:06           491           2110     1 november 
## 10 BarackObama 2022-11-03 11:21:06           491           2110     1 8th      
## # … with 61,169 more rows
## # ℹ Use `print(n = ...)` to see more rows

Finally, let’s do a visualization of the key words used by the two presidents.

frequency <- tidy_tweets_words %>% 
  group_by(screen_name) %>% 
  count(word, sort = TRUE) %>% 
  left_join(tidy_tweets_words %>% 
              group_by(screen_name) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "screen_name"
frequency_wide <- frequency %>% 
  select(screen_name, word, freq) %>% 
  pivot_wider(names_from = screen_name, values_from = freq) %>%
  arrange(desc(BarackObama), desc(JoeBiden))

ggplot(frequency_wide, mapping = aes(x = BarackObama, y = JoeBiden)) +
  geom_jitter(
    alpha = 0.1, size = 2.5, width = 0.15, height = 0.15) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 0) +    
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  geom_abline(color = "red") + theme_bw()
## Warning: Removed 5029 rows containing missing values (geom_point).
## Warning: Removed 5029 rows containing missing values (geom_text).

Try this with other figures that you think can be a good comparison.

Section 3. Stop Words, Word Cloud, and N-grams

Another important cleaning up to do is to remove stop words. For fun, let’s use your own keyword from now. Using these knowledge, try your own search with a keyword of your selection. It would be useful to think about how you’d do the search if you have a specific research question. For example, if you are interested in what people are talking about ‘gentrification’ or ‘neighborhood change’ in metro Atlanta, how would your keyword, search scope, etc. would be?

Getting Tweets

# Get Tweets using your key
my_twts <- search_tweets(q = "'Georgia Tech'", 
                         n = 300,
                         lang = "en")

# Add User info
my_twts <- my_twts %>% 
  cbind(my_twts %>% 
          users_data() %>% 
          select(name, screen_name, location, favourites_count))

Clean + Tokenization

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

# Tidy the string
tidy_tweets_words <- my_twts %>% 
  # Drop retweets
  filter(!str_detect(text, "^RT|^rt")) %>%
  # Drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Add id column
  mutate(id = row_number()) %>%
  # Tokenization (word tokens)
  unnest_tokens(
    output = word, input = text, token = "words") 

tidy_tweets_words %>% 
  head(n = 3)
##            created_at id              id_str
## 1 2022-11-01 20:33:36  1 1587603797198934016
## 2 2022-11-01 20:33:36  1 1587603797198934016
## 3 2022-11-01 20:33:36  1 1587603797198934016
##                                                                                                                                                                                                                                            full_text
## 1 Celebrate with us 🙌\n\nThis Saturday versus Georgia Tech, we will honor our four lettermen who will be inducted into the @hokiesports Hall of Fame this weekend.\n\n🎟» https://t.co/PCT4EVKC6k\n📝» https://t.co/Jcr3wiB0z7 https://t.co/ZhnTkmcLhY
## 2 Celebrate with us 🙌\n\nThis Saturday versus Georgia Tech, we will honor our four lettermen who will be inducted into the @hokiesports Hall of Fame this weekend.\n\n🎟» https://t.co/PCT4EVKC6k\n📝» https://t.co/Jcr3wiB0z7 https://t.co/ZhnTkmcLhY
## 3 Celebrate with us 🙌\n\nThis Saturday versus Georgia Tech, we will honor our four lettermen who will be inducted into the @hokiesports Hall of Fame this weekend.\n\n🎟» https://t.co/PCT4EVKC6k\n📝» https://t.co/Jcr3wiB0z7 https://t.co/ZhnTkmcLhY
##   truncated display_text_range
## 1     FALSE                213
## 2     FALSE                213
## 3     FALSE                213
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    entities
## 1 NA, NA, NA, NA, hokiesports, HokieSports, 16001276, 16001276, 119, 131, https://t.co/PCT4EVKC6k, https://t.co/Jcr3wiB0z7, http://vthoki.es/1ZTyu, http://vthoki.es/PnGIt, vthoki.es/1ZTyu, vthoki.es/PnGIt, 163, 190, 186, 213, NA, NA, 1587603356721242112, 1587603356721242113, 214, 237, http://pbs.twimg.com/media/FghM12mWYAE8rs4.jpg, https://pbs.twimg.com/media/FghM12mWYAE8rs4.jpg, https://t.co/ZhnTkmcLhY, pic.twitter.com/ZhnTkmcLhY, https://twitter.com/HokiesFB/status/1587603797198934016/photo/1, photo, 1638, 544, 150, 960, 2048, 680, 150, 1200, fit, fit, crop, fit, large, small, thumb, medium, NA
## 2 NA, NA, NA, NA, hokiesports, HokieSports, 16001276, 16001276, 119, 131, https://t.co/PCT4EVKC6k, https://t.co/Jcr3wiB0z7, http://vthoki.es/1ZTyu, http://vthoki.es/PnGIt, vthoki.es/1ZTyu, vthoki.es/PnGIt, 163, 190, 186, 213, NA, NA, 1587603356721242112, 1587603356721242113, 214, 237, http://pbs.twimg.com/media/FghM12mWYAE8rs4.jpg, https://pbs.twimg.com/media/FghM12mWYAE8rs4.jpg, https://t.co/ZhnTkmcLhY, pic.twitter.com/ZhnTkmcLhY, https://twitter.com/HokiesFB/status/1587603797198934016/photo/1, photo, 1638, 544, 150, 960, 2048, 680, 150, 1200, fit, fit, crop, fit, large, small, thumb, medium, NA
## 3 NA, NA, NA, NA, hokiesports, HokieSports, 16001276, 16001276, 119, 131, https://t.co/PCT4EVKC6k, https://t.co/Jcr3wiB0z7, http://vthoki.es/1ZTyu, http://vthoki.es/PnGIt, vthoki.es/1ZTyu, vthoki.es/PnGIt, 163, 190, 186, 213, NA, NA, 1587603356721242112, 1587603356721242113, 214, 237, http://pbs.twimg.com/media/FghM12mWYAE8rs4.jpg, https://pbs.twimg.com/media/FghM12mWYAE8rs4.jpg, https://t.co/ZhnTkmcLhY, pic.twitter.com/ZhnTkmcLhY, https://twitter.com/HokiesFB/status/1587603797198934016/photo/1, photo, 1638, 544, 150, 960, 2048, 680, 150, 1200, fit, fit, crop, fit, large, small, thumb, medium, NA
##      metadata
## 1 popular, en
## 2 popular, en
## 3 popular, en
##                                                                    source
## 1 <a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>
## 2 <a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>
## 3 <a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>
##   in_reply_to_status_id in_reply_to_status_id_str in_reply_to_user_id
## 1                    NA                      <NA>                  NA
## 2                    NA                      <NA>                  NA
## 3                    NA                      <NA>                  NA
##   in_reply_to_user_id_str in_reply_to_screen_name geo coordinates
## 1                    <NA>                    <NA>  NA  NA, NA, NA
## 2                    <NA>                    <NA>  NA  NA, NA, NA
## 3                    <NA>                    <NA>  NA  NA, NA, NA
##                                                            place contributors
## 1 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA           NA
## 2 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA           NA
## 3 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA           NA
##   is_quote_status retweet_count favorite_count favorited retweeted
## 1           FALSE            41            321     FALSE     FALSE
## 2           FALSE            41            321     FALSE     FALSE
## 3           FALSE            41            321     FALSE     FALSE
##   possibly_sensitive lang
## 1              FALSE   en
## 2              FALSE   en
## 3              FALSE   en
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 retweeted_status
## 1 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## 2 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## 3 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
##   quoted_status_id quoted_status_id_str
## 1               NA                 <NA>
## 2               NA                 <NA>
## 3               NA                 <NA>
##                                                                                                                                                                                                                                                        quoted_status
## 1 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## 2 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## 3 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
##   favorited_by scopes display_text_width quoted_status_permalink quote_count
## 1           NA     NA                 NA                      NA          NA
## 2           NA     NA                 NA                      NA          NA
## 3           NA     NA                 NA                      NA          NA
##   timestamp_ms reply_count filter_level query withheld_scope withheld_copyright
## 1           NA          NA           NA    NA             NA                 NA
## 2           NA          NA           NA    NA             NA                 NA
## 3           NA          NA           NA    NA             NA                 NA
##   withheld_in_countries possibly_sensitive_appealable                   name
## 1                    NA                            NA Virginia Tech Football
## 2                    NA                            NA Virginia Tech Football
## 3                    NA                            NA Virginia Tech Football
##   screen_name                      location favourites_count      word
## 1    HokiesFB Lane Stadium - Blacksburg, VA            20427 celebrate
## 2    HokiesFB Lane Stadium - Blacksburg, VA            20427      with
## 3    HokiesFB Lane Stadium - Blacksburg, VA            20427        us

Let’s plot top 20 words to see what are some of the words that are frequently used. You will notice that the plot has an issue - there are many redundant words, such as the, to, and, a, for, …, are included. While it makes sense these words are frequently found in Tweets, they are not useful for our analysis.

# plot the top 20 words and sort them in order of their counts
tidy_tweets_words %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "words",
      y = "counts",
      title = "Figure 2: Unique wordcounts found in tweets")
## Selecting by n

#-- Do you observe any problem?

Deleting Stop Words

These so-called stop words can be easily deleted. In tidytext package is a data set called ‘stop_words’, which is a list of common stop words. We can use this data set to filter out stop words from our Tweets.

# load list of stop words - from the tidytext package
data("stop_words")
# view first 6 words
head(stop_words)
## # A tibble: 6 × 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART

We can use anti_join() function for the filtration. This function is part of the dplyr package. It removes stop words from the tweet text and saved as cleaned tweet words. This function works the same way as other join functions in that the order of data.frame in the function argument matters. Consider the schematic below: anti_join(A,B) will give you everything from List A except those that are also in List B. If you do anti_join(B,A), it will give you everything in List B except those that are in List A.

Also, because I am interested in what other words appear in conjunction with ‘street parking’, I will delete ‘street’ and ‘parking’ from the plot. I also delete numbers, because, for some reason, street parking often appear together with some numbers (maybe due to something like 2-hour street parking?)

# remove stop words from your list of words
cleanTokens <- tidy_tweets_words %>% anti_join(stop_words, by = c("word"))

# Check the number of rows after removal of the stop words. There should be fewer words now
print(
  glue::glue("Before: {nrow(tidy_tweets_words)}, After: {nrow(cleanTokens)}")
  )
## Before: 2308, After: 1357
# plot the top 20 words 
cleanTokens %>%
  count(word, sort = TRUE) %>%
  filter(!word %in% c("georgia", "tech")) %>% 
  filter(!str_detect(string = word, pattern = "[0-9]+")) %>% 
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(y = "count",
       x = "words",
       title = "Figure 3: Unique wordcounts found in tweets after applying stop words",
       subtitle = "Stop words removed from the list")
## Selecting by n

Part 4: Word clouds

###You may need these
#library(wordcloud)
library(RColorBrewer)
pal <- brewer.pal(8,"Dark2")

#Get some frequency counts for each word
# I am dropping 'georgia' and 'tech' because 
# I am interested what other words appear with georgia tech
freq_df1 <- cleanTokens %>%
  filter(!word %in% c("georgia", "tech")) %>% 
  filter(!str_detect(string = word, pattern = "[0-9]+")) %>% 
  count(word, sort = TRUE) %>%
  top_n(100) %>%
  mutate(word = reorder(word, n))
## Selecting by n
wordcloud2(data = freq_df1, minRotation = 0, maxRotation = 0, ellipticity = 0.8)

Part 5: N-grams

You may what to see words that appear together in the tweets. N-gram is the sequence of n-words appearing together. For example ‘basketball coach’, ‘dinner time’ are two words occurring together they are called i-grams. Similarly, ‘the three musketeers’ is a tri-gram, and ‘she was very hungry’ is a 4-gram respectively. We will learn how to extract ngrams form the the Tweet text and that will give further insights in to the tweet corpus. For advanced text analysis and machine learning based labeling, specific tokens, and n-grams can be used for feature engineering of the tweet.

N-grams are used to analyze words in context. When we say (1) “We need to check the details.” and (2) “Can we pay it with a check?”, the word check are used as a verb and as a noun. We know what ‘check’ means in a sentence based on other words in the sentence, particularly words that are before and after the word ‘check.’ For example, if the word ‘check’ is used after ‘to’, we can infer that it is used as a verb. You can test bi-grams (2 words), tri-grams (3 words), and so on.

For example, “The result of separating bigrams is helpful for exploratory analyses of the text.” becomes paired_words
1 the result
2 result of
3 of separating
4 separating bigrams
5 bigrams is
6 is helpful
7 helpful for
8 for exploratory
9 exploratory analyses
10 analyses of
11 of the
12 the text

#get ngrams. You may try playing around with the value of n, n=3 , n=4
my_twts_ngram <- my_twts %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  dplyr::select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 2)
#show ngrams with sorted values
my_twts_ngram %>%
  count(paired_words, sort = TRUE) %>% 
  head()
##           paired_words   n
## 1         georgia tech 163
## 2                 to a  85
## 3          a terrified  83
## 4 accidental president  83
## 5        an accidental  83
## 6    because democrats  83

Here we see the ngrams are using stop words such as * a, to, etc.* Next we will try to obtain ngrams occurring without stop words. We will use the separate function of the tidyr library to obtain the paired words in two columns i.e. word 1 and word 2. Subsequently we filter out columns containing stop words using the filter fucntion

library(tidyr)
#separate the paired words into two columns
my_twts_ngram_pair <- my_twts_ngram %>%
  separate(paired_words, c("word1", "word2"), sep = " ")

# filter rows where there are stop words under word 1 column and word 2 column
my_twts_filtered <- my_twts_ngram_pair %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# Filter out words that are not encoded in ASCII
# To see what's ASCCII, google 'ASCII table'
my_twts_filtered <- my_twts_filtered[
  stringi::stri_enc_isascii(my_twts_filtered$word1) &
    stringi::stri_enc_isascii(my_twts_filtered$word2),
  ]

# Sort the new bi-gram (n=2) counts:
my_words_counts <- my_twts_filtered %>%
  count(word1, word2) %>%
  arrange(desc(n))

head(my_words_counts)
##        word1     word2   n
## 1    georgia      tech 163
## 2 accidental president  83
## 3      covid      fear  83
## 4  democrats      sold  83
## 5       fear      porn  83
## 6        joe     biden  83

By using the igraph and ggraph library we are trying to visualize the words occuring in pairs. (Note the edges can’t be drawn at this time)

# plot word network
my_words_counts %>%
  filter(n >= 2) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8, size = 2.5) +
  labs(title = "Figure 5: Word Network: Tweets using my hashtag",
       subtitle = "Text mining twitter data",
       x = "", y = "")