Sentiment Analysis

Bonwoo Koo and Subhro Guhathakurta

2022-11-8

DISCLAIMER: Due to the rapid production of tweets, and that they are uncensored, I can’t be sure that you may not see some images that are not suited for work / school. You may see some that are controversial as well. This is part of the urban analytics field. That said, I hope no one gets offended or upset by anything we may encounter.

Section 0. Get your API credential

If you don’t have credentials for Twitter API, go to this webpage to find instructions on how to get one for yourself.

Note that there are multiple tiers in Twitter API - Essential Access, Elevated Access, and Academic Research Access. Each level has their own caps in terms of the maximum number of Apps per Project, Tweet consumption per month, and others (see this page for more details).

Twitter API provides a very well-organized document that’s great for understanding the structure of the API. There are Apps and Projects in Twitter API that help you organize your work. Each Project can contain a single App if you have Essential access, and up to three Apps if you have Elevated or greater access. We will be using Elevated Access that allows up to 3 Apps within a Project and Tweet consumption cap of 2 million Tweets per month.

Section 1. Setup for Sentiment Analysis

Sentiment analysis refers to the use of natural language processing and/or other related techniques to quantify affective states often from text data (source). Like we’ve experienced for computer vision, recent developments introduced some nicely packaged models that we can easily use out of the box. This document will introduce two different packages for sentiment analysis in R. First is SentimentAnalysis package, which uses dictionary-based approach. On the benefits of dictionary-based approach, their vignette states,

“On the one hand, machine learning approaches are preferred when one strives for high prediction performance. However, machine learning usually works as a black-box, thereby making interpretations diffucult. On the other hand, dictionary-based approaches generate lists of positive and negative words. The respective occurrences of these words are then combined into a single sentiment score. Therefore, the underlying decisions become traceable and researchers can understand the factors that result in a specific sentiment.”

The Second package is sentiment.ai. This package is based on deep learning architecture and “… is relatively simple and out performs the current best offerings on CRAN and even Microsoft’s Azure Cognitive Services” according to the developers.

Installation

Unlike other packages we’ve used so far, this sentiment.ai package is not easy to install (they use the word ‘notorious’ when they describe installation process).

  1. Go to https://docs.conda.io/en/latest/miniconda.html and download Miniconda3 Windows 64-bit installer. Install Miniconda by keep clicking ‘Next’ and ‘Install.’

  2. If your RStudio is currently open, close it and open it again.

  3. Load the package using library(sentiment.ai).

  4. Run the following code. It will install a lot of stuff and will restart R session at the end. Next time you open RStudio, you don’t need to run this code again.

# You need to run it only once.
install_sentiment.ai(envname = "r-sentiment-ai",
                     method = "conda",
                     python_version = "3.8.10")
  1. After R session is restarted, run the following code:
# Sam's solution
init_sentiment.ai(envname = "r-sentiment-ai", method="conda")
## Warning in reticulate::use_condaenv(envname, required = TRUE): multiple Conda environments found; the first-listed will be chosen.
##             name
## 4 r-sentiment-ai
## 6 r-sentiment-ai
##                                                                           python
## 4 C:\\Users\\bonwo\\AppData\\Local\\r-miniconda\\envs\\r-sentiment-ai/python.exe
## 6           C:\\Users\\bonwo\\Documents\\.virtualenvs\\r-sentiment-ai/python.exe
## Warning: The request to `use_python("C:\Users\bonwo\AppData\Local\r-
## miniconda\envs\r-sentiment-ai/python.exe")` will be ignored because
## the environment variable RETICULATE_PYTHON is set to "C:/Users/bonwo/
## Documents/.virtualenvs/r-sentiment-ai/Scripts/python.exe"
## Preparing Model
## Warning: Python 'C:\Users\bonwo\AppData\Local\r-miniconda\envs\r-sentiment-
## ai/python.exe' was requested but 'C:/Users/bonwo/Documents/.virtualenvs/r-
## sentiment-ai/Scripts/python.exe' was loaded instead (see reticulate::py_config()
## for more information)
## <tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x000001B85A056640>
check_sentiment.ai()
## NULL

If you are asked if you wanted to install Miniconda, hit y. This Miniconda is different from the one you’ve already installed because this one is for the use within R. This function may print some error messages, but that’s fine as long as the final message says NULL.

  1. Run the following code for testing purpose.
sentiment_score(c("This installation process is too complicated!", 
                  "Only if it works in the end.", 
                  "But does it?", 
                  "It does work!"))
## This installation process is too complicated! 
##                                    -0.7417080 
##                  Only if it works in the end. 
##                                    -0.4529068 
##                                  But does it? 
##                                     0.1627681 
##                                 It does work! 
##                                     0.6917385

If you can see associated sentiment scores for each sentence, the installation is successful.

Comparing sentiment analysis results from the two packages

Using some sample texts, let’s compare the two packages. This example is borrowed from sentiment.ai.

Remember that next time you close and open RStudio, you don’t need to use install_sentiment.ai() and check_sentiment.ai() function. You can just do library(sentiment.ai) and init_sentiment.ai() to get ready to do sentiment analysis using sentiment.ai package.

# Assuming you've freshly opened RStudio..
library(sentiment.ai)

# Example texts
text <- c(
    "What a great car. It stopped working after a week.",
    "Steve Irwin working to save endangered species",
    "Bob Ross teaching people how to paint",
    "I saw Adolf Hitler on my vacation in Argentina...",
    "the resturant served human flesh",
    "the resturant is my favorite!",
    "the resturant is my favourite!",
    "this restront is my FAVRIT innit!",
    "the resturant was my absolute favorite until they gave me food poisoning",
    "This fantastic app freezes all the time!",
    "I learned so much on my trip to Hiroshima museum last year!",
    "What happened to the people of Hiroshima in 1945",
    "I had a blast on my trip to Nagasaki",
    "The blast in Nagasaki",
    "I love watching scary horror movies",
    "This package offers so much more nuance to sentiment analysis!",
     "you remind me of the babe. What babe? The babe with the power! What power? The power of voodoo. Who do? You do. Do what? Remind me of the babe!"
)

# sentiment.ai
sentiment.ai.score <- sentiment_score(text)

# Sentiment Analysis
sentimentAnalysis.score <- SentimentAnalysis::analyzeSentiment(text)$SentimentQDAP

example <- data.frame(target = text, 
                      sentiment.ai = sentiment.ai.score,
                      sentimentAnalysis = sentimentAnalysis.score)

rownames(example) <- NULL

example %>% 
  kableExtra::kable()
target sentiment.ai sentimentAnalysis
What a great car. It stopped working after a week. -0.6983492 0.4000000
Steve Irwin working to save endangered species 0.2718621 0.1666667
Bob Ross teaching people how to paint 0.2806548 0.0000000
I saw Adolf Hitler on my vacation in Argentina… -0.2888821 0.0000000
the resturant served human flesh -0.3164368 0.2500000
the resturant is my favorite! 0.7953137 0.5000000
the resturant is my favourite! 0.7780578 0.0000000
this restront is my FAVRIT innit! 0.6293544 0.0000000
the resturant was my absolute favorite until they gave me food poisoning -0.3577476 0.0000000
This fantastic app freezes all the time! -0.4124967 0.2500000
I learned so much on my trip to Hiroshima museum last year! 0.6365266 0.0000000
What happened to the people of Hiroshima in 1945 -0.5775906 0.0000000
I had a blast on my trip to Nagasaki 0.7348465 -0.3333333
The blast in Nagasaki -0.5073422 -0.5000000
I love watching scary horror movies 0.5399646 0.0000000
This package offers so much more nuance to sentiment analysis! 0.7359890 0.0000000
you remind me of the babe. What babe? The babe with the power! What power? The power of voodoo. Who do? You do. Do what? Remind me of the babe! 0.5506284 0.3000000

Section 2. Sentiment analysis on Tweets

Now that we have a powerful tool in our hands, let’s apply it to Tweets!

Getting Tweets from user timelines

To acquire enough number of Tweets for class exercise, we will collect Tweets from the timeline of some famous figures.

This section is heavily borrowed from Median

# Get time lines
obama <- rtweet::get_timeline("BarackObama", n = 3200)
biden <- rtweet::get_timeline("JoeBiden", n=3200)

# Add screen nam
obama <- bind_cols(obama, 
                   users_data(obama) %>% select(screen_name, location))

biden <- bind_cols(biden, 
                   users_data(biden) %>% select(screen_name, location))

# Row-bind the two
tweets <- bind_rows(
  obama %>% select(text, screen_name, created_at, retweet_count, favorite_count),
  biden %>% select(text, screen_name, created_at, retweet_count, favorite_count)
  )
# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

# Tidy the string
tidy_tweets_words <- tweets %>% 
  # Drop retweets
  filter(!str_detect(text, "^RT")) %>%
  # Drop URLs
  mutate(text = str_replace_all(text, replace_reg, ""),
         text = gsub("@", "", text),
         text = gsub("\n\n", "", text)) %>%
  # Add id column
  mutate(id = row_number())

tidy_tweets_words
## # A tibble: 5,456 × 6
##    text                        scree…¹ created_at          retwe…² favor…³    id
##    <chr>                       <chr>   <dttm>                <int>   <int> <int>
##  1 "Midterm elections matter.… Barack… 2022-11-08 09:26:49    8577   31721     1
##  2 "The kind of slash and bur… Barack… 2022-11-07 15:00:31    5807   23266     2
##  3 "A simple message from Joe… Barack… 2022-11-07 11:30:20   10240   49030     3
##  4 "Let’s get this done. Go t… Barack… 2022-11-05 21:23:26    2004    7445     4
##  5 "Our democracy is on the b… Barack… 2022-11-05 21:23:25   12737   47586     5
##  6 "The more things change, t… Barack… 2022-11-05 20:30:58   13317   85385     6
##  7 "I’m fired up to be in Phi… Barack… 2022-11-05 17:16:46    2554   12052     7
##  8 "Pennsylvania, make sure y… Barack… 2022-11-05 15:31:40    5192   19662     8
##  9 "We only have three days l… Barack… 2022-11-05 11:53:50    2080    8493     9
## 10 "The only way to make our … Barack… 2022-11-04 16:04:25   13195   59571    10
## # … with 5,446 more rows, and abbreviated variable names ¹​screen_name,
## #   ²​retweet_count, ³​favorite_count
## # ℹ Use `print(n = ...)` to see more rows

Applying Sentiment Analysis and visualize the results

tidy_tweets_words <- tidy_tweets_words %>% 
  mutate(sentiment_ai = sentiment_score(text),
         sentimentAnaly = SentimentAnalysis::analyzeSentiment(text)$SentimentQDAP)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  ||                                                                     |   2%
  |                                                                            
  ||||                                                                   |   4%
  |                                                                            
  |||||                                                                  |   5%
  |                                                                            
  ||||||                                                                 |   7%
  |                                                                            
  |||||||                                                                |   9%
  |                                                                            
  |||||||||                                                              |  11%
  |                                                                            
  ||||||||||                                                             |  13%
  |                                                                            
  |||||||||||                                                            |  15%
  |                                                                            
  ||||||||||||                                                           |  16%
  |                                                                            
  ||||||||||||||                                                         |  18%
  |                                                                            
  |||||||||||||||                                                        |  20%
  |                                                                            
  ||||||||||||||||                                                       |  22%
  |                                                                            
  ||||||||||||||||||                                                     |  24%
  |                                                                            
  |||||||||||||||||||                                                    |  25%
  |                                                                            
  ||||||||||||||||||||                                                   |  27%
  |                                                                            
  |||||||||||||||||||||                                                  |  29%
  |                                                                            
  |||||||||||||||||||||||                                                |  31%
  |                                                                            
  ||||||||||||||||||||||||                                               |  33%
  |                                                                            
  |||||||||||||||||||||||||                                              |  35%
  |                                                                            
  ||||||||||||||||||||||||||                                             |  36%
  |                                                                            
  ||||||||||||||||||||||||||||                                           |  38%
  |                                                                            
  |||||||||||||||||||||||||||||                                          |  40%
  |                                                                            
  ||||||||||||||||||||||||||||||                                         |  42%
  |                                                                            
  ||||||||||||||||||||||||||||||||                                       |  44%
  |                                                                            
  |||||||||||||||||||||||||||||||||                                      |  45%
  |                                                                            
  ||||||||||||||||||||||||||||||||||                                     |  47%
  |                                                                            
  |||||||||||||||||||||||||||||||||||                                    |  49%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||                                  |  51%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||                                 |  53%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||                                |  55%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||                               |  56%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||                             |  58%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||                            |  60%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||                           |  62%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||                         |  64%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||                        |  65%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||||                       |  67%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||                      |  69%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||||                    |  71%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||||||||                   |  73%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||||||                  |  75%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||                 |  76%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||               |  78%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||              |  80%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||             |  82%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||           |  84%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||          |  85%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||         |  87%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||        |  89%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||      |  91%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||     |  93%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||    |  95%
  |                                                                            
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||   |  96%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |  98%
  |                                                                            
  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
tidy_tweets_words %>% 
  mutate(ym = format(created_at, "%Y-%m")) %>% 
  group_by(screen_name, ym) %>% 
  summarise(sentiment = mean(sentiment_ai),
            retweet_count = log(mean(retweet_count))) %>% 
  mutate(ym = ym(ym)) %>% 
  ggplot(data = .) +
  geom_line(mapping = aes(x = ym, y = sentiment, color = retweet_count), lwd = 1) +
  facet_wrap(~screen_name) +
  scale_color_gradient(low="red", high="green") +
  labs(x = "Time Line", 
      y = "Sentiment Score \n",
      title = "Sentiment Score of Tweets from US presidents",
      subtitle = "Sentiment Score: -1 (negative) ~ +1 (positive)",
      color = "Retweet Count (logged)")

Try with your own search key words

One example you can try is the names of different neighborhoods in Atlanta. How can you make sure that the neighborhood names you search are not from other cities?

Of course, feel free to pick whatever Tweets you want to test out.