Journey of Analytics

Deep dive into data analysis tools, theory and projects

Author: JOURNEYOFANALYTICS (page 2 of 2)

Twitter Sentiment Analysis


Today’s post is a 2-part tutorial series on how to create an interactive ShinyR application that displays sentiment analysis for various phrases and search terms. The application accepts user a search term as input and graphically displays sentiment analysis.

In keeping with this month’s theme – “API programming”, this project uses the Twitter API to perform real-time search for tweets containing the user input term. Live App Link on Shiny website is provided and screenshot is as follows:

Twitter Sentiment Analysis Shiny

Shiny application for Twitter Sentiment Analysis

The project idea may seem simple at first, but will teach you the following skills:

  • working with Twitter API and dynamic data streaming (every time the search term changes, the program sends a new request to Twitter for relevant tweets),
  • Building an “interactive”, real-time application in Shiny/R,
  • data visualization with R

As always, the entire source code is also available for download on the Projects Page or can be forked from my  Github account here.


The tutorial is divided into  3 parts :

  1. Introduction
  2. Twitter Connectivity & search
  3. Shiny design


Application Design:

Any good software project begins with the design first. For this application, the design flowchart is shown below:

Design Flowchart for Shiny app

Design Flowchart for Shiny app



Twitter Connectivity

This is similar to the August project and mainly consists of two calls to the Twitter API:

  • authorize twitter api to mine data, using setup_twitter_oauth() function and your Twitter developer keys.

consumer_key = “ckey”
consumer_secret = “csecret”
access_token = “atoken”
access_secret = “asecret”
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

  • check whether the input search term returns tweets containing the phrase. If number of tweets <=5 return an error message. If number of tweets >5, process the tweets and display a sentiment analysis barchart. A custom function performs this computation

chk_searchterm <- function( term )

{  tw_search = searchTwitter(term, n=20, since=’2013-01-01′)

# look for all tweets containing this search term.

if(length(tw_search) <= 5)

{   return_term <- “None/few tweets to analyse for this search term. Please try again!” }


{    return_term <- paste(“Extracting max 20 tweets for Input =”, term, “.Sentiment graph below “)     }



The bargraph is created by assigning numeric values for each of the positive and negative emotions using the tweet text. Emotions used – anger, anticipation, disgust, joy, sadness, surprise, trust, overall positive and negative sentiment.


Shiny webapp

The actual Shiny application design and twitter connectivity are explained in the next post.

Twitter Analysis – Rio2016 Olympics

Twitter Analysis – Rio2016

Olympics season is in full swing. In keeping up with the spirit of this pinnacle of sports, we will use the Twitter API to extract tweets related to Rio2016 and analyze them to extract insights.

Rio OlympicsIn this post we will perform the following tasks:


Step 1 – Connecting to Twitter API

We will use R programming to perform the analysis using Twitter API keys (learn more about how to request these keys here) and the amazing “TwitterR” package to gain clearance permission for data extraction from the Twitter website.

Code for authorization is below:

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Step 2 – Search Twitter API for specific tags

We will search Twitter for all tweets with the tag “#TeamUSA”.
Twitter puts some constraints on how much data can be extracted with each API call, so we limit our search to 2000 tweets. To ensure recency, we specify the tweets should have been posted after Aug 1, 2016. Code snippet below:

tw_search = searchTwitter('#TeamUSA', n=2000, since='2016-08-01', geocode='39.9526,-75.1652,50mi')

Note, the “geocode” option is optional in above command, but I added it to consider tweets from users whose profile location is Philadelphia, ensuring coverage by NBC/Fox are definitely picked up! We save the tweets in a RDS file for easy access.

 saveRDS(tw_search, 'USteam_olympics.rds')


Step 3 – Cleaning up and processing the tweets

First, we remove all special characters and emojis from tweets using the sapply() and iconv() function.

tweet_doc$text &lt;- sapply(tweet_doc$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

We convert the created time to Brazil time. Note, Rio de Janeiro follows Chicago timezone, i.e 1 hour behind Philadelphia/NYC.

tweet_doc$Riotime = with_tz(tweet_doc$created, 'America/Chicago')
tweet_doc$strptime = as.POSIXct(strptime(tweet_doc$Riotime, "%Y-%m-%d %H:%M:%S"))

The as.POSIXct() allows us to aggregate tweets by hour/ date / minute, etc. which we can derive as below:

tweet_doc$day = as.numeric(format(tweet_doc$strptime, "%d"))

We add a  new variable to determine digital device type used for these Tweets, using the device url Twitter provides under column “StatusSource”.

par(mar = c(3, 3, 3, 2))
tweet_doc$statusSource_new = substr(tweet_doc$statusSource,  regexpr('&gt;', tweet_doc$statusSource) + 1,
regexpr('&lt;/a&gt;', tweet_doc$statusSource) - 1)


Step 4 – Graphical Insight

Plot 1: Tweets by hour of day:

gptime &lt;- ggplot(tweet_doc, aes(hour)) + geom_bar(aes(fill = isRetweet)) + xlab('Tweets by hour')

We notice that number of tweets increase as the evening passes with peak frequency at about 9 pm CDT. (graph below)

Bar chart displaying frequency count of #TeamUSA tweets by hour

#TeamUSA tweets by hour

Plot 2: Tweets by device type:

gp &lt;- ggplot(tweet_doc, aes(x= statusSource , fill = isRetweet)) + geom_bar( )

The graph clearly shows iphones dominating the user base.

Tweets by device used

Tweets by device used


Plot 3: Emotional Valence

We extract the emotional sentiment of tweets using a custom function:

polfn = lapply(orig$text, function(txt) {
# strip sentence enders so each tweet is analyzed as a sentence,
# and +'s which muck up regex
gsub('(\\.|!|\\?)\\s+|(\\++)', ' ', txt) %&gt;%
# strip URLs
gsub(' http[^[:blank:]]+', '', .) %&gt;%
# calculate polarity

Applying this, we get the most positive tweet:

“That looked like a very easy win for #TeamUSA  #beachvolleyball #Rio2016”

most negative tweet:

I think it’s a very odd sport but damn those guys are fit #Rio2016 #waterpolo #TeamUSA

Last, we plot a graph to display how emotionalValence change over the day:

Emotional valence change in tweets

Emotional valence change in tweets

Plot 4 : Word Cloud:

word cloud for #teamUSA tweets

word cloud for #teamUSA tweets

We use the “text” column from tweet_doc object to create a word dictionary of the tweets after removing punctuation and unwanted characters. The size of the words increases with their frequency of appearance in the tweets. The image alongside shows such a wordcloud with highlighted words indicating high-frequency phrases.

wordCorpus &lt;- Corpus(VectorSource(tweet_doc$text))
wordCorpus &lt;- tm_map(wordCorpus, removePunctuation)
wordcloud(words = wordCorpus, max.words=500, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=pal)


Plot 4 : Sentiment Graph:

We use the “syuzhet” library to assigns emotional value to each of the 2000 tweets we extracted using the get_nrc_sentiment() function.

mySentiment &lt;- get_nrc_sentiment(tweet_doc$text)

This assigns a numeric value to each tweet to indicate various emotions expressed in the tweet – anger, anticipation, fear, joy, etc. We then add these values back to the tweet_doc object and compute column totals to derive the overall weight for each emotion. Code and image for overall sentiment scores are shown below:

ggplotly(ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Tweets"))


Overall Sentiment Scores - #TeamUSA

Overall Sentiment Scores – #TeamUSA

We can also use the scores to see if positive/negative sentiments change with time of day or date. For our tag “#TeamUSA” we notice this is patently true as seen in graph below:
Positive tweets peak at noon on Opening Ceremony day (aug 5) and negative sentiments peaked on Aug 7 morning.

Sentiment by time

Sentiment by time


Step 5 – Usage for Brand monitoring.

The steps used in the analysis above can be easily modified for monitoring your brand, blog or product, as explained below:

  • Instead of “#TeamUSA” we can use any other tag or company/blog  name or product or any other relevant tags to mine Twitter for tweets.
  • Periodically monitor the tweets about your product or brand to ensure that your “sentiment graph” always tends to positive emotions. If not, ensure your staff is working diligently to counter any negative tweets/ concerns among your users.
  • The graphical analysis for “Tweets by hour of day ” could be used to monitor what time your users/ audience is most active. You could use this insight to publish more content during this time and to ensure your customer support is always available during this period to effectively engage your audience.
  • If your “device type” graph indicates any specific device (e.g: specific Android phone brands) make sure your content caters correctly for mobile users.
  • The high-frequency words in “wordcloud” indicate trending topics, so these can be used as great ideas for new content topics or short-term ads to ride the publicity wave! 🙂


The entire source code for this analysis is available here blog_twitter_olympics or can be forked from the Github page. Please take a look and share your thoughts and feedback. Until next time, adieu! 🙂

August Project Updates

Hello All,

The theme for August is API programming for social media platforms.

working with twitter API

twitter API code with R/ Python

For the August project, I’ve concentrated on working with Twitter API, using both Python and R programming. The code can be downloaded from the Projects Page or forked from my Github account.

Working With APIs:

Before we learn what the code does, please note that you will first need to request Twitter developer tokens (values for consumer_key, consumer_secret, access_key and access_secret) to authorize your account from extracting data from the Twitter platform. If you do not have these tokens yet, you can easily learn how to request tokens using the excellent documentation on the Twitter Developer website . Once you have the tokens please modify these variables at the beginning of the program with your own access.

Second, you will need to install the appropriate twitter packages for running programs in Python and R. This makes it easy to extract data from Twitter since these packages have pre-written functions for various tasks like Twitter authorization, looking up usernames, posting to Twitter, investigating follower counts, extracting profile data in json format, and much more.

“Tweepy” is the package for Python and “twitteR” for R programs, so please install them locally.


Tracking Twitter Follower Growth:

Although Twitter provides a great way to view your own twitter follower growth, there is no way to download or track this data locally. The Python program ( added in this month’s code does just that – extracts follower count and store it to csv Excel file. This makes it possible to track (historical) growth or decline of Twitter follower count over a period of time, starting from today.

With this program that you can monitor your own account and other twitter handles as well! Of course, you can’t go back in time to view older counts, but hey, at least you have started. Plus, you can manually add values for your own accounts.

Track Twitter follower count

File tracking Twitter follower count

(Technically, for twitter handles you do not own, you could get the date of joining of every follower and then deduce when they possibly followed someone. A post for another day, though! )

Extracting Data about Twitter Followers

Follower count is great, but you also want to know the detailed profile of your followers and other interesting twitter accounts. Who are these followers? Where are they located?

There are 2 R programs in the August Project which help you gather this information.

The first (followers_v2.R) extracts a list of all follower ids for a specific twitter account and stores it to a file. Twitter API has a rate limit of 5000 usernames for such queries, so this program uses cursor pagination to pull out information in chunks of 5000 in each iteration. Think of the list of follower ids like the content on a book – some books are thicker, so you have turn more pages! Similarly, if a twitter account has very few followers, the program completes in 1-2 iterations!

The program example works on the twitter account “@phillydotcom” which has >180k followers.  The cursor iteration process itself is implemented using a simple “while” loop.

Twitter follower details

Twitter follower details

The second R program ( dets_followers_v2.R ) uses the list of follower_ids to pull in detailed information about followers. For the scope of this project I am only deriving screen name, username, location and follower count for all of my Followers. Details are stored in a tabular format as shown in image alongside. You can avail this data to geographically segment your Twitter followers, analyze “influencer” followers (users with 25000 or more followers) and lots more.

Please take a look at the code and provide your valuable feedback and comments in the comments section.

Happy House-Warming!

Welcome to the new Blog homepage for Journey of Analytics.

The old blog is still live and all old content will still be available on the previous site. So if you have bookmarked any links or pages, they will still work. However, new posts will no longer appear on the old site, so please bookmark this page as well.

Thank you being a loyal reader with Journey of Analytics.

Happy Coding!

Newer posts

Thanks for reading so far! If you liked our content, please share!