Journey of Analytics

Deep dive into data analysis tools, theory and projects


How to raise money on Kickstarter – extensive EDA and prediction tutorial

In this tutorial, we will explore the characterisitcs of projects on Kickstarter and try to understand what separates the winners from the projects that failed to reach their funding goals.

Qs for Exploratory Analysis:

We will start our analysis with the aim of answering the following questions:

    1. How many projects were successful on Kickstarter, by year and category.
    2. Which sub-categories raised the most amount of money?
    3. Projects originate from which countries?
    4. How many projects exceeded their funding goal by 50% or more?
    5. Did any projects reach $100,000 or more? $1,000,000 or higher?
    6. What was the average amount contributed by each backer, and how does this change over time? Does this amount differ with categories?
    7. What is the average funding period?


Predicting success rates:
Using the answers from the above questions, we will try to create a model that can predict which projects are most likely to be successful.

The dataset is available on Kaggle, and you can run this script LIVE using this kernel link. If you find this tutorial useful or interesting, then please do upvote the kernel ! 🙂

Step1 – Data Pre-processing

a) Let us take a look at the input dataset :

The projects are divided into main and sub-categories. The pledged amount “usd_pledged” has an equivalent value converted to USD, called “usd_pledged_real”. However, the goal amount does not have this conversion. So for now, we will use the amounts as is.

We can see how many people are backing each individual project using the column, “backers”.

b) Now let us look at the first 5 records:

The name doesn’t really indicate any specific pattern although it might be interesting to see if longer names have better success rates. Not pursuing that angle at this time, though.

c) Looking for missing values:

Hurrah, a really clean dataset, even after searching for “empty” strings. 🙂

 d) Date Formatting and splitting:

We have two dates in our dataset – “launch date” and “deadline date”.We convert them from strings to date format.
We also split these dates into the respective year and month columns, so that we can plot variations over time.
So we will now have 4 new columns: launch_year, launch_month, deadline_year and deadline_month.

Exploratory analysis:

a) How many projects are successful?

We see that “failed” and “successful” are the two main categories, comprising ~88% of our dataset.
Sadly we do not know why some projects are marked “undefined” or “canceled”.
“live”” projects are those where the deadlines have not yet passed, although a few among them are already achieved their goal.
Surprisingly, some ‘canceled’ projects had also met their goals (pledged_amount >= goal).
Since these other categories are a very small portion of the dataset, we will subset and only consider records with satus “failed” or “successful” for the rest of the analysis.

b) How many countries have projects on kickstarter?

We see projects are overwhelmingly US. Some country names have the tag N,0“”, so marking them as unknown.

c) Number of projects launched per year:

Looks like some records say dates like 1970, which does not look right. So we discard any records with a launch / deadline year before 2009.
Plotting the counts per year on a graphs: < br />From the graph below, it looks like the count of projects peaked in 2015, then went down. However, this should NOT be taken as an indicator of success rates.



Drilling down a bit more to see count of projects by main_category.

Over the years, maximum number of projects have been in the categories:

    1. Film & Video
    2. Music
    3. Publishing

 d) Number of projects by sub-category: (Top 20 only)

The Top 5 sub-categories are:

    1. Product Design
    2. Documentary
    3. Music
    4. Tabletop Games (interesting!!!)
    5. Shorts (really?! )

Let us now see “Status” of projects for these Top 5 sub_categories:
From the graph below, we see that for category “shorts” and “tabletop games” there are more successfull projects than failed ones.

 e) Backers by category and sub-category:

Since there are a lot of sub-categories, let us explore the sub-categories under the main theme “Design” 

Product design is not just the sub-category with the highest count of projects, but also the category with the highest success ratio.

 f) add flag to see how many got funded more than the goal.

So ~40% of projects reached or surpassed their goal, which matches the number of successful projects .

 g) Calculate average contribution per backer:

From the mean, median and max values we quickly see that the median amount contributed by each backer is only ~$40 whereas the mean is higher due to the extreme positive values. The max amount by a single backer is ~$5000.

h) Calculate reach_ratio

The amount per backer is a good start, but what if the goal amount itself is only $1000? Then an average contribution per backer of $50 impies we only need 20 backers.
So to better understand the probability of a project’s success, we create a derived metric called “reach_ratio”.
This takes the average user contribution and compares it against the goal fund amount.

We see the median reach_ratio is <1%. Only in the third quartile do we even touch 2%!
Clearly most projects have a very low reach ratio. We could subset for “successful” projects only and check if the reach_ratio is higher.

 i) Number of days to achieve goal:

 Predictive Analystics:

We will apply a very simple decision tree algorithm to our dataset.
Since we do not have a separate “test” set, we will split the input dataframe into 2 parts (70/30 split).
We will use the smaller set to test the accuracy of out algorithm.

Taking a peek at the decision tree rules:

kickstarter success decision tree

kickstarter success decision tree

Thus we see that “backers” and “reach-ratio” are the main significant variables.

Re-applying the tree rules to the training set itself, we can validate our model:

From the above tables, we see that the error rate = ~3% and area under curve >= 97%

Finally applying the tree rules to the test set, we get the following stats:

From the above tables, we see that still the error rate = ~3% and area under curve >= 97%



Thus in this tutorial, we explored the factors that contribtue to a project’s success. Main theme and sub-category were important, but the number of backers and “reach_ratio” were found to be most critical.
If a founder wanted to gauge their probability of success, they could measure their “reach-ratio” halfway to the deadline, or perhaps when 25% of the timeline is complete. If the numbers are lower, it means they need to double down and use promotions/social media marketing to get more backers and funding.

If you liked this tutorial, feel free to fork the script. And dont forget to upvote the kernel! 🙂

Twitter Sentiment Analysis


Today’s post is a 2-part tutorial series on how to create an interactive ShinyR application that displays sentiment analysis for various phrases and search terms. The application accepts user a search term as input and graphically displays sentiment analysis.

In keeping with this month’s theme – “API programming”, this project uses the Twitter API to perform real-time search for tweets containing the user input term. Live App Link on Shiny website is provided and screenshot is as follows:

Twitter Sentiment Analysis Shiny

Shiny application for Twitter Sentiment Analysis

The project idea may seem simple at first, but will teach you the following skills:

  • working with Twitter API and dynamic data streaming (every time the search term changes, the program sends a new request to Twitter for relevant tweets),
  • Building an “interactive”, real-time application in Shiny/R,
  • data visualization with R

As always, the entire source code is also available for download on the Projects Page or can be forked from my  Github account here.


The tutorial is divided into  3 parts :

  1. Introduction
  2. Twitter Connectivity & search
  3. Shiny design


Application Design:

Any good software project begins with the design first. For this application, the design flowchart is shown below:

Design Flowchart for Shiny app

Design Flowchart for Shiny app



Twitter Connectivity

This is similar to the August project and mainly consists of two calls to the Twitter API:

  • authorize twitter api to mine data, using setup_twitter_oauth() function and your Twitter developer keys.

consumer_key = “ckey”
consumer_secret = “csecret”
access_token = “atoken”
access_secret = “asecret”
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

  • check whether the input search term returns tweets containing the phrase. If number of tweets <=5 return an error message. If number of tweets >5, process the tweets and display a sentiment analysis barchart. A custom function performs this computation

chk_searchterm <- function( term )

{  tw_search = searchTwitter(term, n=20, since=’2013-01-01′)

# look for all tweets containing this search term.

if(length(tw_search) <= 5)

{   return_term <- “None/few tweets to analyse for this search term. Please try again!” }


{    return_term <- paste(“Extracting max 20 tweets for Input =”, term, “.Sentiment graph below “)     }



The bargraph is created by assigning numeric values for each of the positive and negative emotions using the tweet text. Emotions used – anger, anticipation, disgust, joy, sadness, surprise, trust, overall positive and negative sentiment.


Shiny webapp

The actual Shiny application design and twitter connectivity are explained in the next post.

Twitter Analysis – Rio2016 Olympics

Twitter Analysis – Rio2016

Olympics season is in full swing. In keeping up with the spirit of this pinnacle of sports, we will use the Twitter API to extract tweets related to Rio2016 and analyze them to extract insights.

Rio OlympicsIn this post we will perform the following tasks:


Step 1 – Connecting to Twitter API

We will use R programming to perform the analysis using Twitter API keys (learn more about how to request these keys here) and the amazing “TwitterR” package to gain clearance permission for data extraction from the Twitter website.

Code for authorization is below:

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Step 2 – Search Twitter API for specific tags

We will search Twitter for all tweets with the tag “#TeamUSA”.
Twitter puts some constraints on how much data can be extracted with each API call, so we limit our search to 2000 tweets. To ensure recency, we specify the tweets should have been posted after Aug 1, 2016. Code snippet below:

tw_search = searchTwitter('#TeamUSA', n=2000, since='2016-08-01', geocode='39.9526,-75.1652,50mi')

Note, the “geocode” option is optional in above command, but I added it to consider tweets from users whose profile location is Philadelphia, ensuring coverage by NBC/Fox are definitely picked up! We save the tweets in a RDS file for easy access.

 saveRDS(tw_search, 'USteam_olympics.rds')


Step 3 – Cleaning up and processing the tweets

First, we remove all special characters and emojis from tweets using the sapply() and iconv() function.

tweet_doc$text &lt;- sapply(tweet_doc$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

We convert the created time to Brazil time. Note, Rio de Janeiro follows Chicago timezone, i.e 1 hour behind Philadelphia/NYC.

tweet_doc$Riotime = with_tz(tweet_doc$created, 'America/Chicago')
tweet_doc$strptime = as.POSIXct(strptime(tweet_doc$Riotime, "%Y-%m-%d %H:%M:%S"))

The as.POSIXct() allows us to aggregate tweets by hour/ date / minute, etc. which we can derive as below:

tweet_doc$day = as.numeric(format(tweet_doc$strptime, "%d"))

We add a  new variable to determine digital device type used for these Tweets, using the device url Twitter provides under column “StatusSource”.

par(mar = c(3, 3, 3, 2))
tweet_doc$statusSource_new = substr(tweet_doc$statusSource,  regexpr('&gt;', tweet_doc$statusSource) + 1,
regexpr('&lt;/a&gt;', tweet_doc$statusSource) - 1)


Step 4 – Graphical Insight

Plot 1: Tweets by hour of day:

gptime &lt;- ggplot(tweet_doc, aes(hour)) + geom_bar(aes(fill = isRetweet)) + xlab('Tweets by hour')

We notice that number of tweets increase as the evening passes with peak frequency at about 9 pm CDT. (graph below)

Bar chart displaying frequency count of #TeamUSA tweets by hour

#TeamUSA tweets by hour

Plot 2: Tweets by device type:

gp &lt;- ggplot(tweet_doc, aes(x= statusSource , fill = isRetweet)) + geom_bar( )

The graph clearly shows iphones dominating the user base.

Tweets by device used

Tweets by device used


Plot 3: Emotional Valence

We extract the emotional sentiment of tweets using a custom function:

polfn = lapply(orig$text, function(txt) {
# strip sentence enders so each tweet is analyzed as a sentence,
# and +'s which muck up regex
gsub('(\\.|!|\\?)\\s+|(\\++)', ' ', txt) %&gt;%
# strip URLs
gsub(' http[^[:blank:]]+', '', .) %&gt;%
# calculate polarity

Applying this, we get the most positive tweet:

“That looked like a very easy win for #TeamUSA  #beachvolleyball #Rio2016”

most negative tweet:

I think it’s a very odd sport but damn those guys are fit #Rio2016 #waterpolo #TeamUSA

Last, we plot a graph to display how emotionalValence change over the day:

Emotional valence change in tweets

Emotional valence change in tweets

Plot 4 : Word Cloud:

word cloud for #teamUSA tweets

word cloud for #teamUSA tweets

We use the “text” column from tweet_doc object to create a word dictionary of the tweets after removing punctuation and unwanted characters. The size of the words increases with their frequency of appearance in the tweets. The image alongside shows such a wordcloud with highlighted words indicating high-frequency phrases.

wordCorpus &lt;- Corpus(VectorSource(tweet_doc$text))
wordCorpus &lt;- tm_map(wordCorpus, removePunctuation)
wordcloud(words = wordCorpus, max.words=500, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=pal)


Plot 4 : Sentiment Graph:

We use the “syuzhet” library to assigns emotional value to each of the 2000 tweets we extracted using the get_nrc_sentiment() function.

mySentiment &lt;- get_nrc_sentiment(tweet_doc$text)

This assigns a numeric value to each tweet to indicate various emotions expressed in the tweet – anger, anticipation, fear, joy, etc. We then add these values back to the tweet_doc object and compute column totals to derive the overall weight for each emotion. Code and image for overall sentiment scores are shown below:

ggplotly(ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Tweets"))


Overall Sentiment Scores - #TeamUSA

Overall Sentiment Scores – #TeamUSA

We can also use the scores to see if positive/negative sentiments change with time of day or date. For our tag “#TeamUSA” we notice this is patently true as seen in graph below:
Positive tweets peak at noon on Opening Ceremony day (aug 5) and negative sentiments peaked on Aug 7 morning.

Sentiment by time

Sentiment by time


Step 5 – Usage for Brand monitoring.

The steps used in the analysis above can be easily modified for monitoring your brand, blog or product, as explained below:

  • Instead of “#TeamUSA” we can use any other tag or company/blog  name or product or any other relevant tags to mine Twitter for tweets.
  • Periodically monitor the tweets about your product or brand to ensure that your “sentiment graph” always tends to positive emotions. If not, ensure your staff is working diligently to counter any negative tweets/ concerns among your users.
  • The graphical analysis for “Tweets by hour of day ” could be used to monitor what time your users/ audience is most active. You could use this insight to publish more content during this time and to ensure your customer support is always available during this period to effectively engage your audience.
  • If your “device type” graph indicates any specific device (e.g: specific Android phone brands) make sure your content caters correctly for mobile users.
  • The high-frequency words in “wordcloud” indicate trending topics, so these can be used as great ideas for new content topics or short-term ads to ride the publicity wave! 🙂


The entire source code for this analysis is available here blog_twitter_olympics or can be forked from the Github page. Please take a look and share your thoughts and feedback. Until next time, adieu! 🙂

August Project Updates

Hello All,

The theme for August is API programming for social media platforms.

working with twitter API

twitter API code with R/ Python

For the August project, I’ve concentrated on working with Twitter API, using both Python and R programming. The code can be downloaded from the Projects Page or forked from my Github account.

Working With APIs:

Before we learn what the code does, please note that you will first need to request Twitter developer tokens (values for consumer_key, consumer_secret, access_key and access_secret) to authorize your account from extracting data from the Twitter platform. If you do not have these tokens yet, you can easily learn how to request tokens using the excellent documentation on the Twitter Developer website . Once you have the tokens please modify these variables at the beginning of the program with your own access.

Second, you will need to install the appropriate twitter packages for running programs in Python and R. This makes it easy to extract data from Twitter since these packages have pre-written functions for various tasks like Twitter authorization, looking up usernames, posting to Twitter, investigating follower counts, extracting profile data in json format, and much more.

“Tweepy” is the package for Python and “twitteR” for R programs, so please install them locally.


Tracking Twitter Follower Growth:

Although Twitter provides a great way to view your own twitter follower growth, there is no way to download or track this data locally. The Python program ( added in this month’s code does just that – extracts follower count and store it to csv Excel file. This makes it possible to track (historical) growth or decline of Twitter follower count over a period of time, starting from today.

With this program that you can monitor your own account and other twitter handles as well! Of course, you can’t go back in time to view older counts, but hey, at least you have started. Plus, you can manually add values for your own accounts.

Track Twitter follower count

File tracking Twitter follower count

(Technically, for twitter handles you do not own, you could get the date of joining of every follower and then deduce when they possibly followed someone. A post for another day, though! )

Extracting Data about Twitter Followers

Follower count is great, but you also want to know the detailed profile of your followers and other interesting twitter accounts. Who are these followers? Where are they located?

There are 2 R programs in the August Project which help you gather this information.

The first (followers_v2.R) extracts a list of all follower ids for a specific twitter account and stores it to a file. Twitter API has a rate limit of 5000 usernames for such queries, so this program uses cursor pagination to pull out information in chunks of 5000 in each iteration. Think of the list of follower ids like the content on a book – some books are thicker, so you have turn more pages! Similarly, if a twitter account has very few followers, the program completes in 1-2 iterations!

The program example works on the twitter account “@phillydotcom” which has >180k followers.  The cursor iteration process itself is implemented using a simple “while” loop.

Twitter follower details

Twitter follower details

The second R program ( dets_followers_v2.R ) uses the list of follower_ids to pull in detailed information about followers. For the scope of this project I am only deriving screen name, username, location and follower count for all of my Followers. Details are stored in a tabular format as shown in image alongside. You can avail this data to geographically segment your Twitter followers, analyze “influencer” followers (users with 25000 or more followers) and lots more.

Please take a look at the code and provide your valuable feedback and comments in the comments section.

Happy House-Warming!

Welcome to the new Blog homepage for Journey of Analytics.

The old blog is still live and all old content will still be available on the previous site. So if you have bookmarked any links or pages, they will still work. However, new posts will no longer appear on the old site, so please bookmark this page as well.

Thank you being a loyal reader with Journey of Analytics.

Happy Coding!

Thanks for reading so far! If you liked our content, please share!