Journey of Analytics

Deep dive into data analysis tools, theory and projects

Author: JOURNEYOFANALYTICS (page 1 of 2)

Introduction to Artificial Intelligence

Lately I’ve been exploring deep learning algorithms, and automating system with Artificial Intelligence. Plus, I received a couple of emails asking me about programming skills for AI. So, with those questions in mind, here is a simple introduction to artificial intelligence. The agenda for this post is to answer the following topics:

  1. What is AI?
  2. Types of AI – practical implementation in Fortune500 companies
  3. Applications of AI
  4. Drawbacks of AI
  5. Programming skills for AI
introduction to AI

introduction to AI

What is AI?

AI or artificial intelligence is the process of using software to perform human tasks. It is considered to be a branch of machine learning, and sophisticated algorithms are used to do everything from automating repetitive tasks to creating self-learning sentient systems. Sadly, nowadays AI is being used as an alternative to datascience and machine learning, so it is very difficult to draw a line of difference between them all.

Types of AI

In terms of implementation and usage in Fortune500 companies, AI can perform three important business needs (Davenport & Ronanki, 2018) :

  1. automating business processes,
  2. gaining insight through data analysis, (Cognitive insight)
  3. engaging with customers and employees.

Process automation:

This is perhaps the most prevalent form of AI, where code on a remote server or a sensor sorts information and makes decisions that would otherwise be done by a human. Examples include updating multiple databases with customer address changes or service additions, replacing lost credit or ATM cards, parsing humongous amounts of legal and contractual documents to extract provisions using natural language processing. This type of robotic process automation (RPA) technologies are most popular because they are easy to implement and offer great returns on investment. They are also controversial, because they can sometimes replace low-skilled manual jobs, even though these jobs were always in danger of being outsourced or given to minimum wage workers.

Cognitive insight.

These processes use algorithms to detect patterns in vast volumes of data and interpret their meaning. Using unsupervised machine learning algorithms help the processes become more efficient over time. Examples include predicting what a customer is likely to buy, identify credit fraud in real time, mass-personalization of digital ads and so on. Given the amounts of data, cognitive insight applications are typically used for tasks that would be impossible for people, or to augment human decision-making processes.

Cognitive engagement.

These projects are the most complicated, take most amount of time, and therefore generate the most buzz and are most prone to mismanagement and failures. Examples include intelligent chatbots that are able to process complex questions and improve with every interaction with live customers, voice-to-text reporting solutions, recommendation systems that help providers create customized care plans that take into account individual patients’ health status and previous treatments, recreating customer intimacy with digital customers using digital concierges, etc. However, such AI projects are still not completely mainstream, and companies tend to take a conservative approach in using them for customer-facing systems.

Applications of AI:

There are many applications of AI and currently startups are racing to build AI chips for data centers, robotics, smartphones, drones and other devices. Tech giants like Apple, Google, Facebook, and Microsoft have already created interesting products by applying AI software to speech recognition, internet search, and classifying images.’s AI prowess spans cloud-computing services and voice-activated home digital assistants (Alexa, Amazon Echo). Here are some other interesting applications of AI:

  1. Driverless vehicles
  2. Robo-advisors that can recommend investments, re-balance stock/bond ratios and make personalized recommendations for an individual’s portfolio. An interesting extension of this technique is the list of 50 startups identified by Quid AI CEO, with the most potential to grow, in 2009. Today 10 of those companies have reached billion-dollar evaluations (Reddy, 2017) and include famous names like Evernote, Spotify, Etsy, Zynga, Palantir, Cloudera, OPOWER. [Personal note – if you have not yet heard of Quid, follow them on Twitter @Quid. They publish some amazing business intelligence reports! ]
  3. Image recognition that can aid law enforcement personnel in identifying criminals.
  4. LexisNexis has a product called PatentAdvisor ( which uses data concerning the history of individual patent examiners and how they’ve handled similar patent applications to predict the likelihood of a application being approved. Similarly, there are software applications that use artificial intelligence to help lawyers gather research material for cases, by identifying precedents that will maximize chances for a successful ruling outcome. (Keiser, 2018)

Drawbacks of AI:

There is no doubt that AI has created some amazing opportunities (image recognition to classify malignant tumors) and allowed companies to pass on boring admin tasks to machines. However, since AI systems are created by humans, they do have the following risks and limits:

  1. AI bias: Machine learning and algorithms underlying AI systems also has biases. All algorithms use an initial training set to learn how to identify and predict values. So, if the underlying training set is biased, then the predictions will also be biased. Garbage in, garbage out. Moreover, these biases may not appear as an explicit rule but, rather, be embedded in subtle interactions among the thousands of factors considered. Hence heavily regulated industries like banking bar the use of AI in loan approvals, as it may conflict with fair lending laws.
  2. Lack of verification: Unlike regular rule-based systems, neural network systems which are typically used in AI systems, deal with statistical truths rather than literal truths. So, such systems may fail in extreme rare cases, as the algorithm will overlook cases that may have very low probability of occurrence. For example, predicting a Wall Street crash or a sudden natural calamity like a volcanic eruption (think Hawaii). Lack of verification are major concerns in mission-critical applications, such as controlling a nuclear power plant, or when life-or-death decisions are involved.
  3. Hard to correct errors: If the AI system makes an error (as all systems eventually fail), then diagnosing and correcting the system becomes unimaginably complex, as the underlying mathematics are very complicated.
  4. Human creativity and emotions cannot be automated. AI is excellent at mundane tasks, but not so good at things that are intuitive. As the authors state in a book (Davenport & Kirby, 2016) if the logic can be articulated, a rule-based system can be written, and the process can be automated. However, tasks which involve emotions, creative problem-solving and social interactions cannot be automated. Examples include FBI negotiators, soldiers on flood rescue systems, the inventor who knew the iPod would change the music industry and become a sensation long before anyone expressed a need for it.

Programming Skills for AI:

The skills used to build AI applications are the same as those needed for data scientists and software engineering roles. The top 5 programming skills are Python, R, Java, C++. If you are looking to get started, then three excellent resources are listed below:

  1. Professional Program from Microsoft. The courses are completely free (gasp) although they do charge $99 per course for verified certificates. I took the free versions, and the courses offer a good mix of both practical labs and theory.
  2. Introduction to AI course from Udacity.–cs271
  3. AI and Deep Learning courses by Kirill Eremenko, on Udemy. I’ve taken 4 courses from him and they were all great value for money, and give very real-world, hands-on coding experience. &

Please note that all these 3 are honest recommendations, and I am not being paid or compensated in any shape or form for adding these links.




Brynjolfsson, E., McAfee, A. (2017) The business of artificial intelligence: What it can and cannot do for your organization. Harvard Business Review website. Retrieved from

Davenport, T., Kirby, J. (2016) Only Humans Need Apply: winners and losers in the age of smart machines. Harper Business.

Davenport, T., Ronanki, R. (2018) Artificial Intelligence for the Real World. Harvard Business Review.

Keiser, B. (2018) Law library Management and Legal Research meet Artificial Intelligence.

Reddy, S. (2017) A computer was asked to predict which start-ups would be successful. The results were astonishing. World Economic Forum. Retrieved from





How to raise money on Kickstarter – extensive EDA and prediction tutorial

In this tutorial, we will explore the characterisitcs of projects on Kickstarter and try to understand what separates the winners from the projects that failed to reach their funding goals.

Qs for Exploratory Analysis:

We will start our analysis with the aim of answering the following questions:

    1. How many projects were successful on Kickstarter, by year and category.
    2. Which sub-categories raised the most amount of money?
    3. Projects originate from which countries?
    4. How many projects exceeded their funding goal by 50% or more?
    5. Did any projects reach $100,000 or more? $1,000,000 or higher?
    6. What was the average amount contributed by each backer, and how does this change over time? Does this amount differ with categories?
    7. What is the average funding period?


Predicting success rates:
Using the answers from the above questions, we will try to create a model that can predict which projects are most likely to be successful.

The dataset is available on Kaggle, and you can run this script LIVE using this kernel link. If you find this tutorial useful or interesting, then please do upvote the kernel ! 🙂

Step1 – Data Pre-processing

a) Let us take a look at the input dataset :

The projects are divided into main and sub-categories. The pledged amount “usd_pledged” has an equivalent value converted to USD, called “usd_pledged_real”. However, the goal amount does not have this conversion. So for now, we will use the amounts as is.

We can see how many people are backing each individual project using the column, “backers”.

b) Now let us look at the first 5 records:

The name doesn’t really indicate any specific pattern although it might be interesting to see if longer names have better success rates. Not pursuing that angle at this time, though.

c) Looking for missing values:

Hurrah, a really clean dataset, even after searching for “empty” strings. 🙂

 d) Date Formatting and splitting:

We have two dates in our dataset – “launch date” and “deadline date”.We convert them from strings to date format.
We also split these dates into the respective year and month columns, so that we can plot variations over time.
So we will now have 4 new columns: launch_year, launch_month, deadline_year and deadline_month.

Exploratory analysis:

a) How many projects are successful?

We see that “failed” and “successful” are the two main categories, comprising ~88% of our dataset.
Sadly we do not know why some projects are marked “undefined” or “canceled”.
“live”” projects are those where the deadlines have not yet passed, although a few among them are already achieved their goal.
Surprisingly, some ‘canceled’ projects had also met their goals (pledged_amount >= goal).
Since these other categories are a very small portion of the dataset, we will subset and only consider records with satus “failed” or “successful” for the rest of the analysis.

b) How many countries have projects on kickstarter?

We see projects are overwhelmingly US. Some country names have the tag N,0“”, so marking them as unknown.

c) Number of projects launched per year:

Looks like some records say dates like 1970, which does not look right. So we discard any records with a launch / deadline year before 2009.
Plotting the counts per year on a graphs: < br />From the graph below, it looks like the count of projects peaked in 2015, then went down. However, this should NOT be taken as an indicator of success rates.



Drilling down a bit more to see count of projects by main_category.

Over the years, maximum number of projects have been in the categories:

    1. Film & Video
    2. Music
    3. Publishing

 d) Number of projects by sub-category: (Top 20 only)

The Top 5 sub-categories are:

    1. Product Design
    2. Documentary
    3. Music
    4. Tabletop Games (interesting!!!)
    5. Shorts (really?! )

Let us now see “Status” of projects for these Top 5 sub_categories:
From the graph below, we see that for category “shorts” and “tabletop games” there are more successfull projects than failed ones.

 e) Backers by category and sub-category:

Since there are a lot of sub-categories, let us explore the sub-categories under the main theme “Design” 

Product design is not just the sub-category with the highest count of projects, but also the category with the highest success ratio.

 f) add flag to see how many got funded more than the goal.

So ~40% of projects reached or surpassed their goal, which matches the number of successful projects .

 g) Calculate average contribution per backer:

From the mean, median and max values we quickly see that the median amount contributed by each backer is only ~$40 whereas the mean is higher due to the extreme positive values. The max amount by a single backer is ~$5000.

h) Calculate reach_ratio

The amount per backer is a good start, but what if the goal amount itself is only $1000? Then an average contribution per backer of $50 impies we only need 20 backers.
So to better understand the probability of a project’s success, we create a derived metric called “reach_ratio”.
This takes the average user contribution and compares it against the goal fund amount.

We see the median reach_ratio is <1%. Only in the third quartile do we even touch 2%!
Clearly most projects have a very low reach ratio. We could subset for “successful” projects only and check if the reach_ratio is higher.

 i) Number of days to achieve goal:

 Predictive Analystics:

We will apply a very simple decision tree algorithm to our dataset.
Since we do not have a separate “test” set, we will split the input dataframe into 2 parts (70/30 split).
We will use the smaller set to test the accuracy of out algorithm.

Taking a peek at the decision tree rules:

kickstarter success decision tree

kickstarter success decision tree

Thus we see that “backers” and “reach-ratio” are the main significant variables.

Re-applying the tree rules to the training set itself, we can validate our model:

From the above tables, we see that the error rate = ~3% and area under curve >= 97%

Finally applying the tree rules to the test set, we get the following stats:

From the above tables, we see that still the error rate = ~3% and area under curve >= 97%



Thus in this tutorial, we explored the factors that contribtue to a project’s success. Main theme and sub-category were important, but the number of backers and “reach_ratio” were found to be most critical.
If a founder wanted to gauge their probability of success, they could measure their “reach-ratio” halfway to the deadline, or perhaps when 25% of the timeline is complete. If the numbers are lower, it means they need to double down and use promotions/social media marketing to get more backers and funding.

If you liked this tutorial, feel free to fork the script. And dont forget to upvote the kernel! 🙂

Twitter Sentiment Analysis


Today’s post is a 2-part tutorial series on how to create an interactive ShinyR application that displays sentiment analysis for various phrases and search terms. The application accepts user a search term as input and graphically displays sentiment analysis.

In keeping with this month’s theme – “API programming”, this project uses the Twitter API to perform real-time search for tweets containing the user input term. Live App Link on Shiny website is provided and screenshot is as follows:

Twitter Sentiment Analysis Shiny

Shiny application for Twitter Sentiment Analysis

The project idea may seem simple at first, but will teach you the following skills:

  • working with Twitter API and dynamic data streaming (every time the search term changes, the program sends a new request to Twitter for relevant tweets),
  • Building an “interactive”, real-time application in Shiny/R,
  • data visualization with R

As always, the entire source code is also available for download on the Projects Page or can be forked from my  Github account here.


The tutorial is divided into  3 parts :

  1. Introduction
  2. Twitter Connectivity & search
  3. Shiny design


Application Design:

Any good software project begins with the design first. For this application, the design flowchart is shown below:

Design Flowchart for Shiny app

Design Flowchart for Shiny app



Twitter Connectivity

This is similar to the August project and mainly consists of two calls to the Twitter API:

  • authorize twitter api to mine data, using setup_twitter_oauth() function and your Twitter developer keys.

consumer_key = “ckey”
consumer_secret = “csecret”
access_token = “atoken”
access_secret = “asecret”
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

  • check whether the input search term returns tweets containing the phrase. If number of tweets <=5 return an error message. If number of tweets >5, process the tweets and display a sentiment analysis barchart. A custom function performs this computation

chk_searchterm <- function( term )

{  tw_search = searchTwitter(term, n=20, since=’2013-01-01′)

# look for all tweets containing this search term.

if(length(tw_search) <= 5)

{   return_term <- “None/few tweets to analyse for this search term. Please try again!” }


{    return_term <- paste(“Extracting max 20 tweets for Input =”, term, “.Sentiment graph below “)     }



The bargraph is created by assigning numeric values for each of the positive and negative emotions using the tweet text. Emotions used – anger, anticipation, disgust, joy, sadness, surprise, trust, overall positive and negative sentiment.


Shiny webapp

The actual Shiny application design and twitter connectivity are explained in the next post.

Twitter Analysis – Rio2016 Olympics

Twitter Analysis – Rio2016

Olympics season is in full swing. In keeping up with the spirit of this pinnacle of sports, we will use the Twitter API to extract tweets related to Rio2016 and analyze them to extract insights.

Rio OlympicsIn this post we will perform the following tasks:


Step 1 – Connecting to Twitter API

We will use R programming to perform the analysis using Twitter API keys (learn more about how to request these keys here) and the amazing “TwitterR” package to gain clearance permission for data extraction from the Twitter website.

Code for authorization is below:

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Step 2 – Search Twitter API for specific tags

We will search Twitter for all tweets with the tag “#TeamUSA”.
Twitter puts some constraints on how much data can be extracted with each API call, so we limit our search to 2000 tweets. To ensure recency, we specify the tweets should have been posted after Aug 1, 2016. Code snippet below:

tw_search = searchTwitter('#TeamUSA', n=2000, since='2016-08-01', geocode='39.9526,-75.1652,50mi')

Note, the “geocode” option is optional in above command, but I added it to consider tweets from users whose profile location is Philadelphia, ensuring coverage by NBC/Fox are definitely picked up! We save the tweets in a RDS file for easy access.

 saveRDS(tw_search, 'USteam_olympics.rds')


Step 3 – Cleaning up and processing the tweets

First, we remove all special characters and emojis from tweets using the sapply() and iconv() function.

tweet_doc$text &lt;- sapply(tweet_doc$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

We convert the created time to Brazil time. Note, Rio de Janeiro follows Chicago timezone, i.e 1 hour behind Philadelphia/NYC.

tweet_doc$Riotime = with_tz(tweet_doc$created, 'America/Chicago')
tweet_doc$strptime = as.POSIXct(strptime(tweet_doc$Riotime, "%Y-%m-%d %H:%M:%S"))

The as.POSIXct() allows us to aggregate tweets by hour/ date / minute, etc. which we can derive as below:

tweet_doc$day = as.numeric(format(tweet_doc$strptime, "%d"))

We add a  new variable to determine digital device type used for these Tweets, using the device url Twitter provides under column “StatusSource”.

par(mar = c(3, 3, 3, 2))
tweet_doc$statusSource_new = substr(tweet_doc$statusSource,  regexpr('&gt;', tweet_doc$statusSource) + 1,
regexpr('&lt;/a&gt;', tweet_doc$statusSource) - 1)


Step 4 – Graphical Insight

Plot 1: Tweets by hour of day:

gptime &lt;- ggplot(tweet_doc, aes(hour)) + geom_bar(aes(fill = isRetweet)) + xlab('Tweets by hour')

We notice that number of tweets increase as the evening passes with peak frequency at about 9 pm CDT. (graph below)

Bar chart displaying frequency count of #TeamUSA tweets by hour

#TeamUSA tweets by hour

Plot 2: Tweets by device type:

gp &lt;- ggplot(tweet_doc, aes(x= statusSource , fill = isRetweet)) + geom_bar( )

The graph clearly shows iphones dominating the user base.

Tweets by device used

Tweets by device used


Plot 3: Emotional Valence

We extract the emotional sentiment of tweets using a custom function:

polfn = lapply(orig$text, function(txt) {
# strip sentence enders so each tweet is analyzed as a sentence,
# and +'s which muck up regex
gsub('(\\.|!|\\?)\\s+|(\\++)', ' ', txt) %&gt;%
# strip URLs
gsub(' http[^[:blank:]]+', '', .) %&gt;%
# calculate polarity

Applying this, we get the most positive tweet:

“That looked like a very easy win for #TeamUSA  #beachvolleyball #Rio2016”

most negative tweet:

I think it’s a very odd sport but damn those guys are fit #Rio2016 #waterpolo #TeamUSA

Last, we plot a graph to display how emotionalValence change over the day:

Emotional valence change in tweets

Emotional valence change in tweets

Plot 4 : Word Cloud:

word cloud for #teamUSA tweets

word cloud for #teamUSA tweets

We use the “text” column from tweet_doc object to create a word dictionary of the tweets after removing punctuation and unwanted characters. The size of the words increases with their frequency of appearance in the tweets. The image alongside shows such a wordcloud with highlighted words indicating high-frequency phrases.

wordCorpus &lt;- Corpus(VectorSource(tweet_doc$text))
wordCorpus &lt;- tm_map(wordCorpus, removePunctuation)
wordcloud(words = wordCorpus, max.words=500, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=pal)


Plot 4 : Sentiment Graph:

We use the “syuzhet” library to assigns emotional value to each of the 2000 tweets we extracted using the get_nrc_sentiment() function.

mySentiment &lt;- get_nrc_sentiment(tweet_doc$text)

This assigns a numeric value to each tweet to indicate various emotions expressed in the tweet – anger, anticipation, fear, joy, etc. We then add these values back to the tweet_doc object and compute column totals to derive the overall weight for each emotion. Code and image for overall sentiment scores are shown below:

ggplotly(ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Tweets"))


Overall Sentiment Scores - #TeamUSA

Overall Sentiment Scores – #TeamUSA

We can also use the scores to see if positive/negative sentiments change with time of day or date. For our tag “#TeamUSA” we notice this is patently true as seen in graph below:
Positive tweets peak at noon on Opening Ceremony day (aug 5) and negative sentiments peaked on Aug 7 morning.

Sentiment by time

Sentiment by time


Step 5 – Usage for Brand monitoring.

The steps used in the analysis above can be easily modified for monitoring your brand, blog or product, as explained below:

  • Instead of “#TeamUSA” we can use any other tag or company/blog  name or product or any other relevant tags to mine Twitter for tweets.
  • Periodically monitor the tweets about your product or brand to ensure that your “sentiment graph” always tends to positive emotions. If not, ensure your staff is working diligently to counter any negative tweets/ concerns among your users.
  • The graphical analysis for “Tweets by hour of day ” could be used to monitor what time your users/ audience is most active. You could use this insight to publish more content during this time and to ensure your customer support is always available during this period to effectively engage your audience.
  • If your “device type” graph indicates any specific device (e.g: specific Android phone brands) make sure your content caters correctly for mobile users.
  • The high-frequency words in “wordcloud” indicate trending topics, so these can be used as great ideas for new content topics or short-term ads to ride the publicity wave! 🙂


The entire source code for this analysis is available here blog_twitter_olympics or can be forked from the Github page. Please take a look and share your thoughts and feedback. Until next time, adieu! 🙂

August Project Updates

Hello All,

The theme for August is API programming for social media platforms.

working with twitter API

twitter API code with R/ Python

For the August project, I’ve concentrated on working with Twitter API, using both Python and R programming. The code can be downloaded from the Projects Page or forked from my Github account.

Working With APIs:

Before we learn what the code does, please note that you will first need to request Twitter developer tokens (values for consumer_key, consumer_secret, access_key and access_secret) to authorize your account from extracting data from the Twitter platform. If you do not have these tokens yet, you can easily learn how to request tokens using the excellent documentation on the Twitter Developer website . Once you have the tokens please modify these variables at the beginning of the program with your own access.

Second, you will need to install the appropriate twitter packages for running programs in Python and R. This makes it easy to extract data from Twitter since these packages have pre-written functions for various tasks like Twitter authorization, looking up usernames, posting to Twitter, investigating follower counts, extracting profile data in json format, and much more.

“Tweepy” is the package for Python and “twitteR” for R programs, so please install them locally.


Tracking Twitter Follower Growth:

Although Twitter provides a great way to view your own twitter follower growth, there is no way to download or track this data locally. The Python program ( added in this month’s code does just that – extracts follower count and store it to csv Excel file. This makes it possible to track (historical) growth or decline of Twitter follower count over a period of time, starting from today.

With this program that you can monitor your own account and other twitter handles as well! Of course, you can’t go back in time to view older counts, but hey, at least you have started. Plus, you can manually add values for your own accounts.

Track Twitter follower count

File tracking Twitter follower count

(Technically, for twitter handles you do not own, you could get the date of joining of every follower and then deduce when they possibly followed someone. A post for another day, though! )

Extracting Data about Twitter Followers

Follower count is great, but you also want to know the detailed profile of your followers and other interesting twitter accounts. Who are these followers? Where are they located?

There are 2 R programs in the August Project which help you gather this information.

The first (followers_v2.R) extracts a list of all follower ids for a specific twitter account and stores it to a file. Twitter API has a rate limit of 5000 usernames for such queries, so this program uses cursor pagination to pull out information in chunks of 5000 in each iteration. Think of the list of follower ids like the content on a book – some books are thicker, so you have turn more pages! Similarly, if a twitter account has very few followers, the program completes in 1-2 iterations!

The program example works on the twitter account “@phillydotcom” which has >180k followers.  The cursor iteration process itself is implemented using a simple “while” loop.

Twitter follower details

Twitter follower details

The second R program ( dets_followers_v2.R ) uses the list of follower_ids to pull in detailed information about followers. For the scope of this project I am only deriving screen name, username, location and follower count for all of my Followers. Details are stored in a tabular format as shown in image alongside. You can avail this data to geographically segment your Twitter followers, analyze “influencer” followers (users with 25000 or more followers) and lots more.

Please take a look at the code and provide your valuable feedback and comments in the comments section.

Older posts

Thanks for reading so far! If you liked our content, please share!