Journey of Analytics

Deep dive into data analysis tools, theory and projects

Author: JOURNEYOFANALYTICS (page 1 of 2)

DataScience Professionals : India vs US | Men vs Women

Introduction

This is an analysis of the Kaggle 2018 survey dataset. In my analysis I am trying to understand the similarities and differences between men and women users from US and India, since these are the two biggest segments of the respondent population. The number of respondents who chose something other than Male/Female is quite low, so I excluded that subset as well.

The complete code is available as a kernel on the Kaggle website. If you like this post, do login and upvote! 🙂  This post is a slightly truncated version of the Kernel available on Kaggle.

You can also use the link to go to the dataset and perform your own explorations. Please do feel free to use my code as a starter script.

 

Kaggle users - India vs US

Kaggle users – India vs US

Couple of disclaimers:

NOT intending to say one country is better than the other. Instead just trying to explore the profiles based on what this specific dataset shows.
It is very much possible that there is a response bias and that the differences are due to the nature of the people who are on the Kaggle site, and who answered the survey.
With that out of the way, let us get started. If you like the analysis, please feel to fork the script and extend it further. And do not forget to Upvote! 🙂

Analysis

Some questions that the analysis tries to answer are given below:
a. What is the respondent demographic profile for users from the 2 countries – men vs women, age bucket?
b. What is their educational background and major?
c. What are the job roles and coding experience?
d. What is the most popular language of use?
e. What is the programming language people recommend for an aspiring data scientist?

I deliberately did not compare salary because:
a. 16% of the population did not answer and 20% chose “do not wish to disclose”.
b. the lowest bracket is 0-10k USD, so the max limit of 10k translates to about INR 7,00,000 (7 lakhs) which is quite high. A software engineer, entering the IT industry probably makes around 4-5 lakhs per annum, and they earn much more than others in India. So comparing against US salaries feels like comparing apples to oranges. [Assuming an exchange rate of 1 USD = 70 INR].

Calculations / Data Wrangling:

  1. I’ve aggregated the age buckets into lesser number of segments, because the number of respondents tapers off in the higher age groups. They are quite self-explanatory, as you will see from the ifesle clause below:
  2. Similarly, cleaned up the special characters in the educational qualifications. Also added a tag to the empty values in the following variables – jobrole (Q6), exp_group (Q8), proj(Q40), years in coding (Q24), major (Q5).
  3. I also created some frequency using the sqldf() function. You could use the summarise() from the dplyr package. It really is a matter of choice.

Observations

Gender composition:

As seen in the chart below, many more males (~80%) responded to the survey than women (~20%).
Among women, almost 2/3rd are from US, and only ~38% from India.
The men are almost split 50/50 among US and India.

 

Age composition:

There is a definite trend showing that the Indian respondents are quite young, with both men and women showing >54% in the youngest ge bucket (18-24), and another ~28% falling in the (25-29) category. So almost 82% of the population is under 30 years of age.
Among US respondents, the women seem a bit more younger, with 68% under 30 years, compared to ~57% men of women. However, both men and women had a larger segment in the 55+ category (~20% for women, and 25% for men. Compare it with Indians, where the 55+ group is barely 12%.

 

Educational background:

Overall, this is an educated lot, and most had a bachelors degree or more.
US women were the most educated of the lot, with a whopping 55% with masters degrees and 16% with doctorates.
Among Indians, women had higher levels of education – 10% with Ph.D, 43% masters degree, compared with men where ~34% had a masters degree and only 4% had a doctorate.
Among US men, ~47% had a masters degree, and 19% had doctorates.
This is interesting because Indians are younger compared to US respondents, so many more Indians seem to be pursuing advanced degrees.

Undergrad major:

Among Indians, the majority of respondents added Computer Science as their major.
Maybe because :
(a) Indians have to declare a major when they join, and the choice of majors is not as wide as in the US. ,

  1. Parents tend to force kids towards majors which are known to translate into a decent paying job, which is engineering or medicine.
  2. A case of response bias? The survey came from Kaggle, so not sure if non-coding majors would have even bothered to respond.Among US respondents, the major is also computer science, but followed by maths & stats for women.
    For men, the second category was a tie between non-compsci Engg , followed by maths&stats.

 

Job Roles:

Among Indians, the biggest segment are predominantly students (30%). Among Indian men, the second category is software engineer.
Among US women, the biggest category was also “student” but followed quite closely by “data scientist”. Among US men , the biggest category was “data scientist” followed by “student”.
Note, “other” category is something we created now, so not considering those. They are not the biggest category for any sub-group anyway.
CEOs, not surprisingly are male, 45+ years from the US, with a masters degree.

 

Coding Experience:

Among Indians, most answered <1 year of coding experience , which correlates well with the fact that most of them are under 30, with a huge population of students.
Among US respondents, the split is even between 1-2 years of coding and 3-5 years of coding.
Men seem to have a bit more coding experience than women, again explained by the fact that women were slightly younger overall, compared to US men.

 

Most popular programming language:

Python is the most popular language, discounting the number of people who did not answer. However, among US women, R is also popular (16% favoring it).

I found this quite interesting because I’ve always used R at work, at multiple big-name employers. (Nasdaq, Td bank, etc.) Plus, a lot of companies that used SAS seem to have found it easier to move code to R. Again this is personal opinion.
Maybe it is also because many colleges teach Python as a starting programming language?

 

Conclusions:

  1. Overall, Indians tended to be younger with more people pursuing masters degrees.
  2. US respondents tended to older with stronger coding experience, and many more are practicing data scientists.
    This seems like a great opportunity for Kaggle, if they could match the Indian students with the US data scientists, in a sort of mentor-matching service. 🙂

Automated Email Reports with R

R is an amazing tool to perform advanced statistical analysis and create stunning visualizations. However, data scientists and analytics practitioners do not work in silos, so these analysis have to be copied and emailed to senior managers and partners teams. Cut-copy-paste sounds great, but if it  is a daily or periodic task, it is more useful to automate the reports. So in this blogpost, we are going to learn how to do exactly that.

The R-code uses specific library packages to do this:

  • RDCOMClient – to connect to Outlook and send emails. In most offices, Outlook is still the defacto email client, so this is fine. However, if you are using Slack or something different it may not work.
  • r2excel – To create an excel output file.

The screenshot below shows the final email view:

email screenshot

email screenshot

As seen in the screenshot, the email contains the following:

  • Custom subject with current date
  • Embedded image
  • Attachments – 1 Excel and 1 pdf report

Code Explanation:

The code and supporting input files are available here, under the Projects page under Nov2018. The code has 4 parts:

  • Prepare the work space.
  • Pull the data from source.
  • Cleaning and calculations
  • Create pdf.
  • Create Excel file.
  • Send email.

 

Prepare the work space

I always set the relative paths and working directories at the very beginning, so it is easier to change paths later. You can replace the link with a shared network drive path as well.

Load library packages and custom functions. My code uses the r2excel package which is not directly available as an R-cran package. So you need to install using devtools using the code below.

It is possible to do something similar using the “xlsx” package, but r2excel is easier.

Some other notes:

  • you need the first 2 lines of code only for the first time you installation. From the second time onwards, you only need to load the library.
  • r2excel seems to work only with 64-bit installations of R and Rstudio.
  • you do need Java installed on your computer. If you see an error about java namespace, then check the path variables. There is a very useful thread on Stackoverflow, so take a look.
  • As always, if you see errors Google it and use the Stack Overflow conversations. In 99% of cases, you will find an answer.

Pull the data from source

This is where we connect to an Excel CSV (or text) file. In practice, most people connect to a database of some kind. The R-script I am using connects to a .csv file, but I have added the code to a connect to a SQL database.

That code snippet is commented out, so feel free to substitute your own sql database links. The code will also work for Amazon EC2 cluster.

Some points to keep in mind:

  • If you are using sqlquery() then please note that if your query has an error then R sadly shows only a standard error message. So test your query on SQL server to ensure that you are not missing anything.
  • Some queries do take a long time, if you are pulling from a huge dataset. Also the time taken will be longer in R compared to SQL server direct connection. Using the  Sys.time() command before and after the query is helpful to know how long the query took to complete.
  • If you are only planning to pull the data randomly, it may make sense to pull from SQL server and store locally. Use the fread() function to read those files.
  • If you are using R desktop instead of R-server, the amount of data you can pull may be limited to what your system configuration.
  • ALWAYS optimize your query. Even if you have unlimited memory and computation power, only pull the data you absolutely need. Otherwise you end up unnecessarily sorting through irrelevant data.

Cleaning and calculations

For the current data, there are no NAs, so we don’t need to account for those. However, the read.csv() command creates factors, which I personally do not like, as they sometimes cause issues while merging.

Some of the column names have “.” where R converted the space in the names. So we will manually replace those with an underscore using the gsub() function.

We will also rank the apps based on categories of interest, namely:

  • Most Popular Apps – by number of Reviews
  • Most Popular Apps – by number Downloads and Reviews
  • Most Popular Categories – Paid Apps only
  • Most popular apps with 1 billion installations.

Create pdf

We are going to use the pdf() function to paste all graphs to a pdf document. Basically what this function does is write the graphs to a file rather than show on the console. So the only thing to remember is that if you are testing graphs or make an incorrect graph, everything will get posted to the pdf until you hit the “dev.off()” function. Sometimes if the graph throws an error you may end up with a blank page, or worse, with a corrupt file that cannot be opened.

Currently, the code I am only printing 2 simple graphs using ggplot() and barplot() functions, but you can include many other plots as well.

 

Create Excel file.

The Excel is created in the sequence below:

  • Specify the filename and create an object of type .xlsx This will create an empty Excel placeholder. It is only complete when you save the Workbook using the saveWorkbook() at the end of the section.
  • Use the sheets() to create different worksheets within the Excel.
  • The  xlsx.addHeader() adds a bold Header to each sheet which will help readers understand the content on the page. The r2excel package has other functions to add more informative text in smaller (non-header) font as well, if you need to give some context to readers. Obviously, this is optional if you don’t want to add them.
  • xlsx.addTable() – this is the crucial function that adds the content to Excel, the main “meat” of what you need to show.
  • saveWorkbook() – this function will save the Excel to the folder.
  • xlsx.openFile() – this function opens the file so you can view contents. I typically have the script running on automated mode, so when the Excel opens I am notified that the script completed.

Send email

The email is sent using the following functions:

  • OutApp() – creates an Outlook object. As I mentioned earlier, you do need Outlook and need to be signed in for this to work. I use Outlook for work and at home, so I have not explored options for Slack or other email clients.
  • outmail[[“To”]] – specify the people in the “to” field. You could also read email addresses from a file and pass the values here.
  • outmail[[“cc’]] – similar concept, for the cc field.
  • outmail[[“Subject”]] – I have used the paste0() function to add the current date to the subject, so recipients know it is the latest report.
  • outMail[[“HTMLBody”]] – I used the HTML body so that I can embed the image. If you don’t know HTML programming, no worries! The code is pretty intuitive, you should be able to follow what I’ve done. The image basically is an attachment which the HTML code is forcing to be viewed within the body of the email. If you are sending the email to people outside the organization, they may see a small box instead of the image with a cross on the top left (or right) of the box. Usually, when you hover your mouse near box and right click, it will ask them to download images. You may have seen similar messages in gmail, along with a link to “show images” or ‘always show images from this sender’. You obviously cannot control what the recipient selects, but testing by sending to yourself first helps smoothing out potential aesthetic issues.
  • outMail[[“Attachments”]] – function to add attachments.
  • outMail$Send() – until you run this command, the mail will not be send. If you are using this in office, you may get a popup asking you to do one of the following. Most  of these will generally go away after the first use, but if they don’t, please look up the issue on StackOverflow or contact your IT support for firewall and other security settings.
    • popup to hit “send”
    • popup asking you to “classify” the attachments (internal / public/ confidential) Select as appropriate. For me, this selection is usually  “internal”
    • popup asking you to accept “trust” settings
    • popup blocker notifying you to allow backend app to access Outlook.

 

That is it – and you are done! You have successfully learned how to send an automated email via R.

Introduction to Artificial Intelligence

Lately I’ve been exploring deep learning algorithms, and automating system with Artificial Intelligence. Plus, I received a couple of emails asking me about programming skills for AI. So, with those questions in mind, here is a simple introduction to artificial intelligence. The agenda for this post is to answer the following topics:

  1. What is AI?
  2. Types of AI – practical implementation in Fortune500 companies
  3. Applications of AI
  4. Drawbacks of AI
  5. Programming skills for AI
introduction to AI

introduction to AI

What is AI?

AI or artificial intelligence is the process of using software to perform human tasks. It is considered to be a branch of machine learning, and sophisticated algorithms are used to do everything from automating repetitive tasks to creating self-learning sentient systems. Sadly, nowadays AI is being used as an alternative to datascience and machine learning, so it is very difficult to draw a line of difference between them all.

Types of AI

In terms of implementation and usage in Fortune500 companies, AI can perform three important business needs (Davenport & Ronanki, 2018) :

  1. automating business processes,
  2. gaining insight through data analysis, (Cognitive insight)
  3. engaging with customers and employees.

Process automation:

This is perhaps the most prevalent form of AI, where code on a remote server or a sensor sorts information and makes decisions that would otherwise be done by a human. Examples include updating multiple databases with customer address changes or service additions, replacing lost credit or ATM cards, parsing humongous amounts of legal and contractual documents to extract provisions using natural language processing. This type of robotic process automation (RPA) technologies are most popular because they are easy to implement and offer great returns on investment. They are also controversial, because they can sometimes replace low-skilled manual jobs, even though these jobs were always in danger of being outsourced or given to minimum wage workers.

Cognitive insight.

These processes use algorithms to detect patterns in vast volumes of data and interpret their meaning. Using unsupervised machine learning algorithms help the processes become more efficient over time. Examples include predicting what a customer is likely to buy, identify credit fraud in real time, mass-personalization of digital ads and so on. Given the amounts of data, cognitive insight applications are typically used for tasks that would be impossible for people, or to augment human decision-making processes.

Cognitive engagement.

These projects are the most complicated, take most amount of time, and therefore generate the most buzz and are most prone to mismanagement and failures. Examples include intelligent chatbots that are able to process complex questions and improve with every interaction with live customers, voice-to-text reporting solutions, recommendation systems that help providers create customized care plans that take into account individual patients’ health status and previous treatments, recreating customer intimacy with digital customers using digital concierges, etc. However, such AI projects are still not completely mainstream, and companies tend to take a conservative approach in using them for customer-facing systems.

Applications of AI:

There are many applications of AI and currently startups are racing to build AI chips for data centers, robotics, smartphones, drones and other devices. Tech giants like Apple, Google, Facebook, and Microsoft have already created interesting products by applying AI software to speech recognition, internet search, and classifying images. Amazon.com’s AI prowess spans cloud-computing services and voice-activated home digital assistants (Alexa, Amazon Echo). Here are some other interesting applications of AI:

  1. Driverless vehicles
  2. Robo-advisors that can recommend investments, re-balance stock/bond ratios and make personalized recommendations for an individual’s portfolio. An interesting extension of this technique is the list of 50 startups identified by Quid AI CEO, with the most potential to grow, in 2009. Today 10 of those companies have reached billion-dollar evaluations (Reddy, 2017) and include famous names like Evernote, Spotify, Etsy, Zynga, Palantir, Cloudera, OPOWER. [Personal note – if you have not yet heard of Quid, follow them on Twitter @Quid. They publish some amazing business intelligence reports! ]
  3. Image recognition that can aid law enforcement personnel in identifying criminals.
  4. LexisNexis has a product called PatentAdvisor (lexisnexisip.com/products/patent-advisor) which uses data concerning the history of individual patent examiners and how they’ve handled similar patent applications to predict the likelihood of a application being approved. Similarly, there are software applications that use artificial intelligence to help lawyers gather research material for cases, by identifying precedents that will maximize chances for a successful ruling outcome. (Keiser, 2018)

Drawbacks of AI:

There is no doubt that AI has created some amazing opportunities (image recognition to classify malignant tumors) and allowed companies to pass on boring admin tasks to machines. However, since AI systems are created by humans, they do have the following risks and limits:

  1. AI bias: Machine learning and algorithms underlying AI systems also has biases. All algorithms use an initial training set to learn how to identify and predict values. So, if the underlying training set is biased, then the predictions will also be biased. Garbage in, garbage out. Moreover, these biases may not appear as an explicit rule but, rather, be embedded in subtle interactions among the thousands of factors considered. Hence heavily regulated industries like banking bar the use of AI in loan approvals, as it may conflict with fair lending laws.
  2. Lack of verification: Unlike regular rule-based systems, neural network systems which are typically used in AI systems, deal with statistical truths rather than literal truths. So, such systems may fail in extreme rare cases, as the algorithm will overlook cases that may have very low probability of occurrence. For example, predicting a Wall Street crash or a sudden natural calamity like a volcanic eruption (think Hawaii). Lack of verification are major concerns in mission-critical applications, such as controlling a nuclear power plant, or when life-or-death decisions are involved.
  3. Hard to correct errors: If the AI system makes an error (as all systems eventually fail), then diagnosing and correcting the system becomes unimaginably complex, as the underlying mathematics are very complicated.
  4. Human creativity and emotions cannot be automated. AI is excellent at mundane tasks, but not so good at things that are intuitive. As the authors state in a book (Davenport & Kirby, 2016) if the logic can be articulated, a rule-based system can be written, and the process can be automated. However, tasks which involve emotions, creative problem-solving and social interactions cannot be automated. Examples include FBI negotiators, soldiers on flood rescue systems, the inventor who knew the iPod would change the music industry and become a sensation long before anyone expressed a need for it.

Programming Skills for AI:

The skills used to build AI applications are the same as those needed for data scientists and software engineering roles. The top 5 programming skills are Python, R, Java, C++. If you are looking to get started, then three excellent resources are listed below:

  1. Professional Program from Microsoft. The courses are completely free (gasp) although they do charge $99 per course for verified certificates. I took the free versions, and the courses offer a good mix of both practical labs and theory. https://academy.microsoft.com/en-us/professional-program/tracks/artificial-intelligence/
  2. Introduction to AI course from Udacity. https://www.udacity.com/course/intro-to-artificial-intelligence–cs271
  3. AI and Deep Learning courses by Kirill Eremenko, on Udemy. I’ve taken 4 courses from him and they were all great value for money, and give very real-world, hands-on coding experience. https://www.udemy.com/artificial-intelligence-az/ &

Please note that all these 3 are honest recommendations, and I am not being paid or compensated in any shape or form for adding these links.

 

 

REFERENCES

Brynjolfsson, E., McAfee, A. (2017) The business of artificial intelligence: What it can and cannot do for your organization. Harvard Business Review website. Retrieved from https://hbr.org/cover-story/2017/07/the-business-of-artificial-intelligence

Davenport, T., Kirby, J. (2016) Only Humans Need Apply: winners and losers in the age of smart machines. Harper Business.

Davenport, T., Ronanki, R. (2018) Artificial Intelligence for the Real World. Harvard Business Review.

Keiser, B. (2018) Law library Management and Legal Research meet Artificial Intelligence. onlineresearcher.net

Reddy, S. (2017) A computer was asked to predict which start-ups would be successful. The results were astonishing. World Economic Forum. Retrieved from https://www.weforum.org/agenda/2017/07/computer-ai-machine-learning-predict-the-success-of-startups/

 

 

 

 

How to raise money on Kickstarter – extensive EDA and prediction tutorial

In this tutorial, we will explore the characterisitcs of projects on Kickstarter and try to understand what separates the winners from the projects that failed to reach their funding goals.

Qs for Exploratory Analysis:

We will start our analysis with the aim of answering the following questions:

    1. How many projects were successful on Kickstarter, by year and category.
    2. Which sub-categories raised the most amount of money?
    3. Projects originate from which countries?
    4. How many projects exceeded their funding goal by 50% or more?
    5. Did any projects reach $100,000 or more? $1,000,000 or higher?
    6. What was the average amount contributed by each backer, and how does this change over time? Does this amount differ with categories?
    7. What is the average funding period?

 

Predicting success rates:
Using the answers from the above questions, we will try to create a model that can predict which projects are most likely to be successful.

The dataset is available on Kaggle, and you can run this script LIVE using this kernel link. If you find this tutorial useful or interesting, then please do upvote the kernel ! 🙂

Step1 – Data Pre-processing

a) Let us take a look at the input dataset :

The projects are divided into main and sub-categories. The pledged amount “usd_pledged” has an equivalent value converted to USD, called “usd_pledged_real”. However, the goal amount does not have this conversion. So for now, we will use the amounts as is.

We can see how many people are backing each individual project using the column, “backers”.

b) Now let us look at the first 5 records:

The name doesn’t really indicate any specific pattern although it might be interesting to see if longer names have better success rates. Not pursuing that angle at this time, though.

c) Looking for missing values:

Hurrah, a really clean dataset, even after searching for “empty” strings. 🙂

 d) Date Formatting and splitting:

We have two dates in our dataset – “launch date” and “deadline date”.We convert them from strings to date format.
We also split these dates into the respective year and month columns, so that we can plot variations over time.
So we will now have 4 new columns: launch_year, launch_month, deadline_year and deadline_month.

Exploratory analysis:

a) How many projects are successful?

We see that “failed” and “successful” are the two main categories, comprising ~88% of our dataset.
Sadly we do not know why some projects are marked “undefined” or “canceled”.
“live”” projects are those where the deadlines have not yet passed, although a few among them are already achieved their goal.
Surprisingly, some ‘canceled’ projects had also met their goals (pledged_amount >= goal).
Since these other categories are a very small portion of the dataset, we will subset and only consider records with satus “failed” or “successful” for the rest of the analysis.

b) How many countries have projects on kickstarter?

We see projects are overwhelmingly US. Some country names have the tag N,0“”, so marking them as unknown.

c) Number of projects launched per year:

Looks like some records say dates like 1970, which does not look right. So we discard any records with a launch / deadline year before 2009.
Plotting the counts per year on a graphs: < br />From the graph below, it looks like the count of projects peaked in 2015, then went down. However, this should NOT be taken as an indicator of success rates.

 

 

Drilling down a bit more to see count of projects by main_category.

Over the years, maximum number of projects have been in the categories:

    1. Film & Video
    2. Music
    3. Publishing

 d) Number of projects by sub-category: (Top 20 only)


The Top 5 sub-categories are:

    1. Product Design
    2. Documentary
    3. Music
    4. Tabletop Games (interesting!!!)
    5. Shorts (really?! )

Let us now see “Status” of projects for these Top 5 sub_categories:
From the graph below, we see that for category “shorts” and “tabletop games” there are more successfull projects than failed ones.

 e) Backers by category and sub-category:

Since there are a lot of sub-categories, let us explore the sub-categories under the main theme “Design” 

Product design is not just the sub-category with the highest count of projects, but also the category with the highest success ratio.

 f) add flag to see how many got funded more than the goal.

So ~40% of projects reached or surpassed their goal, which matches the number of successful projects .

 g) Calculate average contribution per backer:

From the mean, median and max values we quickly see that the median amount contributed by each backer is only ~$40 whereas the mean is higher due to the extreme positive values. The max amount by a single backer is ~$5000.

h) Calculate reach_ratio

The amount per backer is a good start, but what if the goal amount itself is only $1000? Then an average contribution per backer of $50 impies we only need 20 backers.
So to better understand the probability of a project’s success, we create a derived metric called “reach_ratio”.
This takes the average user contribution and compares it against the goal fund amount.

We see the median reach_ratio is <1%. Only in the third quartile do we even touch 2%!
Clearly most projects have a very low reach ratio. We could subset for “successful” projects only and check if the reach_ratio is higher.

 i) Number of days to achieve goal:

 Predictive Analystics:

We will apply a very simple decision tree algorithm to our dataset.
Since we do not have a separate “test” set, we will split the input dataframe into 2 parts (70/30 split).
We will use the smaller set to test the accuracy of out algorithm.

Taking a peek at the decision tree rules:

kickstarter success decision tree

kickstarter success decision tree




Thus we see that “backers” and “reach-ratio” are the main significant variables.

Re-applying the tree rules to the training set itself, we can validate our model:

From the above tables, we see that the error rate = ~3% and area under curve >= 97%

Finally applying the tree rules to the test set, we get the following stats:

From the above tables, we see that still the error rate = ~3% and area under curve >= 97%

 

Conclusion:

Thus in this tutorial, we explored the factors that contribtue to a project’s success. Main theme and sub-category were important, but the number of backers and “reach_ratio” were found to be most critical.
If a founder wanted to gauge their probability of success, they could measure their “reach-ratio” halfway to the deadline, or perhaps when 25% of the timeline is complete. If the numbers are lower, it means they need to double down and use promotions/social media marketing to get more backers and funding.

If you liked this tutorial, feel free to fork the script. And dont forget to upvote the kernel! 🙂

Twitter Sentiment Analysis

Introduction

Today’s post is a 2-part tutorial series on how to create an interactive ShinyR application that displays sentiment analysis for various phrases and search terms. The application accepts user a search term as input and graphically displays sentiment analysis.

In keeping with this month’s theme – “API programming”, this project uses the Twitter API to perform real-time search for tweets containing the user input term. Live App Link on Shiny website is provided and screenshot is as follows:

Twitter Sentiment Analysis Shiny

Shiny application for Twitter Sentiment Analysis

The project idea may seem simple at first, but will teach you the following skills:

  • working with Twitter API and dynamic data streaming (every time the search term changes, the program sends a new request to Twitter for relevant tweets),
  • Building an “interactive”, real-time application in Shiny/R,
  • data visualization with R

As always, the entire source code is also available for download on the Projects Page or can be forked from my  Github account here.

 

The tutorial is divided into  3 parts :

  1. Introduction
  2. Twitter Connectivity & search
  3. Shiny design

 

Application Design:

Any good software project begins with the design first. For this application, the design flowchart is shown below:

Design Flowchart for Shiny app

Design Flowchart for Shiny app

 

 

Twitter Connectivity

This is similar to the August project and mainly consists of two calls to the Twitter API:

  • authorize twitter api to mine data, using setup_twitter_oauth() function and your Twitter developer keys.

library(twitteR)
consumer_key = “ckey”
consumer_secret = “csecret”
access_token = “atoken”
access_secret = “asecret”
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

  • check whether the input search term returns tweets containing the phrase. If number of tweets <=5 return an error message. If number of tweets >5, process the tweets and display a sentiment analysis barchart. A custom function performs this computation

chk_searchterm <- function( term )

{  tw_search = searchTwitter(term, n=20, since=’2013-01-01′)

# look for all tweets containing this search term.

if(length(tw_search) <= 5)

{   return_term <- “None/few tweets to analyse for this search term. Please try again!” }

else

{    return_term <- paste(“Extracting max 20 tweets for Input =”, term, “.Sentiment graph below “)     }

return(return_term)

}

The bargraph is created by assigning numeric values for each of the positive and negative emotions using the tweet text. Emotions used – anger, anticipation, disgust, joy, sadness, surprise, trust, overall positive and negative sentiment.

 

Shiny webapp

The actual Shiny application design and twitter connectivity are explained in the next post.

Older posts

Thanks for reading so far! If you liked our content, please share!

Facebook
Google+
https://blog.journeyofanalytics.com/author/journeyofanalytics
Pinterest
LinkedIn