Journey of Analytics

Deep dive into data analysis tools, theory and projects

Category: Monthly projects (page 1 of 3)

Automated Email Reports with R

R is an amazing tool to perform advanced statistical analysis and create stunning visualizations. However, data scientists and analytics practitioners do not work in silos, so these analysis have to be copied and emailed to senior managers and partners teams. Cut-copy-paste sounds great, but if it  is a daily or periodic task, it is more useful to automate the reports. So in this blogpost, we are going to learn how to do exactly that.

The R-code uses specific library packages to do this:

  • RDCOMClient – to connect to Outlook and send emails. In most offices, Outlook is still the defacto email client, so this is fine. However, if you are using Slack or something different it may not work.
  • r2excel – To create an excel output file.

The screenshot below shows the final email view:

email screenshot

email screenshot

As seen in the screenshot, the email contains the following:

  • Custom subject with current date
  • Embedded image
  • Attachments – 1 Excel and 1 pdf report

Code Explanation:

The code and supporting input files are available here, under the Projects page under Nov2018. The code has 4 parts:

  • Prepare the work space.
  • Pull the data from source.
  • Cleaning and calculations
  • Create pdf.
  • Create Excel file.
  • Send email.

 

Prepare the work space

I always set the relative paths and working directories at the very beginning, so it is easier to change paths later. You can replace the link with a shared network drive path as well.

Load library packages and custom functions. My code uses the r2excel package which is not directly available as an R-cran package. So you need to install using devtools using the code below.

It is possible to do something similar using the “xlsx” package, but r2excel is easier.

Some other notes:

  • you need the first 2 lines of code only for the first time you installation. From the second time onwards, you only need to load the library.
  • r2excel seems to work only with 64-bit installations of R and Rstudio.
  • you do need Java installed on your computer. If you see an error about java namespace, then check the path variables. There is a very useful thread on Stackoverflow, so take a look.
  • As always, if you see errors Google it and use the Stack Overflow conversations. In 99% of cases, you will find an answer.

Pull the data from source

This is where we connect to an Excel CSV (or text) file. In practice, most people connect to a database of some kind. The R-script I am using connects to a .csv file, but I have added the code to a connect to a SQL database.

That code snippet is commented out, so feel free to substitute your own sql database links. The code will also work for Amazon EC2 cluster.

Some points to keep in mind:

  • If you are using sqlquery() then please note that if your query has an error then R sadly shows only a standard error message. So test your query on SQL server to ensure that you are not missing anything.
  • Some queries do take a long time, if you are pulling from a huge dataset. Also the time taken will be longer in R compared to SQL server direct connection. Using the  Sys.time() command before and after the query is helpful to know how long the query took to complete.
  • If you are only planning to pull the data randomly, it may make sense to pull from SQL server and store locally. Use the fread() function to read those files.
  • If you are using R desktop instead of R-server, the amount of data you can pull may be limited to what your system configuration.
  • ALWAYS optimize your query. Even if you have unlimited memory and computation power, only pull the data you absolutely need. Otherwise you end up unnecessarily sorting through irrelevant data.

Cleaning and calculations

For the current data, there are no NAs, so we don’t need to account for those. However, the read.csv() command creates factors, which I personally do not like, as they sometimes cause issues while merging.

Some of the column names have “.” where R converted the space in the names. So we will manually replace those with an underscore using the gsub() function.

We will also rank the apps based on categories of interest, namely:

  • Most Popular Apps – by number of Reviews
  • Most Popular Apps – by number Downloads and Reviews
  • Most Popular Categories – Paid Apps only
  • Most popular apps with 1 billion installations.

Create pdf

We are going to use the pdf() function to paste all graphs to a pdf document. Basically what this function does is write the graphs to a file rather than show on the console. So the only thing to remember is that if you are testing graphs or make an incorrect graph, everything will get posted to the pdf until you hit the “dev.off()” function. Sometimes if the graph throws an error you may end up with a blank page, or worse, with a corrupt file that cannot be opened.

Currently, the code I am only printing 2 simple graphs using ggplot() and barplot() functions, but you can include many other plots as well.

 

Create Excel file.

The Excel is created in the sequence below:

  • Specify the filename and create an object of type .xlsx This will create an empty Excel placeholder. It is only complete when you save the Workbook using the saveWorkbook() at the end of the section.
  • Use the sheets() to create different worksheets within the Excel.
  • The  xlsx.addHeader() adds a bold Header to each sheet which will help readers understand the content on the page. The r2excel package has other functions to add more informative text in smaller (non-header) font as well, if you need to give some context to readers. Obviously, this is optional if you don’t want to add them.
  • xlsx.addTable() – this is the crucial function that adds the content to Excel, the main “meat” of what you need to show.
  • saveWorkbook() – this function will save the Excel to the folder.
  • xlsx.openFile() – this function opens the file so you can view contents. I typically have the script running on automated mode, so when the Excel opens I am notified that the script completed.

Send email

The email is sent using the following functions:

  • OutApp() – creates an Outlook object. As I mentioned earlier, you do need Outlook and need to be signed in for this to work. I use Outlook for work and at home, so I have not explored options for Slack or other email clients.
  • outmail[[“To”]] – specify the people in the “to” field. You could also read email addresses from a file and pass the values here.
  • outmail[[“cc’]] – similar concept, for the cc field.
  • outmail[[“Subject”]] – I have used the paste0() function to add the current date to the subject, so recipients know it is the latest report.
  • outMail[[“HTMLBody”]] – I used the HTML body so that I can embed the image. If you don’t know HTML programming, no worries! The code is pretty intuitive, you should be able to follow what I’ve done. The image basically is an attachment which the HTML code is forcing to be viewed within the body of the email. If you are sending the email to people outside the organization, they may see a small box instead of the image with a cross on the top left (or right) of the box. Usually, when you hover your mouse near box and right click, it will ask them to download images. You may have seen similar messages in gmail, along with a link to “show images” or ‘always show images from this sender’. You obviously cannot control what the recipient selects, but testing by sending to yourself first helps smoothing out potential aesthetic issues.
  • outMail[[“Attachments”]] – function to add attachments.
  • outMail$Send() – until you run this command, the mail will not be send. If you are using this in office, you may get a popup asking you to do one of the following. Most  of these will generally go away after the first use, but if they don’t, please look up the issue on StackOverflow or contact your IT support for firewall and other security settings.
    • popup to hit “send”
    • popup asking you to “classify” the attachments (internal / public/ confidential) Select as appropriate. For me, this selection is usually  “internal”
    • popup asking you to accept “trust” settings
    • popup blocker notifying you to allow backend app to access Outlook.

 

That is it – and you are done! You have successfully learned how to send an automated email via R.

How to raise money on Kickstarter – extensive EDA and prediction tutorial

In this tutorial, we will explore the characterisitcs of projects on Kickstarter and try to understand what separates the winners from the projects that failed to reach their funding goals.

Qs for Exploratory Analysis:

We will start our analysis with the aim of answering the following questions:

    1. How many projects were successful on Kickstarter, by year and category.
    2. Which sub-categories raised the most amount of money?
    3. Projects originate from which countries?
    4. How many projects exceeded their funding goal by 50% or more?
    5. Did any projects reach $100,000 or more? $1,000,000 or higher?
    6. What was the average amount contributed by each backer, and how does this change over time? Does this amount differ with categories?
    7. What is the average funding period?

 

Predicting success rates:
Using the answers from the above questions, we will try to create a model that can predict which projects are most likely to be successful.

The dataset is available on Kaggle, and you can run this script LIVE using this kernel link. If you find this tutorial useful or interesting, then please do upvote the kernel ! 🙂

Step1 – Data Pre-processing

a) Let us take a look at the input dataset :

The projects are divided into main and sub-categories. The pledged amount “usd_pledged” has an equivalent value converted to USD, called “usd_pledged_real”. However, the goal amount does not have this conversion. So for now, we will use the amounts as is.

We can see how many people are backing each individual project using the column, “backers”.

b) Now let us look at the first 5 records:

The name doesn’t really indicate any specific pattern although it might be interesting to see if longer names have better success rates. Not pursuing that angle at this time, though.

c) Looking for missing values:

Hurrah, a really clean dataset, even after searching for “empty” strings. 🙂

 d) Date Formatting and splitting:

We have two dates in our dataset – “launch date” and “deadline date”.We convert them from strings to date format.
We also split these dates into the respective year and month columns, so that we can plot variations over time.
So we will now have 4 new columns: launch_year, launch_month, deadline_year and deadline_month.

Exploratory analysis:

a) How many projects are successful?

We see that “failed” and “successful” are the two main categories, comprising ~88% of our dataset.
Sadly we do not know why some projects are marked “undefined” or “canceled”.
“live”” projects are those where the deadlines have not yet passed, although a few among them are already achieved their goal.
Surprisingly, some ‘canceled’ projects had also met their goals (pledged_amount >= goal).
Since these other categories are a very small portion of the dataset, we will subset and only consider records with satus “failed” or “successful” for the rest of the analysis.

b) How many countries have projects on kickstarter?

We see projects are overwhelmingly US. Some country names have the tag N,0“”, so marking them as unknown.

c) Number of projects launched per year:

Looks like some records say dates like 1970, which does not look right. So we discard any records with a launch / deadline year before 2009.
Plotting the counts per year on a graphs: < br />From the graph below, it looks like the count of projects peaked in 2015, then went down. However, this should NOT be taken as an indicator of success rates.

 

 

Drilling down a bit more to see count of projects by main_category.

Over the years, maximum number of projects have been in the categories:

    1. Film & Video
    2. Music
    3. Publishing

 d) Number of projects by sub-category: (Top 20 only)


The Top 5 sub-categories are:

    1. Product Design
    2. Documentary
    3. Music
    4. Tabletop Games (interesting!!!)
    5. Shorts (really?! )

Let us now see “Status” of projects for these Top 5 sub_categories:
From the graph below, we see that for category “shorts” and “tabletop games” there are more successfull projects than failed ones.

 e) Backers by category and sub-category:

Since there are a lot of sub-categories, let us explore the sub-categories under the main theme “Design” 

Product design is not just the sub-category with the highest count of projects, but also the category with the highest success ratio.

 f) add flag to see how many got funded more than the goal.

So ~40% of projects reached or surpassed their goal, which matches the number of successful projects .

 g) Calculate average contribution per backer:

From the mean, median and max values we quickly see that the median amount contributed by each backer is only ~$40 whereas the mean is higher due to the extreme positive values. The max amount by a single backer is ~$5000.

h) Calculate reach_ratio

The amount per backer is a good start, but what if the goal amount itself is only $1000? Then an average contribution per backer of $50 impies we only need 20 backers.
So to better understand the probability of a project’s success, we create a derived metric called “reach_ratio”.
This takes the average user contribution and compares it against the goal fund amount.

We see the median reach_ratio is <1%. Only in the third quartile do we even touch 2%!
Clearly most projects have a very low reach ratio. We could subset for “successful” projects only and check if the reach_ratio is higher.

 i) Number of days to achieve goal:

 Predictive Analystics:

We will apply a very simple decision tree algorithm to our dataset.
Since we do not have a separate “test” set, we will split the input dataframe into 2 parts (70/30 split).
We will use the smaller set to test the accuracy of out algorithm.

Taking a peek at the decision tree rules:

kickstarter success decision tree

kickstarter success decision tree




Thus we see that “backers” and “reach-ratio” are the main significant variables.

Re-applying the tree rules to the training set itself, we can validate our model:

From the above tables, we see that the error rate = ~3% and area under curve >= 97%

Finally applying the tree rules to the test set, we get the following stats:

From the above tables, we see that still the error rate = ~3% and area under curve >= 97%

 

Conclusion:

Thus in this tutorial, we explored the factors that contribtue to a project’s success. Main theme and sub-category were important, but the number of backers and “reach_ratio” were found to be most critical.
If a founder wanted to gauge their probability of success, they could measure their “reach-ratio” halfway to the deadline, or perhaps when 25% of the timeline is complete. If the numbers are lower, it means they need to double down and use promotions/social media marketing to get more backers and funding.

If you liked this tutorial, feel free to fork the script. And dont forget to upvote the kernel! 🙂

Who wants to work at Google?

In this tutorial, we will explore the open roles at Google, and try to see what common attributes Google is looking for, in future employees.

 

This dataset comes from the Kaggle site, and contains text information about job location, title, department, minimum and preferred qualifications and the responsibilities of the position. Using this dataset we will try to answer the following questions: You can download the dataset here, and run the code on the Kaggle site itself here.

  1. Where are the open roles?
  2. Which departments have the most openings?
  3. What are the minimum and preferred educational qualifications needed to get hired at Google?
  4. How much experience is needed?
  5. What categories of roles are the most in demand?

Data Preparation and Cleaning:

The data is all in free-form text, so we do need to do a fair amount of cleanup to remove non-alphanumeric characters. Some of the job locations have special characters too, so we remove those using basic string manipulation functions. Once we read in the file, this is the snapshot of the resulting dataframe:

Job Categories:

First we look at which departments have the most number of open roles. Surprisingly, there are more roles open for the “Marketing and Communications” and “Sales & Account Management” categories, as compared to the traditional technical business units. (like Software Engineering or networking) .

Full-time versus internships:

Let us see how many roles are full-time and how many are for students. As expected, only ~13% of roles are for students i.e. internships. Majority are full-time positions.

Technical Roles:

Since Google is predominantly technical company, let us see how many positions need technical skills, irrespective of the business unit (job category)

a) Roles related to “Google Cloud”:

To check this, we investigate how many roles have the phrase either in the job title or the responsibilities. As shown in the graph below, ~20% of the roles are related to Cloud infrastructure, clearly showing that Google is making Cloud services a high priority.

Educational Qualifications:

Here we are basically parsing the “min_qual” and “pref_qual” columns to see the minimum qualifications needed for the role. If we only take the minimum qualifications into consideration, we see that 80% of the roles explicitly ask for a bachelors degree. Less than 5% of roles ask for a masters or PhD.

min_qualifications for Google jobs

However, when we consider the “preferred” qualifications, the ratio increases to a whopping ~25%. Thus, a fourth of all roles would be more suited to candidates with masters degrees and above.

Google Engineers:

Google is famous for hiring engineers for all types of roles. So we will read the job qualification requirements to identify what percentage of roles requires a technical degree or degree in Engineering.
As seen from the data, 35% specifically ask for an Engineering or computer science degree, including roles in marketing and non-engineering departments.

Years of Experience:

We see that 30% of the roles require at least 5-years, while 35% of roles need even more experience.
So if you did not get hired at Google after graduation, no worries. You have a better chance after gaining a strong experience in other companies.

Role Locations:

The dataset does not have the geographical coordinates for mapping. However, this is easily overcome by using the geocode() function and the amazing Rworldmap package. We are only plotting the locations, so some places would have more roles than others.  So, we see open roles in all parts of the world. However, the maximum positions are in US, followed by UK, and then Europe as a whole.

Responsibilities – Word Cloud:

Let us create a word cloud to see what skills are most needed for the Cloud engineering roles: We see that words like “partner”, “custom solutions”, “cloud”, strategy“,”experience” are more frequent than any specific technical skills. This shows that the Google cloud roles are best filled by senior resources where leadership and business skills become more significant than expertise in a specific technology.

 

Conclusion:

So who has the best chance of getting hired at Google?

For most of the roles (from this dataset), a candidate with the following traits has the best chance of getting hired:

  1. 5+ years of experience.
  2. Engineering or Computer Science bachelor’s degree.
  3. Masters degree or higher.
  4. Working in the US.

The code for this script and graphs are available here on the Kaggle website. If you liked it, don’t forget to upvote the script. 🙂

Thanks and happy coding!

August Projects

In this month’s project, we will implement cluster analysis using the “K-means algorithm”.

We use the weather data from 1500+ locations (near airports) to understand temperature patterns by latitude and time of year.

We use cluster = 5 and assign letter A through E to locations with similar weather patterns. At the end of the analysis, you should be able to interpret the following insights from the resulting graphs and tables:

  1. Temperature patterns are similar towards the far North and South, just vertically shifted.
  2. The Pacific coast is different from the rest of the nation, where the temperature is static almost throughout the year.
  3. It is interesting to see how states in two different parts of the country show similar weather patterns since they are on the same latitude (see Minnesota and Maine). During peak summer, these two states are hotter than California.

 

A sample graph from the analysis is shown below.

US states by 5 major weather clustersUS states by 5 major weather clusters

US states divided into 5 major weather clusters

Data set and code files are available from the main Project site page, under the row for Jul/Aug 2017.

Take a look and play around with the data, to investigate the following:

  1. What happens if you increase cluster size to 7? What happens if you decrease it to 3?
  2. What is the monthly weather pattern for Hawaii (state code = HI) versus New Hampshire (abbreviation = NH) ?
  3. What is the weekly average temperature for a tropical state like Florida (plot a chart with median temperatures for all 52 weeks, by year). Has the average temperature gone up due to global warming?

Please leave your thoughts and comments, or questions if you get stuck on any point.

Happy Coding!

 

 

Monte Carlo Simulations in R

In today’s tutorial, we are going to learn how to implement Monte Carlo Simulations in R.

Logic behind Monte Carlo:

Monte Carlo Simulations in R

Monte Carlo Simulations in R

Monte Carlo simulation (also known as the Monte Carlo Method) is a statistical technique that allows us to compute all the possible outcomes of an event. This makes it extremely helpful in risk assessment and aids decision-making because we can predict the probability of extreme cases coming true. The technique was first used by scientists working on the atom bomb; it was named for Monte Carlo, the Monaco resort town renowned for its casinos. Since its introduction in World War II, Monte Carlo simulation has been used to model a variety of physical and conceptual systems.

Monte Carlo methods are used to identify the probability of an event A happening, among a set of N events. We assume that all the events are independent, and the probability of event A happening once does not prevent the occurrence again.

For example, assume you have a fair coin and you flip it once. The probability of heads is 0.5 i.e. equal possibility of heads or tails. You flip the coin again. The possibility of heads is still 0.5, irrespective of whether we got heads or tails in the first flip. However, we can safely say that if we were to flip the coin 100 times, you would see heads ~50% of the times. The application of Monte Carlo (referred henceforth in this post as MC) methods comes to play when we want to find out the probability of heads occurring 16 times in a row. (or 5 or 3 or any other number.)

You can read more about these methods and the theory behind them, using the links below:

  1. Wikipedia – link.
  2. MC methods in Finance, from Investopedia.com – link2
  3. Basics of MC from software provider Palisade. – link3.

Applications:

MC methods are used by professionals in numerous fields ranging from finance, project management, energy, manufacturing, R&D, insurance, biotech, etc. Some real-world applications of Monte Carlo simulations are given below:

  1. Monte Carlo simulations are used in financial services to predict fraudulent credit card transactions. (since 100 genuine transactions do not guarantee the next one will not be fraudulent, even though it is a rare event by itself.)
  2. Risk analysis. Assume a new product was sold at a loss of $300 to 6 users (due to coupons or sales), a profit of $467 in 79 users and a profit of $82 to 119 customers. We can use Monte Carlo simulations to understand what would be the average P/L (profit or loss) if 1000 customers bought our products.
  3. A/B testing to understand page bounce and success web elements. Assume you changed the payment processing system on your e-commerce site. You are doing an A/B test to see if the upgrade results in improved checkout completion. On the old system, 12 users abandoned their cart, while 19 completed their purchase. On the new system, 147 people abandoned their cart while 320 completed their purchase. Which system works better?
  4. Selection criteria. Example if we have 7 candidates for a scholarship (Eileen, George, Taher, Ramesis, Arya, Sandra and Mike) what is the probability that Mike will be chosen in three consecutive years? Assuming the candidate list is the same and past winners are not barred from receiving the scholarship again.

 

Advantages of using MC:

Unlike simple forecasting, Monte Carlo simulation can help with the following:

  • Probabilistic Results – show scenarios and how the occurrence likelihood.
  • Graphical Results – The outcomes and their chance of occurring can be easily converted to graphs making it easy to communicate findings to an audience.
  • Sensitivity Analysis – Easier to see which variables impact the outcome the most, i.e. which variables had the biggest effect on bottom-line results.
  • Scenario Analysis: Using Monte Carlo simulation, we can see exactly which inputs had which values together when certain outcomes occurred.
  • Correlation of Inputs. In Monte Carlo simulation, it’s possible to model interdependent relationships between input variables. It’s important for accuracy to represent how, in reality, when some factors goes up, others go up or down accordingly.

Code template:

The basic template for MC is as follows:

 

Let’s look at this code in detail:

  • Runs = no of trials or iterations. For our product profit example (application example 2), runs = 1000.
  • Func1 = this is the formula definition where we will indicate number of different events, their probability and the selection criteria. For our scholarship candidate example (application number 4) this function would be modified as:

sum(sample(c(1:7), size =3, replace = T)) > 6

where we are assigning number 1:7 to each student and hence Mike = 7.

Main code:

The code files for this tutorial are available on the 2017 project page. (Link here under Jul/Aug 2017 ) .

Older posts

Thanks for reading so far! If you liked our content, please share!

Facebook
Google+
https://blog.journeyofanalytics.com/category/monthly-projects
Pinterest
LinkedIn