Journey of Analytics

Deep dive into data analysis tools, theory and projects

Category: Monthly projects (page 1 of 2)

Who wants to work at Google?

In this tutorial, we will explore the open roles at Google, and try to see what common attributes Google is looking for, in future employees.

 

This dataset comes from the Kaggle site, and contains text information about job location, title, department, minimum and preferred qualifications and the responsibilities of the position. Using this dataset we will try to answer the following questions: You can download the dataset here, and run the code on the Kaggle site itself here.

  1. Where are the open roles?
  2. Which departments have the most openings?
  3. What are the minimum and preferred educational qualifications needed to get hired at Google?
  4. How much experience is needed?
  5. What categories of roles are the most in demand?

Data Preparation and Cleaning:

The data is all in free-form text, so we do need to do a fair amount of cleanup to remove non-alphanumeric characters. Some of the job locations have special characters too, so we remove those using basic string manipulation functions. Once we read in the file, this is the snapshot of the resulting dataframe:

Job Categories:

First we look at which departments have the most number of open roles. Surprisingly, there are more roles open for the “Marketing and Communications” and “Sales & Account Management” categories, as compared to the traditional technical business units. (like Software Engineering or networking) .

Full-time versus internships:

Let us see how many roles are full-time and how many are for students. As expected, only ~13% of roles are for students i.e. internships. Majority are full-time positions.

Technical Roles:

Since Google is predominantly technical company, let us see how many positions need technical skills, irrespective of the business unit (job category)

a) Roles related to “Google Cloud”:

To check this, we investigate how many roles have the phrase either in the job title or the responsibilities. As shown in the graph below, ~20% of the roles are related to Cloud infrastructure, clearly showing that Google is making Cloud services a high priority.

Educational Qualifications:

Here we are basically parsing the “min_qual” and “pref_qual” columns to see the minimum qualifications needed for the role. If we only take the minimum qualifications into consideration, we see that 80% of the roles explicitly ask for a bachelors degree. Less than 5% of roles ask for a masters or PhD.

min_qualifications for Google jobs

However, when we consider the “preferred” qualifications, the ratio increases to a whopping ~25%. Thus, a fourth of all roles would be more suited to candidates with masters degrees and above.

Google Engineers:

Google is famous for hiring engineers for all types of roles. So we will read the job qualification requirements to identify what percentage of roles requires a technical degree or degree in Engineering.
As seen from the data, 35% specifically ask for an Engineering or computer science degree, including roles in marketing and non-engineering departments.

Years of Experience:

We see that 30% of the roles require at least 5-years, while 35% of roles need even more experience.
So if you did not get hired at Google after graduation, no worries. You have a better chance after gaining a strong experience in other companies.

Role Locations:

The dataset does not have the geographical coordinates for mapping. However, this is easily overcome by using the geocode() function and the amazing Rworldmap package. We are only plotting the locations, so some places would have more roles than others.  So, we see open roles in all parts of the world. However, the maximum positions are in US, followed by UK, and then Europe as a whole.

Responsibilities – Word Cloud:

Let us create a word cloud to see what skills are most needed for the Cloud engineering roles: We see that words like “partner”, “custom solutions”, “cloud”, strategy“,”experience” are more frequent than any specific technical skills. This shows that the Google cloud roles are best filled by senior resources where leadership and business skills become more significant than expertise in a specific technology.

 

Conclusion:

So who has the best chance of getting hired at Google?

For most of the roles (from this dataset), a candidate with the following traits has the best chance of getting hired:

  1. 5+ years of experience.
  2. Engineering or Computer Science bachelor’s degree.
  3. Masters degree or higher.
  4. Working in the US.

The code for this script and graphs are available here on the Kaggle website. If you liked it, don’t forget to upvote the script. 🙂

Thanks and happy coding!

August Projects

In this month’s project, we will implement cluster analysis using the “K-means algorithm”.

We use the weather data from 1500+ locations (near airports) to understand temperature patterns by latitude and time of year.

We use cluster = 5 and assign letter A through E to locations with similar weather patterns. At the end of the analysis, you should be able to interpret the following insights from the resulting graphs and tables:

  1. Temperature patterns are similar towards the far North and South, just vertically shifted.
  2. The Pacific coast is different from the rest of the nation, where the temperature is static almost throughout the year.
  3. It is interesting to see how states in two different parts of the country show similar weather patterns since they are on the same latitude (see Minnesota and Maine). During peak summer, these two states are hotter than California.

 

A sample graph from the analysis is shown below.

US states by 5 major weather clustersUS states by 5 major weather clusters

US states divided into 5 major weather clusters

Data set and code files are available from the main Project site page, under the row for Jul/Aug 2017.

Take a look and play around with the data, to investigate the following:

  1. What happens if you increase cluster size to 7? What happens if you decrease it to 3?
  2. What is the monthly weather pattern for Hawaii (state code = HI) versus New Hampshire (abbreviation = NH) ?
  3. What is the weekly average temperature for a tropical state like Florida (plot a chart with median temperatures for all 52 weeks, by year). Has the average temperature gone up due to global warming?

Please leave your thoughts and comments, or questions if you get stuck on any point.

Happy Coding!

 

 

Monte Carlo Simulations in R

In today’s tutorial, we are going to learn how to implement Monte Carlo Simulations in R.

Logic behind Monte Carlo:

Monte Carlo Simulations in R

Monte Carlo Simulations in R

Monte Carlo simulation (also known as the Monte Carlo Method) is a statistical technique that allows us to compute all the possible outcomes of an event. This makes it extremely helpful in risk assessment and aids decision-making because we can predict the probability of extreme cases coming true. The technique was first used by scientists working on the atom bomb; it was named for Monte Carlo, the Monaco resort town renowned for its casinos. Since its introduction in World War II, Monte Carlo simulation has been used to model a variety of physical and conceptual systems.

Monte Carlo methods are used to identify the probability of an event A happening, among a set of N events. We assume that all the events are independent, and the probability of event A happening once does not prevent the occurrence again.

For example, assume you have a fair coin and you flip it once. The probability of heads is 0.5 i.e. equal possibility of heads or tails. You flip the coin again. The possibility of heads is still 0.5, irrespective of whether we got heads or tails in the first flip. However, we can safely say that if we were to flip the coin 100 times, you would see heads ~50% of the times. The application of Monte Carlo (referred henceforth in this post as MC) methods comes to play when we want to find out the probability of heads occurring 16 times in a row. (or 5 or 3 or any other number.)

You can read more about these methods and the theory behind them, using the links below:

  1. Wikipedia – link.
  2. MC methods in Finance, from Investopedia.com – link2
  3. Basics of MC from software provider Palisade. – link3.

Applications:

MC methods are used by professionals in numerous fields ranging from finance, project management, energy, manufacturing, R&D, insurance, biotech, etc. Some real-world applications of Monte Carlo simulations are given below:

  1. Monte Carlo simulations are used in financial services to predict fraudulent credit card transactions. (since 100 genuine transactions do not guarantee the next one will not be fraudulent, even though it is a rare event by itself.)
  2. Risk analysis. Assume a new product was sold at a loss of $300 to 6 users (due to coupons or sales), a profit of $467 in 79 users and a profit of $82 to 119 customers. We can use Monte Carlo simulations to understand what would be the average P/L (profit or loss) if 1000 customers bought our products.
  3. A/B testing to understand page bounce and success web elements. Assume you changed the payment processing system on your e-commerce site. You are doing an A/B test to see if the upgrade results in improved checkout completion. On the old system, 12 users abandoned their cart, while 19 completed their purchase. On the new system, 147 people abandoned their cart while 320 completed their purchase. Which system works better?
  4. Selection criteria. Example if we have 7 candidates for a scholarship (Eileen, George, Taher, Ramesis, Arya, Sandra and Mike) what is the probability that Mike will be chosen in three consecutive years? Assuming the candidate list is the same and past winners are not barred from receiving the scholarship again.

 

Advantages of using MC:

Unlike simple forecasting, Monte Carlo simulation can help with the following:

  • Probabilistic Results – show scenarios and how the occurrence likelihood.
  • Graphical Results – The outcomes and their chance of occurring can be easily converted to graphs making it easy to communicate findings to an audience.
  • Sensitivity Analysis – Easier to see which variables impact the outcome the most, i.e. which variables had the biggest effect on bottom-line results.
  • Scenario Analysis: Using Monte Carlo simulation, we can see exactly which inputs had which values together when certain outcomes occurred.
  • Correlation of Inputs. In Monte Carlo simulation, it’s possible to model interdependent relationships between input variables. It’s important for accuracy to represent how, in reality, when some factors goes up, others go up or down accordingly.

Code template:

The basic template for MC is as follows:

 

Let’s look at this code in detail:

  • Runs = no of trials or iterations. For our product profit example (application example 2), runs = 1000.
  • Func1 = this is the formula definition where we will indicate number of different events, their probability and the selection criteria. For our scholarship candidate example (application number 4) this function would be modified as:

sum(sample(c(1:7), size =3, replace = T)) > 6

where we are assigning number 1:7 to each student and hence Mike = 7.

Main code:

The code files for this tutorial are available on the 2017 project page. (Link here under Jul/Aug 2017 ) .

Parallel Programming with R

In this month’s project we will implement parallel programming in R:

We achieve this by using the following packages, so please install them on your RStudio IDE.

  • foreach,
  • doParallel
  • parallel. 

 

 Concept of parallel programming:

For beginners, the concept of parallel programming is simple, instead of calculating outputs (from inputs) one at a time, we divide our computation and allocate it to multiple worker connections that work concurrently. ( to retrieve data, process and calculate outputs, etc ) At the end we collate the results into a single dataset (list or dataframe).
This concept assumes that the computations and inputs are exclusive.

Example:
As a simple example, let us assume you are calculating prime factors for an input dataset with a million random, unsorted numbers. We can have a function to calculate factors that takes n seconds to process each number.
In a sequential (non-parallel ) scenario, it would take 1,000,000 * n seconds to process the entire dataset, assuming n seconds for each input.
However, if we had 7 parallel connections, then we could divide the input set into 7 chunks and process 7 numbers ( or datasets) at a time. This reduces the time by a factor of 7.

 

Benefits:

Parallel programming is not really useful if your computation is quick or if you are exploring stuff. But it is a very powerful tool to speed things up when it comes to simulations, machine learning and even time-consuming calculations or data retrievals.

For example, I was recently working  on a project where I was looking at billions of transaction data for sub-optimal price charges (where the actual transaction price was x% or higher than first bid). The logic itself is simple :

 

However, the sheer amount of data was taking the program hours (yes hours) to complete for a single month, even after optimizing the SQL queries. And I had to look at 6 months worth of data! Like looking for a needle in a haystack. So I implemented parallel worker connections and bingo! time reduced by 80%.

Similarly, I have used parallelization to simulate multiple modeling scenarios (what happens when input price changes to m and competition prices changes to y) in parallel and reduce computation time.

 

Code Structure:

The skeleton code for implementation is as follows:

In the above code, you can also embed the SqlQuery() in a function, along with other calculations, and call the function inside the dopar() function call.

Note, the odbc() function is a fake username-password combination to show the format. We need to place it inside the foreach() loop so each worker gets its own database connection to retrieve data in parallel. Assigning it outside as a global variable will NOT work.

We use a custom function to collate the results as follows:

 

Implementation:

The attached code file shows 3 cases for implementation, using the above code structure:

  1. calling a simple SQL query inside the foreach()
  2. calling a simple math-calculation function, without sql queries.
  3. more complex function to perform both sql data retrieval and math functions to compute results.

You can  download the code files from the Projects Page link here, under Apr 2017.

Points To Note:

  • Make sure you run the query on a single value to ensure that the sql query itself is correct. Unfortunately, R gives a very generic “failed to execute” type error irrespective of whether you missed a comma in the select statement or if the variable name is incorrect. This makes debugging very annoying.
  • If you have sql queries, please do optimize it so it only picks up the minimum number of records/columns needed.
  • Parallel programming does NOT mean instant!! So, processing a million rows may still take an hour or more, if you have loads of calculations. Though you will be tons faster than sequential processing.
  • A cluster with 1000 or more parallel worker connections is NOT effective. Unless you have a supercomputer of course. If using a laptop, don’t try more than 10.
  • Don’t put a query that writes to the table, you will lock the table and the process will NEVER ever complete.
    For a sanity check, try inserting a print statement to keep tabs that the function is actually working.

 

Until next time, happy coding.

Predictive analytics using Ames Housing Data (Kaggle Starter Script)

Hello All,

In today’s tutorial we will apply 5 different machine learning algorithms to predict house sale prices using the Ames Housing Data.

This dataset is also available as an active Kaggle competition for the next month, so you can use this as a Kaggle starter script (in R). Use the output from the models to generate submission files for the Kaggle platform and view how well you fare on the public leaderboard.  This is also a perfect simulation for real-world analytics problem where the final results are validated by a customer / client/ third-party.

This tutorial is divided into 4 parts:

  • Data Load & Cleanup
  • Feature Selection
  • Apply algorithms
  • Using arithmetic / geometric means from multiple algorithms to increase accuracy

 

Problem Statement:

Before we begin, let us understand the problem statement :

Predict Home SalePrice for the Test Data Set with the lowest possible error.

The Kaggle competition evaluates the leaderboard score based on the  Root-Mean-Squared-Error (RMSE)  between the logarithm of the predicted value and the logarithm of the observed sales price. So the best model will be one with a score of 0.

The evaluation considers log values so model is penalized for incorrectly predicting both expensive houses AND cheap houses. So the % of error deviation from real value matters, not just the $ value of the homes.
E.g.: if the predicted value of homeprice was 42000$ but actual value was 37000$ , then $-value error is only 5000$ which doesn’t seem a lot. However, error % is ~13.52% .
On the contrary, imagine a home with real saleprice = 389411$ , which we predicted to be 410000$. $ value difference is 20589$, yet % error is only ~5.28% which is a better prediction.

In real-life too, you will frequently face situations where the sensitivity of predictions  is as important as value-accuracy.

As always,  you can download the code and data files from the Projects Page here , under the Feb 2017.

 

Data Load & Cleanup:

In this step we perform the following tasks:

  1. Load the test and training set.
  2. Check training set for missing values.
  3. Delete columns with more than 40% missing records. These include the variables Alley (1369 empty ), FireplaceQu (690), Fence (1179), PoolQC (1453), MiscFeature (1406) .
  4. The other variables identified in step 1  are:
    • LotFrontage , Alley , MasVnrType, BsmtQual , BsmtCond , BsmtExposure , BsmtFinType1, BsmtFinType2,  GarageType , GarageCond.
  5. For columns with very few missing values, we choose one of 3 options:
    • For categorical values, create a new level “unk” to indicate missing data.
    • For columns (numeric or categorical) where the data tends to fall overwhelmingly in one category , mark missing values with this option.
    • For numeric data, mark missing values with -1.
    • NOTE, we will apply these rules on the test set also , for consistency, irrespective of whether there are missing values or not.
  6. Repeat steps 2-5 for test set columns, since there may be empty cells for columns in test set which did not show up in the training set.  The variables we now identify include:
    • MSZoning, Utilities, Exterior1st, Exterior2nd, MasVnrArea,
    • BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF.
    • BsmtFullBath. BsmtHalfBath, KitchenQual,
    • GarageYrBlt, GarageQual, GarageFinish, GarageCars, GarageArea,
    • SaleType.
  7. We write the modified training and test sets to Excel so we can apply our models on the corrected data. ( This is especially helpful in real life if you need to fine tune your model over a few days time.

 

Feature Selection:

The easiest way to check if a relationship exists is to use statistical functions: chisquare, anova or correlation.

Our target variable is “SalePrice” (numeric value) . So we will test all the other predictor variables against this factor to see if any relation exists and how much it affects the SalePrice.
For categorical predictors, we will use chisquare test, whereas for numeric predictors, we will use correlation.

Our dataset has 74 predictive factors (excluding SalePrice and Id), so we run a for loop to do a rough check. Relation exists only if p-values < 0.05.
If the predictor column is of type integer/numeric, we will apply correlation. If column = “character” , we apply chisquare.

We also add a column to identify variables of interest using the code below:

  • If correlation value falls below -0.75 (high negative correlation) or above 0.75 (high positive correlation) then we mark the variable as “match”.
  • If p-val from chisquare test falls below 0.05 then we mark it as “match”.

Using this “quack” approach, we quickly identify 19 variables of interest. These include Neighborhood, Building Type (single family, townhome, etc) , YearBuilt, Year Remodeled , Basement type (finished / unfinished), house area for first floor/ second floor / garage, number of bathrooms (both full / half) , no of cars garage can accommodate, sale type and sale condition.

NOTE, it is always a good idea to graphically visualize correlation functions as the above approach may miss out predictors with a non-linear relationship.

Median House SalePrice by Neighborhood

Median House SalePrice by Neighborhood

 

Apply Algorithms:

We apply the following machine learning algorithms:

  1. Linear Regression Model.
  2. Classification Tree Model.
  3. Neural Network Model.
  4. Random Forest algorithm model.
  5. Generalized Linear Model (GLM)

 

We follow the same steps for all 5 models:
(Note, code and functions shown only for Linear Regression Model. Detailed functions and variables used for the other models are available in the R program files.)

  1. Use the training set to create a formula.
  2. Apply formula to predict values for validation set.
  3. Check the deviation from true homeprices in terms of both median $-value and % error.
  4. Apply formula on test set and check rank/score on leaderboard.

 

The error rates for all 5 models are given in the table below:

home price table showing model error %

home price table showing model error %

Combining Algorithms to Improve Accuracy:

There are many scientific papers which show that combining answers from multiple models greatly improves accuracy. A simple but excellent explanation (with respect to Kaggle ) is given by MLwave.com founder and past Kaggle winner is  provided here. Basically, combining results from unrelated models can improve accuracy, even if individual models are NOT good at 100% accurate predictions.

Similar to the explanation provided in the link, we calculate the arithmetic mean to average the results from all 5 models .

 

Summary:

We learnt how to clean and process a real-life dataset, select features of interest (and impact on target variable) apply 5 different machine learning algorithms.

The code for this tutorial is available here on the Projects page, under month of Feb.

Please take a look and feel free to comment with your own feedback or models.

 

Older posts