Journey of Analytics

Deep dive into data analysis tools, theory and projects

Category: Machine Learning (page 1 of 2)

Sberbank Machine Learning Series – Post 2 – Mind maps & Hypothesis

This is the second post of the Sberbank Russia housing set analysis, where we will narrow down the variables of interest and create a roadmap to understand which factors significantly impact the target variable (price_doc).

You can read the introductory first post here.


Analysis Roadmap:

This Kaggle dataset has ~290 variables, so having a clear direction is important. In the initial phase, we obviously do not know which variable is significant, and which one is not, so we will just read through the data dictionary and logically select variables of interest. Using these we create our hypothesis, i.e the relationship with target variable (home price) and test the strength of the relationship.

The dataset also includes macroeconomic variables, so we will also create derived variables to test interactions between variables.

A simple mindmap for this dataset is as below:

home price analysis mindmap

home price analysis mindmap

Hypothesis Qs:

The hypothesis Qs and predictor variables of interest are listed below:

Target Variable: (TV)

“price_doc” is the variable to predict. Henceforth this will be referred to as “TV”.


Predictor variables:

These are the variables that affect the target variable, although we do not know which one is more significant over the others, or indeed if two or more variables interact together to make a bigger impact.

For the Sberbank set, we have predictor variables from 3 categories:

  1. Property details,
  2. Neighborhood characteristics,
  3. Macroeconomic factors

(Note, all the predictors in the mindmap, marked with a # indicate derived or calculated variables).


Property details:

  1. Timestamp –
    1. We will use both the timestamp (d/m/y) as well as extract the month-year values to assess relationship with TV.
    2. We will also check if any of the homes have multiple timestamps, which means the house passed through multiple owners. If yes, does this correlate with a specific sub_area?
  2. Single family and bigger homes also have patios, yards, lofts, etc which creates a difference between living area and full home area. So we take a ratio between life_sq and full_sq and check if a home with bigger ratio plus larger full_sq gets better price.
  3. Kitch_sq – Do homes with larger kitchens command better price? So, we will take a ratio of kitch_sq / life_sq and check impact on house price.
  4. Sub_area – does this affect price?
  5. Build_year –
    1. Logically newer homes should have better price.
    2. Also check if there is interaction with full_sq i.e larger, newer homes gets better price?
    3. Check inter-relationship with sub_area.
  6. Material – how does this affect TV?
  7. Floor/max_floor –
    1. create this ratio and check affected price. Note, we need to identify how single-family homes are identified, since they would have to be excluded as a separate subset.
    2. Does a higher floor increase price? In specific sub_area? For example, certain top floor apartments in Chicago and NYC command better price since tenants get an amazing view of the skyline, and there is limited real estate in such areas.
  8. Product_type – Investment or ownership. Check if investment properties have better price.


Neighborhood details:

  1. Full_all – Total population in the area. Denser population should correlate with higher sale price.
  2. Male_f / female_f – Derived variable. If the ratio is skewed it may indicate military zones or special communities, which may possibly affect price.
  3. Kid friendly neighborhood – Calculate ratio of x13_all / full_all , i.e ratio of total population under 13 to overall population. A high ratio indicates a family-friendly neighborhood or residential suburb which may be better for home sale price. Also correlate with sub_area.
  4. Similar to above, calculate ratio of teens to overall population. Correlate with sub_area.
  5. Proximity to public transport: Calculate normalized scores for the following:
    1. Railroad_stn_walk_min,
    2. Metro_min_avto,
    3. Public_transport_walk
    4. Add all to get a weighted score. Lower values should hopefully correlate with higher home prices.
  6. Entertainment amenities: Easy access to entertainment options should be higher in densely populated areas with higher standards of living, and these areas presumably should command better home values. Hence we check relationship of TV with the following variables:
    1. Fitness_km,
    2. Bigmarket_km
    3. Stadium_km,
    4. Shoppingcentres_km,
  7. Proximity to office: TV versus normalized values for :
    1. Office_count_500,
    2. Office_count_1000,
    3. Logically the more number of offices nearby, better price value.
  8. Similarly, calculate normalized values for number of industries in the vicinity, i.e. prom_part_500 / prom_part_5000. However, here the hypothesis is that houses nearby will have lower sale prices, since industries lead to noise/pollution, and does not make an ideal residential neighborhood. (optional, check if sub_areas with high number of industries, have lower number of standalone homes (single-family/townhomes, etc).
  9. Ratio of premium cafes to inexpensive ones in the neighborhood i.e café_count_5000_price_high/ café_count_price_500. If the ratio is high, then do the houses in these areas have increased sale price? Also correlate with sub_area.


Macro Variables:

These are overall numbers for the entire country, so they remain fairly constant for a whole year. However, we will merge these variables to the training and test set, to get a more holistic view of the real estate market.

The reasoning is simple, if the overall mortgage rates are excessive (let’s say 35% interest rates) then it is highly unlikely there will be large number of home prices, thus forcing a reduction the overall home sale prices. Similarly, factors like inflation, income per person also affect home prices.

  1. Ratio of Income_per_Cap and real_disposable_income: ideally the economy is doing better if both numbers are high, thus making it easier for homebuyers to get home loans and consequently pursue the house of their dreams.
  2. Mortgage_value: We will use a normalized value, to see how much this number changes over the years. If the number is lower, our hypothesis is that more number of people took larger loans, and hence sale prices for the year should be higher.
  3. Usdrub: how well is the Ruble (Russian currency) faring against the dollar. Higher numbers should indicate better stability and economy and a stronger correlation with TV. (we will ignore the relationship with Euros for now).
  4. Cpi: normalized value over the years.
  5. GDP: we take a ratio of gdp_annual_growth/ gdp_annual, since both numbers should be high in a good economy.
  6. Unemployment ratio: Uemployment/ employment. Hypothesis is to look for an inverse relationship with TV.
  7. Population_migration: We will try to see the interaction with TV, while taking sub_area into consideration.
  8. Museum_visits_per_100_cap: Derive values to see if numbers have increased or decreased from the previos year, indicating higher/lower disposable income.
  9. Construction_value: normalized value.


In the next posts, we will use a) these hypothesis Qs to understand how the target variable is affected by the variables. (b) Apply the variables in different algorithms to calculate TV.

Sberbank Machine Learning Series – Post 1 – Project Introduction

For this month’s tutorials, we are going to work on the Kaggle Sberbank housing set, to forecast house price prices in Russia. This is a unique dataset from the Sberbank, an old and eminent institution in Russia, in that they have provided macroeconomic information along with the training and test data. The macro data includes variables like avg salary information, GDP, average mortgage rates by year, strength of Russian ruble versus Euro/Dollar, etc by month and year. This allows us to incorporate relevant political and economic factors that may create volatility in housing prices.

You can view more detailed information about the dataset, and download the files from the Kaggle website link here.

House price predictions

House price predictions

We are going to use this dataset in a series of posts to perform the following:

  1. Mindmaps for both Data exploration and solution framework.  In this dataset, there are 291 variables  in the training set, and 100 variables in the macro set. So for this project, we are going to use both Tableau and R for exploring the data.
  2. Initial Hypothesis testing to check for variable interactions, and help create meaningful derived variables.
  3. Baseline prediction models using 5 different machine learning algorithms.
  4. Internal and external validation. Internal validation by comparing models by sensitivity, accuracy and specificity . External validation by comparing scores on the Kaggle leaderboard.
  5. Ensemble (hybrid) models using combination of the baseline models.
  6. Final model upload to Kaggle.


Until next time, happy Coding!

Predictive analytics using Ames Housing Data (Kaggle Starter Script)

Hello All,

In today’s tutorial we will apply 5 different machine learning algorithms to predict house sale prices using the Ames Housing Data.

This dataset is also available as an active Kaggle competition for the next month, so you can use this as a Kaggle starter script (in R). Use the output from the models to generate submission files for the Kaggle platform and view how well you fare on the public leaderboard.  This is also a perfect simulation for real-world analytics problem where the final results are validated by a customer / client/ third-party.

This tutorial is divided into 4 parts:

  • Data Load & Cleanup
  • Feature Selection
  • Apply algorithms
  • Using arithmetic / geometric means from multiple algorithms to increase accuracy


Problem Statement:

Before we begin, let us understand the problem statement :

Predict Home SalePrice for the Test Data Set with the lowest possible error.

The Kaggle competition evaluates the leaderboard score based on the  Root-Mean-Squared-Error (RMSE)  between the logarithm of the predicted value and the logarithm of the observed sales price. So the best model will be one with a score of 0.

The evaluation considers log values so model is penalized for incorrectly predicting both expensive houses AND cheap houses. So the % of error deviation from real value matters, not just the $ value of the homes.
E.g.: if the predicted value of homeprice was 42000$ but actual value was 37000$ , then $-value error is only 5000$ which doesn’t seem a lot. However, error % is ~13.52% .
On the contrary, imagine a home with real saleprice = 389411$ , which we predicted to be 410000$. $ value difference is 20589$, yet % error is only ~5.28% which is a better prediction.

In real-life too, you will frequently face situations where the sensitivity of predictions  is as important as value-accuracy.

As always,  you can download the code and data files from the Projects Page here , under the Feb 2017.


Data Load & Cleanup:

In this step we perform the following tasks:

  1. Load the test and training set.
  2. Check training set for missing values.
  3. Delete columns with more than 40% missing records. These include the variables Alley (1369 empty ), FireplaceQu (690), Fence (1179), PoolQC (1453), MiscFeature (1406) .
  4. The other variables identified in step 1  are:
    • LotFrontage , Alley , MasVnrType, BsmtQual , BsmtCond , BsmtExposure , BsmtFinType1, BsmtFinType2,  GarageType , GarageCond.
  5. For columns with very few missing values, we choose one of 3 options:
    • For categorical values, create a new level “unk” to indicate missing data.
    • For columns (numeric or categorical) where the data tends to fall overwhelmingly in one category , mark missing values with this option.
    • For numeric data, mark missing values with -1.
    • NOTE, we will apply these rules on the test set also , for consistency, irrespective of whether there are missing values or not.
  6. Repeat steps 2-5 for test set columns, since there may be empty cells for columns in test set which did not show up in the training set.  The variables we now identify include:
    • MSZoning, Utilities, Exterior1st, Exterior2nd, MasVnrArea,
    • BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF.
    • BsmtFullBath. BsmtHalfBath, KitchenQual,
    • GarageYrBlt, GarageQual, GarageFinish, GarageCars, GarageArea,
    • SaleType.
  7. We write the modified training and test sets to Excel so we can apply our models on the corrected data. ( This is especially helpful in real life if you need to fine tune your model over a few days time.


Feature Selection:

The easiest way to check if a relationship exists is to use statistical functions: chisquare, anova or correlation.

Our target variable is “SalePrice” (numeric value) . So we will test all the other predictor variables against this factor to see if any relation exists and how much it affects the SalePrice.
For categorical predictors, we will use chisquare test, whereas for numeric predictors, we will use correlation.

Our dataset has 74 predictive factors (excluding SalePrice and Id), so we run a for loop to do a rough check. Relation exists only if p-values < 0.05.
If the predictor column is of type integer/numeric, we will apply correlation. If column = “character” , we apply chisquare.

We also add a column to identify variables of interest using the code below:

  • If correlation value falls below -0.75 (high negative correlation) or above 0.75 (high positive correlation) then we mark the variable as “match”.
  • If p-val from chisquare test falls below 0.05 then we mark it as “match”.

Using this “quack” approach, we quickly identify 19 variables of interest. These include Neighborhood, Building Type (single family, townhome, etc) , YearBuilt, Year Remodeled , Basement type (finished / unfinished), house area for first floor/ second floor / garage, number of bathrooms (both full / half) , no of cars garage can accommodate, sale type and sale condition.

NOTE, it is always a good idea to graphically visualize correlation functions as the above approach may miss out predictors with a non-linear relationship.

Median House SalePrice by Neighborhood

Median House SalePrice by Neighborhood


Apply Algorithms:

We apply the following machine learning algorithms:

  1. Linear Regression Model.
  2. Classification Tree Model.
  3. Neural Network Model.
  4. Random Forest algorithm model.
  5. Generalized Linear Model (GLM)


We follow the same steps for all 5 models:
(Note, code and functions shown only for Linear Regression Model. Detailed functions and variables used for the other models are available in the R program files.)

  1. Use the training set to create a formula.
  2. Apply formula to predict values for validation set.
  3. Check the deviation from true homeprices in terms of both median $-value and % error.
  4. Apply formula on test set and check rank/score on leaderboard.


The error rates for all 5 models are given in the table below:

home price table showing model error %

home price table showing model error %

Combining Algorithms to Improve Accuracy:

There are many scientific papers which show that combining answers from multiple models greatly improves accuracy. A simple but excellent explanation (with respect to Kaggle ) is given by founder and past Kaggle winner is  provided here. Basically, combining results from unrelated models can improve accuracy, even if individual models are NOT good at 100% accurate predictions.

Similar to the explanation provided in the link, we calculate the arithmetic mean to average the results from all 5 models .



We learnt how to clean and process a real-life dataset, select features of interest (and impact on target variable) apply 5 different machine learning algorithms.

The code for this tutorial is available here on the Projects page, under month of Feb.

Please take a look and feel free to comment with your own feedback or models.


Machine Learning Model for Predictive Analytics in 6 easy steps

In this post, we are going to learn how to apply a machine learning model for predictive analytics. We will create 5 models using different algorithms and test the results to compare which model gives the most accurate results. You can use this approach to compete on Kaggle or make predictions using your own datasets.

Dataset – For this experiment, we will use the birth_weight dataset from Delaware State Open Data Portal, which includes data from infants born in the period 2009-2016, including place of delivery (hospital/ birthing center/ home), gestation period (premature/ normal) and details about mother’s health conditions. You can download the data directly from the Open Data Link ( or use the file provided with the code.

Step 1 – Prepare the Workspace.

  1. We clean up the memory of current R session, load some standard library packages (data.table, ggplot, sqldf, etc).
  2. We load the dataset “Births.csv”.



Step 2 – Data Exploration.

  1. This step helps us understand the dataset – the range of values for variables, most common occurrences, etc. For our dataset, we look at a summary of birth years, birth weight and number of unique values.

2.  This is the point where we process for missing values and make a decision whether to ignore (entire column with large number of missing data), delete (very few records) or possibly replace it with median values. In this set however, there are no missing values that need to be processed.

3. Check how many unique values exist for each column.



Step 3 – Test and Training Set

If you’ve ever competed on Kaggle, you will realize that the “training” set is the datafile used to create the machine learning model and the “test” set is the one where we use our model to predict the target variables.

In our case, we only have 1 file, so we will manually divide our set into 3 sets – one training set and one 2 test sets. (70% ,15%, 15% split) Why 2 test sets? Because it helps us better understand how the model reacts to new data. You can work with just one if you like. Just use one sequence command and stop with testdf command.


Step 4 – Hypothesis Testing

statistical functions

statistical functions

In this step , we try to understand which predictors most affect our target variable using statistical functions such as ANOVA, chisquare, correlation, etc. The exact function you use can be determined using the table alongside.

Irrespective of which function we use, we assume the following hypothesis:
a) Ho (null hypothesis) – no relation exists. Ho is accepted if p-values if >= 0.05
b) Ha (alternate hypothesis) – relation exists. Ha is accepted if p-value < 0.05. If Ha is found true, then we conduct posthoc tests (for Anova and chisquare tests ONLY) to understand which sub-categories show significant differences in the relationship.


(1) Relation between birth_weight and mom’s_ethnicity exists since p-value < 0.05.

Using BONFERRONI adjustment and posthoc tests, we realize that mothers with “unknown” race are more likely to have babies with low birth weight, as compared to women of other races.

We also see this from the frequency table (below). Clearly only 70% of babies born to mothers of “unknown” race are of normal weight (2500 gms or above) compared to 92% babies from “other” race moms and 93% babies of White-race origins.

mom ethnicity

mom ethnicity

(2) Relation between birth_weight and when prenatal_care started (first trimester, second, third or none) Although we see p-value < 0.05 Ha cannot be accepted because the posthoc tests do NOT show significant differences among prenatal care subsets.


(3) Relation between birth_weight and gestation period:

Posthoc tests show that babies in the groups POSTTERM 42+ WKS and TERM 37-41 WKS are similar and have higher birth weights than premature babies.
(4) We perform similar tests between birth_weight and multiple-babies (single, twins or triplets) and gender.



Step 5 – Model Creation

We create 5 models:

  • LDA (linear discriminant analysis) model with just 3 variables:
  • LDA model with just 7 variables:
  • Decision tree model:
  • Model using Naïve Bayes theorem.
  • Model using Neural Network theorem.


(1) Simple LDA model:

Model formula:

Make predictions with test1 file.

Examine how well the model performed.

lda_model prediction-accuracy


From alongside table, we see that number of correct predictions (highlighted in green)

= (32+166+4150) / 5000

= 4348 / 50

= 0.8696

Thus, 86.96% predictions were correctly identified for test1! (Note, we will use the same process for checking all 5 models.)

Using a similar process, we get 88.4% correct predictions for test2.


(2) LDA model with just 7 variables:


Make predictions for test1 and test2 files:

We get 87.6% correct predictions for test1 file and 88.57% correct for test2.


(3) Decision Tree Model

For the tree model, we first modify the birth weight variable to be treated as a “factor” rather than a string variable.

Model Formula:

Make predictions for test1 and test2 files:

We get 91.16% correct predictions for test1 file and 91.6% correct for test2. However, the sensitivity of this model is little low, since it has predicted that all babies will be of normal weight i.e “2500+” category. This is one of the disadvantages of tree models. If the target variable has a highly popular option which accounts for 80% or more records, then the model basically assigns everyone to it. (sort of brute force algorithm)


(4) Naive Bayes Theorem :

Model Formula:

Make predictions for test1 and test2 files:

Again we get model accuracy of 91.16% 91.6% respectively for test1 and test2 files. However, this model also suffers from a “brute-force” approach and has marked all babies with normal weight i.e “2500+” category. This reminds us that we must be careful about both accuracy and sensitivity of the model when applying an algorithm for forecasting purposes.


(5) Neural Net Algorithm Model :

Model Formula:

In the above formula, the “maxit” operation specifies a stop after maximum number of iterations, so that the program doesn’t go into an infinite loop trying to converge values. Since we have set the seed to 270, our formula converges after 330 iterations. With other “seed value” this number may be higher or lower.

Make predictions for test1 file:

Validation table (below) shows that total number of correct observations = 4592. Hence model forecast accuracy = 91.84%



Test with second file:

Thus, Neural Net models are accurate at 91.84% and 92.57% respectively for test1 and test2 respectively.



Step 6 – Comparison of models

We take a quick look at how our models fared using a tabular comparison:  We conclude that neural network algorithm gives us the best accuracy and sensitivity.

compare data models

compare data models


The code and datafiles for this tutorial are added to the New Projects page under “Jan” section. If you found this useful, please do share with your friends and colleagues. Feel free to share your thoughts and feedback in the comments section.


Stock Price Analysis – Linear Regression Model in 5 simple steps

In this post we are going to analyze stock prices for company Facebook and create a linear regression model.


Code Overview:

Our code performs the following functions. You can download the code here.

  1. Load original dataset.
  2. Add data to benchmark against S&P500 data.
  3. Create derived variables. We create variables to store calculations for the following:
    • Date values: split composite dates into day, month and year.
    • Daily volatility : Price change between daily high and low prices or intraday change in price. A bigger difference signifies heavy volatility.
    • Inter-day price volatility : Price change from previous day
  4. Data visualization to validate relationship between target variable and predictors.
  5. Create Linear Regression Model


Step 1&2 – Load Datasets

We load stock prices for Facebook and S&P500. Note, the S&P500 index prices begin from 2004, whereas Facebook was listed as a public company only in May 2012.
We specify “all.x = TRUE” in the merge command to indicate that we do not want dates which are not present in the Facebook file.

fbknew = merge(fbkdf, sp5k, by = “Date”, all.x = TRUE)

Note, we obtained this data from the Yahoo! Finance homepage using the tab “Historical Data”.


Step 3 – Derived Variables

a) Date Values:
The as.Date() function is an excellent choice to breakdown the date variable into day, month and year. The beauty of this function is that it allows you to specify the format of the date in the original data since different regions format it differently. (mmddyyyy / ddmmyy/ ..)

fbknew$Date2 = as.Date(fbknew$Date, “%Y-%m-%d”)
fbknew$mthyr = paste(as.numeric(format(fbknew$Date2, “%m”)), “-“,
as.numeric(format(fbknew$Date2, “%Y”)), sep = “”)
fbknew$mth = as.numeric(format(fbknew$Date2, “%m”))
fbknew$year = as.numeric(format(fbknew$Date2, “%Y”))

b) Intraday price:

fbknew$prc_change = fbknew$High – fbknew$Low

We have only calculated the difference between the High and Low, but since data is available for “Open” and “Close” you can calculate the maximum range as a self-exercise.
Another interesting exercise would be to calculate the average of both, or create a weighted variable to capture the impact of all 4 variables.

c) Inter-day price:
We first sort by date, using the order() function, and then use a “for loop” to calculate price difference from the previous day. Note, since we are working with dataframes we cannot use a simple subtraction command like x = var[i] – var[i-1]. Feel free to try, the error message is really beneficial in understanding how arrays and dataframes differ!


Step 4 – Data Visualization

Before we create any statistical model, it is always good practice to visually explore the relationships between target variable (here “opening price”) and the predictor variables.

With linear regression model, it is more so, to identify if any variables show a non-linear (exponential, parabolic ) relationship. If you see such patterns you can still use linear regression, after you normalize the data using a log function.

Here are some of the charts from such an exploration between “Open” price (OP) and other variables. Note there may be multiple explanations for the relationship, here we are only looking at the patterns, not the reason behind it.

 a) OP versus Trade Volume:

From chart below it looks like the volume is inversely related to price. There is also one outlier (last data point) . We use the lines() and lowess() functions to add a smooth trendline.
An interesting self-exercise would be to identify the date when this occurred and match it to a specific event in the company history (perhaps the stock split?)

Facebook Stock – Trading Volume versus Opening Stock Price

b) OP by Month & Year:

We see that the stock price has been steadily increasing year on year.

Facebook_stock_price_by_month&and year

Facebook price by month and year

c) OP versus S&P500 index price:

Clearly the Facecbook stock price has a linear relationship with S&P500 index price (logical too!)


Facebook stock price versus S&P500 performance

d) Daily volatility:

This is slightly scattered although the central clustering of data points indicates this as a fairly stable.

Note, we use the sqldf package to aggregate data by monthyear / month for some of these charts.

Step 5 – Linear Regression Model

Our model formula is as follows:

lmodelfb = lm(Open ~ High + Volume + SP500_price + prc_change +     mthyr + mth + year + volt_chg,     data = fbknew)

We use the summary function to view the intercepts and identify the predictors with the biggest impact, as shown in table below:

predictor variables

We see that price is affected overall by S&P500, interday volatility and trade volume. Some months in 2013 also showed a significant impact.
This was our linear regression data model. Again, feel free to download the code and experiment for yourself.

Feel free to add your own observations in the comments section.

Older posts