Journey of Analytics

Deep dive into data analysis tools, theory and projects

Tag: machine learning blog series1

Sberbank Machine Learning Series – Post 2 – Mind maps & Hypothesis

This is the second post of the Sberbank Russia housing set analysis, where we will narrow down the variables of interest and create a roadmap to understand which factors significantly impact the target variable (price_doc).

You can read the introductory first post here.


Analysis Roadmap:

This Kaggle dataset has ~290 variables, so having a clear direction is important. In the initial phase, we obviously do not know which variable is significant, and which one is not, so we will just read through the data dictionary and logically select variables of interest. Using these we create our hypothesis, i.e the relationship with target variable (home price) and test the strength of the relationship.

The dataset also includes macroeconomic variables, so we will also create derived variables to test interactions between variables.

A simple mindmap for this dataset is as below:

home price analysis mindmap

home price analysis mindmap

Hypothesis Qs:

The hypothesis Qs and predictor variables of interest are listed below:

Target Variable: (TV)

“price_doc” is the variable to predict. Henceforth this will be referred to as “TV”.


Predictor variables:

These are the variables that affect the target variable, although we do not know which one is more significant over the others, or indeed if two or more variables interact together to make a bigger impact.

For the Sberbank set, we have predictor variables from 3 categories:

  1. Property details,
  2. Neighborhood characteristics,
  3. Macroeconomic factors

(Note, all the predictors in the mindmap, marked with a # indicate derived or calculated variables).


Property details:

  1. Timestamp –
    1. We will use both the timestamp (d/m/y) as well as extract the month-year values to assess relationship with TV.
    2. We will also check if any of the homes have multiple timestamps, which means the house passed through multiple owners. If yes, does this correlate with a specific sub_area?
  2. Single family and bigger homes also have patios, yards, lofts, etc which creates a difference between living area and full home area. So we take a ratio between life_sq and full_sq and check if a home with bigger ratio plus larger full_sq gets better price.
  3. Kitch_sq – Do homes with larger kitchens command better price? So, we will take a ratio of kitch_sq / life_sq and check impact on house price.
  4. Sub_area – does this affect price?
  5. Build_year –
    1. Logically newer homes should have better price.
    2. Also check if there is interaction with full_sq i.e larger, newer homes gets better price?
    3. Check inter-relationship with sub_area.
  6. Material – how does this affect TV?
  7. Floor/max_floor –
    1. create this ratio and check affected price. Note, we need to identify how single-family homes are identified, since they would have to be excluded as a separate subset.
    2. Does a higher floor increase price? In specific sub_area? For example, certain top floor apartments in Chicago and NYC command better price since tenants get an amazing view of the skyline, and there is limited real estate in such areas.
  8. Product_type – Investment or ownership. Check if investment properties have better price.


Neighborhood details:

  1. Full_all – Total population in the area. Denser population should correlate with higher sale price.
  2. Male_f / female_f – Derived variable. If the ratio is skewed it may indicate military zones or special communities, which may possibly affect price.
  3. Kid friendly neighborhood – Calculate ratio of x13_all / full_all , i.e ratio of total population under 13 to overall population. A high ratio indicates a family-friendly neighborhood or residential suburb which may be better for home sale price. Also correlate with sub_area.
  4. Similar to above, calculate ratio of teens to overall population. Correlate with sub_area.
  5. Proximity to public transport: Calculate normalized scores for the following:
    1. Railroad_stn_walk_min,
    2. Metro_min_avto,
    3. Public_transport_walk
    4. Add all to get a weighted score. Lower values should hopefully correlate with higher home prices.
  6. Entertainment amenities: Easy access to entertainment options should be higher in densely populated areas with higher standards of living, and these areas presumably should command better home values. Hence we check relationship of TV with the following variables:
    1. Fitness_km,
    2. Bigmarket_km
    3. Stadium_km,
    4. Shoppingcentres_km,
  7. Proximity to office: TV versus normalized values for :
    1. Office_count_500,
    2. Office_count_1000,
    3. Logically the more number of offices nearby, better price value.
  8. Similarly, calculate normalized values for number of industries in the vicinity, i.e. prom_part_500 / prom_part_5000. However, here the hypothesis is that houses nearby will have lower sale prices, since industries lead to noise/pollution, and does not make an ideal residential neighborhood. (optional, check if sub_areas with high number of industries, have lower number of standalone homes (single-family/townhomes, etc).
  9. Ratio of premium cafes to inexpensive ones in the neighborhood i.e café_count_5000_price_high/ café_count_price_500. If the ratio is high, then do the houses in these areas have increased sale price? Also correlate with sub_area.


Macro Variables:

These are overall numbers for the entire country, so they remain fairly constant for a whole year. However, we will merge these variables to the training and test set, to get a more holistic view of the real estate market.

The reasoning is simple, if the overall mortgage rates are excessive (let’s say 35% interest rates) then it is highly unlikely there will be large number of home prices, thus forcing a reduction the overall home sale prices. Similarly, factors like inflation, income per person also affect home prices.

  1. Ratio of Income_per_Cap and real_disposable_income: ideally the economy is doing better if both numbers are high, thus making it easier for homebuyers to get home loans and consequently pursue the house of their dreams.
  2. Mortgage_value: We will use a normalized value, to see how much this number changes over the years. If the number is lower, our hypothesis is that more number of people took larger loans, and hence sale prices for the year should be higher.
  3. Usdrub: how well is the Ruble (Russian currency) faring against the dollar. Higher numbers should indicate better stability and economy and a stronger correlation with TV. (we will ignore the relationship with Euros for now).
  4. Cpi: normalized value over the years.
  5. GDP: we take a ratio of gdp_annual_growth/ gdp_annual, since both numbers should be high in a good economy.
  6. Unemployment ratio: Uemployment/ employment. Hypothesis is to look for an inverse relationship with TV.
  7. Population_migration: We will try to see the interaction with TV, while taking sub_area into consideration.
  8. Museum_visits_per_100_cap: Derive values to see if numbers have increased or decreased from the previos year, indicating higher/lower disposable income.
  9. Construction_value: normalized value.


In the next posts, we will use a) these hypothesis Qs to understand how the target variable is affected by the variables. (b) Apply the variables in different algorithms to calculate TV.

Sberbank Machine Learning Series – Post 1 – Project Introduction

For this month’s tutorials, we are going to work on the Kaggle Sberbank housing set, to forecast house price prices in Russia. This is a unique dataset from the Sberbank, an old and eminent institution in Russia, in that they have provided macroeconomic information along with the training and test data. The macro data includes variables like avg salary information, GDP, average mortgage rates by year, strength of Russian ruble versus Euro/Dollar, etc by month and year. This allows us to incorporate relevant political and economic factors that may create volatility in housing prices.

You can view more detailed information about the dataset, and download the files from the Kaggle website link here.

House price predictions

House price predictions

We are going to use this dataset in a series of posts to perform the following:

  1. Mindmaps for both Data exploration and solution framework.  In this dataset, there are 291 variables  in the training set, and 100 variables in the macro set. So for this project, we are going to use both Tableau and R for exploring the data.
  2. Initial Hypothesis testing to check for variable interactions, and help create meaningful derived variables.
  3. Baseline prediction models using 5 different machine learning algorithms.
  4. Internal and external validation. Internal validation by comparing models by sensitivity, accuracy and specificity . External validation by comparing scores on the Kaggle leaderboard.
  5. Ensemble (hybrid) models using combination of the baseline models.
  6. Final model upload to Kaggle.


Until next time, happy Coding!