Deep dive into data analysis tools, theory and projects

Author: anu - Journey of Analytics Team (Page 2 of 4)

Sberbank Machine Learning Series – Post 2 – Mind maps & Hypothesis

This is the second post of the Sberbank Russia housing set analysis, where we will narrow down the variables of interest and create a roadmap to understand which factors significantly impact the target variable (price_doc).

You can read the introductory first post here.

 

Analysis Roadmap:

This Kaggle dataset has ~290 variables, so having a clear direction is important. In the initial phase, we obviously do not know which variable is significant, and which one is not, so we will just read through the data dictionary and logically select variables of interest. Using these we create our hypothesis, i.e the relationship with target variable (home price) and test the strength of the relationship.

The dataset also includes macroeconomic variables, so we will also create derived variables to test interactions between variables.

A simple mindmap for this dataset is as below:

home price analysis mindmap

home price analysis mindmap

Hypothesis Qs:

The hypothesis Qs and predictor variables of interest are listed below:

Target Variable: (TV)

“price_doc” is the variable to predict. Henceforth this will be referred to as “TV”.

 

Predictor variables:

These are the variables that affect the target variable, although we do not know which one is more significant over the others, or indeed if two or more variables interact together to make a bigger impact.

For the Sberbank set, we have predictor variables from 3 categories:

  1. Property details,
  2. Neighborhood characteristics,
  3. Macroeconomic factors

(Note, all the predictors in the mindmap, marked with a # indicate derived or calculated variables).

 

Property details:

  1. Timestamp –
    1. We will use both the timestamp (d/m/y) as well as extract the month-year values to assess relationship with TV.
    2. We will also check if any of the homes have multiple timestamps, which means the house passed through multiple owners. If yes, does this correlate with a specific sub_area?
  2. Single family and bigger homes also have patios, yards, lofts, etc which creates a difference between living area and full home area. So we take a ratio between life_sq and full_sq and check if a home with bigger ratio plus larger full_sq gets better price.
  3. Kitch_sq – Do homes with larger kitchens command better price? So, we will take a ratio of kitch_sq / life_sq and check impact on house price.
  4. Sub_area – does this affect price?
  5. Build_year –
    1. Logically newer homes should have better price.
    2. Also check if there is interaction with full_sq i.e larger, newer homes gets better price?
    3. Check inter-relationship with sub_area.
  6. Material – how does this affect TV?
  7. Floor/max_floor –
    1. create this ratio and check affected price. Note, we need to identify how single-family homes are identified, since they would have to be excluded as a separate subset.
    2. Does a higher floor increase price? In specific sub_area? For example, certain top floor apartments in Chicago and NYC command better price since tenants get an amazing view of the skyline, and there is limited real estate in such areas.
  8. Product_type – Investment or ownership. Check if investment properties have better price.

 

Neighborhood details:

  1. Full_all – Total population in the area. Denser population should correlate with higher sale price.
  2. Male_f / female_f – Derived variable. If the ratio is skewed it may indicate military zones or special communities, which may possibly affect price.
  3. Kid friendly neighborhood – Calculate ratio of x13_all / full_all , i.e ratio of total population under 13 to overall population. A high ratio indicates a family-friendly neighborhood or residential suburb which may be better for home sale price. Also correlate with sub_area.
  4. Similar to above, calculate ratio of teens to overall population. Correlate with sub_area.
  5. Proximity to public transport: Calculate normalized scores for the following:
    1. Railroad_stn_walk_min,
    2. Metro_min_avto,
    3. Public_transport_walk
    4. Add all to get a weighted score. Lower values should hopefully correlate with higher home prices.
  6. Entertainment amenities: Easy access to entertainment options should be higher in densely populated areas with higher standards of living, and these areas presumably should command better home values. Hence we check relationship of TV with the following variables:
    1. Fitness_km,
    2. Bigmarket_km
    3. Stadium_km,
    4. Shoppingcentres_km,
  7. Proximity to office: TV versus normalized values for :
    1. Office_count_500,
    2. Office_count_1000,
    3. Logically the more number of offices nearby, better price value.
  8. Similarly, calculate normalized values for number of industries in the vicinity, i.e. prom_part_500 / prom_part_5000. However, here the hypothesis is that houses nearby will have lower sale prices, since industries lead to noise/pollution, and does not make an ideal residential neighborhood. (optional, check if sub_areas with high number of industries, have lower number of standalone homes (single-family/townhomes, etc).
  9. Ratio of premium cafes to inexpensive ones in the neighborhood i.e café_count_5000_price_high/ café_count_price_500. If the ratio is high, then do the houses in these areas have increased sale price? Also correlate with sub_area.

 

Macro Variables:

These are overall numbers for the entire country, so they remain fairly constant for a whole year. However, we will merge these variables to the training and test set, to get a more holistic view of the real estate market.

The reasoning is simple, if the overall mortgage rates are excessive (let’s say 35% interest rates) then it is highly unlikely there will be large number of home prices, thus forcing a reduction the overall home sale prices. Similarly, factors like inflation, income per person also affect home prices.

  1. Ratio of Income_per_Cap and real_disposable_income: ideally the economy is doing better if both numbers are high, thus making it easier for homebuyers to get home loans and consequently pursue the house of their dreams.
  2. Mortgage_value: We will use a normalized value, to see how much this number changes over the years. If the number is lower, our hypothesis is that more number of people took larger loans, and hence sale prices for the year should be higher.
  3. Usdrub: how well is the Ruble (Russian currency) faring against the dollar. Higher numbers should indicate better stability and economy and a stronger correlation with TV. (we will ignore the relationship with Euros for now).
  4. Cpi: normalized value over the years.
  5. GDP: we take a ratio of gdp_annual_growth/ gdp_annual, since both numbers should be high in a good economy.
  6. Unemployment ratio: Uemployment/ employment. Hypothesis is to look for an inverse relationship with TV.
  7. Population_migration: We will try to see the interaction with TV, while taking sub_area into consideration.
  8. Museum_visits_per_100_cap: Derive values to see if numbers have increased or decreased from the previos year, indicating higher/lower disposable income.
  9. Construction_value: normalized value.

 

In the next posts, we will use a) these hypothesis Qs to understand how the target variable is affected by the variables. (b) Apply the variables in different algorithms to calculate TV.

Sberbank Machine Learning Series – Post 1 – Project Introduction

For this month’s tutorials, we are going to work on the Kaggle Sberbank housing set, to forecast house price prices in Russia. This is a unique dataset from the Sberbank, an old and eminent institution in Russia, in that they have provided macroeconomic information along with the training and test data. The macro data includes variables like avg salary information, GDP, average mortgage rates by year, strength of Russian ruble versus Euro/Dollar, etc by month and year. This allows us to incorporate relevant political and economic factors that may create volatility in housing prices.

You can view more detailed information about the dataset, and download the files from the Kaggle website link here.

House price predictions

House price predictions

We are going to use this dataset in a series of posts to perform the following:

  1. Mindmaps for both Data exploration and solution framework.  In this dataset, there are 291 variables  in the training set, and 100 variables in the macro set. So for this project, we are going to use both Tableau and R for exploring the data.
  2. Initial Hypothesis testing to check for variable interactions, and help create meaningful derived variables.
  3. Baseline prediction models using 5 different machine learning algorithms.
  4. Internal and external validation. Internal validation by comparing models by sensitivity, accuracy and specificity . External validation by comparing scores on the Kaggle leaderboard.
  5. Ensemble (hybrid) models using combination of the baseline models.
  6. Final model upload to Kaggle.

 

Until next time, happy Coding!

Tutorials – Dashboards with R programming

In this blogpost we are going to implement dashboards using R programming, using the latest R library package “flexdashboard”.

R programming already offers some good features for graphs and charts (packages like ggplot, leaflet, etc). Plus there is always the option to create web applications using the Shiny library and presentations with RMarkdown documents.

However, this new library leverages these libraries and allows us to create some stunning dashboards, using interactive graphs and text. What I loved the most, was the “storyboard” feature that allows me to present content in Tableau-style frames. Please note that for this you need to create RMarkdown (.Rmd) files and insert the code using the R chunks as needed.

Do I think it will replace Tableau or any other enterprise BI dashboard tool? Not really (at least in the near future). But I do think it offers the following advantages:

  1. great alternative to static presentations since your audience can interact with the data.
  2. RMarkdown allows both programming and regular text content to be pooled together in a single document.
  3. Open source (a very big plus, in my opinion!)
  4. Storyboard format allows you to logically move the audience through the analysis : problem statement, raw data and exploration, different parts of the models/simulations/ number crunching, patterns in data, final summary and recommendations. Presenting the patterns that allow you to accept or reject a hypothesis has never been easier.

So, without further ado, let us look at the dashboard implementation with two examples:

  1. Storyboard dashboard.
  2. Simple dashboard with Shiny elements.

Library Installation instructions:

To start off, please install the “flexdashboard” package in your RStudio IDE. If installation is completed correctly, you will see the flex-dashboard feature when you create a new RMarkdown document, as shown in images below:

Step1 image:

Step 2 image:

Storyboard Dashboard:

Instead of analyzing a single dataset, I have chosen to present different interactive graph types using the storyboard feature. This will allow you to experience the range of options possible with this package.

An image of the storyboard is shown below, but you can also view the live document here (without source code or data files) .  The complete data and source code files are available for download here, simply search for “Dashboard Codefiles” on the Projects page.

The storyboard elements are described below:

  • Element 1 – Click on each frame to see the graph and explanation associated with that story point. (click element 5 to see Facebook stock trends)
  • Element 2 – This is the location for your graphs, tables, etc. One below each story point.
  • Element 3 – This explanation column in the right can be omitted, if required. However, my personal opinion is that this is a good way to highlight certain facts about the graph or place instructions, hyperlinks, etc. Like a webpage sidebar.
  • Element 4 – My tutorial only has 4 story elements, but if you have more flexdashboard automatically provides left-right arrow for navigation. (Just like Tableau).
  • Element 6 – Title bar for the project. Notice the social sharing button on the far right.

Shiny Style Simple Dashboard:

Here I have used Shiny elements to allow user to select the variables to plot on a graph. This also shows how you can divide your page into columns and rows, to show different content in different panes.

See image below for some of the features incorporated in this dashboard:

Feature explanation:

  • Element 1 – Title pane. The red color comes from the “cerulean” theme I used. You can change colors to match your business logo or personal preferences.
  • Element 2 – Shiny inputs. The dropdown is populated automatically from the dataset, so you don’t have to specify values separately.
  • Element 3 – Output graph, based on choices from element 1.
  • Element 4 – notice how this is separated vertically from element3, and horizontally from element 5.
  • Element 5 – Another graph. You can render text, image or pull in web content using the appropriate Shiny commands.
  • Element 6 – you can embed social share button, and also the source code. Click the code icon “</>” to view the code. You can also download the data and program files here, from my  Projects page.

Hope you found these dashboard implementations useful. Please add your valuable comments and feedback in the comments section.

Until next time, happy coding! ?