Journey of Analytics

Deep dive into data analysis tools, theory and projects

Top US Cities with Highest Rent

In this post, we will use the Zillow rent dataset to perform  exploratory and inferential statistics. Our main goal is to identify the most expensive real estate cities in US.


Input Files:

The Kaggle dataset contains two files with rental prices for 13000+ cities across the time frame Nov 2010 – Jan 2017. One file contains values for rent, the other has price per square foot.

Additionally, we use a public dataset to map geographical coordinates to the city names. The main analysis does not need the latitude, longitude values, so you can proceed without this file, except for the last map. Although, having these values helps to create some stunning visuals.

Feel free to use the location data file with other datasets or projects, as it contains coordinate information for cities in numerous countries. 


Note of caution:

The location data file is quite large, so the fread() to read it and the merge() later will take a minute or so.


Analysis Qs:

To give some structure to our analysis, these are the main goals for the project:

  1. Most expensive cities in US, by rent.
  2. Most expensive cities by price per square foot.
  3. Which states have a higher concentration of such cities?
  4. Rent trends over time.

Please note that the datafiles and R-program code are available on the Projects page under Aug 2017.

Data Cleansing:

The Kaggle files are quite clean, without many missing values. However, to use them for analyzing trends over time, we still need to process them.  In this case, the rent for each month is in a separate column, so we need to aggregate those together.  We achieve this by using a custom for-loop.


On a side note, if you are trying to massage data for reporting formats, say similar to a pivot table in Excel, then using similar for-loops can save you tons of time doing manual steps.

We will also merge the latitude & longitude data at this step. Some of the city names don’t match exactly so we will use some string manipulation functions to make a perfect match.

This is how the data frame looks after the data processing step:

transformed data object

transformed data object


Rent Analysis:

We will use the Jan 2017 month to do a ranking for parameters like population density, rent amount and price per square ft.


a) Most expensive cities in US, by rent:

We use Jan data to sort the cities by rent amount, then assign a title similar to “Num. City_Name” . Take the list of top 10 cities and then merge with the original rent dataframe, to view rent trend over time.

This gives us the list below:

US cities with highest rent

US cities with highest rent


If we plot the rent values since Nov 2010, we get a chart as shown below:

We notice that Jupiter island and Westlake see some intra-yearly rent patterns indicating seasonal shifts in demand/supply.


b) Cities with highest price by area:

Using the price per square foot dataset, we can also identify cities with the highest price per square foot area. The city list for this analysis is as follows:


Notice that the city names in the two lists are not identical. Jupiter island which was first in list 1, has moved down to spot 4.  Similarly, a 2000 sq.ft home in Malibu CA would set you back by $9,000 per month! We also see that most cities in this list are predominantly in California or Florida.


c) Cities with small area but huge rent!!

Let us investigate which cities make you shell out tons of money for very small homes. We can calculate area using the price per sq. ft. and rent amount.

small home, big rent

small home, big rent


d) Ranking cities with higher population density:

Similar code gives us the list below:

rent in cities with large population

rent in cities with large population

Not surprisingly, we see names like New York, Los Angeles and Chicago heading the list.



Mapping Cities & Rent:

We’ve added the geographical coordinates to our dataset, so let us try to plot the cities and their median rent. We will add a column for the text we want to display and use leaflet() function to create the map.

Note the maps look a little blurred at first, after 10 seconds the areas look lot clearer as the maps load up. So you can see national & state highway, city names and other details. The zoom feature allows users to zoom in and out.

Images for Hawaii are shown below:

US city map with clusters

US city map with clusters

Zooming to the left and down to view Hawaii.

Hawaii map

Hawaii map


Zooming further to check the Kailua island of Hawaii:

Median rent in Kailua, HI

Median rent in Kailua, HI


Data Insights:

  1. Top 10 most expensive cities seem to be concentrated in CA and TX. (California & Texas)
  2. In such cities you have to pay $10,000+ as rent.
  3. For the cities where you pay a lot for homes smaller than 900 sq ft, we notice that Hawaii cities have a seasonal trend. Perhaps due to tourist cycles and the torrential rains.
  4. The most populous cities are not always the most expensive, although it probably means a lot more competition for the same few homes.
  5. Median rent in most populous cities is ~$1300

What other insights did you pick up?


Next Steps:

You can play around with the data and code to see other rankings or create your visualizations. Here are some pointers to get you started:

  1. Rank cities by highest rent price for some random months – Jan 2014, July 2015, Mar 2012, Aug 2013, Nov 2016, July 2011, Sep 2015. Do the top 20 lists remain the same? Different?
  2. Collect the list of city names from all the above and view trend over time? Identify which city has the maximum price % increase, where price % =[ (Jan2017 rent – Nov 2010 rent) / Nov 2010 rent ]
  3. Which state has the highest number of such expensive cities? If the answer is CA, which is the second most expensive state?
  4. Repeat steps 1-3 for price per square foot.
  5. Select a midwestern state like Kansas, Oklahoma, North Dakota or Mississippi and repeat the analysis at a state level.


Please feel free to download the code files and datasets from the Projects Page under Aug 2017.

August Projects

In this month’s project, we will implement cluster analysis using the “K-means algorithm”.

We use the weather data from 1500+ locations (near airports) to understand temperature patterns by latitude and time of year.

We use cluster = 5 and assign letter A through E to locations with similar weather patterns. At the end of the analysis, you should be able to interpret the following insights from the resulting graphs and tables:

  1. Temperature patterns are similar towards the far North and South, just vertically shifted.
  2. The Pacific coast is different from the rest of the nation, where the temperature is static almost throughout the year.
  3. It is interesting to see how states in two different parts of the country show similar weather patterns since they are on the same latitude (see Minnesota and Maine). During peak summer, these two states are hotter than California.


A sample graph from the analysis is shown below.

US states by 5 major weather clustersUS states by 5 major weather clusters

US states divided into 5 major weather clusters

Data set and code files are available from the main Project site page, under the row for Jul/Aug 2017.

Take a look and play around with the data, to investigate the following:

  1. What happens if you increase cluster size to 7? What happens if you decrease it to 3?
  2. What is the monthly weather pattern for Hawaii (state code = HI) versus New Hampshire (abbreviation = NH) ?
  3. What is the weekly average temperature for a tropical state like Florida (plot a chart with median temperatures for all 52 weeks, by year). Has the average temperature gone up due to global warming?

Please leave your thoughts and comments, or questions if you get stuck on any point.

Happy Coding!



Monte Carlo Simulations in R

In today’s tutorial, we are going to learn how to implement Monte Carlo Simulations in R.

Logic behind Monte Carlo:

Monte Carlo Simulations in R

Monte Carlo Simulations in R

Monte Carlo simulation (also known as the Monte Carlo Method) is a statistical technique that allows us to compute all the possible outcomes of an event. This makes it extremely helpful in risk assessment and aids decision-making because we can predict the probability of extreme cases coming true. The technique was first used by scientists working on the atom bomb; it was named for Monte Carlo, the Monaco resort town renowned for its casinos. Since its introduction in World War II, Monte Carlo simulation has been used to model a variety of physical and conceptual systems.

Monte Carlo methods are used to identify the probability of an event A happening, among a set of N events. We assume that all the events are independent, and the probability of event A happening once does not prevent the occurrence again.

For example, assume you have a fair coin and you flip it once. The probability of heads is 0.5 i.e. equal possibility of heads or tails. You flip the coin again. The possibility of heads is still 0.5, irrespective of whether we got heads or tails in the first flip. However, we can safely say that if we were to flip the coin 100 times, you would see heads ~50% of the times. The application of Monte Carlo (referred henceforth in this post as MC) methods comes to play when we want to find out the probability of heads occurring 16 times in a row. (or 5 or 3 or any other number.)

You can read more about these methods and the theory behind them, using the links below:

  1. Wikipedia – link.
  2. MC methods in Finance, from – link2
  3. Basics of MC from software provider Palisade. – link3.


MC methods are used by professionals in numerous fields ranging from finance, project management, energy, manufacturing, R&D, insurance, biotech, etc. Some real-world applications of Monte Carlo simulations are given below:

  1. Monte Carlo simulations are used in financial services to predict fraudulent credit card transactions. (since 100 genuine transactions do not guarantee the next one will not be fraudulent, even though it is a rare event by itself.)
  2. Risk analysis. Assume a new product was sold at a loss of $300 to 6 users (due to coupons or sales), a profit of $467 in 79 users and a profit of $82 to 119 customers. We can use Monte Carlo simulations to understand what would be the average P/L (profit or loss) if 1000 customers bought our products.
  3. A/B testing to understand page bounce and success web elements. Assume you changed the payment processing system on your e-commerce site. You are doing an A/B test to see if the upgrade results in improved checkout completion. On the old system, 12 users abandoned their cart, while 19 completed their purchase. On the new system, 147 people abandoned their cart while 320 completed their purchase. Which system works better?
  4. Selection criteria. Example if we have 7 candidates for a scholarship (Eileen, George, Taher, Ramesis, Arya, Sandra and Mike) what is the probability that Mike will be chosen in three consecutive years? Assuming the candidate list is the same and past winners are not barred from receiving the scholarship again.


Advantages of using MC:

Unlike simple forecasting, Monte Carlo simulation can help with the following:

  • Probabilistic Results – show scenarios and how the occurrence likelihood.
  • Graphical Results – The outcomes and their chance of occurring can be easily converted to graphs making it easy to communicate findings to an audience.
  • Sensitivity Analysis – Easier to see which variables impact the outcome the most, i.e. which variables had the biggest effect on bottom-line results.
  • Scenario Analysis: Using Monte Carlo simulation, we can see exactly which inputs had which values together when certain outcomes occurred.
  • Correlation of Inputs. In Monte Carlo simulation, it’s possible to model interdependent relationships between input variables. It’s important for accuracy to represent how, in reality, when some factors goes up, others go up or down accordingly.

Code template:

The basic template for MC is as follows:


Let’s look at this code in detail:

  • Runs = no of trials or iterations. For our product profit example (application example 2), runs = 1000.
  • Func1 = this is the formula definition where we will indicate number of different events, their probability and the selection criteria. For our scholarship candidate example (application number 4) this function would be modified as:

sum(sample(c(1:7), size =3, replace = T)) > 6

where we are assigning number 1:7 to each student and hence Mike = 7.

Main code:

The code files for this tutorial are available on the 2017 project page. (Link here under Jul/Aug 2017 ) .

Sberbank Machine Learning Series – Post 2 – Mind maps & Hypothesis

This is the second post of the Sberbank Russia housing set analysis, where we will narrow down the variables of interest and create a roadmap to understand which factors significantly impact the target variable (price_doc).

You can read the introductory first post here.


Analysis Roadmap:

This Kaggle dataset has ~290 variables, so having a clear direction is important. In the initial phase, we obviously do not know which variable is significant, and which one is not, so we will just read through the data dictionary and logically select variables of interest. Using these we create our hypothesis, i.e the relationship with target variable (home price) and test the strength of the relationship.

The dataset also includes macroeconomic variables, so we will also create derived variables to test interactions between variables.

A simple mindmap for this dataset is as below:

home price analysis mindmap

home price analysis mindmap

Hypothesis Qs:

The hypothesis Qs and predictor variables of interest are listed below:

Target Variable: (TV)

“price_doc” is the variable to predict. Henceforth this will be referred to as “TV”.


Predictor variables:

These are the variables that affect the target variable, although we do not know which one is more significant over the others, or indeed if two or more variables interact together to make a bigger impact.

For the Sberbank set, we have predictor variables from 3 categories:

  1. Property details,
  2. Neighborhood characteristics,
  3. Macroeconomic factors

(Note, all the predictors in the mindmap, marked with a # indicate derived or calculated variables).


Property details:

  1. Timestamp –
    1. We will use both the timestamp (d/m/y) as well as extract the month-year values to assess relationship with TV.
    2. We will also check if any of the homes have multiple timestamps, which means the house passed through multiple owners. If yes, does this correlate with a specific sub_area?
  2. Single family and bigger homes also have patios, yards, lofts, etc which creates a difference between living area and full home area. So we take a ratio between life_sq and full_sq and check if a home with bigger ratio plus larger full_sq gets better price.
  3. Kitch_sq – Do homes with larger kitchens command better price? So, we will take a ratio of kitch_sq / life_sq and check impact on house price.
  4. Sub_area – does this affect price?
  5. Build_year –
    1. Logically newer homes should have better price.
    2. Also check if there is interaction with full_sq i.e larger, newer homes gets better price?
    3. Check inter-relationship with sub_area.
  6. Material – how does this affect TV?
  7. Floor/max_floor –
    1. create this ratio and check affected price. Note, we need to identify how single-family homes are identified, since they would have to be excluded as a separate subset.
    2. Does a higher floor increase price? In specific sub_area? For example, certain top floor apartments in Chicago and NYC command better price since tenants get an amazing view of the skyline, and there is limited real estate in such areas.
  8. Product_type – Investment or ownership. Check if investment properties have better price.


Neighborhood details:

  1. Full_all – Total population in the area. Denser population should correlate with higher sale price.
  2. Male_f / female_f – Derived variable. If the ratio is skewed it may indicate military zones or special communities, which may possibly affect price.
  3. Kid friendly neighborhood – Calculate ratio of x13_all / full_all , i.e ratio of total population under 13 to overall population. A high ratio indicates a family-friendly neighborhood or residential suburb which may be better for home sale price. Also correlate with sub_area.
  4. Similar to above, calculate ratio of teens to overall population. Correlate with sub_area.
  5. Proximity to public transport: Calculate normalized scores for the following:
    1. Railroad_stn_walk_min,
    2. Metro_min_avto,
    3. Public_transport_walk
    4. Add all to get a weighted score. Lower values should hopefully correlate with higher home prices.
  6. Entertainment amenities: Easy access to entertainment options should be higher in densely populated areas with higher standards of living, and these areas presumably should command better home values. Hence we check relationship of TV with the following variables:
    1. Fitness_km,
    2. Bigmarket_km
    3. Stadium_km,
    4. Shoppingcentres_km,
  7. Proximity to office: TV versus normalized values for :
    1. Office_count_500,
    2. Office_count_1000,
    3. Logically the more number of offices nearby, better price value.
  8. Similarly, calculate normalized values for number of industries in the vicinity, i.e. prom_part_500 / prom_part_5000. However, here the hypothesis is that houses nearby will have lower sale prices, since industries lead to noise/pollution, and does not make an ideal residential neighborhood. (optional, check if sub_areas with high number of industries, have lower number of standalone homes (single-family/townhomes, etc).
  9. Ratio of premium cafes to inexpensive ones in the neighborhood i.e café_count_5000_price_high/ café_count_price_500. If the ratio is high, then do the houses in these areas have increased sale price? Also correlate with sub_area.


Macro Variables:

These are overall numbers for the entire country, so they remain fairly constant for a whole year. However, we will merge these variables to the training and test set, to get a more holistic view of the real estate market.

The reasoning is simple, if the overall mortgage rates are excessive (let’s say 35% interest rates) then it is highly unlikely there will be large number of home prices, thus forcing a reduction the overall home sale prices. Similarly, factors like inflation, income per person also affect home prices.

  1. Ratio of Income_per_Cap and real_disposable_income: ideally the economy is doing better if both numbers are high, thus making it easier for homebuyers to get home loans and consequently pursue the house of their dreams.
  2. Mortgage_value: We will use a normalized value, to see how much this number changes over the years. If the number is lower, our hypothesis is that more number of people took larger loans, and hence sale prices for the year should be higher.
  3. Usdrub: how well is the Ruble (Russian currency) faring against the dollar. Higher numbers should indicate better stability and economy and a stronger correlation with TV. (we will ignore the relationship with Euros for now).
  4. Cpi: normalized value over the years.
  5. GDP: we take a ratio of gdp_annual_growth/ gdp_annual, since both numbers should be high in a good economy.
  6. Unemployment ratio: Uemployment/ employment. Hypothesis is to look for an inverse relationship with TV.
  7. Population_migration: We will try to see the interaction with TV, while taking sub_area into consideration.
  8. Museum_visits_per_100_cap: Derive values to see if numbers have increased or decreased from the previos year, indicating higher/lower disposable income.
  9. Construction_value: normalized value.


In the next posts, we will use a) these hypothesis Qs to understand how the target variable is affected by the variables. (b) Apply the variables in different algorithms to calculate TV.

Sberbank Machine Learning Series – Post 1 – Project Introduction

For this month’s tutorials, we are going to work on the Kaggle Sberbank housing set, to forecast house price prices in Russia. This is a unique dataset from the Sberbank, an old and eminent institution in Russia, in that they have provided macroeconomic information along with the training and test data. The macro data includes variables like avg salary information, GDP, average mortgage rates by year, strength of Russian ruble versus Euro/Dollar, etc by month and year. This allows us to incorporate relevant political and economic factors that may create volatility in housing prices.

You can view more detailed information about the dataset, and download the files from the Kaggle website link here.

House price predictions

House price predictions

We are going to use this dataset in a series of posts to perform the following:

  1. Mindmaps for both Data exploration and solution framework.  In this dataset, there are 291 variables  in the training set, and 100 variables in the macro set. So for this project, we are going to use both Tableau and R for exploring the data.
  2. Initial Hypothesis testing to check for variable interactions, and help create meaningful derived variables.
  3. Baseline prediction models using 5 different machine learning algorithms.
  4. Internal and external validation. Internal validation by comparing models by sensitivity, accuracy and specificity . External validation by comparing scores on the Kaggle leaderboard.
  5. Ensemble (hybrid) models using combination of the baseline models.
  6. Final model upload to Kaggle.


Until next time, happy Coding!

« Older posts