In this post, we will use the Zillow rent dataset to perform  exploratory and inferential statistics. Our main goal is to identify the most expensive real estate cities in US.


Input Files:

The Kaggle dataset contains two files with rental prices for 13000+ cities across the time frame Nov 2010 – Jan 2017. One file contains values for rent, the other has price per square foot.

Additionally, we use a public dataset to map geographical coordinates to the city names. The main analysis does not need the latitude, longitude values, so you can proceed without this file, except for the last map. Although, having these values helps to create some stunning visuals.

Feel free to use the location data file with other datasets or projects, as it contains coordinate information for cities in numerous countries. 


Note of caution:

The location data file is quite large, so the fread() to read it and the merge() later will take a minute or so.


Analysis Qs:

To give some structure to our analysis, these are the main goals for the project:

  1. Most expensive cities in US, by rent.
  2. Most expensive cities by price per square foot.
  3. Which states have a higher concentration of such cities?
  4. Rent trends over time.

Please note that the datafiles and R-program code are available on the Projects page under Aug 2017.

Data Cleansing:

The Kaggle files are quite clean, without many missing values. However, to use them for analyzing trends over time, we still need to process them.  In this case, the rent for each month is in a separate column, so we need to aggregate those together.  We achieve this by using a custom for-loop.


On a side note, if you are trying to massage data for reporting formats, say similar to a pivot table in Excel, then using similar for-loops can save you tons of time doing manual steps.

We will also merge the latitude & longitude data at this step. Some of the city names don’t match exactly so we will use some string manipulation functions to make a perfect match.

This is how the data frame looks after the data processing step:

transformed data object

transformed data object


Rent Analysis:

We will use the Jan 2017 month to do a ranking for parameters like population density, rent amount and price per square ft.


a) Most expensive cities in US, by rent:

We use Jan data to sort the cities by rent amount, then assign a title similar to “Num. City_Name” . Take the list of top 10 cities and then merge with the original rent dataframe, to view rent trend over time.

This gives us the list below:

US cities with highest rent

US cities with highest rent


If we plot the rent values since Nov 2010, we get a chart as shown below:

We notice that Jupiter island and Westlake see some intra-yearly rent patterns indicating seasonal shifts in demand/supply.


b) Cities with highest price by area:

Using the price per square foot dataset, we can also identify cities with the highest price per square foot area. The city list for this analysis is as follows:


Notice that the city names in the two lists are not identical. Jupiter island which was first in list 1, has moved down to spot 4.  Similarly, a 2000 sq.ft home in Malibu CA would set you back by $9,000 per month! We also see that most cities in this list are predominantly in California or Florida.


c) Cities with small area but huge rent!!

Let us investigate which cities make you shell out tons of money for very small homes. We can calculate area using the price per sq. ft. and rent amount.

small home, big rent

small home, big rent


d) Ranking cities with higher population density:

Similar code gives us the list below:

rent in cities with large population

rent in cities with large population

Not surprisingly, we see names like New York, Los Angeles and Chicago heading the list.



Mapping Cities & Rent:

We’ve added the geographical coordinates to our dataset, so let us try to plot the cities and their median rent. We will add a column for the text we want to display and use leaflet() function to create the map.

Note the maps look a little blurred at first, after 10 seconds the areas look lot clearer as the maps load up. So you can see national & state highway, city names and other details. The zoom feature allows users to zoom in and out.

Images for Hawaii are shown below:

US city map with clusters

US city map with clusters

Zooming to the left and down to view Hawaii.

Hawaii map

Hawaii map


Zooming further to check the Kailua island of Hawaii:

Median rent in Kailua, HI

Median rent in Kailua, HI


Data Insights:

  1. Top 10 most expensive cities seem to be concentrated in CA and TX. (California & Texas)
  2. In such cities you have to pay $10,000+ as rent.
  3. For the cities where you pay a lot for homes smaller than 900 sq ft, we notice that Hawaii cities have a seasonal trend. Perhaps due to tourist cycles and the torrential rains.
  4. The most populous cities are not always the most expensive, although it probably means a lot more competition for the same few homes.
  5. Median rent in most populous cities is ~$1300

What other insights did you pick up?


Next Steps:

You can play around with the data and code to see other rankings or create your visualizations. Here are some pointers to get you started:

  1. Rank cities by highest rent price for some random months – Jan 2014, July 2015, Mar 2012, Aug 2013, Nov 2016, July 2011, Sep 2015. Do the top 20 lists remain the same? Different?
  2. Collect the list of city names from all the above and view trend over time? Identify which city has the maximum price % increase, where price % =[ (Jan2017 rent – Nov 2010 rent) / Nov 2010 rent ]
  3. Which state has the highest number of such expensive cities? If the answer is CA, which is the second most expensive state?
  4. Repeat steps 1-3 for price per square foot.
  5. Select a midwestern state like Kansas, Oklahoma, North Dakota or Mississippi and repeat the analysis at a state level.


Please feel free to download the code files and datasets from the Projects Page under Aug 2017.