## Journey of Analytics

### Deep dive into data analysis tools, theory and projects

#### Category: Monthly projects (page 2 of 2)

Hello All,

In today’s tutorial we will apply 5 different machine learning algorithms to predict house sale prices using the Ames Housing Data.

This dataset is also available as an active Kaggle competition for the next month, so you can use this as a Kaggle starter script (in R). Use the output from the models to generate submission files for the Kaggle platform and view how well you fare on the public leaderboard.  This is also a perfect simulation for real-world analytics problem where the final results are validated by a customer / client/ third-party.

This tutorial is divided into 4 parts:

• Feature Selection
• Apply algorithms
• Using arithmetic / geometric means from multiple algorithms to increase accuracy

# Problem Statement:

Before we begin, let us understand the problem statement :

Predict Home SalePrice for the Test Data Set with the lowest possible error.

The Kaggle competition evaluates the leaderboard score based on the  Root-Mean-Squared-Error (RMSE)  between the logarithm of the predicted value and the logarithm of the observed sales price. So the best model will be one with a score of 0.

The evaluation considers log values so model is penalized for incorrectly predicting both expensive houses AND cheap houses. So the % of error deviation from real value matters, not just the \$ value of the homes.
E.g.: if the predicted value of homeprice was 42000\$ but actual value was 37000\$ , then \$-value error is only 5000\$ which doesn’t seem a lot. However, error % is ~13.52% .
On the contrary, imagine a home with real saleprice = 389411\$ , which we predicted to be 410000\$. \$ value difference is 20589\$, yet % error is only ~5.28% which is a better prediction.

In real-life too, you will frequently face situations where the sensitivity of predictions  is as important as value-accuracy.

As always,  you can download the code and data files from the Projects Page here , under the Feb 2017.

In this step we perform the following tasks:

1. Load the test and training set.
2. Check training set for missing values.
3. Delete columns with more than 40% missing records. These include the variables Alley (1369 empty ), FireplaceQu (690), Fence (1179), PoolQC (1453), MiscFeature (1406) .
4. The other variables identified in step 1  are:
• LotFrontage , Alley , MasVnrType, BsmtQual , BsmtCond , BsmtExposure , BsmtFinType1, BsmtFinType2,  GarageType , GarageCond.
5. For columns with very few missing values, we choose one of 3 options:
• For categorical values, create a new level “unk” to indicate missing data.
• For columns (numeric or categorical) where the data tends to fall overwhelmingly in one category , mark missing values with this option.
• For numeric data, mark missing values with -1.
• NOTE, we will apply these rules on the test set also , for consistency, irrespective of whether there are missing values or not.
6. Repeat steps 2-5 for test set columns, since there may be empty cells for columns in test set which did not show up in the training set.  The variables we now identify include:
• MSZoning, Utilities, Exterior1st, Exterior2nd, MasVnrArea,
• BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF.
• BsmtFullBath. BsmtHalfBath, KitchenQual,
• GarageYrBlt, GarageQual, GarageFinish, GarageCars, GarageArea,
• SaleType.
7. We write the modified training and test sets to Excel so we can apply our models on the corrected data. ( This is especially helpful in real life if you need to fine tune your model over a few days time.

# Feature Selection:

The easiest way to check if a relationship exists is to use statistical functions: chisquare, anova or correlation.

Our target variable is “SalePrice” (numeric value) . So we will test all the other predictor variables against this factor to see if any relation exists and how much it affects the SalePrice.
For categorical predictors, we will use chisquare test, whereas for numeric predictors, we will use correlation.

Our dataset has 74 predictive factors (excluding SalePrice and Id), so we run a for loop to do a rough check. Relation exists only if p-values < 0.05.
If the predictor column is of type integer/numeric, we will apply correlation. If column = “character” , we apply chisquare.

We also add a column to identify variables of interest using the code below:

• If correlation value falls below -0.75 (high negative correlation) or above 0.75 (high positive correlation) then we mark the variable as “match”.
• If p-val from chisquare test falls below 0.05 then we mark it as “match”.

Using this “quack” approach, we quickly identify 19 variables of interest. These include Neighborhood, Building Type (single family, townhome, etc) , YearBuilt, Year Remodeled , Basement type (finished / unfinished), house area for first floor/ second floor / garage, number of bathrooms (both full / half) , no of cars garage can accommodate, sale type and sale condition.

NOTE, it is always a good idea to graphically visualize correlation functions as the above approach may miss out predictors with a non-linear relationship.

Median House SalePrice by Neighborhood

# Apply Algorithms:

We apply the following machine learning algorithms:

1. Linear Regression Model.
2. Classification Tree Model.
3. Neural Network Model.
4. Random Forest algorithm model.
5. Generalized Linear Model (GLM)

We follow the same steps for all 5 models:
(Note, code and functions shown only for Linear Regression Model. Detailed functions and variables used for the other models are available in the R program files.)

1. Use the training set to create a formula.
2. Apply formula to predict values for validation set.
3. Check the deviation from true homeprices in terms of both median \$-value and % error.
4. Apply formula on test set and check rank/score on leaderboard.

The error rates for all 5 models are given in the table below:

home price table showing model error %

# Combining Algorithms to Improve Accuracy:

There are many scientific papers which show that combining answers from multiple models greatly improves accuracy. A simple but excellent explanation (with respect to Kaggle ) is given by MLwave.com founder and past Kaggle winner is  provided here. Basically, combining results from unrelated models can improve accuracy, even if individual models are NOT good at 100% accurate predictions.

Similar to the explanation provided in the link, we calculate the arithmetic mean to average the results from all 5 models .

# Summary:

We learnt how to clean and process a real-life dataset, select features of interest (and impact on target variable) apply 5 different machine learning algorithms.

The code for this tutorial is available here on the Projects page, under month of Feb.

Please take a look and feel free to comment with your own feedback or models.

Hello All,

First of all, Happy New Year! Wishing you all a fantastic year in 2017 and hope you achieve all your goals for this year, and much more! 🙂

Most people’s New Year Resolutions are related to health, whether it going to the gym, eating healthy, walking more, reducing that stubborn belly fat or something similar. Since I bought a Fitbit Charge2 fitness tracker late this year, I thought it would be an interesting idea to base this month’s project on the data.
The entire codebase, images and datafiles are available at this link on a new Projects Page.

# Project Overview:

The project consists of 3 parts:

1. Scraping the Fitbit Site:  for “sleep quality” data. If you log in to the Fitbit site, they do allow export of exercise, sleep duration and some other data. However, crucial data like heartrate during activities, number of movements during the night, duration of restless sleep, etc are completely missing! I realize not everyone has a Fitbit, so I’ve added some datafiles for you to experiment. However, you can use the logic to scrape other sites in a similar fashion since I am using my login credentials. (similar to API programming explained in these posts on Twitter and Yelp API)
2. Aggregating downloaded data:  We also download  data freely available on the website itself and then aggregate them together , selecting only the data we want. This  step is important  because in the real-world, data is rarely found in a single repository. Data cleansing, derived variables and other processing steps will happen in this section.
3. Hypothesis testing: In this part, we will try to understand what factors affect sleep quality. Does it depend on movements during the night,  is there better sleep on weekend nights, etc.?  Does exercising more increase sleep quality?

# Section 1:

Scraping the Fitbit site was made extremely easy thanks to the package “fitbitScraper”. In our program file “fitbit_scraper.R”, we extract sleep related data for the month of Nov and Dec 2016.

sleep_datafile

# Section 2:

We combine the data from the web scraper, heartrate and exercise datafiles. We now have data for 2 months regarding the following variables:

• sleep duration / start/ end time, sleep quality
• number of movements during the night, number of times awake, duration of both.
• Calories_burnt/ day, number of minutes performing light/ moderate/ heavy exercise,
• weekday, date , month.

Fitbit dataset

final datafile Fitbit tracker

# Section 3:

Using the above data, we use hypothesis testing methods (anova, correlation and chi-square testing ) to understand patterns in our data.

Once you run the code, you will observe the following results:

1. Number of times awake increase when daily steps are between 4000-7000 steps.
2. Weekends do NOT equate to better sleep, even though duration of sleep is higher.
3. Sleep quality is WORST when number of movements is <10 during the night. This may seem counter-intuitive, but I know from personal experience that on the days when I am  stressed out, I sleep like a robot in one position throughout the night. The data seems to support this theory as well. 🙂
4. Number of calories burnt is highest during weekends (unsurprising), followed by Tuesday.

Apart from the statistical tests, we also use data visualizations to double-check our analysis. Some plots are given below:

correlation diagram

steps versus sleep_quality

anova

diagram to view relationships between variables

Hello All,

Just to notify that the code for monthly projects has been uploaded to the “Projects Page”.

This month’s code focuses on text analytics and includes code for:

1. Identifying string patterns and word associations.
2. string searches and string manipulations.
3. Text processing and cleaning (remove emojis, punctuation marks, etc)
4. weighted ranking

word association

1. Text_analysis code: Detailed explanation given under link.
2. Code – pwd strength. An explanation is given under this blog post.

Happy Coding! 🙂

In this post we will learn how to apply our data science skills to solve a business problem – namely why passwords get stolen or hijacked?
This post is inspired from a blog entry on Data Science Central, where the solution was coded in Python. (Our analysis will use R programming and extend the original idea)

In this tutorial, we will explore the following questions:

1. What are the most common patterns found in passwords?
2. How many passwords are  banking type “strong ” combinations (containing special characters, length >8) ?
3. How many passwords make excessive use of repetitive characters, like “1111”, “007”, “aaabbbccc” or similar.

Remember, this is a “real-world” dataset and this type of list is often used to create password dictionaries. You can also use it to develop your own password strength checker.

Overall, this tutorial will cover the following topics:

1. basic string functions: stringlength, stringsearch, etc.
2. data visualization using pie charts, histograms,
3. Color coded HTML tables (similar to Excel) – a great feature if you plan to create Shiny Webapps with Tables.
4. Weighted ranking.

So let’s get started:

## What makes a “Strong” password?

First let us take a look at the minimum requirements of  an ideal password:

1. Minimum 8 characters in length.
2. Contains 3 out of 4 of the following items:
• Uppercase Letters
• Lowercase Letters
• Numbers
• Symbols

## Analysis Procedure:

TFscores = data.frame(fread(“C:/anu/ja/dec2016/passwords_data.txt”, stringsAsFactors = FALSE, sep = ‘\n’, skip = 16))

2. Calculate length of each password:

3. Plot histogram to see frequency distribution of password lengths. Note, we use a custom for-loop to generate labels for the histogram.

hist(TFscores\$len, col = “blue” , ylim = c(0, 150000),

main = “Frequency Distribution – password length”,

xlab = “Password Length”,  ylab = “Count / Frequency”, labels = lendf\$labelstr)

4.

a. Calculate number of digits in each password.

TFscores\$strmatch = gsub(pattern = “[[:digit:]]”, replacement = “”, TFscores\$password)

TFscores\$numberlen = TFscores\$len – str_length(TFscores\$strmatch)

b. Similarly calculate number of characters from other character classes:

• Upper case alphabets
• Lower case alphabets
• Special characters – ! â # % & â ( ) * + , – . / : ;

5. Assign 1 point as password strength “rank” for every character class present in the password.  As mentioned earlier, an ideal password should have at least 3 character classes.

TFscores\$rank = TFscores\$urank + TFscores\$lrank + TFscores\$nrank +   TFscores\$srank

Let us take a look to see how the passwords in our list stack up:

pie(piedfchar\$Var1,labels = labelarrchar , col=rainbow(9),  main=”no. of Character classes in password”)

6. Count number of unique characters in password :

Note, this function is resource intensive, and takes couple of hours to complete due to size of the dataset.
To reduce the time/effort , the calculated values are added to the zipfolder, titled “pwd_scores.csv”.

7. Assign  password strength category based on rank and length:

TFscores\$pwdclass = “weak”   #default

TFscores\$pwdclass[TFscores\$len < 5 | TFscores\$rank == 1 ] = “very weak”

TFscores\$pwdclass[TFscores\$len >= 8 & TFscores\$rank >=2] = “medium”

TFscores\$pwdclass[TFscores\$len >= 12] = “strong”

TFscores\$pwdclass[TFscores\$len >= 12 & TFscores\$rank == 4] = “very strong”

Based on this criteria, we get the following frequency distribution:

7. We can derive the following insights from steps 5 and 6:

• 77.68% of passwords are weak or very weak!
• ~3% of passwords have less than 5 characters.
• ~72% of passwords have less only 1 type of character class.
• 0.5% of passwords have 8+ characters yet number of unique characters is less than 30%.
• ~0.9% of characters have less than 4 unique characters.
• 72% of passwords contain only digits.

8. Let’s see if there are any patterns repeated in the passwords, like “12345”, “abcde”, “1111”, etc:

• 1.2% of passwords contain pattern “12345”.
• 0.01% of passwords contain pattern “abcde”.
• 0.3% of passwords contain pattern “1111”.
• 0.02% of passwords contain pattern “1234”.
• 15% of passwords contain year notations like “198*”, “197*”, “199”, “200*”. Sample shown alongside clearly shows that many people use important years from their life for their passwords. (logically true!)

9. View the password strength visually. We use the “condformat” function to create an HTML table that is easy to assimilate:

condformat(testsampledf) +  rule_fill_discrete(password, expression = rank < 2, colours = c(“TRUE”=”red”)) +
rule_fill_discrete(len, expression = (len >= 12), colours = c(“TRUE”=”gold”)) +
rule_fill_discrete(pwdclass, expression = (rank>2 & len>=8) , colours = c(“TRUE”=”green”))

Hello All,

The theme for August is API programming for social media platforms.

twitter API code with R/ Python

For the August project, I’ve concentrated on working with Twitter API, using both Python and R programming. The code can be downloaded from the Projects Page or forked from my Github account.

## Working With APIs:

Before we learn what the code does, please note that you will first need to request Twitter developer tokens (values for consumer_key, consumer_secret, access_key and access_secret) to authorize your account from extracting data from the Twitter platform. If you do not have these tokens yet, you can easily learn how to request tokens using the excellent documentation on the Twitter Developer website . Once you have the tokens please modify these variables at the beginning of the program with your own access.

Second, you will need to install the appropriate twitter packages for running programs in Python and R. This makes it easy to extract data from Twitter since these packages have pre-written functions for various tasks like Twitter authorization, looking up usernames, posting to Twitter, investigating follower counts, extracting profile data in json format, and much more.

“Tweepy” is the package for Python and “twitteR” for R programs, so please install them locally.

Although Twitter provides a great way to view your own twitter follower growth, there is no way to download or track this data locally. The Python program ( twitter_follower_ct_ver4.py) added in this month’s code does just that – extracts follower count and store it to csv Excel file. This makes it possible to track (historical) growth or decline of Twitter follower count over a period of time, starting from today.

With this program that you can monitor your own account and other twitter handles as well! Of course, you can’t go back in time to view older counts, but hey, at least you have started. Plus, you can manually add values for your own accounts.

(Technically, for twitter handles you do not own, you could get the date of joining of every follower and then deduce when they possibly followed someone. A post for another day, though! )

Follower count is great, but you also want to know the detailed profile of your followers and other interesting twitter accounts. Who are these followers? Where are they located?

There are 2 R programs in the August Project which help you gather this information.

The first (followers_v2.R) extracts a list of all follower ids for a specific twitter account and stores it to a file. Twitter API has a rate limit of 5000 usernames for such queries, so this program uses cursor pagination to pull out information in chunks of 5000 in each iteration. Think of the list of follower ids like the content on a book – some books are thicker, so you have turn more pages! Similarly, if a twitter account has very few followers, the program completes in 1-2 iterations!

The program example works on the twitter account “@phillydotcom” which has >180k followers.  The cursor iteration process itself is implemented using a simple “while” loop.