Journey of Analytics

Deep dive into data analysis tools, theory and projects

Tag: Kaggle starter script

DataScience Professionals : India vs US | Men vs Women

Introduction

This is an analysis of the Kaggle 2018 survey dataset. In my analysis I am trying to understand the similarities and differences between men and women users from US and India, since these are the two biggest segments of the respondent population. The number of respondents who chose something other than Male/Female is quite low, so I excluded that subset as well.

The complete code is available as a kernel on the Kaggle website. If you like this post, do login and upvote! ūüôā¬† This post is a slightly truncated version of the Kernel available on Kaggle.

You can also use the link to go to the dataset and perform your own explorations. Please do feel free to use my code as a starter script.

 

Kaggle users - India vs US

Kaggle users – India vs US

Couple of disclaimers:

NOT intending to say one country is better than the other. Instead just trying to explore the profiles based on what this specific dataset shows.
It is very much possible that there is a response bias and that the differences are due to the nature of the people who are on the Kaggle site, and who answered the survey.
With that out of the way, let us get started. If you like the analysis, please feel to fork the script and extend it further. And do not forget to Upvote! ūüôā

Analysis

Some questions that the analysis tries to answer are given below:
a. What is the respondent demographic profile for users from the 2 countries – men vs women, age bucket?
b. What is their educational background and major?
c. What are the job roles and coding experience?
d. What is the most popular language of use?
e. What is the programming language people recommend for an aspiring data scientist?

I deliberately did not compare salary because:
a. 16% of the population did not answer and 20% chose ‚Äúdo not wish to disclose‚ÄĚ.
b. the lowest bracket is 0-10k USD, so the max limit of 10k translates to about INR 7,00,000 (7 lakhs) which is quite high. A software engineer, entering the IT industry probably makes around 4-5 lakhs per annum, and they earn much more than others in India. So comparing against US salaries feels like comparing apples to oranges. [Assuming an exchange rate of 1 USD = 70 INR].

Calculations / Data Wrangling:

  1. I’ve aggregated the age buckets into lesser number of segments, because the number of respondents tapers off in the higher age groups. They are quite self-explanatory, as you will see from the ifesle clause below:
  2. Similarly, cleaned up the special characters in the educational qualifications. Also added a tag to the empty values in the following variables Рjobrole (Q6), exp_group (Q8), proj(Q40), years in coding (Q24), major (Q5).
  3. I also created some frequency using the sqldf() function. You could use the summarise() from the dplyr package. It really is a matter of choice.

Observations

Gender composition:

As seen in the chart below, many more males (~80%) responded to the survey than women (~20%).
Among women, almost 2/3rd are from US, and only ~38% from India.
The men are almost split 50/50 among US and India.

 

Age composition:

There is a definite trend showing that the Indian respondents are quite young, with both men and women showing >54% in the youngest ge bucket (18-24), and another ~28% falling in the (25-29) category. So almost 82% of the population is under 30 years of age.
Among US respondents, the women seem a bit more younger, with 68% under 30 years, compared to ~57% men of women. However, both men and women had a larger segment in the 55+ category (~20% for women, and 25% for men. Compare it with Indians, where the 55+ group is barely 12%.

 

Educational background:

Overall, this is an educated lot, and most had a bachelors degree or more.
US women were the most educated of the lot, with a whopping 55% with masters degrees and 16% with doctorates.
Among Indians, women had higher levels of education – 10% with Ph.D, 43% masters degree, compared with men where ~34% had a masters degree and only 4% had a doctorate.
Among US men, ~47% had a masters degree, and 19% had doctorates.
This is interesting because Indians are younger compared to US respondents, so many more Indians seem to be pursuing advanced degrees.

Undergrad major:

Among Indians, the majority of respondents added Computer Science as their major.
Maybe because :
(a) Indians have to declare a major when they join, and the choice of majors is not as wide as in the US. ,

  1. Parents tend to force kids towards majors which are known to translate into a decent paying job, which is engineering or medicine.
  2. A case of response bias? The survey came from Kaggle, so not sure if non-coding majors would have even bothered to respond.Among US respondents, the major is also computer science, but followed by maths & stats for women.
    For men, the second category was a tie between non-compsci Engg , followed by maths&stats.

 

Job Roles:

Among Indians, the biggest segment are predominantly students (30%). Among Indian men, the second category is software engineer.
Among US women, the biggest category was also ‚Äústudent‚ÄĚ but followed quite closely by ‚Äúdata scientist‚ÄĚ. Among US men , the biggest category was ‚Äúdata scientist‚ÄĚ followed by ‚Äústudent‚ÄĚ.
Note, ‚Äúother‚ÄĚ category is something we created now, so not considering those. They are not the biggest category for any sub-group anyway.
CEOs, not surprisingly are male, 45+ years from the US, with a masters degree.

 

Coding Experience:

Among Indians, most answered <1 year of coding experience , which correlates well with the fact that most of them are under 30, with a huge population of students.
Among US respondents, the split is even between 1-2 years of coding and 3-5 years of coding.
Men seem to have a bit more coding experience than women, again explained by the fact that women were slightly younger overall, compared to US men.

 

Most popular programming language:

Python is the most popular language, discounting the number of people who did not answer. However, among US women, R is also popular (16% favoring it).

I found this quite interesting because I’ve always used R at work, at multiple big-name employers. (Nasdaq, Td bank, etc.) Plus, a lot of companies that used SAS seem to have found it easier to move code to R. Again this is personal opinion.
Maybe it is also because many colleges teach Python as a starting programming language?

 

Conclusions:

  1. Overall, Indians tended to be younger with more people pursuing masters degrees.
  2. US respondents tended to older with stronger coding experience, and many more are practicing data scientists.
    This seems like a great opportunity for Kaggle, if they could match the Indian students with the US data scientists, in a sort of mentor-matching service. ūüôā

Predictive analytics using Ames Housing Data (Kaggle Starter Script)

Hello All,

In today’s tutorial we will apply 5 different machine learning algorithms to predict house sale prices using the Ames Housing Data.

This dataset is also available as an active Kaggle competition for the next month, so you can use this as a Kaggle starter script (in R). Use the output from the models to generate submission files for the Kaggle platform and view how well you fare on the public leaderboard.  This is also a perfect simulation for real-world analytics problem where the final results are validated by a customer / client/ third-party.

This tutorial is divided into 4 parts:

  • Data Load & Cleanup
  • Feature Selection
  • Apply algorithms
  • Using arithmetic / geometric means from multiple algorithms to increase accuracy

 

Problem Statement:

Before we begin, let us understand the problem statement :

Predict Home SalePrice for the Test Data Set with the lowest possible error.

The Kaggle competition evaluates the leaderboard score based on the  Root-Mean-Squared-Error (RMSE)  between the logarithm of the predicted value and the logarithm of the observed sales price. So the best model will be one with a score of 0.

The evaluation considers log values so model is penalized for incorrectly predicting both expensive houses AND cheap houses. So the % of error deviation from real value matters, not just the $ value of the homes.
E.g.: if the predicted value of homeprice was 42000$ but actual value was 37000$ , then $-value error is only 5000$ which doesn’t seem a lot. However, error % is ~13.52% .
On the contrary, imagine a home with real saleprice = 389411$ , which we predicted to be 410000$. $ value difference is 20589$, yet % error is only ~5.28% which is a better prediction.

In real-life too, you will frequently face situations where the sensitivity of predictions  is as important as value-accuracy.

As always,  you can download the code and data files from the Projects Page here , under the Feb 2017.

 

Data Load & Cleanup:

In this step we perform the following tasks:

  1. Load the test and training set.
  2. Check training set for missing values.
  3. Delete columns with more than 40% missing records. These include the variables Alley (1369 empty ), FireplaceQu (690), Fence (1179), PoolQC (1453), MiscFeature (1406) .
  4. The other variables identified in step 1  are:
    • LotFrontage , Alley , MasVnrType, BsmtQual , BsmtCond , BsmtExposure , BsmtFinType1, BsmtFinType2, ¬†GarageType , GarageCond.
  5. For columns with very few missing values, we choose one of 3 options:
    • For categorical values, create a new level “unk” to indicate missing data.
    • For columns (numeric or categorical) where the data tends to fall overwhelmingly in one category , mark missing values with this option.
    • For numeric data, mark missing values with -1.
    • NOTE,¬†we will apply these rules on the test set also , for consistency, irrespective of whether there are missing values or not.
  6. Repeat steps 2-5 for test set columns, since there may be empty cells for columns in test set which did not show up in the training set.  The variables we now identify include:
    • MSZoning,¬†Utilities,¬†Exterior1st,¬†Exterior2nd,¬†MasVnrArea,
    • BsmtFinSF1,¬†BsmtFinSF2,¬†BsmtUnfSF,¬†TotalBsmtSF.
    • BsmtFullBath.¬†BsmtHalfBath,¬†KitchenQual,
    • GarageYrBlt,¬†GarageQual,¬†GarageFinish,¬†GarageCars,¬†GarageArea,
    • SaleType.
  7. We write the modified training and test sets to Excel so we can apply our models on the corrected data. ( This is especially helpful in real life if you need to fine tune your model over a few days time.

 

Feature Selection:

The easiest way to check if a relationship exists is to use statistical functions: chisquare, anova or correlation.

Our target variable is “SalePrice” (numeric value) . So we will test all the other predictor variables against this factor to see if any relation exists and how much it affects the SalePrice.
For categorical predictors, we will use chisquare test, whereas for numeric predictors, we will use correlation.

Our dataset has 74 predictive factors (excluding SalePrice and Id), so we run a for loop to do a rough check. Relation exists only if p-values < 0.05.
If the predictor column is of type integer/numeric, we will apply correlation. If column = “character” , we apply chisquare.

We also add a column to identify variables of interest using the code below:

  • If correlation value falls below -0.75 (high negative correlation) or above 0.75 (high positive correlation) then we mark the variable as “match”.
  • If p-val from chisquare test falls below 0.05 then we mark it as “match”.

Using this “quack” approach, we quickly identify 19 variables of interest. These include¬†Neighborhood, Building Type (single family, townhome, etc) , YearBuilt, Year Remodeled , Basement type (finished / unfinished), house area for first floor/ second floor / garage, number of bathrooms (both full / half) , no of cars garage can accommodate, sale type and sale condition.

NOTE, it is always a good idea to graphically visualize correlation functions as the above approach may miss out predictors with a non-linear relationship.

Median House SalePrice by Neighborhood

Median House SalePrice by Neighborhood

 

Apply Algorithms:

We apply the following machine learning algorithms:

  1. Linear Regression Model.
  2. Classification Tree Model.
  3. Neural Network Model.
  4. Random Forest algorithm model.
  5. Generalized Linear Model (GLM)

 

We follow the same steps for all 5 models:
(Note, code and functions shown only for Linear Regression Model. Detailed functions and variables used for the other models are available in the R program files.)

  1. Use the training set to create a formula.
  2. Apply formula to predict values for validation set.
  3. Check the deviation from true homeprices in terms of both median $-value and % error.
  4. Apply formula on test set and check rank/score on leaderboard.

 

The error rates for all 5 models are given in the table below:

home price table showing model error %

home price table showing model error %

Combining Algorithms to Improve Accuracy:

There are many scientific papers which show that combining answers from multiple models greatly improves accuracy. A simple but excellent explanation (with respect to Kaggle ) is given by MLwave.com founder and past Kaggle winner is  provided here. Basically, combining results from unrelated models can improve accuracy, even if individual models are NOT good at 100% accurate predictions.

Similar to the explanation provided in the link, we calculate the arithmetic mean to average the results from all 5 models .

 

Summary:

We learnt how to clean and process a real-life dataset, select features of interest (and impact on target variable) apply 5 different machine learning algorithms.

The code for this tutorial is available here on the Projects page, under month of Feb.

Please take a look and feel free to comment with your own feedback or models.

 

Facebook
LinkedIn