This is an analysis of the Kaggle 2018 survey dataset. In my analysis I am trying to understand the similarities and differences between men and women users from US and India, since these are the two biggest segments of the respondent population. The number of respondents who chose something other than Male/Female is quite low, so I excluded that subset as well.
The complete code is available as a kernel on the Kaggle website. If you like this post, do login and upvote! 🙂 This post is a slightly truncated version of the Kernel available on Kaggle.
You can also use the link to go to the dataset and perform your own explorations. Please do feel free to use my code as a starter script.
Couple of disclaimers:
NOT intending to say one country is better than the other. Instead just trying to explore the profiles based on what this specific dataset shows.
It is very much possible that there is a response bias and that the differences are due to the nature of the people who are on the Kaggle site, and who answered the survey.
With that out of the way, let us get started. If you like the analysis, please feel to fork the script and extend it further. And do not forget to Upvote! 🙂
Some questions that the analysis tries to answer are given below:
a. What is the respondent demographic profile for users from the 2 countries – men vs women, age bucket?
b. What is their educational background and major?
c. What are the job roles and coding experience?
d. What is the most popular language of use?
e. What is the programming language people recommend for an aspiring data scientist?
I deliberately did not compare salary because:
a. 16% of the population did not answer and 20% chose “do not wish to disclose”.
b. the lowest bracket is 0-10k USD, so the max limit of 10k translates to about INR 7,00,000 (7 lakhs) which is quite high. A software engineer, entering the IT industry probably makes around 4-5 lakhs per annum, and they earn much more than others in India. So comparing against US salaries feels like comparing apples to oranges. [Assuming an exchange rate of 1 USD = 70 INR].
Calculations / Data Wrangling:
- I’ve aggregated the age buckets into lesser number of segments, because the number of respondents tapers off in the higher age groups. They are quite self-explanatory, as you will see from the ifesle clause below:
if-else codeR12345# age - groupings:ctry_cmp$age_bucket <- ifelse( ctry_cmp$Q2 %in% c("18-21", "22-24"), "age1_[18-24]",ifelse(ctry_cmp$Q2 %in% c("25-29", "35-39"), "age2_[25-39]",ifelse(ctry_cmp$Q2 %in% c("35-39", "40-44"), "age3_[35-44]",ifelse(ctry_cmp$Q2 %in% c("45-49", "50-54"), "age4_[45-54]", "age5_[55+]"))))
- Similarly, cleaned up the special characters in the educational qualifications. Also added a tag to the empty values in the following variables – jobrole (Q6), exp_group (Q8), proj(Q40), years in coding (Q24), major (Q5).
- I also created some frequency using the sqldf() function. You could use the summarise() from the dplyr package. It really is a matter of choice.
As seen in the chart below, many more males (~80%) responded to the survey than women (~20%).
Among women, almost 2/3rd are from US, and only ~38% from India.
The men are almost split 50/50 among US and India.
There is a definite trend showing that the Indian respondents are quite young, with both men and women showing >54% in the youngest ge bucket (18-24), and another ~28% falling in the (25-29) category. So almost 82% of the population is under 30 years of age.
Among US respondents, the women seem a bit more younger, with 68% under 30 years, compared to ~57% men of women. However, both men and women had a larger segment in the 55+ category (~20% for women, and 25% for men. Compare it with Indians, where the 55+ group is barely 12%.
Overall, this is an educated lot, and most had a bachelors degree or more.
US women were the most educated of the lot, with a whopping 55% with masters degrees and 16% with doctorates.
Among Indians, women had higher levels of education – 10% with Ph.D, 43% masters degree, compared with men where ~34% had a masters degree and only 4% had a doctorate.
Among US men, ~47% had a masters degree, and 19% had doctorates.
This is interesting because Indians are younger compared to US respondents, so many more Indians seem to be pursuing advanced degrees.
Among Indians, the majority of respondents added Computer Science as their major.
Maybe because :
(a) Indians have to declare a major when they join, and the choice of majors is not as wide as in the US. ,
- Parents tend to force kids towards majors which are known to translate into a decent paying job, which is engineering or medicine.
- A case of response bias? The survey came from Kaggle, so not sure if non-coding majors would have even bothered to respond.Among US respondents, the major is also computer science, but followed by maths & stats for women.
For men, the second category was a tie between non-compsci Engg , followed by maths&stats.
Among Indians, the biggest segment are predominantly students (30%). Among Indian men, the second category is software engineer.
Among US women, the biggest category was also “student” but followed quite closely by “data scientist”. Among US men , the biggest category was “data scientist” followed by “student”.
Note, “other” category is something we created now, so not considering those. They are not the biggest category for any sub-group anyway.
CEOs, not surprisingly are male, 45+ years from the US, with a masters degree.
Among Indians, most answered <1 year of coding experience , which correlates well with the fact that most of them are under 30, with a huge population of students.
Among US respondents, the split is even between 1-2 years of coding and 3-5 years of coding.
Men seem to have a bit more coding experience than women, again explained by the fact that women were slightly younger overall, compared to US men.
Most popular programming language:
Python is the most popular language, discounting the number of people who did not answer. However, among US women, R is also popular (16% favoring it).
I found this quite interesting because I’ve always used R at work, at multiple big-name employers. (Nasdaq, Td bank, etc.) Plus, a lot of companies that used SAS seem to have found it easier to move code to R. Again this is personal opinion.
Maybe it is also because many colleges teach Python as a starting programming language?
- Overall, Indians tended to be younger with more people pursuing masters degrees.
- US respondents tended to older with stronger coding experience, and many more are practicing data scientists.
This seems like a great opportunity for Kaggle, if they could match the Indian students with the US data scientists, in a sort of mentor-matching service. 🙂