Journey of Analytics

Deep dive into data analysis tools, theory and projects

Author: anu - Journey of Analytics Team (page 1 of 4)

Top 10 Most Valuable Data Science Skills in 2020

The first month of the new decade is almost at an end. It’s also “job-hunting” time when students start looking for internships and employees think about switching roles and companies, in search of better salaries and opportunities. If you fall into one of these categories, then here are the Top 10 skills your resume absolutely needs to include, to get noticed by employers and land your dream job.

Data Science Skills for 2020

Methodology:

I looked at 200 job descriptions for jobs posted on LinkedIn in 7 major US/Canada cities – San Francisco, Seattle, Chicago, New York, Philadelphia, Atlanta, Toronto. Let’s face it – LinkedIn is the go-to platform for job seekers and recruiters. So looking at any other site seems a waste of time.

The job listings included many of the top Global brands in tech (Microsoft, Amazon, etc.), product (AirBnb, Uber, Visa), consulting (Deloitte, Accenture), banks (JP Morgan, Capital One) and so on. I only considered jobs with the title “Data scientist” or “Data Analyst”, with 150+ in the former. It took a while, but doing this manually also allowed me to exclude repetitive postings, since some companies post same role for multiple locations.

Ultimately, this allowed me to quickly identify patterns and repeated skills, which I am presenting in this blogpost.

I’ve categorized the skills into 2 parts: Core and Advanced. Core skills are the absolute minimum you should have, recruiters and automated job application systems will simply disqualify you without them. Advanced skills are those “preferred” competencies that make you look more valuable as a candidate, so make sure to highlight them with examples on your resume. So, if you are trying to transition to a career in Data Science, then I would highly recommend learning these first, and then jumping into the others. Needless to say, everyone working (or entering) this field needs to have a portfolio of projects.

Disclaimer – having all the 10 skills does NOT guarantee a job but vastly improves your chances. You’ll still need to do some legwork, to get considered and my book “Data Science Jobs” can help you shorten this process. The book is also on SALE for $0.99 this weekend, Jan 25th to Jan 28th, at a 92% discount.

Core Skills:

Minimum qualifications for Data Scientist roles

[1] Programming (R/Python): This is a no-brainer, you need to be an expert in either R/Python. Some jobs will list SAS or other obscure languages, but R or Python was a constant and mandatory requirement in 100% of all the jobs I parsed.

I am not going to argue the merits of one over the other in this post, but I will emphasize that R is still very much a in-demand skill. Plus, for most entry level roles, a candidate with only Python is not going to be considered more favorably (or declined!) over someone who knows only R. In fact, at my current and previous 2 roles, R-programming was the preferred language of choice. If you’d like to know my true views on the R vs Python debate, read this post.

[2] SQL: Most colleges and bootcamps do not teach this, but it is inordinately valuable. You cannot find insights without data, and 99% of companies predominantly use SQL databases of some kind. Fancy stuff like MongoDb, NoSQL or Hadoop are excellent keywords to add to your bio, but SQL is the baseline. You don’t need to know stored procedures or admin level expertise, but please learn basics of SQL for pulling in data with filters and optimizing table joins. SQL is mandatory to thrive as a data scientist.

[3] Basic math & Stats: By this I mean basic high-school stuff, like calculating confidence intervals and profit-loss calculations. If you cannot distinguish between mean and median, then no self-respecting manager will trust your numbers, or believe your insights have excluded those pesky outliers. Profits, incremental benefits in $ are other useful formulae you should know too, so brush up on your business math.

[4] Machine Learning Algorithms: Knowing to code algorithms is expected, but so is knowing the logic behind them. If you cannot explain it in plain English, you really don’t know what you are talking about!

[5] Data Visualization: Tableau is the preferred technology, although I’ve seen people find success with Excel charts ( Excel will never die! ) and R libraries, too. However, I definitely see Tableau dominating everything else in the coming years.

Advanced Skills:

Advanced Data Science Skills that make you indispensable!

[6] Communication skills: A picture is worth 1000 words; and being able to present data in meaningful, concise ways is crucial. Too many newbies get lost in the analysis itself, or hyper-focused on their beautiful code. Most managers want to see recommendations and insights that they can apply in practice! So being able to think like a “consultant” is crucial whether you are entry-level or the lead data scientist.

Good presentation skills (written and verbal) are important, more so for any dashboards or visualization reports, and I don’t mean color palettes or chart-types. Instead, make sure your dashboards are not “data-vomit”, a very practical (and apt!) term coined by Avinash Kaushik. If users cannot make head or tail of the dashboard without your handholding, or if the most important take-away is not obvious within 5 seconds, then you’ve done a poor job.  

[7] Cloud services: Most companies have moved databases to AWS/Azure, and many are implementing production models in the cloud. So, learn those basics about Docker, containers, and deploying your models and code to the cloud. This is still a niche skill, so having it will definitely help you stand apart as most companies make the move towards automation.

[8] Software engineering: You don’t need to become a software engineer but knowing basic architecture and data flow Qs will help you troubleshoot better, write better code that is easily moved to production. Some Qs to start – what is the data about, where (all) is it coming from? Learn about scheduler jobs and report automation, these have helped me automate the most boring repetitive tasks and look like a superstar to my managers! The infrastructure teams do extremely valuable work (keeping things running smoothly) so learn about “rules” and expectations, and make sure your code conforms to everything. I always do, and my requests are treated much better! 😉

[9] Automated ML: This is slowly getting popular, as companies try to cut costs and improve efficiencies with automation. H20.ai and DataRobot are just 2 names off my head, but there are many more vendors in the market. If possible, learn how to work with those, as they can reduce your time for analysis and speed up production deployment. They won’t replace good data scientists, but they do magnify the disparity between someone who is mindless copy/pasting code and the truly efficient data scientists. So make sure your “core” skills are impeccable.  

[10] Domain expertise: Nothing beats experience, but even if you are new to the company (or field) learn as much as you can from senior colleagues and partner teams. Find out the “why/how/what” Qs – who is using the analysis results, why do they truly want it? How will it be applied? How does it save the company money or increase profits? How can I do it faster while maintaining accuracy, and also adding to the bottom line? What metric does the end user (or my manager) really care about?

As Machine learning software add more automation and features, this blend of technology and domain expertise will ensure you are never a casualty of layoffs or cost-cutting! I’ve put this at the end, but really you should be thinking about this from DAY ONE!

For example, my current role involves models for credit card fraud prediction. However, once I learned the end-to-end process of card customer lifecycle (incoming application, review, collections, payments, etc.) my models have become much better. Plus, I have deeper understanding of Fair banking and privacy laws which can prevent many demographic variables from being used in modes. Similarly, a friend working in the petrochemical industry realized that his boss cared more about preventing true negatives (Overlooking or NOT maintaining end-of-life or faulty sensors that can potentially cause leaks or explosions ) than false positives (unnecessary maintenance for good sensors), even though both models can give you similar accuracy.

So build these skills, and see your career and salary potential sky-rocket in 2020!

Who wants to work at Google?

In this tutorial, we will explore the open roles at Google, and try to see what common attributes Google is looking for, in future employees.

 

This dataset comes from the Kaggle site, and contains text information about job location, title, department, minimum and preferred qualifications and the responsibilities of the position. Using this dataset we will try to answer the following questions: You can download the dataset here, and run the code on the Kaggle site itself here.

  1. Where are the open roles?
  2. Which departments have the most openings?
  3. What are the minimum and preferred educational qualifications needed to get hired at Google?
  4. How much experience is needed?
  5. What categories of roles are the most in demand?

Data Preparation and Cleaning:

The data is all in free-form text, so we do need to do a fair amount of cleanup to remove non-alphanumeric characters. Some of the job locations have special characters too, so we remove those using basic string manipulation functions. Once we read in the file, this is the snapshot of the resulting dataframe:

Job Categories:

First we look at which departments have the most number of open roles. Surprisingly, there are more roles open for the “Marketing and Communications” and “Sales & Account Management” categories, as compared to the traditional technical business units. (like Software Engineering or networking) .

Full-time versus internships:

Let us see how many roles are full-time and how many are for students. As expected, only ~13% of roles are for students i.e. internships. Majority are full-time positions.

Technical Roles:

Since Google is predominantly technical company, let us see how many positions need technical skills, irrespective of the business unit (job category)

a) Roles related to “Google Cloud”:

To check this, we investigate how many roles have the phrase either in the job title or the responsibilities. As shown in the graph below, ~20% of the roles are related to Cloud infrastructure, clearly showing that Google is making Cloud services a high priority.

Educational Qualifications:

Here we are basically parsing the “min_qual” and “pref_qual” columns to see the minimum qualifications needed for the role. If we only take the minimum qualifications into consideration, we see that 80% of the roles explicitly ask for a bachelors degree. Less than 5% of roles ask for a masters or PhD.

min_qualifications for Google jobs

However, when we consider the “preferred” qualifications, the ratio increases to a whopping ~25%. Thus, a fourth of all roles would be more suited to candidates with masters degrees and above.

Google Engineers:

Google is famous for hiring engineers for all types of roles. So we will read the job qualification requirements to identify what percentage of roles requires a technical degree or degree in Engineering.
As seen from the data, 35% specifically ask for an Engineering or computer science degree, including roles in marketing and non-engineering departments.

Years of Experience:

We see that 30% of the roles require at least 5-years, while 35% of roles need even more experience.
So if you did not get hired at Google after graduation, no worries. You have a better chance after gaining a strong experience in other companies.

Role Locations:

The dataset does not have the geographical coordinates for mapping. However, this is easily overcome by using the geocode() function and the amazing Rworldmap package. We are only plotting the locations, so some places would have more roles than others.  So, we see open roles in all parts of the world. However, the maximum positions are in US, followed by UK, and then Europe as a whole.

Responsibilities – Word Cloud:

Let us create a word cloud to see what skills are most needed for the Cloud engineering roles: We see that words like “partner”, “custom solutions”, “cloud”, strategy“,”experience” are more frequent than any specific technical skills. This shows that the Google cloud roles are best filled by senior resources where leadership and business skills become more significant than expertise in a specific technology.

 

Conclusion:

So who has the best chance of getting hired at Google?

For most of the roles (from this dataset), a candidate with the following traits has the best chance of getting hired:

  1. 5+ years of experience.
  2. Engineering or Computer Science bachelor’s degree.
  3. Masters degree or higher.
  4. Working in the US.

The code for this script and graphs are available here on the Kaggle website. If you liked it, don’t forget to upvote the script. 🙂

Thanks and happy coding!

Top US Cities with Highest Rent

In this post, we will use the Zillow rent dataset to perform  exploratory and inferential statistics. Our main goal is to identify the most expensive real estate cities in US. We will be using leaflets to create interactive drill-down maps.

 

Input Files:

The Kaggle dataset contains two files with rental prices for 13000+ cities across the time frame Nov 2010 – Jan 2017. One file contains values for rent, the other has price per square foot.

Additionally, we use a public dataset to map geographical coordinates to the city names. The main analysis does not need the latitude, longitude values, so you can proceed without this file, except for the last map. Although, having these values helps to create some stunning visuals.

Feel free to use the location data file with other datasets or projects, as it contains coordinate information for cities in numerous countries. 

Note of caution:

The location data file is quite large, so the fread() to read it and the merge() later will take a minute or so.

Analysis Qs:

To give some structure to our analysis, these are the main goals for the project:

  1. Most expensive cities in US, by rent.
  2. Most expensive cities by price per square foot.
  3. Which states have a higher concentration of such cities?
  4. Rent trends over time.

Please note that the datafiles and R-program code are available on the Projects page – under “Rent Analysis” under Aug 2017.

P.S.: If you are interested in a career in datascience, be sure to check our ebook “Data Science Jobs” on Amazon – US link here.

Data Cleansing:

The Kaggle files are quite clean, without many missing values. However, to use them for analyzing trends over time, we still need to process them.  In this case, the rent for each month is in a separate column, so we need to aggregate those together.  We achieve this by using a custom for-loop.

On a side note, if you are trying to massage data for reporting formats, say similar to a pivot table in Excel, then using similar for-loops can save you tons of time doing manual steps.

We will also merge the latitude & longitude data at this step. Some of the city names don’t match exactly so we will use some string manipulation functions to make a perfect match.

This is how the data frame looks after the data processing step:

transformed data object

transformed data object

Rent Analysis:

We will use the Jan 2017 month to do a ranking for parameters like population density, rent amount and price per square ft.

a) Most expensive cities in US, by rent:

We use Jan data to sort the cities by rent amount, then assign a title similar to “Num. City_Name” . Take the list of top 10 cities and then merge with the original rent dataframe, to view rent trend over time.

This gives us the list below:

US cities with highest rent

US cities with highest rent

If we plot the rent values since Nov 2010, we get a chart as shown below:

We notice that Jupiter island and Westlake see some intra-yearly rent patterns indicating seasonal shifts in demand/supply.

b) Cities with highest price by area:

Using the price per square foot dataset, we can also identify cities with the highest price per square foot area. The city list for this analysis is as follows:

Notice that the city names in the two lists are not identical. Jupiter island which was first in list 1, has moved down to spot 4.  Similarly, a 2000 sq.ft home in Malibu CA would set you back by $9,000 per month! We also see that most cities in this list are predominantly in California or Florida.

c) Cities with small area but huge rent!!

Let us investigate which cities make you shell out tons of money for very small homes. We can calculate area using the price per sq. ft. and rent amount.

small home, big rent

small home, big rent

d) Ranking cities with higher population density:

Similar code gives us the list below:

rent in cities with large population

rent in cities with large population

Not surprisingly, we see names like New York, Los Angeles and Chicago heading the list.

Mapping Cities & Rent:

We’ve added the geographical coordinates to our dataset, so let us try to plot the cities and their median rent. We will add a column for the text we want to display and use leaflet() function to create the map.

Note the maps look a little blurred at first, after 10 seconds the areas look lot clearer as the maps load up. So you can see national & state highway, city names and other details. The zoom feature allows users to zoom in and out.

Images for Hawaii are shown below:

US city map with clusters

US city map with clusters

Zooming to the left and down to view Hawaii.

Hawaii map

Hawaii map

Zooming further to check the Kailua island of Hawaii:

Median rent in Kailua, HI

Median rent in Kailua, HI

Data Insights:

  1. Top 10 most expensive cities seem to be concentrated in CA and TX. (California & Texas)
  2. In such cities you have to pay $10,000+ as rent.
  3. For the cities where you pay a lot for homes smaller than 900 sq ft, we notice that Hawaii cities have a seasonal trend. Perhaps due to tourist cycles (uptick) and the torrential rains (low tourism).
  4. The most populous cities are not always the most expensive, although it probably means a lot more competition for the same few homes.
  5. Median rent in most populous cities is ~$1300

What other insights did you pick up?

Next Steps:

You can play around with the data and code to see other rankings or create your visualizations. Here are some pointers to get you started:

  1. Rank cities by highest rent price for some random months – Jan 2014, July 2015, Mar 2012, Aug 2013, Nov 2016, July 2011, Sep 2015. Do the top 20 lists remain the same? Different?
  2. Collect the list of city names from all the above and view trend over time? Identify which city has the maximum price % increase, where price % =[ (Jan2017 rent – Nov 2010 rent) / Nov 2010 rent ]
  3. Which state has the highest number of such expensive cities? If the answer is CA, which is the second most expensive state?
  4. Repeat steps 1-3 for price per square foot.
  5. Select a midwestern state like Kansas, Oklahoma, North Dakota or Mississippi and repeat the analysis at a state level.

Please feel free to download the code files and datasets from the Projects Page under “Rent Analysis” .

August Projects

In this month’s project, we will implement cluster analysis using the “K-means algorithm”.

We use the weather data from 1500+ locations (near airports) to understand temperature patterns by latitude and time of year.

We use cluster = 5 and assign letter A through E to locations with similar weather patterns. At the end of the analysis, you should be able to interpret the following insights from the resulting graphs and tables:

  1. Temperature patterns are similar towards the far North and South, just vertically shifted.
  2. The Pacific coast is different from the rest of the nation, where the temperature is static almost throughout the year.
  3. It is interesting to see how states in two different parts of the country show similar weather patterns since they are on the same latitude (see Minnesota and Maine). During peak summer, these two states are hotter than California.

 

A sample graph from the analysis is shown below.

US states by 5 major weather clustersUS states by 5 major weather clusters

US states divided into 5 major weather clusters

Data set and code files are available from the main Project site page, under the row for Jul/Aug 2017.

Take a look and play around with the data, to investigate the following:

  1. What happens if you increase cluster size to 7? What happens if you decrease it to 3?
  2. What is the monthly weather pattern for Hawaii (state code = HI) versus New Hampshire (abbreviation = NH) ?
  3. What is the weekly average temperature for a tropical state like Florida (plot a chart with median temperatures for all 52 weeks, by year). Has the average temperature gone up due to global warming?

Please leave your thoughts and comments, or questions if you get stuck on any point.

Happy Coding!

 

 

Monte Carlo Simulations in R

In today’s tutorial, we are going to learn how to implement Monte Carlo Simulations in R.

Logic behind Monte Carlo:

Monte Carlo Simulations in R

Monte Carlo Simulations in R

Monte Carlo simulation (also known as the Monte Carlo Method) is a statistical technique that allows us to compute all the possible outcomes of an event. This makes it extremely helpful in risk assessment and aids decision-making because we can predict the probability of extreme cases coming true. The technique was first used by scientists working on the atom bomb; it was named for Monte Carlo, the Monaco resort town renowned for its casinos. Since its introduction in World War II, Monte Carlo simulation has been used to model a variety of physical and conceptual systems.

Monte Carlo methods are used to identify the probability of an event A happening, among a set of N events. We assume that all the events are independent, and the probability of event A happening once does not prevent the occurrence again.

For example, assume you have a fair coin and you flip it once. The probability of heads is 0.5 i.e. equal possibility of heads or tails. You flip the coin again. The possibility of heads is still 0.5, irrespective of whether we got heads or tails in the first flip. However, we can safely say that if we were to flip the coin 100 times, you would see heads ~50% of the times. The application of Monte Carlo (referred henceforth in this post as MC) methods comes to play when we want to find out the probability of heads occurring 16 times in a row. (or 5 or 3 or any other number.)

You can read more about these methods and the theory behind them, using the links below:

  1. Wikipedia – link.
  2. MC methods in Finance, from Investopedia.com – link2
  3. Basics of MC from software provider Palisade. – link3.

Applications:

MC methods are used by professionals in numerous fields ranging from finance, project management, energy, manufacturing, R&D, insurance, biotech, etc. Some real-world applications of Monte Carlo simulations are given below:

  1. Monte Carlo simulations are used in financial services to predict fraudulent credit card transactions. (since 100 genuine transactions do not guarantee the next one will not be fraudulent, even though it is a rare event by itself.)
  2. Risk analysis. Assume a new product was sold at a loss of $300 to 6 users (due to coupons or sales), a profit of $467 in 79 users and a profit of $82 to 119 customers. We can use Monte Carlo simulations to understand what would be the average P/L (profit or loss) if 1000 customers bought our products.
  3. A/B testing to understand page bounce and success web elements. Assume you changed the payment processing system on your e-commerce site. You are doing an A/B test to see if the upgrade results in improved checkout completion. On the old system, 12 users abandoned their cart, while 19 completed their purchase. On the new system, 147 people abandoned their cart while 320 completed their purchase. Which system works better?
  4. Selection criteria. Example if we have 7 candidates for a scholarship (Eileen, George, Taher, Ramesis, Arya, Sandra and Mike) what is the probability that Mike will be chosen in three consecutive years? Assuming the candidate list is the same and past winners are not barred from receiving the scholarship again.

 

Advantages of using MC:

Unlike simple forecasting, Monte Carlo simulation can help with the following:

  • Probabilistic Results – show scenarios and how the occurrence likelihood.
  • Graphical Results – The outcomes and their chance of occurring can be easily converted to graphs making it easy to communicate findings to an audience.
  • Sensitivity Analysis – Easier to see which variables impact the outcome the most, i.e. which variables had the biggest effect on bottom-line results.
  • Scenario Analysis: Using Monte Carlo simulation, we can see exactly which inputs had which values together when certain outcomes occurred.
  • Correlation of Inputs. In Monte Carlo simulation, it’s possible to model interdependent relationships between input variables. It’s important for accuracy to represent how, in reality, when some factors goes up, others go up or down accordingly.

Code template:

The basic template for MC is as follows:

 

Let’s look at this code in detail:

  • Runs = no of trials or iterations. For our product profit example (application example 2), runs = 1000.
  • Func1 = this is the formula definition where we will indicate number of different events, their probability and the selection criteria. For our scholarship candidate example (application number 4) this function would be modified as:

sum(sample(c(1:7), size =3, replace = T)) > 6

where we are assigning number 1:7 to each student and hence Mike = 7.

Main code:

The code files for this tutorial are available on the 2017 project page. (Link here under Jul/Aug 2017 ) .

Older posts
Facebook
LinkedIn