This is just a short note to specify that the list of FREE datasets is updated for 2020. There are 50+ sites and links to the newly released Google Dataset search engine. So, have fun exploring these data repositories to master programming, create stunning visualizations and build your own unique project portfolios.
Some starter projects with these datafiles are available on the Projects page, using R-programming.
The first month of the new decade is almost at an end. It’s also “job-hunting” time when students start looking for internships and employees think about switching roles and companies, in search of better salaries and opportunities. If you fall into one of these categories, then here are the Top 10 skills your resume absolutely needs to include, to get noticed by employers and land your dream job.
I looked at 200 job descriptions for jobs posted on LinkedIn in 7 major US/Canada cities – San Francisco, Seattle, Chicago, New York, Philadelphia, Atlanta, Toronto. Let’s face it – LinkedIn is the go-to platform for job seekers and recruiters. So looking at any other site seems a waste of time.
The job listings included many of the top Global brands in tech (Microsoft, Amazon, etc.), product (AirBnb, Uber, Visa), consulting (Deloitte, Accenture), banks (JP Morgan, Capital One) and so on. I only considered jobs with the title “Data scientist” or “Data Analyst”, with 150+ in the former. It took a while, but doing this manually also allowed me to exclude repetitive postings, since some companies post same role for multiple locations.
Ultimately, this allowed me to quickly identify patterns and
repeated skills, which I am presenting in this blogpost.
I’ve categorized the skills into 2 parts: Core and Advanced. Core skills are the absolute minimum you should have, recruiters and automated job application systems will simply disqualify you without them. Advanced skills are those “preferred” competencies that make you look more valuable as a candidate, so make sure to highlight them with examples on your resume. So, if you are trying to transition to a career in Data Science, then I would highly recommend learning these first, and then jumping into the others. Needless to say, everyone working (or entering) this field needs to have a portfolio of projects.
Disclaimer – having all the 10 skills does NOT guarantee a job but vastly improves your chances. You’ll still need to do some legwork, to get considered and my book “Data Science Jobs” can help you shorten this process. The book is also on SALE for $0.99 this weekend, Jan 25th to Jan 28th, at a 92% discount.
 Programming (R/Python): This is a no-brainer, you need to be an expert in either R/Python. Some jobs will list SAS or other obscure languages, but R or Python was a constant and mandatory requirement in 100% of all the jobs I parsed.
I am not going to argue the merits of one over the other in this post, but I will emphasize that R is still very much a in-demand skill. Plus, for most entry level roles, a candidate with only Python is not going to be considered more favorably (or declined!) over someone who knows only R. In fact, at my current and previous 2 roles, R-programming was the preferred language of choice. If you’d like to know my true views on the R vs Python debate, read this post.
 SQL: Most colleges and bootcamps do not teach this, but it is inordinately valuable. You cannot find insights without data, and 99% of companies predominantly use SQL databases of some kind. Fancy stuff like MongoDb, NoSQL or Hadoop are excellent keywords to add to your bio, but SQL is the baseline. You don’t need to know stored procedures or admin level expertise, but please learn basics of SQL for pulling in data with filters and optimizing table joins. SQL is mandatory to thrive as a data scientist.
 Basic math & Stats: By this I mean basic high-school stuff, like calculating confidence intervals and profit-loss calculations. If you cannot distinguish between mean and median, then no self-respecting manager will trust your numbers, or believe your insights have excluded those pesky outliers. Profits, incremental benefits in $ are other useful formulae you should know too, so brush up on your business math.
 Machine Learning Algorithms: Knowing to code algorithms is expected, but so is knowing the logic behind them. If you cannot explain it in plain English, you really don’t know what you are talking about!
 Data Visualization: Tableau is the preferred technology, although I’ve seen people find success with Excel charts ( Excel will never die! ) and R libraries, too. However, I definitely see Tableau dominating everything else in the coming years.
 Communication skills: A picture is worth 1000 words; and being able to present data in meaningful, concise ways is crucial. Too many newbies get lost in the analysis itself, or hyper-focused on their beautiful code. Most managers want to see recommendations and insights that they can apply in practice! So being able to think like a “consultant” is crucial whether you are entry-level or the lead data scientist.
Good presentation skills (written and verbal) are important, more so for any dashboards or visualization reports, and I don’t mean color palettes or chart-types. Instead, make sure your dashboards are not “data-vomit”, a very practical (and apt!) term coined by Avinash Kaushik. If users cannot make head or tail of the dashboard without your handholding, or if the most important take-away is not obvious within 5 seconds, then you’ve done a poor job.
 Cloud services: Most companies have moved databases to AWS/Azure, and many are implementing production models in the cloud. So, learn those basics about Docker, containers, and deploying your models and code to the cloud. This is still a niche skill, so having it will definitely help you stand apart as most companies make the move towards automation.
 Software engineering: You don’t need to become a software engineer but knowing basic architecture and data flow Qs will help you troubleshoot better, write better code that is easily moved to production. Some Qs to start – what is the data about, where (all) is it coming from? Learn about scheduler jobs and report automation, these have helped me automate the most boring repetitive tasks and look like a superstar to my managers! The infrastructure teams do extremely valuable work (keeping things running smoothly) so learn about “rules” and expectations, and make sure your code conforms to everything. I always do, and my requests are treated much better! 😉
 Automated ML: This is slowly getting popular, as companies try to cut costs and improve efficiencies with automation. H20.ai and DataRobot are just 2 names off my head, but there are many more vendors in the market. If possible, learn how to work with those, as they can reduce your time for analysis and speed up production deployment. They won’t replace good data scientists, but they do magnify the disparity between someone who is mindless copy/pasting code and the truly efficient data scientists. So make sure your “core” skills are impeccable.
 Domain expertise: Nothing beats experience, but even if you are new to the company (or field) learn as much as you can from senior colleagues and partner teams. Find out the “why/how/what” Qs – who is using the analysis results, why do they truly want it? How will it be applied? How does it save the company money or increase profits? How can I do it faster while maintaining accuracy, and also adding to the bottom line? What metric does the end user (or my manager) really care about?
As Machine learning software add more automation and features, this blend of technology and domain expertise will ensure you are never a casualty of layoffs or cost-cutting! I’ve put this at the end, but really you should be thinking about this from DAY ONE!
For example, my current role involves models for credit card fraud prediction. However, once I learned the end-to-end process of card customer lifecycle (incoming application, review, collections, payments, etc.) my models have become much better. Plus, I have deeper understanding of Fair banking and privacy laws which can prevent many demographic variables from being used in modes. Similarly, a friend working in the petrochemical industry realized that his boss cared more about preventing true negatives (Overlooking or NOT maintaining end-of-life or faulty sensors that can potentially cause leaks or explosions ) than false positives (unnecessary maintenance for good sensors), even though both models can give you similar accuracy.
So build these skills, and see your career and salary potential sky-rocket in 2020!
November is the month of Thanksgiving, and vacations and of course deals galore! As part of saying thanks to my loyal readers, here are some deals specific to data science professionals and students, that you should definitely not miss on.
If you are exploring Data Science careers or preparing for interviews for a winter graduation, then take a look at my ebook “Data Science Jobs“. It is currently part of a Kindle countdown deal and priced 50% off from its normal price. Currently only $2.99 and prices will keep increasing until Friday morning when it goes back to full price.
Want a FREE book on Statistics, as related to R-programming and machine learning algorithms? I am currently looking to giveaway FREE advanced reviewer copies (ARC) . You can look at the book contents here, and if it seems interesting then please sign up here to be a reviewer.
If you are deploying machine learning models on the cloud, then chances are you work with Kubernetes or have at least heard of it. If you haven’t and you are an aspiring data scientist/ engineer, then you should compulsorily learn about tho
The R-programming project for November is a sentiment analysis on song lyrics by different artists. There is lots of data wrangling involved to aggregate different lyrics, and compare the lyrics favored by 2 different artists. The code repository is added to the Projects page here. I’ve written the main code in R, and used Tableau to generate some of the visuals, but this can be easily tweaked to create an awesome Shiny dashboard to add to a data science portfolio.
In this month’s we are going to look at data analysis and visualization of social networks using R programming.
Friendster Networks Mapping
Friendster was a yesteryear social media network, something akin to Facebook. I’ve never used it but it is one of those easily available datasets where you have a list of users and all their connections. So it is easy to create a viz and look at whose networks are strong and whose are weak, or even the bridge between multiple networks.
The dataset and code files are added on the Projects Page here , under “social network viz”.
For this analysis, we will be using the following library packages:
Load the datafiles. The list of users is given in the file named “nodes” as each user is a node in the graph. The connection list is given in the file named “edges” as a 1-to-1 mapping. So if user Miranda has 10 friends, there would be 10 records for Miranda in the “edges” file, one for each friend. The friendster datafile has been anonymized, so there are numbers (id) rather than names.
Convert the dataframes into a very specific format. We do some prepwork so that we can directly use the graph visualization functions.
Create a graph object. This will also help to create clusters. Since the dataset is anonymized it might seem irrelevant, but imagine this in your own social network. You might have one cluster of friends who are from your school, another bunch from your office, one set who are cousins and family members and some random folks. Creating a graph object allows us to look at where those clusters lie automatically.
Visualize using functions specific to graph objects. The first function is visNetwork() which generates an interactive color coded cluster graph. When you click on any of th nodes (colored circles), it will highlight all the connections radiating from the node. (In the image below, I have highlighted node for user 17.
You can also use the same function with a bunch of different parameters, as shown below:
In the image below you can see the 3 colored clusters and the central (light blue) node. The connections in blue are the ones that do not have a lot of direct connections. The yellow and red clusters are tigher, indicating they have internal connections with each other. (similar to a bunch of classmates who all know each other)
Feel free to play around with the code. One extensions of this idea would be to download Facebook or LinkedIn data (premium account needed) and create similar visualizations.
Or if you have a list of airports and routes, you could create something like this as a flight network map, to know the minimum number of hops between 2 destinations and alternative routes.
You could also do a counter to see which nodes have the most number of friends and increase the size of the circle. This would make it easier to view which nodes are the most well-connected.
Of course, do not be over-mesmerized by the data. In real-life, the strength of the relationship also matters. This is hard to quantify or collect, even though its easy to depict once you have the data in hand/ For example, I have a 1000 connections who I’ve met at conferences or random events. If I needed a job, most may not really be useful. But my friend Sarah has only 300 but super-loyal friends who literally found her a job in 2 days when she had to move back to her hometown to take care of a sick parent.
With that thought, do take a look at the code and have fun coding! 🙂
/ JOURNEYOFANALYTICS / Comments Off on DataScience Portfolio Ideas for Students & Beginners
A lot has been written on the importance of a portfolio if you are looking for a DataScience role. Ideally, you should document your learning journey so that you can reuse code, write well-documented code and also improve your data storytelling skills.
However, most students and beginners get stumped on what to include in their portfolio, as their projects are all the same that their classmates, bootcamp associates and seniors have created. So, in this post I am going to tell you what projects you should have in your portfolio kitty, as well as a list of ideas that you can use to construct a collection of projects that will help you stand out on LinkedIn, Github and in the eyes of prospective hiring managers.
In this post though, I am classifying projects based on skill level along with sample ideas for DIY projects that you can attempt on your own.
On that note, if you are already looking for a job, or about to do so, do take a look at my book “DataScience Jobs“, available on Amazon. This book will help you reduce your job search time and quickly start a career in analytics.
Since I prefer R over Python, all the project lists in this post will be coded in R. However, feel free to implement these ideas in Python, too!
a. Entry-level / Rookie Stage
If you are just starting out, and are not very comfortable with even syntax, your main aim is to learn how to code along with DataScience concepts. At this stage, just try to write simple scripts in R that can pull data, clean it up and calculate mean/median and create basic exploratory graphs. Pick up any competition dataset on Kaggle.com and look at the highest voted EDA script. Try to recreate it on your own, read through and understand the hows and whys of the code. One excellent example is the Zillow EDA by Philipp Spachtholz.
This will not only teach you the code syntax, but also how to approach a new dataset and slice/dice it to identify meaningful patterns before any analysis can begin.
Once you are comfortable, you can move on to machine learning algorithms. Rather than Titanic, I actually prefer the Housing Prices Dataset. Initially, run the sample submission to establish a baseline score on the leaderboard. Then apply every algorithm you can look up and see how it works on the dataset. This is the fastest way to understand why some algorithms work on numerical target variables versus categorical versus time series.
Next, look at the kernels with decent leaderboard score and replicate them. If you applied those algorithms but did not get the same result, check why there was a mismatch.
Now pick a new dataset and repeat. I prefer competition datasets since you can easily see how your score moves up or down. Sometimes simple decision trees work better than complex Bayesian logic or Xgboost. Experimenting will help you figure out why.
Sample ideas –
Survey analysis: Pick up a survey dataset like the Stack overflow developer survey and complete a thorough EDA – men vs women, age and salary correlation, cities with highest salary after factoring in currency differences and cost of living. Can your insights also be converted into an eye-catching Infographic? Can you recreate this?
Simple predictions: Apply any algorithms you know on the Google analytics revenue predictor dataset. How do you compare against the baseline sample submission? Against the leaderboard?
Automated reporting: Go for end-to-end reporting. Can you automate a simple report, or create a formatted Excel or pdf chart using only R programming? Sample code here.
b. Senior Analyst/Coder
At this stage simple competitions should be easy for you. You dont need to be in the top 1%, even being in the Top 30-40% is good enough. Although, if you can win a competition even better!
Now you can start looking at non-tabular data like NLP sentiment analysis, image classification, API data pulls and even dataset mashup. This is also the stage when you probably feel comfortable enough to start applying for roles, so building unique projects are key.
For sentiment analysis, nothing beats Twitter data, so get the API keys and start pulling data on a topic of interest. You might be limited by the daily pull limits on the free tier, so check if you need 2 accounts and aggregate data over a couple days or even a week. A starter example is the sentiment analysis I did during the Rio Olympics supporting Team USA.
You should also start dabbling in RShiny and automated reports as these will help you in actual jobs where you need to present idea mockups and standardizing weekly/ daily reports.
Sample ideas –
Twitter Sentiment Analysis: Look at the Twitter sentiments expressed before big IPO launches and see whether the positive or negative feelings correlated with a jump in prices. There are dozens of apps that look at the relation between stock prices and Twitter sentiments, but for this you’d need to be a little more creative since the IPO will not have any historical data to predict the first day dips and peaks.
API/RShiny Project: Develop a RShiny dashboard using Yelp API, showing the most popular restaurants around airports. You can combine a public airport dataset and merge it with filtered data from the Yelp API. A similar example (with code) is included in this Yelp College App dashboard.
Lyrics Clustering: Try doing some text analytics using song lyrics from this dataset with 50,000+ songs. Do artists repeat their lyrics? Are there common themes across all artists? Do male singers use different words versus female solo tracks? Do bands focus on a totally different theme? If you see your favorite band or lead singer, check how their work has evolved over the years.
By now, you should be fairly comfortable with analyzing data from different datasource types (image, text, unstructured), building advanced recommender systems and implementing unsupervised machine learning algorithms. You are now moving from analyze stage to build stage.
You may or may not already have a job by now. If you do, congratulations! Remember to keep learning and coding so you can accelerate your career further.
If you have not, check out my book on how to land a high-paying ($$$) Data Science job job within 90 days.
Look at building Deep learning using keras and apps using artificial intelligence. Even better, can you fully automate your job? No, you wont “downsize” yourself. Instead your employer will happily promote you since you’ve shown them a superb way to improve efficiency and cut costs, and they will love to have you look at other parts of the business where you can repeat the process.
Sample project ideas –
Build an App: College recommender system using public datasets and web scraping in R. (Remember to check terms of service as you do not want to violate any laws!) Goal is to recreate a report like the Top 10 cities to live in, but from a college perspective.
Start thinking about what data you need – college details (names, locations, majors, size, demographics, cost), outlook (Christian/HBCU/minority), student prospects (salary after graduation, time to graduate, diversity, scholarship, student debt ) , admission process (deadlines, average scores, heavy sports leaning) and so on. How will you aggregate this data? Where will you store it? How can you make it interactive and create an app that people might pay for?
Upwork Gigs: Look at Upwork contracts tagged as intermediate or expert, esp. the ones with $500+ budgets. Even if you dont want to bid, just attempt the project on your own. If you fail, you will know you still need to master some more concepts, if you succeed then it will be a superb confidence booster and learning opportunity.
Audio Processing: Use the VOX celebrity dataset to identify the speaker based on audio/speech dataset. Audio files are an interesting datasource with applications in customer recognition (think bank call centers to prevent fraud), parsing for customer complaints, etc.
Build your own package: Think about the functions and code you use most often. Can you build a package around it? The most trending R-packages are listed here. Can you build something better?
Do you have any other interesting ideas? If so, feel free to contact me with your ideas or send me a link with the Github repo.