Deep dive into data analysis tools, theory and projects

Category: Python programming

How to Become a Data Scientist in 2020

Despite the spike in the interest related to Data Science and Machine Learning roles and courses, it is still possible to become a fully functional data scientist with minimal resources.

Some caveats, (1) be committed to invest hours of effort building your expertise. (2) The job market has gotten quite competitive, so be mentally prepared to work strategically and accept that finding a job will require sweat equity.

Note, the title of this post is “Data Scientist” but the steps below are true even if your aim is to become a data analyst, data engineer, analytics consultant or machine learning engineer.

data science job types
Data Science job types

Steps to Data Science Expertise

At its core, becoming a data scientist will require three steps (in sequence):

  1. Learn the skills
  2. Build your portfolio
  3. Apply to jobs strategically.

Step 1 – Learn the skills.

The list of skills below are mandatory.

  • Programming in R or Python.
  • Programming in SQL. Most courses never talk about SQL, but it is critical.
  • Machine learning algorithms. Know the code and also which one fits for what use case.
  • If you search Google, you will find free courses and books on all the above topics. Or go for a low-cost option from Udemy. Essentially you can learn the skills for <$100, even now in 2020.

Step 2 – Build your portfolio.

  • You can add 100 certifications, but you also do need to showcase the learning by way of projects. Use Github to host your projects or create a free wordpress website. If you have the capacity, explore low cost website hosting from Wix or Squarespace.
  • The project should be unique to you. Pick any free public dataset, and apply your perspective to slice and dice the data, and extract insights. This is what will set you apart from the 10,000 other candidates who completed the same free bootcamp or Coursera class. Sample project idea list here., based on beginner or advanced skill levels.
  • Free tutorials are available on many sites, including my own journeyofanalytics.

Step 3 – Apply to jobs strategically.

  • The job market is getting heated up, as people enter this field in thousands. Getting job leads is hard, getting to interview stage is even harder.
  • Make sure your profile on LinkedIn is “all-star”, with at least 500 connections.
  • You can significantly improve your odds by leveraging niche job sites, and hunting on LinkedIn content tabs and Twitter. Both are highly manual, which is why they work! No one else wants to pursue those methods! 🙂 A detailed how-to guide, full list of niche job boards and interview question sets are all available in my job search book which I keep updating every quarter. These strategies work, hence the blatant plug-in!
  • Be prepared to face a lot of rejections, especially for landing the first job. In the beginning, don’t be afraid to accept a low-paying job or work internships. It is easier to get a job when you are already hired!
  • Initially you may be hired as a “data analyst” – accept! A lot of companies are using the terms analyst and scientist intermittently, or use the “data scientist” title to designate more experienced hires.
  • Note, there are other job types in the data science domain apart from “data scientist” so check if you can leverage your previous experiences for other role types.

Book Offer:

Note, I realize a lot of students are graduating soon and the global pandemic is making it hard to find jobs. Some employers are already reneging on confirmed offers, which increases pressure on students. Hence I’ve reduced my ebook price to $0.99 for the month of May 2020.

Note, the book will NOT be marked free to deter folks who just download books and guides but do not intend to put in any effort!

All the best for an exciting new career!

This question was previously answered (by me) on Quora under the question – “Is it possible to become a data scientist in 2020 with only a few resources?

Top 10 Most Valuable Data Science Skills in 2020

The first month of the new decade is almost at an end. It’s also “job-hunting” time when students start looking for internships and employees think about switching roles and companies, in search of better salaries and opportunities. If you fall into one of these categories, then here are the Top 10 skills your resume absolutely needs to include, to get noticed by employers and land your dream job.

Data Science Skills for 2020

Methodology:

I looked at 200 job descriptions for jobs posted on LinkedIn in 7 major US/Canada cities – San Francisco, Seattle, Chicago, New York, Philadelphia, Atlanta, Toronto. Let’s face it – LinkedIn is the go-to platform for job seekers and recruiters. So looking at any other site seems a waste of time.

The job listings included many of the top Global brands in tech (Microsoft, Amazon, etc.), product (AirBnb, Uber, Visa), consulting (Deloitte, Accenture), banks (JP Morgan, Capital One) and so on. I only considered jobs with the title “Data scientist” or “Data Analyst”, with 150+ in the former. It took a while, but doing this manually also allowed me to exclude repetitive postings, since some companies post same role for multiple locations.

Ultimately, this allowed me to quickly identify patterns and repeated skills, which I am presenting in this blogpost.

I’ve categorized the skills into 2 parts: Core and Advanced. Core skills are the absolute minimum you should have, recruiters and automated job application systems will simply disqualify you without them. Advanced skills are those “preferred” competencies that make you look more valuable as a candidate, so make sure to highlight them with examples on your resume. So, if you are trying to transition to a career in Data Science, then I would highly recommend learning these first, and then jumping into the others. Needless to say, everyone working (or entering) this field needs to have a portfolio of projects.

Disclaimer – having all the 10 skills does NOT guarantee a job but vastly improves your chances. You’ll still need to do some legwork, to get considered and my book “Data Science Jobs” can help you shorten this process. The book is also on SALE for $0.99 this weekend, Jan 25th to Jan 28th, at a 92% discount.

Core Skills:

Minimum qualifications for Data Scientist roles

[1] Programming (R/Python): This is a no-brainer, you need to be an expert in either R/Python. Some jobs will list SAS or other obscure languages, but R or Python was a constant and mandatory requirement in 100% of all the jobs I parsed.

I am not going to argue the merits of one over the other in this post, but I will emphasize that R is still very much a in-demand skill. Plus, for most entry level roles, a candidate with only Python is not going to be considered more favorably (or declined!) over someone who knows only R. In fact, at my current and previous 2 roles, R-programming was the preferred language of choice. If you’d like to know my true views on the R vs Python debate, read this post.

[2] SQL: Most colleges and bootcamps do not teach this, but it is inordinately valuable. You cannot find insights without data, and 99% of companies predominantly use SQL databases of some kind. Fancy stuff like MongoDb, NoSQL or Hadoop are excellent keywords to add to your bio, but SQL is the baseline. You don’t need to know stored procedures or admin level expertise, but please learn basics of SQL for pulling in data with filters and optimizing table joins. SQL is mandatory to thrive as a data scientist.

[3] Basic math & Stats: By this I mean basic high-school stuff, like calculating confidence intervals and profit-loss calculations. If you cannot distinguish between mean and median, then no self-respecting manager will trust your numbers, or believe your insights have excluded those pesky outliers. Profits, incremental benefits in $ are other useful formulae you should know too, so brush up on your business math.

[4] Machine Learning Algorithms: Knowing to code algorithms is expected, but so is knowing the logic behind them. If you cannot explain it in plain English, you really don’t know what you are talking about!

[5] Data Visualization: Tableau is the preferred technology, although I’ve seen people find success with Excel charts ( Excel will never die! ) and R libraries, too. However, I definitely see Tableau dominating everything else in the coming years.

Advanced Skills:

Advanced Data Science Skills that make you indispensable!

[6] Communication skills: A picture is worth 1000 words; and being able to present data in meaningful, concise ways is crucial. Too many newbies get lost in the analysis itself, or hyper-focused on their beautiful code. Most managers want to see recommendations and insights that they can apply in practice! So being able to think like a “consultant” is crucial whether you are entry-level or the lead data scientist.

Good presentation skills (written and verbal) are important, more so for any dashboards or visualization reports, and I don’t mean color palettes or chart-types. Instead, make sure your dashboards are not “data-vomit”, a very practical (and apt!) term coined by Avinash Kaushik. If users cannot make head or tail of the dashboard without your handholding, or if the most important take-away is not obvious within 5 seconds, then you’ve done a poor job.  

[7] Cloud services: Most companies have moved databases to AWS/Azure, and many are implementing production models in the cloud. So, learn those basics about Docker, containers, and deploying your models and code to the cloud. This is still a niche skill, so having it will definitely help you stand apart as most companies make the move towards automation.

[8] Software engineering: You don’t need to become a software engineer but knowing basic architecture and data flow Qs will help you troubleshoot better, write better code that is easily moved to production. Some Qs to start – what is the data about, where (all) is it coming from? Learn about scheduler jobs and report automation, these have helped me automate the most boring repetitive tasks and look like a superstar to my managers! The infrastructure teams do extremely valuable work (keeping things running smoothly) so learn about “rules” and expectations, and make sure your code conforms to everything. I always do, and my requests are treated much better! ?

[9] Automated ML: This is slowly getting popular, as companies try to cut costs and improve efficiencies with automation. H20.ai and DataRobot are just 2 names off my head, but there are many more vendors in the market. If possible, learn how to work with those, as they can reduce your time for analysis and speed up production deployment. They won’t replace good data scientists, but they do magnify the disparity between someone who is mindless copy/pasting code and the truly efficient data scientists. So make sure your “core” skills are impeccable.  

[10] Domain expertise: Nothing beats experience, but even if you are new to the company (or field) learn as much as you can from senior colleagues and partner teams. Find out the “why/how/what” Qs – who is using the analysis results, why do they truly want it? How will it be applied? How does it save the company money or increase profits? How can I do it faster while maintaining accuracy, and also adding to the bottom line? What metric does the end user (or my manager) really care about?

As Machine learning software add more automation and features, this blend of technology and domain expertise will ensure you are never a casualty of layoffs or cost-cutting! I’ve put this at the end, but really you should be thinking about this from DAY ONE!

For example, my current role involves models for credit card fraud prediction. However, once I learned the end-to-end process of card customer lifecycle (incoming application, review, collections, payments, etc.) my models have become much better. Plus, I have deeper understanding of Fair banking and privacy laws which can prevent many demographic variables from being used in modes. Similarly, a friend working in the petrochemical industry realized that his boss cared more about preventing true negatives (Overlooking or NOT maintaining end-of-life or faulty sensors that can potentially cause leaks or explosions ) than false positives (unnecessary maintenance for good sensors), even though both models can give you similar accuracy.

So build these skills, and see your career and salary potential sky-rocket in 2020!

How to Become a Data Scientist

This question and its variations are the most searched topics on Google. As a practicing datascience professional, and manager to boot, dozens of people ask me this question every week.

This post is my honest and detailed answer.

Step 1 – Coding & ML skills

  • You need to master programming in either R or Python. If you don’t know which to pick, pick R, or toss a coin. [Or listen to me, and pick R – programming as it is used at Top Firms like NASDAQ, JPMorgan, and many more..] Also, when I say master, you need to know more than writing a simple calculator or “Hello World” function. You should be able to perform complex data wrangling, pull data from databases, write custom functions and apply algorithms, even if someone wakes you up at midnight.
  • By ML, I mean the logic behind machine learning algorithms. When presented with a problem, you should be able to identify which algorithm to apply and write the code snippet to do this.
  • Resources – Coursera, Udacity, Udemy. There are countless others, but these 3 are my favorites. Personal recommendation, basic R from Coursera (JHU) and Machine learning fundamentals from Kirill’s course on Udemy.

Step 2 – Build your portfolio.

  • Recruiters and hiring managers don’t know you exist, and having an online portfolio is the best way to attract their attention. Also, once employers do come calling, they will want to evaluate your technical expertise, so a portfolio helps.
  • The best way to showcase your value to potential employers is to establish your brand via projects on Github, LinkedIn and your website.
  • If you do not have your own website, create one for free using wordpress or Wix.
  • Stumped on what to post in your project portfolio?
  • Step1 – Start by looking in the kernels portion on the site www.kaggle.com there are tons of folks who have leveraged free datasets to create interesting visualizations. Also enroll in any active competitions and navigate to the discussion forums. You will find very generous folks who have posted starter scripts and detailed exploratory analysis. Fork the script and try to replicate the solution. My personal recommendation would be to begin with titanic contest or the housing prices set. My professional website journeyofanalytics also houses some interesting project tutorials, if you want to take a look.
  • Step 2 – pick a similar datasets from kaggle or any other open source site, and apply the code to the new datasets. Bingo, a totally new project and ample practice for you.
  • Step3 – Work your way up to image recognition and text processing.

Step 3 – Apply for jobs strategically.

  • Please don’t randomly apply to every single datascience job in the country. Be strategic using LinkedIn to reach out to hiring managers. Remember, its better to hear “NO” directly from the hiring manager than to apply online and wait in eternity.
  • Competition is getting fierce, so be methodical. Books like “Data Science Jobs” will help you pinpoint the best jobs in your target city, and also connect with hiring managers for jobs that are not posted anywhere else.
  • Yes, I wrote the book listed above – this is the book I wished I had when I started in this field! Unlike other books on the market with random generalizations, this book is written specifically for jobseekers in the datascience field. Plus, I’ve successfully helped a dozen folks land lucrative jobs (data analyst/data scientist roles) using the strategies outlined in this book. This book will help you cut your datascience job search time in half!
  • Upwork is a fabulous site to get gigs to tide you until you get hired full-time. It is also a fabulous way of being unique and standing out to potential employers! As a recruiter once told me, “it is easier to hire someone who already has a job, than to evaluate someone who doesn’t!”
  • If your first job is not at your dream job, do not despair. Earn and learn, every company, big or small, will teach you valuable skills that will help you get better and snag your ideal role next year. I do recommend staying at roles for at least 12 months, before switching, otherwise you won’t have anything impactful to discuss in the next interview.

Step 4 – Continuous learning.

  • Even if you’ve landed the “data scientist” job you always wanted, you cannot afford to rest on your laurels. Keep your skills current by attending online classes, conferences and reading up on tech changes.
    Udemy, again is my go to resource to stay abreast of technical skills.
  • Network with others to know how roles are changing, and what skills are valuable.

Finally, being in this filed is a rewarding experience, and also quite lucrative. However, no one can get to the top without putting in sufficient effort and time. So, master the basics and apply your skills, you will definitely meet with success.

If you are looking to establish a career in datascience, then don’t forget to take a look at my book – “Data Science Jobs‘ now available on Amazon.

August Project Updates

Hello All,

The theme for August is API programming for social media platforms.

working with twitter API

twitter API code with R/ Python

For the August project, I’ve concentrated on working with Twitter API, using both Python and R programming. The code can be downloaded from the Projects Page or forked from my Github account.

Working With APIs:

Before we learn what the code does, please note that you will first need to request Twitter developer tokens (values for consumer_key, consumer_secret, access_key and access_secret) to authorize your account from extracting data from the Twitter platform. If you do not have these tokens yet, you can easily learn how to request tokens using the excellent documentation on the Twitter Developer website . Once you have the tokens please modify these variables at the beginning of the program with your own access.

Second, you will need to install the appropriate twitter packages for running programs in Python and R. This makes it easy to extract data from Twitter since these packages have pre-written functions for various tasks like Twitter authorization, looking up usernames, posting to Twitter, investigating follower counts, extracting profile data in json format, and much more.

“Tweepy” is the package for Python and “twitteR” for R programs, so please install them locally.

 

Tracking Twitter Follower Growth:

Although Twitter provides a great way to view your own twitter follower growth, there is no way to download or track this data locally. The Python program ( twitter_follower_ct_ver4.py) added in this month’s code does just that – extracts follower count and store it to csv Excel file. This makes it possible to track (historical) growth or decline of Twitter follower count over a period of time, starting from today.

With this program that you can monitor your own account and other twitter handles as well! Of course, you can’t go back in time to view older counts, but hey, at least you have started. Plus, you can manually add values for your own accounts.

Track Twitter follower count

File tracking Twitter follower count

(Technically, for twitter handles you do not own, you could get the date of joining of every follower and then deduce when they possibly followed someone. A post for another day, though! )

Extracting Data about Twitter Followers

Follower count is great, but you also want to know the detailed profile of your followers and other interesting twitter accounts. Who are these followers? Where are they located?

There are 2 R programs in the August Project which help you gather this information.

The first (followers_v2.R) extracts a list of all follower ids for a specific twitter account and stores it to a file. Twitter API has a rate limit of 5000 usernames for such queries, so this program uses cursor pagination to pull out information in chunks of 5000 in each iteration. Think of the list of follower ids like the content on a book – some books are thicker, so you have turn more pages! Similarly, if a twitter account has very few followers, the program completes in 1-2 iterations!

The program example works on the twitter account “@phillydotcom” which has >180k followers.  The cursor iteration process itself is implemented using a simple “while” loop.

Twitter follower details

Twitter follower details

The second R program ( dets_followers_v2.R ) uses the list of follower_ids to pull in detailed information about followers. For the scope of this project I am only deriving screen name, username, location and follower count for all of my Followers. Details are stored in a tabular format as shown in image alongside. You can avail this data to geographically segment your Twitter followers, analyze “influencer” followers (users with 25000 or more followers) and lots more.

Please take a look at the code and provide your valuable feedback and comments in the comments section.

Twitter
Visit Us
Follow Me
LinkedIn