[Updated as on Jan 31, 2020]
There is no doubt that having a project portfolio is one of the best ways to master Data Science whether you aspire to be a data analyst, machine learning expert or data visualization ninja! In fact, students and job seekers who showcase their skills with a unique portfolio find it easier to land lucrative jobs faster than their peers! (For project ideas, check this post, for job search advice look here.)
To create a custom portfolio, you need good data. So this post presents a list of Top 50 websites to gather datasets to use for your projects in R, Python, SAS, Tableau or other software. Best part, these datasets are all free, free, free! (Some might need you to create a login)
The datasets are divided into 5 broad categories as below:
- Government & UN/ Global Organizations
- Academic Websites
- Kaggle & Data Science Websites
- Curated Lists
Government and UN/World Bank websites:
-  US government database with 190k+ datasets – link . These include county-level data on demographics, education/schools and economic indicators; list of museums & recreational areas across the country, agriculture/ weather and soil data and so much more!
-  UK government database with 25k+ datasets . Similar to the US site, but from the UK government.
-  Canada government database. Data for Canada.
-  Center for Disease Control – link
-  Bureau of Labor Statistics – link
-  NASA datasets – link
-  World Bank Data – link
-  World Economic Forum. I like the white paper style reports on this site too. It teaches you on how to think what Qs to answer using different datasets as well as how to present results in a meaningful way! This is an important skill for senior data scientists, academics and analytics consultants, so take a look.
-  UN database with 34 sets and 60 million records – link . Data by country and region.
-  EU commission open data – link
-  NIST – link
-  U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) – dataset from survey to determine magnitude of alcohol use and psychiatric disorders in the U.S. population. The dataset and descriptive codebook are available here.
-  Plants Checklist from US Department of Agriculture – link .
-  Yelp academic data – link
-  Univ of California, Irvine Machine Learning Repository – link
-  Harvard Univ: link
-  Harvard Dataverse database: link
-  MIT – link1 and link2
-  Univ of North Carolina, adolescent health – link
-  Mars Crater Study, a global database that includes over 300,000 Mars craters 1 km or larger. Link to Descriptive guide and dataset.
-  Click Dataset from Indiana University (~2.5TB dataset) – link .
-  Pew Research Data – Pew Research is an organization focused on research on topics of public interest. Their studies gauge trends in multiple areas such as internet, technology trends, global attitudes, religion and social/ demographic trends. Astonishingly, they not only publish these reports but also make all their datasets publicly available for download!
-  Million Song Dataset from Columbia University , including data related to the song tracks and their artist/ composers.
Kaggle & Datascience resources:
Few of my favorite datasets from Kaggle Website are listed here. Please note that Kaggle recently announced an Open Data platform, so you may see many new datasets there in the coming months.
-  Walmart recruiting at stores – link
-  Airbnb new user booking predictions – link
-  US dept of education scorecard – link
-  Titanic Survival Analysis – link
-  Edx – link
-  Enron email information and data – link
- Quandl – an excellent source for stock data. This site has both FREE and paid datasets.
-  Gapminder – link
-  KDnuggets provides a great list of datasets from almost every field imaginable – space, music, books, etc. May repeat some datasets from the list above. link
-  Reddit datasets – Users have posted an eclectic mix of datasets about gun ownership, NYPD crime rates, college student study habits and caffeine concentrations in popular beverages.
-  Data Science Central has also curated many datasets for free – link
-  List of open datasets from DataFloq – link
-  Sammy Chen (@transwarpio ) curated list of datasets. This list is categorized by topic, so definitely take a look.
-  DataWorld – This site also has a list of paid and FREE datasets. I have not used the site, but heard good reviews regarding the community.
-  MRI brain scan images and data – link
-  Internet Usage Data from the Center for Applied Internet Data Analysis –link .
-  Google repository of digitized books and ngram viewer – link.
-  Database with geographical information – link
-  Yahoo offers some interesting datasets, the caveat being that you need to be affiliated with an accredited educational organization. (student or professor) – you can view the datasets here.
-  Google Public Data – Google has a search engine specifically for searching publicly available data. This is a good place to start as you can search a large amount of datasets in one place. Of course, there is a NEWER link that went live a couple days ago! 🙂
-  Public datasets from Amazon – see link.
Make sure you do attribute the datasets to the appropriate origin sites. Happy vizzing and coding! 🙂