Here are top 50 websites to gather datasets to use for your data science projects in R, Python, SAS, Excel or other programming language or statistical software. Best part, these are all free, free, free!
The datasets are divided into 5 broad categories as below:
- Government & UN/ Global Organizations
- Academic Websites
- Kaggle & Data Science Websites
- Curated Lists
Government and UN/World Bank websites:
- US government database with 190k+ datasets – link
- UK government database with 25k+ datasets
- Canada government database
- FBI crime statistics
- Center for Disease Control – link
- Bureau of Labor Statistics – link
- NASA datasets – link
- World Bank Data – link
- World Economic Forum
- UN database with 34 sets and 60 million records – link
- EU commission open data – link
- NIST – link
- National Center for Education Statistics – link
- U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) – dataset from survey to determine magnitude of alcohol use and psychiatric disorders in the U.S. population. The dataset and descriptive codebook are available here.
- Plants Checklist from US Department of Agriculture – link .
- Yelp academic data – link
- Univ of California, Irvine – link
- Harvard Univ: link
- Harvard Dataverse database: link
- MIT – link1 and link2
- Univ of North Carolina, adolescent health – link
- Mars Crater Study, a global database that includes over 300,000 Mars craters 1 km or larger. Link to Descriptive guide and dataset.
- Click Dataset from Indiana University (~2.5TB dataset) – link .
- Pew Research Data – Pew Research is an organization focused on research on topics of public interest. Their studies gauge trends in multiple areas such as internet, technology trends, global attitudes, religion and social/ demographic trends. Astonishingly, they not only publish these reports but also make all their datasets publicly available for download!
- Million Song Dataset from Columbia University , including data related to the song tracks and their artist/ composers.
Kaggle & Datascience resources:
Few of my favourite datasets from Kaggle Website are listed here. Please note that Kaggle recently announced an Open Data platform, so you may see many new datasets there in the coming months.
- Walmart recruting at stores – link
- Airbnb new user booking predictions – link
- US dept of education scorecard – link
- Titanic Survival Analysis – link
- Databits.io – link
- Edx – link
- Airbnb – link
- Datasets on Climate information, human genome data, Enron email information, etc – link
- Gapminder – link
- KDnuggets provides a great list of datasets from almost every field imaginable – space, music, books, etc. May repeat some datasets from the list above. link
- An eclectic mix of datasets about gun ownership, NYPD crime rates, college student study habits and caffeine concentrations in popular beverages – link
- Data Science Central has also curated many datasets for free – link
- List of open datasets from DataFloq – link
- Sammy Chen (@transwarpio ) curated list of datasets. This list is categorized by topic, so definitely take a look.
- MRI brain scan images and data – link
- Economic, education, Health and other datasets from Quandl. Please note this site also has a premium version of other datasets .
- Google repository of digitized books and ngram viewer – link. Sample chart shown below:
- Database with geographical information – link
- Loan information from Lending Club – link
- Google Public Data – Google has a search engine specifically for searching publicly available data. This is a good place to start as you can search a large amount of datasets in one place.
- Statista – This site aggregates thousands of data sets and offers access as a paid service. However, some of the data sets are available for free.
- Internet Usage Data from the Center for Applied Internet Data Analysis –link .
- Yahoo offers some interesting datasets, the caveat being that you need to be affiliated with an accredited educational organization. (student or professor) – you can view the datasets here.
- Enron Emails aggregated as a dataset.
- Public datasets from Amazon – see link.