Page 8 of 8

50+ free Datasets for Data Science Projects

/ anu - Journey of Analytics Team

[Updated as on Jan 31, 2020]

50+ free-datasets for your DataScience project portfolio

There is no doubt that having a project portfolio is one of the best ways to master Data Science whether you aspire to be a data analyst, machine learning expert or data visualization ninja! In fact, students and job seekers who showcase their skills with a unique portfolio find it easier to land lucrative jobs faster than their peers! (For project ideas, check this post, for job search advice look here.)

To create a custom portfolio, you need good data. So this post presents a list of Top 50 websites to gather datasets to use for your projects in R, Python, SAS, Tableau or other software. Best part, these datasets are all free, free, free! (Some might need you to create a login)

The datasets are divided into 5 broad categories as below:

Government & UN/ Global Organizations
Academic Websites
Kaggle & Data Science Websites
Curated Lists
Miscellaneous

Government and UN/World Bank websites:

[1] US government database with 190k+ datasets – link . These include county-level data on demographics, education/schools and economic indicators; list of museums & recreational areas across the country, agriculture/ weather and soil data and so much more!
[2] UK government database with 25k+ datasets . Similar to the US site, but from the UK government.
[3] Canada government database. Data for Canada.

[4] Center for Disease Control – link
[5] Bureau of Labor Statistics – link
[6] NASA datasets – link
[7] World Bank Data – link
[8] World Economic Forum. I like the white paper style reports on this site too. It teaches you on how to think what Qs to answer using different datasets as well as how to present results in a meaningful way! This is an important skill for senior data scientists, academics and analytics consultants, so take a look.
[9] UN database with 34 sets and 60 million records – link . Data by country and region.
[10] EU commission open data – link
[11] NIST – link
[12] U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) – dataset from survey to determine magnitude of alcohol use and psychiatric disorders in the U.S. population. The dataset and descriptive codebook are available here.
[13] Plants Checklist from US Department of Agriculture – link .

Academic websites:

[14] Yelp academic data – link
[15] Univ of California, Irvine Machine Learning Repository – link
[16] Harvard Univ: link
[17] Harvard Dataverse database: link
[18] MIT – link1 and link2
[19] Univ of North Carolina, adolescent health – link
[20] Mars Crater Study, a global database that includes over 300,000 Mars craters 1 km or larger. Link to Descriptive guide and dataset.
[31] Click Dataset from Indiana University (~2.5TB dataset) – link .
[32] Pew Research Data – Pew Research is an organization focused on research on topics of public interest. Their studies gauge trends in multiple areas such as internet, technology trends, global attitudes, religion and social/ demographic trends. Astonishingly, they not only publish these reports but also make all their datasets publicly available for download!
[33] Million Song Dataset from Columbia University , including data related to the song tracks and their artist/ composers.

Kaggle & Datascience resources:

Few of my favorite datasets from Kaggle Website are listed here. Please note that Kaggle recently announced an Open Data platform, so you may see many new datasets there in the coming months.

[34] Walmart recruiting at stores – link
[35] Airbnb new user booking predictions – link
[36] US dept of education scorecard – link
[37] Titanic Survival Analysis – link
[38] Edx – link
[39] Enron email information and data – link
[40]Quandl – an excellent source for stock data. This site has both FREE and paid datasets.
[41] Gapminder – link

Curated Lists:

curated-datasets

[42] KDnuggets provides a great list of datasets from almost every field imaginable – space, music, books, etc. May repeat some datasets from the list above. link
[43] Reddit datasets – Users have posted an eclectic mix of datasets about gun ownership, NYPD crime rates, college student study habits and caffeine concentrations in popular beverages.
[44] Data Science Central has also curated many datasets for free – link
[45] List of open datasets from DataFloq – link
[46] Sammy Chen (@transwarpio ) curated list of datasets. This list is categorized by topic, so definitely take a look.
[47] DataWorld – This site also has a list of paid and FREE datasets. I have not used the site, but heard good reviews regarding the community.

Others:

[48] MRI brain scan images and data – link
[49] Internet Usage Data from the Center for Applied Internet Data Analysis –link .
[50] Google repository of digitized books and ngram viewer – link.
[51] Database with geographical information – link
[52] Yahoo offers some interesting datasets, the caveat being that you need to be affiliated with an accredited educational organization. (student or professor) – you can view the datasets here.
[53] Google Public Data – Google has a search engine specifically for searching publicly available data. This is a good place to start as you can search a large amount of datasets in one place. Of course, there is a NEWER link that went live a couple days ago! 🙂
[54] Public datasets from Amazon – see link.

Make sure you do attribute the datasets to the appropriate origin sites. Happy vizzing and coding! 🙂

Yelp College Search – Shiny Based App

/ anu - Journey of Analytics Team

As part of this month’s API theme, we will work with the Yelp API using R and Shiny to create a college search app to explore colleges near a specific city or zip code.

(Link to view Shiny app is added: https://anupamaprv.shinyapps.io/yelp_collegeapp/ )

Yelp College Search App

As you are all aware, Yelp is a platform which allows you to search for myriad businesses (restaurants, theme parks, colleges, auto repairs, professional services, etc… ) by name or location and (IMPORTANTLY) view honest reviews from customers who have used those services. With 135 million monthly visitors and 95 million reviews, it is equivalent (if not better) to Google reviews; a LinkedIn of sorts for businesses, if you consider it that way.

With the new academic year almost upon us, it makes sense that students and/or parents would benefit from using this site to explore their options, although it should NOT be relied as the only source of truth for educational or career decisions!

With that in mind, we will create a web application that will accept two user inputs and display results in an output window with three panes.
Inputs:

City name or zip code
Search radius

Output panes:

Tab pane 1 – Display user selection as text output
Tab pane 2 – Map view of the selected location with markers for each college.
Tab pane 3 – Tabular view displaying college name, number of yelp reviews, overall yelp_rating and phone number of the college.

With that said, this tutorial will walk through the following tasks:

Working with Yelp API using R programming
Designing the application user interface with Shiny
Translating this code to other use cases.

Step 1 – Working with Yelp API

Yelp uses the OAuth 1.0a method to process authentication, which is explained best on their developer website itself. A link is provided here. Like all API access requests, you will need a yelp developer account and create a dummy “app” to request permission keys from this developer page. Note, if you have never created a Yelp account, please do so now.

Unlike Facebook or Twitter API usage, we do not use any package specifically for Yelp. Instead we use the httr package which allows us to send a GET() request to the API. You can easily understand how to send queries by exploring the API console itself. However, to analyze or process the results, you do need a script. A sample query to search colleges near Philadelphia is as below:

1
https://api.yelp.com/v2/search/?location=philadelphia&radius_filter=10000&category_filter=collegeuniv

The steps to receive Yelp authorization are as follows:

Store access tokens – Consumer_key, Consumer_Secret, Token and Token_Secret. Request clearance using code below:

1
myapp sig resultsout

Process the results into json format and then convert to a usable dataframe

We want the query to be changed based on user inputs, so will put the query within a search function as below. The function will also convert the returned results from a nested list into a readable dataframe. To limit results, we will drop educational institutions with less than 3 reviews.

1
2
3
4
yelp_srch <- function( radius_miles, locn, n )
{
# convert search radius into metres
radius_meters <- radius_miles*1609.34

# create composite Yelp API query
querycomposite2 <- paste0(api_part1, locn, api_part2, radius_meters, api_part3, sep = ”)
resultsout <- GET(querycomposite2, sig)

collegeDataContent = content(resultsout)
collegelist=jsonlite::fromJSON(toJSON(collegeDataContent))
collegeresultsp <- data.frame(collegelist)

colnames(collegeresultsp) = c(“lat_delta”, “long_delta”, “latitude”, “longitude”,
“total”, “claim_biz”, “yelp_rating”, “mobile_url”,
“image_url”, “No_of_reviews”, “College”, “image_url_small”,
“main_weblink”, “categories”, “phone”, “short_text”,
“biz_image_url”, “snippet_url”, “display_phone”,
“rating_image_url”, “biz_id”, “closed”, “location_city”)

varseln <- c(11,7,10,19 )
collegeset <- subset(collegeresultsp, select = varseln)

tk <- subset(collegeset, No_of_reviews > n)
rownames( tk ) <- seq_len( nrow( tk ) )

return(tk)
}

At this time, we are not adding any error handling functions since the processing occurs only when the user hits the “Analyze” button.

Step 2 – Creating the Shiny Application

Like all shiny applications, the ui.R file specifies the layout of the web application. As described in the introduction, we have the input tab on the left and 3-paned tabbed output on the right. The server.R file implements the logic and function calls to collate and format the data needed to populate the three panes.

The input pane uses a text input and a slider input, both are which are straight forward implementations using the code provided in the official Shiny widget gallery.

On the output pane, we display the Yelp logo and review_star images to indicate the data is being pulled from Yelp, and to comply with the display requirements under their Terms of Use. The first “Tab1” tab is also pretty simple and echoes the user input once the “analyze” button is entered.

Shiny app – input pane & output tab1

The second “Plot” pane creates a map plot of the location using the leaflet package. The pointer may be single or clustered based on how many results are returned. Note that the Yelp API limits results, so some queries may be truncated. It is also the explanation for providing a smaller range in the search radius slider input. The data for this pane is pulled directly from Yelp using a search function similar to the code provided at the beginning of this post.

Shiny app output pane2

If you are interested in exploring more options with the leaflet view, like adding zoom capabilities or custom messages on the popup markers, then please take a look at this other post on our old blog – Graphical Data Exploration.

The third pane is a tabular view of search results using the yelp_srch() function added above.

Step 3 – Re-purposing the code

Honestly, this Yelp API code can be used for myriad other uses as below:

Create a more detailed search app that allows users to add more inputs to search for other business categories or locations.
Add dat from other APIs like Facebook to create an even more data-rich search directory with social proofs using Facebook likes, Yelp Star Rating, etc.
If you are a hotel site, you could embed a yelp search app on your website to help users see a map view of restaurants and places on interest. This would help users realize how close they are to the historical sites/ major highway routes/ city downtown/ amazing entertainment options, etc.
A travel website could plot a mapview of interactive itineraries, so users could select options based on whether they are travelling with kids/ seniors/ students, etc. while taking comfort in the knowledge that the places are truly worth visiting! 🙂
Most APIs use OAuth methods for authentication, so you could easily modify the code to access data from other sites in an easy legal way. (Please do read the Terms of Service, though for any such usage).

As always, the entire source code for this analysis is FREELY available as yelp_api_project or can be forked from the link on Github. Please take a look and share your thoughts and feedback.

Until next time, adieu! 🙂

Twitter Analysis – Rio2016 Olympics

/ JOURNEYOFANALYTICS

Twitter Analysis – Rio2016

Olympics season is in full swing. In keeping up with the spirit of this pinnacle of sports, we will use the Twitter API to extract tweets related to Rio2016 and analyze them to extract insights.

In this post we will perform the following tasks:

extract tweets containing the tag “#TeamUSA”,
analyze tweets graphically on various factors,
perform a sentiment analysis indicating overall emotions associated with such tweets.
*IMPORTANT* learn how some of these steps can be used to monitor your brand and extract customer sentiment.

Step 1 – Connecting to Twitter API

We will use R programming to perform the analysis using Twitter API keys (learn more about how to request these keys here) and the amazing “TwitterR” package to gain clearance permission for data extraction from the Twitter website.

Code for authorization is below:

1
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Step 2 – Search Twitter API for specific tags

We will search Twitter for all tweets with the tag “#TeamUSA”.
Twitter puts some constraints on how much data can be extracted with each API call, so we limit our search to 2000 tweets. To ensure recency, we specify the tweets should have been posted after Aug 1, 2016. Code snippet below:

1
tw_search = searchTwitter('#TeamUSA', n=2000, since='2016-08-01', geocode='39.9526,-75.1652,50mi')

Note, the “geocode” option is optional in above command, but I added it to consider tweets from users whose profile location is Philadelphia, ensuring coverage by NBC/Fox are definitely picked up! We save the tweets in a RDS file for easy access.

1
saveRDS(tw_search, 'USteam_olympics.rds')

Step 3 – Cleaning up and processing the tweets

First, we remove all special characters and emojis from tweets using the sapply() and iconv() function.

1
tweet_doc$text <- sapply(tweet_doc$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

We convert the created time to Brazil time. Note, Rio de Janeiro follows Chicago timezone, i.e 1 hour behind Philadelphia/NYC.

1
2
tweet_doc$Riotime = with_tz(tweet_doc$created, 'America/Chicago')
tweet_doc$strptime = as.POSIXct(strptime(tweet_doc$Riotime, "%Y-%m-%d %H:%M:%S"))

The as.POSIXct() allows us to aggregate tweets by hour/ date / minute, etc. which we can derive as below:

1
tweet_doc$day = as.numeric(format(tweet_doc$strptime, "%d"))

We add a new variable to determine digital device type used for these Tweets, using the device url Twitter provides under column “StatusSource”.

1
2
3
par(mar = c(3, 3, 3, 2))
tweet_doc$statusSource_new = substr(tweet_doc$statusSource, regexpr('>', tweet_doc$statusSource) + 1,
regexpr('</a>', tweet_doc$statusSource) - 1)

Step 4 – Graphical Insight

Plot 1: Tweets by hour of day:

1
2
gptime <- ggplot(tweet_doc, aes(hour)) + geom_bar(aes(fill = isRetweet)) + xlab('Tweets by hour')
ggplotly(gptime)

We notice that number of tweets increase as the evening passes with peak frequency at about 9 pm CDT. (graph below)

#TeamUSA tweets by hour

Plot 2: Tweets by device type:

1
2
gp <- ggplot(tweet_doc, aes(x= statusSource , fill = isRetweet)) + geom_bar( )
ggplotly(gp)

The graph clearly shows iphones dominating the user base.

Tweets by device used

Plot 3: Emotional Valence

We extract the emotional sentiment of tweets using a custom function:

1
2
3
4
5
6
7
8
9
polfn = lapply(orig$text, function(txt) {
# strip sentence enders so each tweet is analyzed as a sentence,
# and +'s which muck up regex
gsub('(\\.|!|\\?)\\s+|(\\++)', ' ', txt) %>%
# strip URLs
gsub(' http[^[:blank:]]+', '', .) %>%
# calculate polarity
polarity()
})

Applying this, we get the most positive tweet:

“That looked like a very easy win for #TeamUSA #beachvolleyball #Rio2016”

most negative tweet:

I think it’s a very odd sport but damn those guys are fit #Rio2016 #waterpolo #TeamUSA

Last, we plot a graph to display how emotionalValence change over the day:

Emotional valence change in tweets

Plot 4 : Word Cloud:

word cloud for #teamUSA tweets

We use the “text” column from tweet_doc object to create a word dictionary of the tweets after removing punctuation and unwanted characters. The size of the words increases with their frequency of appearance in the tweets. The image alongside shows such a wordcloud with highlighted words indicating high-frequency phrases.

1
2
3
4
wordCorpus <- Corpus(VectorSource(tweet_doc$text))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordcloud(words = wordCorpus, max.words=500, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=pal)

Plot 4 : Sentiment Graph:

We use the “syuzhet” library to assigns emotional value to each of the 2000 tweets we extracted using the get_nrc_sentiment() function.

1
mySentiment <- get_nrc_sentiment(tweet_doc$text)

This assigns a numeric value to each tweet to indicate various emotions expressed in the tweet – anger, anticipation, fear, joy, etc. We then add these values back to the tweet_doc object and compute column totals to derive the overall weight for each emotion. Code and image for overall sentiment scores are shown below:

1
2
3
4
ggplotly(ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Tweets"))

Overall Sentiment Scores – #TeamUSA

We can also use the scores to see if positive/negative sentiments change with time of day or date. For our tag “#TeamUSA” we notice this is patently true as seen in graph below:
Positive tweets peak at noon on Opening Ceremony day (aug 5) and negative sentiments peaked on Aug 7 morning.

Sentiment by time

Step 5 – Usage for Brand monitoring.

The steps used in the analysis above can be easily modified for monitoring your brand, blog or product, as explained below:

Instead of “#TeamUSA” we can use any other tag or company/blog name or product or any other relevant tags to mine Twitter for tweets.
Periodically monitor the tweets about your product or brand to ensure that your “sentiment graph” always tends to positive emotions. If not, ensure your staff is working diligently to counter any negative tweets/ concerns among your users.
The graphical analysis for “Tweets by hour of day ” could be used to monitor what time your users/ audience is most active. You could use this insight to publish more content during this time and to ensure your customer support is always available during this period to effectively engage your audience.
If your “device type” graph indicates any specific device (e.g: specific Android phone brands) make sure your content caters correctly for mobile users.
The high-frequency words in “wordcloud” indicate trending topics, so these can be used as great ideas for new content topics or short-term ads to ride the publicity wave! 🙂

The entire source code for this analysis is available here blog_twitter_olympics or can be forked from the Github page. Please take a look and share your thoughts and feedback. Until next time, adieu! 🙂

August Project Updates

/ JOURNEYOFANALYTICS

Hello All,

The theme for August is API programming for social media platforms.

twitter API code with R/ Python

For the August project, I’ve concentrated on working with Twitter API, using both Python and R programming. The code can be downloaded from the Projects Page or forked from my Github account.

Working With APIs:

Before we learn what the code does, please note that you will first need to request Twitter developer tokens (values for consumer_key, consumer_secret, access_key and access_secret) to authorize your account from extracting data from the Twitter platform. If you do not have these tokens yet, you can easily learn how to request tokens using the excellent documentation on the Twitter Developer website . Once you have the tokens please modify these variables at the beginning of the program with your own access.

Second, you will need to install the appropriate twitter packages for running programs in Python and R. This makes it easy to extract data from Twitter since these packages have pre-written functions for various tasks like Twitter authorization, looking up usernames, posting to Twitter, investigating follower counts, extracting profile data in json format, and much more.

“Tweepy” is the package for Python and “twitteR” for R programs, so please install them locally.

Tracking Twitter Follower Growth:

Although Twitter provides a great way to view your own twitter follower growth, there is no way to download or track this data locally. The Python program ( twitter_follower_ct_ver4.py) added in this month’s code does just that – extracts follower count and store it to csv Excel file. This makes it possible to track (historical) growth or decline of Twitter follower count over a period of time, starting from today.

With this program that you can monitor your own account and other twitter handles as well! Of course, you can’t go back in time to view older counts, but hey, at least you have started. Plus, you can manually add values for your own accounts.

File tracking Twitter follower count

(Technically, for twitter handles you do not own, you could get the date of joining of every follower and then deduce when they possibly followed someone. A post for another day, though! )

Extracting Data about Twitter Followers

Follower count is great, but you also want to know the detailed profile of your followers and other interesting twitter accounts. Who are these followers? Where are they located?

There are 2 R programs in the August Project which help you gather this information.

The first (followers_v2.R) extracts a list of all follower ids for a specific twitter account and stores it to a file. Twitter API has a rate limit of 5000 usernames for such queries, so this program uses cursor pagination to pull out information in chunks of 5000 in each iteration. Think of the list of follower ids like the content on a book – some books are thicker, so you have turn more pages! Similarly, if a twitter account has very few followers, the program completes in 1-2 iterations!

The program example works on the twitter account “@phillydotcom” which has >180k followers. The cursor iteration process itself is implemented using a simple “while” loop.

Twitter follower details

The second R program ( dets_followers_v2.R ) uses the list of follower_ids to pull in detailed information about followers. For the scope of this project I am only deriving screen name, username, location and follower count for all of my Followers. Details are stored in a tabular format as shown in image alongside. You can avail this data to geographically segment your Twitter followers, analyze “influencer” followers (users with 25000 or more followers) and lots more.

Please take a look at the code and provide your valuable feedback and comments in the comments section.

Happy House-Warming!

/ JOURNEYOFANALYTICS

Welcome to the new Blog homepage for Journey of Analytics.

The old blog is still live and all old content will still be available on the previous site. So if you have bookmarked any links or pages, they will still work. However, new posts will no longer appear on the old site, so please bookmark this page as well.

Thank you being a loyal reader with Journey of Analytics.

Happy Coding!

Journey of Analytics

Page 8 of 8

50+ free Datasets for Data Science Projects

Government and UN/World Bank websites:

Academic websites:

Kaggle & Datascience resources:

Curated Lists:

Others:

Yelp College Search – Shiny Based App

Step 1 – Working with Yelp API

Step 2 – Creating the Shiny Application

Step 3 – Re-purposing the code

Twitter Analysis – Rio2016 Olympics

Twitter Analysis – Rio2016

Step 1 – Connecting to Twitter API

Step 2 – Search Twitter API for specific tags

Step 3 – Cleaning up and processing the tweets

Step 4 – Graphical Insight

Step 5 – Usage for Brand monitoring.

Happy House-Warming!

Email Follow

Looking for DataScience Jobs?

Search this Blog

Blogs I follow

Categories

Make 2020 Amazing!

Others

Recent Posts

Looking for a Data Scientist Role or exploring careers in this field?
Click here to learn more.