Deep dive into data analysis tools, theory and projects

Category: R programming (Page 6 of 7)

Crime Density Area Contour Map

Hello All,

Today’s post is related to geographical heat maps – where a specific variable (say ethic groups, art colleges or crime category) is color coded to show areas  of high or low concentration.

The dataset is from the Philadelphia crime database, generously posted on Kaggle. I’m using the geographical coordinates available in this file to plot crime density maps for 4 specific crime categories. A simple function is created which takes the “crime category” as input and returns a contour map, using the ggmap library.

A detailed instruction is already posted as an RMarkdown file on the RPubs website. Please take a look at the link here.

The entire source code is also available for philly_crime_density_maps as a zipped file which includes – R program (easy to modify and play with the data!), the RMarkdown file. Please remember to add the dataset .csv file  from the Kaggle website and store in the same directory.

Philly Burglary-prone area maps

Burglary crime density area maps for Philadelphia

If you liked this post, and would like to receive updates for similar projects then please do signup for our blog updates. New projects are also added on our parent site at the beginning of every month, so do subscribe! If you think others may find this site, then please do share this link on Twitter and other social media! Thank you.

We love hearing feedback and questions. If you have any tips or would have taken a different approach please do share your thoughts in the comments section.

Happy Coding!

Twitter Sentiment Analysis

Introduction

Today’s post is a 2-part tutorial series on how to create an interactive ShinyR application that displays sentiment analysis for various phrases and search terms. The application accepts user a search term as input and graphically displays sentiment analysis.

In keeping with this month’s theme – “API programming”, this project uses the Twitter API to perform real-time search for tweets containing the user input term. Live App Link on Shiny website is provided and screenshot is as follows:

Twitter Sentiment Analysis Shiny

Shiny application for Twitter Sentiment Analysis

The project idea may seem simple at first, but will teach you the following skills:

  • working with Twitter API and dynamic data streaming (every time the search term changes, the program sends a new request to Twitter for relevant tweets),
  • Building an “interactive”, real-time application in Shiny/R,
  • data visualization with R

As always, the entire source code is also available for download on the Projects Page or can be forked from my  Github account here.

 

The tutorial is divided into  3 parts :

  1. Introduction
  2. Twitter Connectivity & search
  3. Shiny design

 

Application Design:

Any good software project begins with the design first. For this application, the design flowchart is shown below:

Design Flowchart for Shiny app

Design Flowchart for Shiny app

 

 

Twitter Connectivity

This is similar to the August project and mainly consists of two calls to the Twitter API:

  • authorize twitter api to mine data, using setup_twitter_oauth() function and your Twitter developer keys.

library(twitteR)
consumer_key = “ckey”
consumer_secret = “csecret”
access_token = “atoken”
access_secret = “asecret”
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

  • check whether the input search term returns tweets containing the phrase. If number of tweets <=5 return an error message. If number of tweets >5, process the tweets and display a sentiment analysis barchart. A custom function performs this computation

chk_searchterm <- function( term )

{  tw_search = searchTwitter(term, n=20, since=’2013-01-01′)

# look for all tweets containing this search term.

if(length(tw_search) <= 5)

{   return_term <- “None/few tweets to analyse for this search term. Please try again!” }

else

{    return_term <- paste(“Extracting max 20 tweets for Input =”, term, “.Sentiment graph below “)     }

return(return_term)

}

The bargraph is created by assigning numeric values for each of the positive and negative emotions using the tweet text. Emotions used – anger, anticipation, disgust, joy, sadness, surprise, trust, overall positive and negative sentiment.

 

Shiny webapp

The actual Shiny application design and twitter connectivity are explained in the next post.

50+ free Datasets for Data Science Projects

[Updated as on Jan 31, 2020]

50+ free-datasets for your DataScience project portfolio

There is no doubt that having a project portfolio is one of the best ways to master Data Science whether you aspire to be a data analyst, machine learning expert or data visualization ninja! In fact, students and job seekers who showcase their skills with a unique portfolio find it easier to land lucrative jobs faster than their peers! (For project ideas, check this post, for job search advice look here.)

To create a custom portfolio, you need good data. So this post presents a list of Top 50 websites to gather datasets to use for your projects in R, Python, SAS, Tableau or other software. Best part, these datasets are all free, free, free! (Some might need you to create a login)

The datasets are divided into 5 broad categories as below:

  1. Government & UN/ Global Organizations
  2. Academic Websites
  3. Kaggle & Data Science Websites
  4. Curated Lists
  5. Miscellaneous

Government and UN/World Bank websites:

  • [1] US government database with 190k+ datasets – link . These include county-level data on demographics, education/schools and economic indicators; list of museums & recreational areas across the country, agriculture/ weather and soil data and so much more!
  • [2] UK government database with 25k+ datasets . Similar to the US site, but from the UK government.
  • [3] Canada government database. Data for Canada.

  • [4] Center for Disease Control – link
  • [5] Bureau of Labor Statistics – link
  • [6] NASA datasets – link 
  • [7] World Bank Data – link 
  • [8] World Economic Forum. I like the white paper style reports on this site too. It teaches you on how to think what Qs to answer using different datasets as well as how to present results in a meaningful way! This is an important skill for senior data scientists, academics and analytics consultants, so take a look.
  • [9] UN database with 34 sets and 60 million records – link . Data by country and region.
  • [10] EU commission open data – link
  • [11] NIST – link
  • [12] U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) – dataset from survey to determine magnitude of alcohol use and psychiatric disorders in the U.S. population. The dataset and descriptive codebook are available here.
  • [13] Plants Checklist from US Department of Agriculture – link .

Academic websites:

  • [14] Yelp academic data – link
  • [15] Univ of California, Irvine Machine Learning Repository – link
  • [16] Harvard Univ: link
  • [17] Harvard Dataverse database: link 
  • [18] MIT – link1 and  link2
  • [19] Univ of North Carolina, adolescent health – link
  • [20] Mars Crater Study, a global database that includes over 300,000 Mars craters 1 km or larger. Link to Descriptive guide and dataset.
  • [31] Click Dataset from Indiana University (~2.5TB dataset) – link .
  • [32] Pew Research Data – Pew Research is an organization focused on research on topics of public interest. Their studies gauge trends in multiple areas such as  internet, technology trends, global attitudes, religion  and social/ demographic trends. Astonishingly, they not only publish these reports but also make all their datasets publicly available for download!
  • [33] Million Song Dataset from Columbia University , including data related to the song tracks and their artist/ composers.

Kaggle & Datascience resources:

Few of my favorite datasets from Kaggle Website are listed here. Please note that Kaggle recently announced an Open Data platform, so you may see many new datasets there in the coming months.

  • [34] Walmart recruiting at stores – link
  • [35] Airbnb new user booking predictions – link
  • [36] US dept of education scorecard – link
  • [37] Titanic Survival Analysis – link
  • [38] Edx – link
  • [39] Enron email information and data – link
  • [40]Quandl – an excellent source for stock data. This site has both FREE and paid datasets.
  • [41] Gapminder – link

Curated Lists:

curated-datasets

  • [42] KDnuggets provides a great list of datasets from almost every field imaginable – space, music, books, etc. May repeat some datasets from the list above. link
  • [43] Reddit datasets – Users have posted an eclectic mix of datasets about gun ownership, NYPD crime rates, college student study habits and caffeine concentrations in popular beverages.
  • [44] Data Science Central has also curated many datasets for free – link
  • [45] List of open datasets from DataFloq – link
  • [46] Sammy Chen (@transwarpio ) curated list of datasets. This list is categorized by topic, so definitely take a look.
  • [47] DataWorld – This site also has a list of paid and FREE datasets. I have not used the site, but heard good reviews regarding the community.

Others:

  • [48] MRI brain scan images and data – link
  • [49] Internet Usage Data from the Center for Applied Internet Data Analysis –link .
  • [50] Google repository of digitized books and ngram viewer – link.
  • [51] Database with geographical information – link
  • [52] Yahoo offers some interesting datasets, the caveat being that you need to be affiliated with an accredited educational organization. (student or professor) – you can view the datasets here.
  • [53] Google Public Data – Google has a search engine specifically for searching publicly available data. This is a good place to start as you can search a large amount of datasets in one place. Of course, there is a NEWER link that went live a couple days ago! 🙂
  • [54] Public datasets from Amazon – see link.

Make sure you do attribute the datasets to the appropriate origin sites. Happy vizzing and coding! 🙂

Yelp College Search – Shiny Based App

As part of this month’s API theme, we will work with the Yelp API using R and Shiny to create a college search app to explore colleges near a specific city or zip code.

(Link to view Shiny app is added: https://anupamaprv.shinyapps.io/yelp_collegeapp/  )

Yelp College Search App

Yelp College Search App

As you are all aware, Yelp is a platform which allows you to search for myriad businesses (restaurants, theme parks, colleges, auto repairs, professional services, etc… ) by name or location and (IMPORTANTLY) view honest reviews from customers who have used those services. With 135 million monthly visitors and 95 million reviews, it is equivalent (if not better) to Google reviews; a LinkedIn of sorts for businesses, if you consider it that way.

With the new academic year almost upon us, it makes sense that students and/or parents would benefit from using this site to explore their options, although it should NOT be relied as the only source of truth for educational or career decisions!

With that in mind, we will create a web application that will accept two user inputs and display results in an output window with three panes.
Inputs:

  • City name or zip code
  • Search radius

Output panes:

  • Tab pane 1 – Display user selection as text output
  • Tab pane 2 – Map view of the selected location with markers for each college.
  • Tab pane 3 – Tabular view displaying college name, number of yelp reviews, overall yelp_rating and phone number of the college.

 

With that said, this tutorial will walk through the following tasks:

Step 1 – Working with Yelp API

Yelp uses the OAuth 1.0a method to process authentication, which is explained best on their developer website itself. A link is provided here. Like all API access requests, you will need a yelp developer account and create a dummy “app” to request permission keys from this developer page. Note, if you have never created a Yelp account, please do so now.

Unlike Facebook or Twitter API usage, we do not use any package specifically for Yelp. Instead we use the httr package which allows us to send a GET() request to the API. You can easily understand how to send queries by exploring the API console itself. However, to analyze or process the results, you do need a script. A sample query to search colleges near Philadelphia is as below:

1
https://api.yelp.com/v2/search/?location=philadelphia&amp;radius_filter=10000&amp;category_filter=collegeuniv

The steps to receive Yelp authorization are as follows:

  • Store access tokens – Consumer_key, Consumer_Secret, Token and Token_Secret. Request clearance using code below:
1
myapp sig resultsout
  • Process the results into json format and then convert to a usable dataframe

We want the query to be changed based on user inputs, so will put the query within a search function as below. The function will also convert the returned results from a nested list into a readable dataframe. To limit results, we will drop educational institutions with less than 3 reviews.

1
2
3
4
yelp_srch &lt;- function( radius_miles, locn, n )
{
# convert search radius into metres
radius_meters &lt;- radius_miles*1609.34

# create composite Yelp API query
querycomposite2 <- paste0(api_part1, locn, api_part2, radius_meters, api_part3, sep = ”)
resultsout <- GET(querycomposite2, sig)

collegeDataContent = content(resultsout)
collegelist=jsonlite::fromJSON(toJSON(collegeDataContent))
collegeresultsp <- data.frame(collegelist)

colnames(collegeresultsp) = c(“lat_delta”, “long_delta”, “latitude”, “longitude”,
“total”, “claim_biz”, “yelp_rating”, “mobile_url”,
“image_url”, “No_of_reviews”, “College”, “image_url_small”,
“main_weblink”, “categories”, “phone”, “short_text”,
“biz_image_url”, “snippet_url”, “display_phone”,
“rating_image_url”, “biz_id”, “closed”, “location_city”)

varseln <- c(11,7,10,19 )
collegeset <- subset(collegeresultsp, select = varseln)

tk <- subset(collegeset, No_of_reviews > n)
rownames( tk ) <- seq_len( nrow( tk ) )

return(tk)
}

 

At this time, we are not adding any error handling functions since the processing occurs only when the user hits the “Analyze” button.

 

Step 2 – Creating the Shiny Application

Like all shiny applications, the ui.R file specifies the layout of the web application. As described in the introduction, we have the input tab on the left and 3-paned tabbed output on the right. The server.R file implements the logic and function calls to collate and format the data needed to populate the three panes.

The input pane uses a text input and a slider input, both are which are straight forward implementations using the code provided in the official Shiny widget gallery.

On the output pane, we display the Yelp logo and review_star images to indicate the data is being pulled from Yelp, and to comply with the display requirements under their Terms of Use. The first “Tab1” tab is also pretty simple and echoes the user input once the “analyze” button is entered.

Shiny app output pane1

Shiny app – input pane & output tab1

The second “Plot” pane creates a map plot of the location using the leaflet package. The pointer may be single or clustered based on how many results are returned. Note that the Yelp API limits results, so some queries may be truncated. It is also the explanation for providing a smaller range in the search radius slider input. The data for this pane is pulled directly from Yelp using a search function similar to the code provided at the beginning of this post.

Shiny app output pane2

Shiny app output pane2

If you are interested in exploring more options with the leaflet view, like adding zoom capabilities or custom messages on the popup markers, then please take a look at this other post on our old blog – Graphical Data Exploration.

The third pane is a tabular view of search results using the yelp_srch() function added above.

yelp_app_pane3

Step 3 – Re-purposing the code

Honestly, this Yelp API code can be used for myriad other uses as below:

  • Create a more detailed search app that allows users to add more inputs to search for other business categories or locations.
  • Add dat from other APIs like Facebook to create an even more data-rich search directory with social proofs using Facebook likes, Yelp Star Rating, etc.
  • If you are a hotel site, you could embed  a yelp search app on your website to help users see a map view of restaurants and places on interest. This would help users realize how close they are to the historical sites/ major highway routes/ city downtown/ amazing entertainment options, etc.
  • A travel website could plot a mapview of interactive itineraries, so users could select options based on whether they are travelling with kids/ seniors/  students, etc. while taking comfort in the knowledge that the places are truly worth visiting! 🙂
  • Most APIs use OAuth methods for authentication, so you could easily modify the code to access data from other sites in an easy legal way. (Please do read the Terms of Service, though for any such usage).

 

As always, the entire source code for this analysis is FREELY available as yelp_api_project or can be forked from the link on Github. Please take a look and share your thoughts and feedback.

Until next time, adieu! 🙂

Twitter Analysis – Rio2016 Olympics

Twitter Analysis – Rio2016

Olympics season is in full swing. In keeping up with the spirit of this pinnacle of sports, we will use the Twitter API to extract tweets related to Rio2016 and analyze them to extract insights.

Rio OlympicsIn this post we will perform the following tasks:

 

Step 1 – Connecting to Twitter API

We will use R programming to perform the analysis using Twitter API keys (learn more about how to request these keys here) and the amazing “TwitterR” package to gain clearance permission for data extraction from the Twitter website.

Code for authorization is below:

1
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Step 2 – Search Twitter API for specific tags

We will search Twitter for all tweets with the tag “#TeamUSA”.
Twitter puts some constraints on how much data can be extracted with each API call, so we limit our search to 2000 tweets. To ensure recency, we specify the tweets should have been posted after Aug 1, 2016. Code snippet below:

1
tw_search = searchTwitter('#TeamUSA', n=2000, since='2016-08-01', geocode='39.9526,-75.1652,50mi')

Note, the “geocode” option is optional in above command, but I added it to consider tweets from users whose profile location is Philadelphia, ensuring coverage by NBC/Fox are definitely picked up! We save the tweets in a RDS file for easy access.

1
 saveRDS(tw_search, 'USteam_olympics.rds')

 

Step 3 – Cleaning up and processing the tweets

First, we remove all special characters and emojis from tweets using the sapply() and iconv() function.

1
tweet_doc$text &lt;- sapply(tweet_doc$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

We convert the created time to Brazil time. Note, Rio de Janeiro follows Chicago timezone, i.e 1 hour behind Philadelphia/NYC.

1
2
tweet_doc$Riotime = with_tz(tweet_doc$created, 'America/Chicago')
tweet_doc$strptime = as.POSIXct(strptime(tweet_doc$Riotime, "%Y-%m-%d %H:%M:%S"))

The as.POSIXct() allows us to aggregate tweets by hour/ date / minute, etc. which we can derive as below:

1
tweet_doc$day = as.numeric(format(tweet_doc$strptime, "%d"))

We add a  new variable to determine digital device type used for these Tweets, using the device url Twitter provides under column “StatusSource”.

1
2
3
par(mar = c(3, 3, 3, 2))
tweet_doc$statusSource_new = substr(tweet_doc$statusSource,  regexpr('&gt;', tweet_doc$statusSource) + 1,
regexpr('&lt;/a&gt;', tweet_doc$statusSource) - 1)

 

Step 4 – Graphical Insight

Plot 1: Tweets by hour of day:

1
2
gptime &lt;- ggplot(tweet_doc, aes(hour)) + geom_bar(aes(fill = isRetweet)) + xlab('Tweets by hour')
ggplotly(gptime)

We notice that number of tweets increase as the evening passes with peak frequency at about 9 pm CDT. (graph below)

Bar chart displaying frequency count of #TeamUSA tweets by hour

#TeamUSA tweets by hour

Plot 2: Tweets by device type:

1
2
gp &lt;- ggplot(tweet_doc, aes(x= statusSource , fill = isRetweet)) + geom_bar( )
ggplotly(gp)

The graph clearly shows iphones dominating the user base.

Tweets by device used

Tweets by device used

 

Plot 3: Emotional Valence

We extract the emotional sentiment of tweets using a custom function:

1
2
3
4
5
6
7
8
9
polfn = lapply(orig$text, function(txt) {
# strip sentence enders so each tweet is analyzed as a sentence,
# and +'s which muck up regex
gsub('(\\.|!|\\?)\\s+|(\\++)', ' ', txt) %&gt;%
# strip URLs
gsub(' http[^[:blank:]]+', '', .) %&gt;%
# calculate polarity
polarity()
})

Applying this, we get the most positive tweet:

“That looked like a very easy win for #TeamUSA  #beachvolleyball #Rio2016”

most negative tweet:

I think it’s a very odd sport but damn those guys are fit #Rio2016 #waterpolo #TeamUSA

Last, we plot a graph to display how emotionalValence change over the day:

Emotional valence change in tweets

Emotional valence change in tweets

Plot 4 : Word Cloud:

word cloud for #teamUSA tweets

word cloud for #teamUSA tweets

We use the “text” column from tweet_doc object to create a word dictionary of the tweets after removing punctuation and unwanted characters. The size of the words increases with their frequency of appearance in the tweets. The image alongside shows such a wordcloud with highlighted words indicating high-frequency phrases.

1
2
3
4
wordCorpus &lt;- Corpus(VectorSource(tweet_doc$text))
wordCorpus &lt;- tm_map(wordCorpus, removePunctuation)
wordcloud(words = wordCorpus, max.words=500, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=pal)

 

Plot 4 : Sentiment Graph:

We use the “syuzhet” library to assigns emotional value to each of the 2000 tweets we extracted using the get_nrc_sentiment() function.

1
mySentiment &lt;- get_nrc_sentiment(tweet_doc$text)

This assigns a numeric value to each tweet to indicate various emotions expressed in the tweet – anger, anticipation, fear, joy, etc. We then add these values back to the tweet_doc object and compute column totals to derive the overall weight for each emotion. Code and image for overall sentiment scores are shown below:

1
2
3
4
ggplotly(ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Tweets"))

 

Overall Sentiment Scores - #TeamUSA

Overall Sentiment Scores – #TeamUSA

We can also use the scores to see if positive/negative sentiments change with time of day or date. For our tag “#TeamUSA” we notice this is patently true as seen in graph below:
Positive tweets peak at noon on Opening Ceremony day (aug 5) and negative sentiments peaked on Aug 7 morning.

Sentiment by time

Sentiment by time

 

Step 5 – Usage for Brand monitoring.

The steps used in the analysis above can be easily modified for monitoring your brand, blog or product, as explained below:

  • Instead of “#TeamUSA” we can use any other tag or company/blog  name or product or any other relevant tags to mine Twitter for tweets.
  • Periodically monitor the tweets about your product or brand to ensure that your “sentiment graph” always tends to positive emotions. If not, ensure your staff is working diligently to counter any negative tweets/ concerns among your users.
  • The graphical analysis for “Tweets by hour of day ” could be used to monitor what time your users/ audience is most active. You could use this insight to publish more content during this time and to ensure your customer support is always available during this period to effectively engage your audience.
  • If your “device type” graph indicates any specific device (e.g: specific Android phone brands) make sure your content caters correctly for mobile users.
  • The high-frequency words in “wordcloud” indicate trending topics, so these can be used as great ideas for new content topics or short-term ads to ride the publicity wave! 🙂

 

The entire source code for this analysis is available here blog_twitter_olympics or can be forked from the Github page. Please take a look and share your thoughts and feedback. Until next time, adieu! 🙂

« Older posts Newer posts »
Twitter
Visit Us
Follow Me
LinkedIn