Deep dive into data analysis tools, theory and projects

Category: Learning Resources (Page 7 of 7)

Crime Density Area Contour Map

Hello All,

Today’s post is related to geographical heat maps – where a specific variable (say ethic groups, art colleges or crime category) is color coded to show areas  of high or low concentration.

The dataset is from the Philadelphia crime database, generously posted on Kaggle. I’m using the geographical coordinates available in this file to plot crime density maps for 4 specific crime categories. A simple function is created which takes the “crime category” as input and returns a contour map, using the ggmap library.

A detailed instruction is already posted as an RMarkdown file on the RPubs website. Please take a look at the link here.

The entire source code is also available for philly_crime_density_maps as a zipped file which includes – R program (easy to modify and play with the data!), the RMarkdown file. Please remember to add the dataset .csv file  from the Kaggle website and store in the same directory.

Philly Burglary-prone area maps

Burglary crime density area maps for Philadelphia

If you liked this post, and would like to receive updates for similar projects then please do signup for our blog updates. New projects are also added on our parent site at the beginning of every month, so do subscribe! If you think others may find this site, then please do share this link on Twitter and other social media! Thank you.

We love hearing feedback and questions. If you have any tips or would have taken a different approach please do share your thoughts in the comments section.

Happy Coding!

Twitter Sentiment Analysis

Introduction

Today’s post is a 2-part tutorial series on how to create an interactive ShinyR application that displays sentiment analysis for various phrases and search terms. The application accepts user a search term as input and graphically displays sentiment analysis.

In keeping with this month’s theme – “API programming”, this project uses the Twitter API to perform real-time search for tweets containing the user input term. Live App Link on Shiny website is provided and screenshot is as follows:

Twitter Sentiment Analysis Shiny

Shiny application for Twitter Sentiment Analysis

The project idea may seem simple at first, but will teach you the following skills:

  • working with Twitter API and dynamic data streaming (every time the search term changes, the program sends a new request to Twitter for relevant tweets),
  • Building an “interactive”, real-time application in Shiny/R,
  • data visualization with R

As always, the entire source code is also available for download on the Projects Page or can be forked from my  Github account here.

 

The tutorial is divided into  3 parts :

  1. Introduction
  2. Twitter Connectivity & search
  3. Shiny design

 

Application Design:

Any good software project begins with the design first. For this application, the design flowchart is shown below:

Design Flowchart for Shiny app

Design Flowchart for Shiny app

 

 

Twitter Connectivity

This is similar to the August project and mainly consists of two calls to the Twitter API:

  • authorize twitter api to mine data, using setup_twitter_oauth() function and your Twitter developer keys.

library(twitteR)
consumer_key = “ckey”
consumer_secret = “csecret”
access_token = “atoken”
access_secret = “asecret”
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

  • check whether the input search term returns tweets containing the phrase. If number of tweets <=5 return an error message. If number of tweets >5, process the tweets and display a sentiment analysis barchart. A custom function performs this computation

chk_searchterm <- function( term )

{  tw_search = searchTwitter(term, n=20, since=’2013-01-01′)

# look for all tweets containing this search term.

if(length(tw_search) <= 5)

{   return_term <- “None/few tweets to analyse for this search term. Please try again!” }

else

{    return_term <- paste(“Extracting max 20 tweets for Input =”, term, “.Sentiment graph below “)     }

return(return_term)

}

The bargraph is created by assigning numeric values for each of the positive and negative emotions using the tweet text. Emotions used – anger, anticipation, disgust, joy, sadness, surprise, trust, overall positive and negative sentiment.

 

Shiny webapp

The actual Shiny application design and twitter connectivity are explained in the next post.

50+ free Datasets for Data Science Projects

[Updated as on Jan 31, 2020]

50+ free-datasets for your DataScience project portfolio

There is no doubt that having a project portfolio is one of the best ways to master Data Science whether you aspire to be a data analyst, machine learning expert or data visualization ninja! In fact, students and job seekers who showcase their skills with a unique portfolio find it easier to land lucrative jobs faster than their peers! (For project ideas, check this post, for job search advice look here.)

To create a custom portfolio, you need good data. So this post presents a list of Top 50 websites to gather datasets to use for your projects in R, Python, SAS, Tableau or other software. Best part, these datasets are all free, free, free! (Some might need you to create a login)

The datasets are divided into 5 broad categories as below:

  1. Government & UN/ Global Organizations
  2. Academic Websites
  3. Kaggle & Data Science Websites
  4. Curated Lists
  5. Miscellaneous

Government and UN/World Bank websites:

  • [1] US government database with 190k+ datasets – link . These include county-level data on demographics, education/schools and economic indicators; list of museums & recreational areas across the country, agriculture/ weather and soil data and so much more!
  • [2] UK government database with 25k+ datasets . Similar to the US site, but from the UK government.
  • [3] Canada government database. Data for Canada.

  • [4] Center for Disease Control – link
  • [5] Bureau of Labor Statistics – link
  • [6] NASA datasets – link 
  • [7] World Bank Data – link 
  • [8] World Economic Forum. I like the white paper style reports on this site too. It teaches you on how to think what Qs to answer using different datasets as well as how to present results in a meaningful way! This is an important skill for senior data scientists, academics and analytics consultants, so take a look.
  • [9] UN database with 34 sets and 60 million records – link . Data by country and region.
  • [10] EU commission open data – link
  • [11] NIST – link
  • [12] U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) – dataset from survey to determine magnitude of alcohol use and psychiatric disorders in the U.S. population. The dataset and descriptive codebook are available here.
  • [13] Plants Checklist from US Department of Agriculture – link .

Academic websites:

  • [14] Yelp academic data – link
  • [15] Univ of California, Irvine Machine Learning Repository – link
  • [16] Harvard Univ: link
  • [17] Harvard Dataverse database: link 
  • [18] MIT – link1 and  link2
  • [19] Univ of North Carolina, adolescent health – link
  • [20] Mars Crater Study, a global database that includes over 300,000 Mars craters 1 km or larger. Link to Descriptive guide and dataset.
  • [31] Click Dataset from Indiana University (~2.5TB dataset) – link .
  • [32] Pew Research Data – Pew Research is an organization focused on research on topics of public interest. Their studies gauge trends in multiple areas such as  internet, technology trends, global attitudes, religion  and social/ demographic trends. Astonishingly, they not only publish these reports but also make all their datasets publicly available for download!
  • [33] Million Song Dataset from Columbia University , including data related to the song tracks and their artist/ composers.

Kaggle & Datascience resources:

Few of my favorite datasets from Kaggle Website are listed here. Please note that Kaggle recently announced an Open Data platform, so you may see many new datasets there in the coming months.

  • [34] Walmart recruiting at stores – link
  • [35] Airbnb new user booking predictions – link
  • [36] US dept of education scorecard – link
  • [37] Titanic Survival Analysis – link
  • [38] Edx – link
  • [39] Enron email information and data – link
  • [40]Quandl – an excellent source for stock data. This site has both FREE and paid datasets.
  • [41] Gapminder – link

Curated Lists:

curated-datasets

  • [42] KDnuggets provides a great list of datasets from almost every field imaginable – space, music, books, etc. May repeat some datasets from the list above. link
  • [43] Reddit datasets – Users have posted an eclectic mix of datasets about gun ownership, NYPD crime rates, college student study habits and caffeine concentrations in popular beverages.
  • [44] Data Science Central has also curated many datasets for free – link
  • [45] List of open datasets from DataFloq – link
  • [46] Sammy Chen (@transwarpio ) curated list of datasets. This list is categorized by topic, so definitely take a look.
  • [47] DataWorld – This site also has a list of paid and FREE datasets. I have not used the site, but heard good reviews regarding the community.

Others:

  • [48] MRI brain scan images and data – link
  • [49] Internet Usage Data from the Center for Applied Internet Data Analysis –link .
  • [50] Google repository of digitized books and ngram viewer – link.
  • [51] Database with geographical information – link
  • [52] Yahoo offers some interesting datasets, the caveat being that you need to be affiliated with an accredited educational organization. (student or professor) – you can view the datasets here.
  • [53] Google Public Data – Google has a search engine specifically for searching publicly available data. This is a good place to start as you can search a large amount of datasets in one place. Of course, there is a NEWER link that went live a couple days ago! 🙂
  • [54] Public datasets from Amazon – see link.

Make sure you do attribute the datasets to the appropriate origin sites. Happy vizzing and coding! 🙂

Yelp College Search – Shiny Based App

As part of this month’s API theme, we will work with the Yelp API using R and Shiny to create a college search app to explore colleges near a specific city or zip code.

(Link to view Shiny app is added: https://anupamaprv.shinyapps.io/yelp_collegeapp/  )

Yelp College Search App

Yelp College Search App

As you are all aware, Yelp is a platform which allows you to search for myriad businesses (restaurants, theme parks, colleges, auto repairs, professional services, etc… ) by name or location and (IMPORTANTLY) view honest reviews from customers who have used those services. With 135 million monthly visitors and 95 million reviews, it is equivalent (if not better) to Google reviews; a LinkedIn of sorts for businesses, if you consider it that way.

With the new academic year almost upon us, it makes sense that students and/or parents would benefit from using this site to explore their options, although it should NOT be relied as the only source of truth for educational or career decisions!

With that in mind, we will create a web application that will accept two user inputs and display results in an output window with three panes.
Inputs:

  • City name or zip code
  • Search radius

Output panes:

  • Tab pane 1 – Display user selection as text output
  • Tab pane 2 – Map view of the selected location with markers for each college.
  • Tab pane 3 – Tabular view displaying college name, number of yelp reviews, overall yelp_rating and phone number of the college.

 

With that said, this tutorial will walk through the following tasks:

Step 1 – Working with Yelp API

Yelp uses the OAuth 1.0a method to process authentication, which is explained best on their developer website itself. A link is provided here. Like all API access requests, you will need a yelp developer account and create a dummy “app” to request permission keys from this developer page. Note, if you have never created a Yelp account, please do so now.

Unlike Facebook or Twitter API usage, we do not use any package specifically for Yelp. Instead we use the httr package which allows us to send a GET() request to the API. You can easily understand how to send queries by exploring the API console itself. However, to analyze or process the results, you do need a script. A sample query to search colleges near Philadelphia is as below:

1
https://api.yelp.com/v2/search/?location=philadelphia&amp;radius_filter=10000&amp;category_filter=collegeuniv

The steps to receive Yelp authorization are as follows:

  • Store access tokens – Consumer_key, Consumer_Secret, Token and Token_Secret. Request clearance using code below:
1
myapp sig resultsout
  • Process the results into json format and then convert to a usable dataframe

We want the query to be changed based on user inputs, so will put the query within a search function as below. The function will also convert the returned results from a nested list into a readable dataframe. To limit results, we will drop educational institutions with less than 3 reviews.

1
2
3
4
yelp_srch &lt;- function( radius_miles, locn, n )
{
# convert search radius into metres
radius_meters &lt;- radius_miles*1609.34

# create composite Yelp API query
querycomposite2 <- paste0(api_part1, locn, api_part2, radius_meters, api_part3, sep = ”)
resultsout <- GET(querycomposite2, sig)

collegeDataContent = content(resultsout)
collegelist=jsonlite::fromJSON(toJSON(collegeDataContent))
collegeresultsp <- data.frame(collegelist)

colnames(collegeresultsp) = c(“lat_delta”, “long_delta”, “latitude”, “longitude”,
“total”, “claim_biz”, “yelp_rating”, “mobile_url”,
“image_url”, “No_of_reviews”, “College”, “image_url_small”,
“main_weblink”, “categories”, “phone”, “short_text”,
“biz_image_url”, “snippet_url”, “display_phone”,
“rating_image_url”, “biz_id”, “closed”, “location_city”)

varseln <- c(11,7,10,19 )
collegeset <- subset(collegeresultsp, select = varseln)

tk <- subset(collegeset, No_of_reviews > n)
rownames( tk ) <- seq_len( nrow( tk ) )

return(tk)
}

 

At this time, we are not adding any error handling functions since the processing occurs only when the user hits the “Analyze” button.

 

Step 2 – Creating the Shiny Application

Like all shiny applications, the ui.R file specifies the layout of the web application. As described in the introduction, we have the input tab on the left and 3-paned tabbed output on the right. The server.R file implements the logic and function calls to collate and format the data needed to populate the three panes.

The input pane uses a text input and a slider input, both are which are straight forward implementations using the code provided in the official Shiny widget gallery.

On the output pane, we display the Yelp logo and review_star images to indicate the data is being pulled from Yelp, and to comply with the display requirements under their Terms of Use. The first “Tab1” tab is also pretty simple and echoes the user input once the “analyze” button is entered.

Shiny app output pane1

Shiny app – input pane & output tab1

The second “Plot” pane creates a map plot of the location using the leaflet package. The pointer may be single or clustered based on how many results are returned. Note that the Yelp API limits results, so some queries may be truncated. It is also the explanation for providing a smaller range in the search radius slider input. The data for this pane is pulled directly from Yelp using a search function similar to the code provided at the beginning of this post.

Shiny app output pane2

Shiny app output pane2

If you are interested in exploring more options with the leaflet view, like adding zoom capabilities or custom messages on the popup markers, then please take a look at this other post on our old blog – Graphical Data Exploration.

The third pane is a tabular view of search results using the yelp_srch() function added above.

yelp_app_pane3

Step 3 – Re-purposing the code

Honestly, this Yelp API code can be used for myriad other uses as below:

  • Create a more detailed search app that allows users to add more inputs to search for other business categories or locations.
  • Add dat from other APIs like Facebook to create an even more data-rich search directory with social proofs using Facebook likes, Yelp Star Rating, etc.
  • If you are a hotel site, you could embed  a yelp search app on your website to help users see a map view of restaurants and places on interest. This would help users realize how close they are to the historical sites/ major highway routes/ city downtown/ amazing entertainment options, etc.
  • A travel website could plot a mapview of interactive itineraries, so users could select options based on whether they are travelling with kids/ seniors/  students, etc. while taking comfort in the knowledge that the places are truly worth visiting! 🙂
  • Most APIs use OAuth methods for authentication, so you could easily modify the code to access data from other sites in an easy legal way. (Please do read the Terms of Service, though for any such usage).

 

As always, the entire source code for this analysis is FREELY available as yelp_api_project or can be forked from the link on Github. Please take a look and share your thoughts and feedback.

Until next time, adieu! 🙂

Newer posts »
Twitter
Visit Us
Follow Me
LinkedIn