Deep dive into data analysis tools, theory and projects

Author: anu - Journey of Analytics Team (Page 4 of 4)

Machine Learning Algorithms

In the last few posts, we saw standalone analytics projects to perform sentiment analysis, visually explore large datasets for insights  and create interesting Shiny applications.

In the coming months however, we will cover how to implement machine learning algorithms in depth. We will explore the underlying concepts behind the algorithm, (why and how the formula works) , implement using a real-world Kaggle dataset and also learn about limitations and advantages.

 

What algorithms will we cover?

There are many algorithms to choose from, and this infographic from ThinkBigData provides an excellent and comprehensive list. Feel free to use as a handout or print one for your cubicles! For our purposes, we will cover two algorithms from each category.

Categories of machine learning algorithms

Categories of machine learning algorithms. Source – ThinkBigData.com, by author Anubhav Srivastava.

 

Quick FAQ – selecting algorithm in practice

Many readers often ask, “how do I understand which algorithm to select? ” And this is also where new programmers often get stuck.

The long-winded answer is there is no secret sauce, and unfortunately often comes from experience or the problem definition itself.

The above answer is not very satisfying, so here are two “cheat-sheet” answers:

  1. A good approximation is given by this infographic by Microsoft Azure is a great example.  Download it from Link here. 
  2. Regression is a very common and flexible model, so the table below provides idea to create a base model based on whether your target variable is qualitative (numeric) or categorical (e.g gender or country)

    regression algorithms

    regression algorithms based on target variables

 

US Presidential Elections – Roundup of Final Forecasts

With barely 48 hours remaining for the US Presidential Elections, I thought a roundup post curating the “forecasts” seemed inevitable.

So here are the analysis from 3 Top Forecasters, known for their accurate predictions:

US Presidential Elections 2016

US Presidential Elections 2016

 

(1) Nate Silver, FiveThirtyEight:

This website has been giving a running status of the elections and has been accounting for the numerous pendulum swing (and shocking) changes that have characterized this election. Currently, it shows Hillary Clinton to be the clear winner with a ~70% chance of being the next President. You can check out the state-wise stats and electoral vote breakdown in their webpage here.  If you are interested you can also view their forecasts using 3 different models: polls only, polls+forecast and now-cast (current sentiment) and how they have changed over the last 12  months.

Their analytics are pretty amazing, so do take a look as a learning exercise, even if you do not agree with the forecast itself!

 

(2) 270towin:

Predictions and forecasts from Larry Sabato and the team at the University of Virginia Center for Politics. The final forecast from this team also puts Ms. Clinton as the clear winner.  They also expect Democrats to take control over the Senate. You can view their statewise electoral vote predictions here.

 

(3) Dr. Lichtman’s 13-key system:

Unlike other statistical teams and political analysts, this distinguished professor of history at American University, rose to fame using a simplified 13-key system for predicting the Presidential Elections. According to Dr. Allan J. Lichtman’s theory, if six or more questions are answered true, then the party holding the White House will be toppled from power. His system has been proven right for the past 30 years, so please do take a look at it before you scoff that it does not contain the mathematical proof and complex computations touted by media houses and political analytics teams. Dr. Allan J. Lichtman predicts  Trump to be the winner,  as he shows six of the questions are currently TRUE. Read more about this system and the analysis here.

 

Overall: 

Finally, looking at the overall sentiment on Twitter and news media, it does look like Hillary’s win is imminent.

But until the final vote is cast, who knows what may change?

Crime Density Area Contour Map

Hello All,

Today’s post is related to geographical heat maps – where a specific variable (say ethic groups, art colleges or crime category) is color coded to show areas  of high or low concentration.

The dataset is from the Philadelphia crime database, generously posted on Kaggle. I’m using the geographical coordinates available in this file to plot crime density maps for 4 specific crime categories. A simple function is created which takes the “crime category” as input and returns a contour map, using the ggmap library.

A detailed instruction is already posted as an RMarkdown file on the RPubs website. Please take a look at the link here.

The entire source code is also available for philly_crime_density_maps as a zipped file which includes – R program (easy to modify and play with the data!), the RMarkdown file. Please remember to add the dataset .csv file  from the Kaggle website and store in the same directory.

Philly Burglary-prone area maps

Burglary crime density area maps for Philadelphia

If you liked this post, and would like to receive updates for similar projects then please do signup for our blog updates. New projects are also added on our parent site at the beginning of every month, so do subscribe! If you think others may find this site, then please do share this link on Twitter and other social media! Thank you.

We love hearing feedback and questions. If you have any tips or would have taken a different approach please do share your thoughts in the comments section.

Happy Coding!

50+ free Datasets for Data Science Projects

[Updated as on Jan 31, 2020]

50+ free-datasets for your DataScience project portfolio

There is no doubt that having a project portfolio is one of the best ways to master Data Science whether you aspire to be a data analyst, machine learning expert or data visualization ninja! In fact, students and job seekers who showcase their skills with a unique portfolio find it easier to land lucrative jobs faster than their peers! (For project ideas, check this post, for job search advice look here.)

To create a custom portfolio, you need good data. So this post presents a list of Top 50 websites to gather datasets to use for your projects in R, Python, SAS, Tableau or other software. Best part, these datasets are all free, free, free! (Some might need you to create a login)

The datasets are divided into 5 broad categories as below:

  1. Government & UN/ Global Organizations
  2. Academic Websites
  3. Kaggle & Data Science Websites
  4. Curated Lists
  5. Miscellaneous

Government and UN/World Bank websites:

  • [1] US government database with 190k+ datasets – link . These include county-level data on demographics, education/schools and economic indicators; list of museums & recreational areas across the country, agriculture/ weather and soil data and so much more!
  • [2] UK government database with 25k+ datasets . Similar to the US site, but from the UK government.
  • [3] Canada government database. Data for Canada.

  • [4] Center for Disease Control – link
  • [5] Bureau of Labor Statistics – link
  • [6] NASA datasets – link 
  • [7] World Bank Data – link 
  • [8] World Economic Forum. I like the white paper style reports on this site too. It teaches you on how to think what Qs to answer using different datasets as well as how to present results in a meaningful way! This is an important skill for senior data scientists, academics and analytics consultants, so take a look.
  • [9] UN database with 34 sets and 60 million records – link . Data by country and region.
  • [10] EU commission open data – link
  • [11] NIST – link
  • [12] U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) – dataset from survey to determine magnitude of alcohol use and psychiatric disorders in the U.S. population. The dataset and descriptive codebook are available here.
  • [13] Plants Checklist from US Department of Agriculture – link .

Academic websites:

  • [14] Yelp academic data – link
  • [15] Univ of California, Irvine Machine Learning Repository – link
  • [16] Harvard Univ: link
  • [17] Harvard Dataverse database: link 
  • [18] MIT – link1 and  link2
  • [19] Univ of North Carolina, adolescent health – link
  • [20] Mars Crater Study, a global database that includes over 300,000 Mars craters 1 km or larger. Link to Descriptive guide and dataset.
  • [31] Click Dataset from Indiana University (~2.5TB dataset) – link .
  • [32] Pew Research Data – Pew Research is an organization focused on research on topics of public interest. Their studies gauge trends in multiple areas such as  internet, technology trends, global attitudes, religion  and social/ demographic trends. Astonishingly, they not only publish these reports but also make all their datasets publicly available for download!
  • [33] Million Song Dataset from Columbia University , including data related to the song tracks and their artist/ composers.

Kaggle & Datascience resources:

Few of my favorite datasets from Kaggle Website are listed here. Please note that Kaggle recently announced an Open Data platform, so you may see many new datasets there in the coming months.

  • [34] Walmart recruiting at stores – link
  • [35] Airbnb new user booking predictions – link
  • [36] US dept of education scorecard – link
  • [37] Titanic Survival Analysis – link
  • [38] Edx – link
  • [39] Enron email information and data – link
  • [40]Quandl – an excellent source for stock data. This site has both FREE and paid datasets.
  • [41] Gapminder – link

Curated Lists:

curated-datasets

  • [42] KDnuggets provides a great list of datasets from almost every field imaginable – space, music, books, etc. May repeat some datasets from the list above. link
  • [43] Reddit datasets – Users have posted an eclectic mix of datasets about gun ownership, NYPD crime rates, college student study habits and caffeine concentrations in popular beverages.
  • [44] Data Science Central has also curated many datasets for free – link
  • [45] List of open datasets from DataFloq – link
  • [46] Sammy Chen (@transwarpio ) curated list of datasets. This list is categorized by topic, so definitely take a look.
  • [47] DataWorld – This site also has a list of paid and FREE datasets. I have not used the site, but heard good reviews regarding the community.

Others:

  • [48] MRI brain scan images and data – link
  • [49] Internet Usage Data from the Center for Applied Internet Data Analysis –link .
  • [50] Google repository of digitized books and ngram viewer – link.
  • [51] Database with geographical information – link
  • [52] Yahoo offers some interesting datasets, the caveat being that you need to be affiliated with an accredited educational organization. (student or professor) – you can view the datasets here.
  • [53] Google Public Data – Google has a search engine specifically for searching publicly available data. This is a good place to start as you can search a large amount of datasets in one place. Of course, there is a NEWER link that went live a couple days ago! 🙂
  • [54] Public datasets from Amazon – see link.

Make sure you do attribute the datasets to the appropriate origin sites. Happy vizzing and coding! 🙂

Yelp College Search – Shiny Based App

As part of this month’s API theme, we will work with the Yelp API using R and Shiny to create a college search app to explore colleges near a specific city or zip code.

(Link to view Shiny app is added: https://anupamaprv.shinyapps.io/yelp_collegeapp/  )

Yelp College Search App

Yelp College Search App

As you are all aware, Yelp is a platform which allows you to search for myriad businesses (restaurants, theme parks, colleges, auto repairs, professional services, etc… ) by name or location and (IMPORTANTLY) view honest reviews from customers who have used those services. With 135 million monthly visitors and 95 million reviews, it is equivalent (if not better) to Google reviews; a LinkedIn of sorts for businesses, if you consider it that way.

With the new academic year almost upon us, it makes sense that students and/or parents would benefit from using this site to explore their options, although it should NOT be relied as the only source of truth for educational or career decisions!

With that in mind, we will create a web application that will accept two user inputs and display results in an output window with three panes.
Inputs:

  • City name or zip code
  • Search radius

Output panes:

  • Tab pane 1 – Display user selection as text output
  • Tab pane 2 – Map view of the selected location with markers for each college.
  • Tab pane 3 – Tabular view displaying college name, number of yelp reviews, overall yelp_rating and phone number of the college.

 

With that said, this tutorial will walk through the following tasks:

Step 1 – Working with Yelp API

Yelp uses the OAuth 1.0a method to process authentication, which is explained best on their developer website itself. A link is provided here. Like all API access requests, you will need a yelp developer account and create a dummy “app” to request permission keys from this developer page. Note, if you have never created a Yelp account, please do so now.

Unlike Facebook or Twitter API usage, we do not use any package specifically for Yelp. Instead we use the httr package which allows us to send a GET() request to the API. You can easily understand how to send queries by exploring the API console itself. However, to analyze or process the results, you do need a script. A sample query to search colleges near Philadelphia is as below:

1
https://api.yelp.com/v2/search/?location=philadelphia&radius_filter=10000&category_filter=collegeuniv

The steps to receive Yelp authorization are as follows:

  • Store access tokens – Consumer_key, Consumer_Secret, Token and Token_Secret. Request clearance using code below:
1
myapp sig resultsout
  • Process the results into json format and then convert to a usable dataframe

We want the query to be changed based on user inputs, so will put the query within a search function as below. The function will also convert the returned results from a nested list into a readable dataframe. To limit results, we will drop educational institutions with less than 3 reviews.

1
2
3
4
yelp_srch <- function( radius_miles, locn, n )
{
# convert search radius into metres
radius_meters <- radius_miles*1609.34

# create composite Yelp API query
querycomposite2 <- paste0(api_part1, locn, api_part2, radius_meters, api_part3, sep = ”)
resultsout <- GET(querycomposite2, sig)

collegeDataContent = content(resultsout)
collegelist=jsonlite::fromJSON(toJSON(collegeDataContent))
collegeresultsp <- data.frame(collegelist)

colnames(collegeresultsp) = c(“lat_delta”, “long_delta”, “latitude”, “longitude”,
“total”, “claim_biz”, “yelp_rating”, “mobile_url”,
“image_url”, “No_of_reviews”, “College”, “image_url_small”,
“main_weblink”, “categories”, “phone”, “short_text”,
“biz_image_url”, “snippet_url”, “display_phone”,
“rating_image_url”, “biz_id”, “closed”, “location_city”)

varseln <- c(11,7,10,19 )
collegeset <- subset(collegeresultsp, select = varseln)

tk <- subset(collegeset, No_of_reviews > n)
rownames( tk ) <- seq_len( nrow( tk ) )

return(tk)
}

 

At this time, we are not adding any error handling functions since the processing occurs only when the user hits the “Analyze” button.

 

Step 2 – Creating the Shiny Application

Like all shiny applications, the ui.R file specifies the layout of the web application. As described in the introduction, we have the input tab on the left and 3-paned tabbed output on the right. The server.R file implements the logic and function calls to collate and format the data needed to populate the three panes.

The input pane uses a text input and a slider input, both are which are straight forward implementations using the code provided in the official Shiny widget gallery.

On the output pane, we display the Yelp logo and review_star images to indicate the data is being pulled from Yelp, and to comply with the display requirements under their Terms of Use. The first “Tab1” tab is also pretty simple and echoes the user input once the “analyze” button is entered.

Shiny app output pane1

Shiny app – input pane & output tab1

The second “Plot” pane creates a map plot of the location using the leaflet package. The pointer may be single or clustered based on how many results are returned. Note that the Yelp API limits results, so some queries may be truncated. It is also the explanation for providing a smaller range in the search radius slider input. The data for this pane is pulled directly from Yelp using a search function similar to the code provided at the beginning of this post.

Shiny app output pane2

Shiny app output pane2

If you are interested in exploring more options with the leaflet view, like adding zoom capabilities or custom messages on the popup markers, then please take a look at this other post on our old blog – Graphical Data Exploration.

The third pane is a tabular view of search results using the yelp_srch() function added above.

yelp_app_pane3

Step 3 – Re-purposing the code

Honestly, this Yelp API code can be used for myriad other uses as below:

  • Create a more detailed search app that allows users to add more inputs to search for other business categories or locations.
  • Add dat from other APIs like Facebook to create an even more data-rich search directory with social proofs using Facebook likes, Yelp Star Rating, etc.
  • If you are a hotel site, you could embed  a yelp search app on your website to help users see a map view of restaurants and places on interest. This would help users realize how close they are to the historical sites/ major highway routes/ city downtown/ amazing entertainment options, etc.
  • A travel website could plot a mapview of interactive itineraries, so users could select options based on whether they are travelling with kids/ seniors/  students, etc. while taking comfort in the knowledge that the places are truly worth visiting! 🙂
  • Most APIs use OAuth methods for authentication, so you could easily modify the code to access data from other sites in an easy legal way. (Please do read the Terms of Service, though for any such usage).

 

As always, the entire source code for this analysis is FREELY available as yelp_api_project or can be forked from the link on Github. Please take a look and share your thoughts and feedback.

Until next time, adieu! 🙂

Newer posts »
Twitter
Visit Us
Follow Me
LinkedIn