Journey of Analytics

Deep dive into data analysis tools, theory and projects

Introduction to Artificial Intelligence

Lately I’ve been exploring deep learning algorithms, and automating system with Artificial Intelligence. Plus, I received a couple of emails asking me about programming skills for AI. So, with those questions in mind, here is a simple introduction to artificial intelligence. The agenda for this post is to answer the following topics:

  1. What is AI?
  2. Types of AI – practical implementation in Fortune500 companies
  3. Applications of AI
  4. Drawbacks of AI
  5. Programming skills for AI
introduction to AI

introduction to AI

What is AI?

AI or artificial intelligence is the process of using software to perform human tasks. It is considered to be a branch of machine learning, and sophisticated algorithms are used to do everything from automating repetitive tasks to creating self-learning sentient systems. Sadly, nowadays AI is being used as an alternative to datascience and machine learning, so it is very difficult to draw a line of difference between them all.

Types of AI

In terms of implementation and usage in Fortune500 companies, AI can perform three important business needs (Davenport & Ronanki, 2018) :

  1. automating business processes,
  2. gaining insight through data analysis, (Cognitive insight)
  3. engaging with customers and employees.

Process automation:

This is perhaps the most prevalent form of AI, where code on a remote server or a sensor sorts information and makes decisions that would otherwise be done by a human. Examples include updating multiple databases with customer address changes or service additions, replacing lost credit or ATM cards, parsing humongous amounts of legal and contractual documents to extract provisions using natural language processing. This type of robotic process automation (RPA) technologies are most popular because they are easy to implement and offer great returns on investment. They are also controversial, because they can sometimes replace low-skilled manual jobs, even though these jobs were always in danger of being outsourced or given to minimum wage workers.

Cognitive insight.

These processes use algorithms to detect patterns in vast volumes of data and interpret their meaning. Using unsupervised machine learning algorithms help the processes become more efficient over time. Examples include predicting what a customer is likely to buy, identify credit fraud in real time, mass-personalization of digital ads and so on. Given the amounts of data, cognitive insight applications are typically used for tasks that would be impossible for people, or to augment human decision-making processes.

Cognitive engagement.

These projects are the most complicated, take most amount of time, and therefore generate the most buzz and are most prone to mismanagement and failures. Examples include intelligent chatbots that are able to process complex questions and improve with every interaction with live customers, voice-to-text reporting solutions, recommendation systems that help providers create customized care plans that take into account individual patients’ health status and previous treatments, recreating customer intimacy with digital customers using digital concierges, etc. However, such AI projects are still not completely mainstream, and companies tend to take a conservative approach in using them for customer-facing systems.

Applications of AI:

There are many applications of AI and currently startups are racing to build AI chips for data centers, robotics, smartphones, drones and other devices. Tech giants like Apple, Google, Facebook, and Microsoft have already created interesting products by applying AI software to speech recognition, internet search, and classifying images. Amazon.com’s AI prowess spans cloud-computing services and voice-activated home digital assistants (Alexa, Amazon Echo). Here are some other interesting applications of AI:

  1. Driverless vehicles
  2. Robo-advisors that can recommend investments, re-balance stock/bond ratios and make personalized recommendations for an individual’s portfolio. An interesting extension of this technique is the list of 50 startups identified by Quid AI CEO, with the most potential to grow, in 2009. Today 10 of those companies have reached billion-dollar evaluations (Reddy, 2017) and include famous names like Evernote, Spotify, Etsy, Zynga, Palantir, Cloudera, OPOWER. [Personal note – if you have not yet heard of Quid, follow them on Twitter @Quid. They publish some amazing business intelligence reports! ]
  3. Image recognition that can aid law enforcement personnel in identifying criminals.
  4. LexisNexis has a product called PatentAdvisor (lexisnexisip.com/products/patent-advisor) which uses data concerning the history of individual patent examiners and how they’ve handled similar patent applications to predict the likelihood of a application being approved. Similarly, there are software applications that use artificial intelligence to help lawyers gather research material for cases, by identifying precedents that will maximize chances for a successful ruling outcome. (Keiser, 2018)

Drawbacks of AI:

There is no doubt that AI has created some amazing opportunities (image recognition to classify malignant tumors) and allowed companies to pass on boring admin tasks to machines. However, since AI systems are created by humans, they do have the following risks and limits:

  1. AI bias: Machine learning and algorithms underlying AI systems also has biases. All algorithms use an initial training set to learn how to identify and predict values. So, if the underlying training set is biased, then the predictions will also be biased. Garbage in, garbage out. Moreover, these biases may not appear as an explicit rule but, rather, be embedded in subtle interactions among the thousands of factors considered. Hence heavily regulated industries like banking bar the use of AI in loan approvals, as it may conflict with fair lending laws.
  2. Lack of verification: Unlike regular rule-based systems, neural network systems which are typically used in AI systems, deal with statistical truths rather than literal truths. So, such systems may fail in extreme rare cases, as the algorithm will overlook cases that may have very low probability of occurrence. For example, predicting a Wall Street crash or a sudden natural calamity like a volcanic eruption (think Hawaii). Lack of verification are major concerns in mission-critical applications, such as controlling a nuclear power plant, or when life-or-death decisions are involved.
  3. Hard to correct errors: If the AI system makes an error (as all systems eventually fail), then diagnosing and correcting the system becomes unimaginably complex, as the underlying mathematics are very complicated.
  4. Human creativity and emotions cannot be automated. AI is excellent at mundane tasks, but not so good at things that are intuitive. As the authors state in a book (Davenport & Kirby, 2016) if the logic can be articulated, a rule-based system can be written, and the process can be automated. However, tasks which involve emotions, creative problem-solving and social interactions cannot be automated. Examples include FBI negotiators, soldiers on flood rescue systems, the inventor who knew the iPod would change the music industry and become a sensation long before anyone expressed a need for it.

Programming Skills for AI:

The skills used to build AI applications are the same as those needed for data scientists and software engineering roles. The top 5 programming skills are Python, R, Java, C++. If you are looking to get started, then three excellent resources are listed below:

  1. Professional Program from Microsoft. The courses are completely free (gasp) although they do charge $99 per course for verified certificates. I took the free versions, and the courses offer a good mix of both practical labs and theory. https://academy.microsoft.com/en-us/professional-program/tracks/artificial-intelligence/
  2. Introduction to AI course from Udacity. https://www.udacity.com/course/intro-to-artificial-intelligence–cs271
  3. AI and Deep Learning courses by Kirill Eremenko, on Udemy. I’ve taken 4 courses from him and they were all great value for money, and give very real-world, hands-on coding experience. https://www.udemy.com/artificial-intelligence-az/ &

Please note that all these 3 are honest recommendations, and I am not being paid or compensated in any shape or form for adding these links.

 

 

REFERENCES

Brynjolfsson, E., McAfee, A. (2017) The business of artificial intelligence: What it can and cannot do for your organization. Harvard Business Review website. Retrieved from https://hbr.org/cover-story/2017/07/the-business-of-artificial-intelligence

Davenport, T., Kirby, J. (2016) Only Humans Need Apply: winners and losers in the age of smart machines. Harper Business.

Davenport, T., Ronanki, R. (2018) Artificial Intelligence for the Real World. Harvard Business Review.

Keiser, B. (2018) Law library Management and Legal Research meet Artificial Intelligence. onlineresearcher.net

Reddy, S. (2017) A computer was asked to predict which start-ups would be successful. The results were astonishing. World Economic Forum. Retrieved from https://www.weforum.org/agenda/2017/07/computer-ai-machine-learning-predict-the-success-of-startups/

 

 

 

 

How to raise money on Kickstarter – extensive EDA and prediction tutorial

In this tutorial, we will explore the characterisitcs of projects on Kickstarter and try to understand what separates the winners from the projects that failed to reach their funding goals.

Qs for Exploratory Analysis:

We will start our analysis with the aim of answering the following questions:

    1. How many projects were successful on Kickstarter, by year and category.
    2. Which sub-categories raised the most amount of money?
    3. Projects originate from which countries?
    4. How many projects exceeded their funding goal by 50% or more?
    5. Did any projects reach $100,000 or more? $1,000,000 or higher?
    6. What was the average amount contributed by each backer, and how does this change over time? Does this amount differ with categories?
    7. What is the average funding period?

 

Predicting success rates:
Using the answers from the above questions, we will try to create a model that can predict which projects are most likely to be successful.

The dataset is available on Kaggle, and you can run this script LIVE using this kernel link. If you find this tutorial useful or interesting, then please do upvote the kernel ! 🙂

Step1 – Data Pre-processing

a) Let us take a look at the input dataset :

The projects are divided into main and sub-categories. The pledged amount “usd_pledged” has an equivalent value converted to USD, called “usd_pledged_real”. However, the goal amount does not have this conversion. So for now, we will use the amounts as is.

We can see how many people are backing each individual project using the column, “backers”.

b) Now let us look at the first 5 records:

The name doesn’t really indicate any specific pattern although it might be interesting to see if longer names have better success rates. Not pursuing that angle at this time, though.

c) Looking for missing values:

Hurrah, a really clean dataset, even after searching for “empty” strings. 🙂

 d) Date Formatting and splitting:

We have two dates in our dataset – “launch date” and “deadline date”.We convert them from strings to date format.
We also split these dates into the respective year and month columns, so that we can plot variations over time.
So we will now have 4 new columns: launch_year, launch_month, deadline_year and deadline_month.

Exploratory analysis:

a) How many projects are successful?

We see that “failed” and “successful” are the two main categories, comprising ~88% of our dataset.
Sadly we do not know why some projects are marked “undefined” or “canceled”.
“live”” projects are those where the deadlines have not yet passed, although a few among them are already achieved their goal.
Surprisingly, some ‘canceled’ projects had also met their goals (pledged_amount >= goal).
Since these other categories are a very small portion of the dataset, we will subset and only consider records with satus “failed” or “successful” for the rest of the analysis.

b) How many countries have projects on kickstarter?

We see projects are overwhelmingly US. Some country names have the tag N,0“”, so marking them as unknown.

c) Number of projects launched per year:

Looks like some records say dates like 1970, which does not look right. So we discard any records with a launch / deadline year before 2009.
Plotting the counts per year on a graphs: < br />From the graph below, it looks like the count of projects peaked in 2015, then went down. However, this should NOT be taken as an indicator of success rates.

 

 

Drilling down a bit more to see count of projects by main_category.

Over the years, maximum number of projects have been in the categories:

    1. Film & Video
    2. Music
    3. Publishing

 d) Number of projects by sub-category: (Top 20 only)


The Top 5 sub-categories are:

    1. Product Design
    2. Documentary
    3. Music
    4. Tabletop Games (interesting!!!)
    5. Shorts (really?! )

Let us now see “Status” of projects for these Top 5 sub_categories:
From the graph below, we see that for category “shorts” and “tabletop games” there are more successfull projects than failed ones.

 e) Backers by category and sub-category:

Since there are a lot of sub-categories, let us explore the sub-categories under the main theme “Design” 

Product design is not just the sub-category with the highest count of projects, but also the category with the highest success ratio.

 f) add flag to see how many got funded more than the goal.

So ~40% of projects reached or surpassed their goal, which matches the number of successful projects .

 g) Calculate average contribution per backer:

From the mean, median and max values we quickly see that the median amount contributed by each backer is only ~$40 whereas the mean is higher due to the extreme positive values. The max amount by a single backer is ~$5000.

h) Calculate reach_ratio

The amount per backer is a good start, but what if the goal amount itself is only $1000? Then an average contribution per backer of $50 impies we only need 20 backers.
So to better understand the probability of a project’s success, we create a derived metric called “reach_ratio”.
This takes the average user contribution and compares it against the goal fund amount.

We see the median reach_ratio is <1%. Only in the third quartile do we even touch 2%!
Clearly most projects have a very low reach ratio. We could subset for “successful” projects only and check if the reach_ratio is higher.

 i) Number of days to achieve goal:

 Predictive Analystics:

We will apply a very simple decision tree algorithm to our dataset.
Since we do not have a separate “test” set, we will split the input dataframe into 2 parts (70/30 split).
We will use the smaller set to test the accuracy of out algorithm.

Taking a peek at the decision tree rules:

kickstarter success decision tree

kickstarter success decision tree




Thus we see that “backers” and “reach-ratio” are the main significant variables.

Re-applying the tree rules to the training set itself, we can validate our model:

From the above tables, we see that the error rate = ~3% and area under curve >= 97%

Finally applying the tree rules to the test set, we get the following stats:

From the above tables, we see that still the error rate = ~3% and area under curve >= 97%

 

Conclusion:

Thus in this tutorial, we explored the factors that contribtue to a project’s success. Main theme and sub-category were important, but the number of backers and “reach_ratio” were found to be most critical.
If a founder wanted to gauge their probability of success, they could measure their “reach-ratio” halfway to the deadline, or perhaps when 25% of the timeline is complete. If the numbers are lower, it means they need to double down and use promotions/social media marketing to get more backers and funding.

If you liked this tutorial, feel free to fork the script. And dont forget to upvote the kernel! 🙂

Who wants to work at Google?

In this tutorial, we will explore the open roles at Google, and try to see what common attributes Google is looking for, in future employees.

 

This dataset comes from the Kaggle site, and contains text information about job location, title, department, minimum and preferred qualifications and the responsibilities of the position. Using this dataset we will try to answer the following questions: You can download the dataset here, and run the code on the Kaggle site itself here.

  1. Where are the open roles?
  2. Which departments have the most openings?
  3. What are the minimum and preferred educational qualifications needed to get hired at Google?
  4. How much experience is needed?
  5. What categories of roles are the most in demand?

Data Preparation and Cleaning:

The data is all in free-form text, so we do need to do a fair amount of cleanup to remove non-alphanumeric characters. Some of the job locations have special characters too, so we remove those using basic string manipulation functions. Once we read in the file, this is the snapshot of the resulting dataframe:

Job Categories:

First we look at which departments have the most number of open roles. Surprisingly, there are more roles open for the “Marketing and Communications” and “Sales & Account Management” categories, as compared to the traditional technical business units. (like Software Engineering or networking) .

Full-time versus internships:

Let us see how many roles are full-time and how many are for students. As expected, only ~13% of roles are for students i.e. internships. Majority are full-time positions.

Technical Roles:

Since Google is predominantly technical company, let us see how many positions need technical skills, irrespective of the business unit (job category)

a) Roles related to “Google Cloud”:

To check this, we investigate how many roles have the phrase either in the job title or the responsibilities. As shown in the graph below, ~20% of the roles are related to Cloud infrastructure, clearly showing that Google is making Cloud services a high priority.

Educational Qualifications:

Here we are basically parsing the “min_qual” and “pref_qual” columns to see the minimum qualifications needed for the role. If we only take the minimum qualifications into consideration, we see that 80% of the roles explicitly ask for a bachelors degree. Less than 5% of roles ask for a masters or PhD.

min_qualifications for Google jobs

However, when we consider the “preferred” qualifications, the ratio increases to a whopping ~25%. Thus, a fourth of all roles would be more suited to candidates with masters degrees and above.

Google Engineers:

Google is famous for hiring engineers for all types of roles. So we will read the job qualification requirements to identify what percentage of roles requires a technical degree or degree in Engineering.
As seen from the data, 35% specifically ask for an Engineering or computer science degree, including roles in marketing and non-engineering departments.

Years of Experience:

We see that 30% of the roles require at least 5-years, while 35% of roles need even more experience.
So if you did not get hired at Google after graduation, no worries. You have a better chance after gaining a strong experience in other companies.

Role Locations:

The dataset does not have the geographical coordinates for mapping. However, this is easily overcome by using the geocode() function and the amazing Rworldmap package. We are only plotting the locations, so some places would have more roles than others.  So, we see open roles in all parts of the world. However, the maximum positions are in US, followed by UK, and then Europe as a whole.

Responsibilities – Word Cloud:

Let us create a word cloud to see what skills are most needed for the Cloud engineering roles: We see that words like “partner”, “custom solutions”, “cloud”, strategy“,”experience” are more frequent than any specific technical skills. This shows that the Google cloud roles are best filled by senior resources where leadership and business skills become more significant than expertise in a specific technology.

 

Conclusion:

So who has the best chance of getting hired at Google?

For most of the roles (from this dataset), a candidate with the following traits has the best chance of getting hired:

  1. 5+ years of experience.
  2. Engineering or Computer Science bachelor’s degree.
  3. Masters degree or higher.
  4. Working in the US.

The code for this script and graphs are available here on the Kaggle website. If you liked it, don’t forget to upvote the script. 🙂

Thanks and happy coding!

Top US Cities with Highest Rent

In this post, we will use the Zillow rent dataset to perform  exploratory and inferential statistics. Our main goal is to identify the most expensive real estate cities in US.

 

Input Files:

The Kaggle dataset contains two files with rental prices for 13000+ cities across the time frame Nov 2010 – Jan 2017. One file contains values for rent, the other has price per square foot.

Additionally, we use a public dataset to map geographical coordinates to the city names. The main analysis does not need the latitude, longitude values, so you can proceed without this file, except for the last map. Although, having these values helps to create some stunning visuals.

Feel free to use the location data file with other datasets or projects, as it contains coordinate information for cities in numerous countries. 

 

Note of caution:

The location data file is quite large, so the fread() to read it and the merge() later will take a minute or so.

 

Analysis Qs:

To give some structure to our analysis, these are the main goals for the project:

  1. Most expensive cities in US, by rent.
  2. Most expensive cities by price per square foot.
  3. Which states have a higher concentration of such cities?
  4. Rent trends over time.

Please note that the datafiles and R-program code are available on the Projects page under Aug 2017.

Data Cleansing:

The Kaggle files are quite clean, without many missing values. However, to use them for analyzing trends over time, we still need to process them.  In this case, the rent for each month is in a separate column, so we need to aggregate those together.  We achieve this by using a custom for-loop.

 

On a side note, if you are trying to massage data for reporting formats, say similar to a pivot table in Excel, then using similar for-loops can save you tons of time doing manual steps.

We will also merge the latitude & longitude data at this step. Some of the city names don’t match exactly so we will use some string manipulation functions to make a perfect match.

This is how the data frame looks after the data processing step:

transformed data object

transformed data object

 

Rent Analysis:

We will use the Jan 2017 month to do a ranking for parameters like population density, rent amount and price per square ft.

 

a) Most expensive cities in US, by rent:

We use Jan data to sort the cities by rent amount, then assign a title similar to “Num. City_Name” . Take the list of top 10 cities and then merge with the original rent dataframe, to view rent trend over time.

This gives us the list below:

US cities with highest rent

US cities with highest rent

 

If we plot the rent values since Nov 2010, we get a chart as shown below:

We notice that Jupiter island and Westlake see some intra-yearly rent patterns indicating seasonal shifts in demand/supply.

 

b) Cities with highest price by area:

Using the price per square foot dataset, we can also identify cities with the highest price per square foot area. The city list for this analysis is as follows:

 

Notice that the city names in the two lists are not identical. Jupiter island which was first in list 1, has moved down to spot 4.  Similarly, a 2000 sq.ft home in Malibu CA would set you back by $9,000 per month! We also see that most cities in this list are predominantly in California or Florida.

 

c) Cities with small area but huge rent!!

Let us investigate which cities make you shell out tons of money for very small homes. We can calculate area using the price per sq. ft. and rent amount.

small home, big rent

small home, big rent

 

d) Ranking cities with higher population density:

Similar code gives us the list below:

rent in cities with large population

rent in cities with large population

Not surprisingly, we see names like New York, Los Angeles and Chicago heading the list.

 

 

Mapping Cities & Rent:

We’ve added the geographical coordinates to our dataset, so let us try to plot the cities and their median rent. We will add a column for the text we want to display and use leaflet() function to create the map.

Note the maps look a little blurred at first, after 10 seconds the areas look lot clearer as the maps load up. So you can see national & state highway, city names and other details. The zoom feature allows users to zoom in and out.

Images for Hawaii are shown below:

US city map with clusters

US city map with clusters

Zooming to the left and down to view Hawaii.

Hawaii map

Hawaii map

 

Zooming further to check the Kailua island of Hawaii:

Median rent in Kailua, HI

Median rent in Kailua, HI

 

Data Insights:

  1. Top 10 most expensive cities seem to be concentrated in CA and TX. (California & Texas)
  2. In such cities you have to pay $10,000+ as rent.
  3. For the cities where you pay a lot for homes smaller than 900 sq ft, we notice that Hawaii cities have a seasonal trend. Perhaps due to tourist cycles and the torrential rains.
  4. The most populous cities are not always the most expensive, although it probably means a lot more competition for the same few homes.
  5. Median rent in most populous cities is ~$1300

What other insights did you pick up?

 

Next Steps:

You can play around with the data and code to see other rankings or create your visualizations. Here are some pointers to get you started:

  1. Rank cities by highest rent price for some random months – Jan 2014, July 2015, Mar 2012, Aug 2013, Nov 2016, July 2011, Sep 2015. Do the top 20 lists remain the same? Different?
  2. Collect the list of city names from all the above and view trend over time? Identify which city has the maximum price % increase, where price % =[ (Jan2017 rent – Nov 2010 rent) / Nov 2010 rent ]
  3. Which state has the highest number of such expensive cities? If the answer is CA, which is the second most expensive state?
  4. Repeat steps 1-3 for price per square foot.
  5. Select a midwestern state like Kansas, Oklahoma, North Dakota or Mississippi and repeat the analysis at a state level.

 

Please feel free to download the code files and datasets from the Projects Page under Aug 2017.

August Projects

In this month’s project, we will implement cluster analysis using the “K-means algorithm”.

We use the weather data from 1500+ locations (near airports) to understand temperature patterns by latitude and time of year.

We use cluster = 5 and assign letter A through E to locations with similar weather patterns. At the end of the analysis, you should be able to interpret the following insights from the resulting graphs and tables:

  1. Temperature patterns are similar towards the far North and South, just vertically shifted.
  2. The Pacific coast is different from the rest of the nation, where the temperature is static almost throughout the year.
  3. It is interesting to see how states in two different parts of the country show similar weather patterns since they are on the same latitude (see Minnesota and Maine). During peak summer, these two states are hotter than California.

 

A sample graph from the analysis is shown below.

US states by 5 major weather clustersUS states by 5 major weather clusters

US states divided into 5 major weather clusters

Data set and code files are available from the main Project site page, under the row for Jul/Aug 2017.

Take a look and play around with the data, to investigate the following:

  1. What happens if you increase cluster size to 7? What happens if you decrease it to 3?
  2. What is the monthly weather pattern for Hawaii (state code = HI) versus New Hampshire (abbreviation = NH) ?
  3. What is the weekly average temperature for a tropical state like Florida (plot a chart with median temperatures for all 52 weeks, by year). Has the average temperature gone up due to global warming?

Please leave your thoughts and comments, or questions if you get stuck on any point.

Happy Coding!

 

 

« Older posts
Facebook
Google+
https://blog.journeyofanalytics.com
Pinterest
LinkedIn