Journey of Analytics

Deep dive into data analysis tools, theory and projects

Category: Machine Learning (page 1 of 2)

PHLAI – Philly Artificial Conference Summary

I had the opportunity to attend the third annual PHLAI 2019 conference on Artificial Intelligence and Machine Learning. Organized by Comcast, this was an amazing full-day event (on Aug 15th) with a theme of using AI to improve customer experience.

ai-cust-service
AI for customer service

For those who are not in the Greater Philadelphia area, I have to point out that most AI/Machine learning or even tech conferences typically happen either on the West Coast or in NYC. So the cost of travel/lodging can quickly add up if you want to attend as an individual.

Always been a point of contention to me that there aren’t any interesting conferences held in the Philly area. After all, we are home to a large number of startups (RJMetrics, Blackfynn, Stitch, Seer Interactive), Fortune500 employers like Nasdaq, Comcast, Vanguard and most of the big banks (JPMorgan, Barclays, etc.)

So here is a big shoutout to Comcast (esp. Ami Ehlenberger) for making the magic happen in Philly!

PHLAI 2019 Conference tracks and schedule

Coming back to the conference, here is a summary of my takeaways of the event:
1) Keynote – morning & evening
2) Speaker & Workshop Sessions – DataRobot, H20, Humane AI, Tableau and more..
3) Networking
4) Key Takeaways
5) Overall

1. Keynote

There were 2 keynotes – morning and afternoon. Both were brilliant, and engaging and filled with useful information.

Morning Keynote

This was given by Nij Chawla of Cognitive Scale. Snippets from his talk that resonated most with me:

  • Practical tips on how to use AI to create individual customization, especially to improve conversions on business websites.
  • For getting most ROI on AI investments, begin with smaller AI projects and deploy/iterate. Most projects fail because they aim at initiating a massive overhaul that take so long to move to production, that half the assumptions and data become invalid. Divide and conquer.
  • Don’t use AI just for the sake of it.
  • Real time data is crucial. 360 degrees view by aggregating multiple sources will help to drive maximum effectiveness, especially in banking and finance.
Morning Keynote – Cognitive Scale

He also presented some interesting examples of AI projects that are already in use today in various domains:

  • Early screening of autism & parkinson using front camera of phone. Apparently the typing/ tapping on phone is different for such patients and when combined with facial recognition technology, the false positives have been very low.
  • Drones in insurance. One company used drones to take pictures to assess damage after hurricane Harvey and proactively filed claims on behalf of their customers. This meant that the papaperwork and claim money was released for affected customers. Home repair contractors are in heavy demand after such natural disaster but the customers from this company had no trouble getting help because everything was already set up! What better example of extraordinary customer service than helping your customers re-build their life and homes? WOW!
  • Drone taxi in Dubai. Basically a driverless helicopter. Scary yet fascinating!
  • VR glasses for shopping. Allows customers to interact with different products and choose color/ design/ style. Ultimate customization without the hassles of carrying gigantic amounts of inventory.

Go back to Top of Page

Evening Keynote

This was presented by Piers Lingle & David Monnerat from Comcast. They spoke about the AI projects implemented at Comcast including projects to proactively present options to customers that increase customer experience, reduce call center volumes and complaints. I loved the idea of using automation to present 2-way dialog using multiple means of communication. The also spoke about chatbots and strategies that help chatbots learn at scale.

Using AI to augment customer service & wow them!

Some other salient points:

  • Automation is a part of the experience, not an add-on or an extra bother
  • Voice remote for Comcast TV users allows customers to fast forward for commercial breaks. Cool! 🙂
  • Think beyond NLP. AI works better in troubleshooting tasks when customers are struggling with describing problems, for example troubleshooting internet connectivity issues.
  • Measure from the beginning . Measure the right thing. Easy to say, hard to do.
  • Done correctly, AI does seem like magic. However, it takes a lot of work, science and complexity. So appreciate it! 🙂 Above all, trust the tech experts hired for this role instead of pigeoning them with hyper-narrow constraints but high expectations.

Go back to Top of Page

2. Speaker & Workshop sessions

The afternoon had 3 different tracks of speaker sessions with 3 talks each – Product, Customer Service & Real-World Experience. Attendees were allowed to attend talks from all 3 tracks which was what I did. The speaker lineup was so spectacular, I noticed a lot of folks being spoilt for choice and scratching their heads after every talk, on where to go! 🙂

I am only writing about the ones I attended. However, I did receive positive feedback overall for all the other sessions too! (No offense to the folks who presented but are not mentioned on here! )

Workshop by Datarobot.

This is an interesting software that helps you explore data, automate data cleansing and most important test by running multiple models in parallet. So you could test with hundreds of models and algorithms. The workshop was presented by the VP of DataScience, Gourab De and was extremely appealing.

It does have features for tweaking parameters, and excellent documentation so personally I think it would greatly augment the work of an experienced data scientists. Given the shortage of talent, tools like these would allow smaller teams to become faster and more efficient.

The availability of logs is another fabulous feature for banks (since I work in one) due to the stringent regulatory requirements around compliance, ethics and model governance.

Chaos Engineering & Machine Learning

Presented by Sudhir Borra of Comcast itself. The talk was mainly about the concept of chaos engineering; used to test resiliency of systems. This is quite crucial for serverless AI and automated CI/CD pipelines when models are tested and deployed in the cloud. This was quite technical yet practical to teams and companies looking to implement robust AI systems within their organizations.

Chaos Engg – initially developed at Netflix

Data Stories Tableau Session

Presented by Vidya Setlur of Tableau, who spoke of some interesting features with Tableau. For example, did you know Tableau gives logical color palettes where it makes sense, rather than to default colors? For example for a dataviz on vegetables, brocolli is assigned color green, whereas carrots would be assigned orange. She also spoke about “stroop” effect, which I urge every data scientist or Tableau developer to look up and understand.

There were also some interesting Qs about expanding the natural langauge questions to be activated via voice. The answer was that it is currently not available because the folks who would truly find this useful are senior executives who would possibly only use it over phone (not desktop) and hence also require touchscreen capabilities for using drill-down features.

This was a perspective I had not thought about, so I was quite impressed at how far the company has thought about features users might want far into the future. Like Apple, Tableau’s elegant design, ease of use and intuitive features make it a class apart in terms of functionality and customer experience.

The Humane Machine: Building AI Systems with Emotional Intelligence

Extremely engaging talk by Arpit Mathur of Comcast on some interesting work that Comcast is doing.

He also spoke of LIWC library – linguistic inquiry and word count and how to apply those on natural langauge data to gauge sentiments, similar to the Plutchiks wheel of emotions. Also on how to use these for creating more “sentient” chatbots and applications, since the context matters as much as words when we (humans) communicate. For example, how some cultures use words literally, whereas others could use it to mean the exact opposite!

He also discussed ethics in AI, which naturally was a “hot” topic and sparked multiple questions and discussions.

Handson Labs – H20 – Automated AI/ML

I’d heard many rave reviews about this product, so it was wonderful to be able to attend a handson workshop on how to use the tool, available features and check effectiveness of the models it can generate. Personally I found it to have just the right mix of automated model generation (great for newbies) and model customization (for advanced users).

Go back to Top of Page

3. Networking

The conference included two rounds of networking time, once with free breakfast, and second at 5:15 pm in a formal networking reception. Both were great opportunities to meet with other attendees and speakers.

I was not attending with any aim for job hunting, but I did see a list of open roles at Comcast. Plus, I did see folks talking about roles in their org, and describing their work, so certainly the event was an excellent venue to have informational interviews and reach out for potential collaboration.

Not to mention the new contacts will be useful in the future, should I choose to look.

I met a wide variety of folks – from students to software developers to product leads and even senior engineering leaders who wanted to learn best practices on how best to get ROI on future AI investments. There were folks who were just interested in learning more about AI, some looking for their next big idea and everyone in between. It was not just tech folks who attended, and folks were attending from domains like adtech, marketing, automobiles, healthcare and of course banking.

Amazing to see the turnout and community. Overall, I think job seekers should attend such events!

Oh, and I have to mention the gorgeous Comcast building – amazing working space and the central atrium was both futuristic and functional. I definitely regretted not taking the Comcast tour! (In fairness, the workshop I attended was fabulous too! )

Of course there was also time to chitchat between sessions and I am thankful to everyone who interacted with me! Thanks for your sparkling conversations.

Networking opportunities galore

Go back to Top of Page

4. Key Takeways

  • Focus on the customer and the end goal. Technology is only a tool to enable better outcomes and efficiencies. AI tools have matured, so make full use of it.
  • For the best return and impacts from AI projects, quick action is key. Choose a specific usecase, choose a metric to evaluate success and then proceed. If the model works, BRAVO! If they do not, you get valuable feedback on your initial assumptions, so you can make better decisions in the future.
  • Clean, reliable data is paramount. Garbage in, garbage out!
  • Collaboration is important. You do need to get the blessings from multiple stakeholders like IT, business, product, customer service, etc. But do not let collaboration equate to decision paralysis.
  • Done right, AI can be magical and live up to the hype. The critical component is getting started.
  • Data and user preferences change quickly, so think about automating model deployments and/or implementing CI/CD using technologies like kubernetes or the myriad tools and software available on the market to get to production quicker.
  • There is a huge talent shortage of experienced machine learning and artificial intelligence developers, so think about “upskilling” existing employees who already know the business and customer pain points.

Go back to Top of Page

5. Overall

Overall I found the event to be incredibly useful and I plan to attend again next year. I am also looking forward to use the takeaways in my own role and evangelize AI products within my org.

Go back to Top of Page

DataScience Portfolio Ideas for Students & Beginners

A lot has been written on the importance of a portfolio if you are looking for a DataScience role. Ideally, you should document your learning journey so that you can reuse code, write well-documented code and also improve your data storytelling skills.

DataScience Portfolio Ideas

However, most students and beginners get stumped on what to include in their portfolio, as their projects are all the same that their classmates, bootcamp associates and seniors have created. So, in this post I am going to tell you what projects you should have in your portfolio kitty, as well as a list of ideas that you can use to construct a collection of projects that will help you stand out on LinkedIn, Github and in the eyes of prospective hiring managers.

Job Search Guide

You can find many interesting projects on the “Projects” page of my website JourneyofAnalytics. I’ve also listed 50+ sources for free datasets in this blogpost.

In this post though, I am classifying projects based on skill level along with sample ideas for DIY projects that you can attempt on your own.

On that note, if you are already looking for a job, or about to do so, do take a look at my book “DataScience Jobs“, available on Amazon. This book will help you reduce your job search time and quickly start a career in analytics.

Since I prefer R over Python, all the project lists in this post will be coded in R. However, feel free to implement these ideas in Python, too!

a. Entry-level / Rookie Stage

  1. If you are just starting out, and are not very comfortable with even syntax, your main aim is to learn how to code along with DataScience concepts. At this stage, just try to write simple scripts in R that can pull data, clean it up and calculate mean/median and create basic exploratory graphs. Pick up any competition dataset on Kaggle.com and look at the highest voted EDA script. Try to recreate it on your own, read through and understand the hows and whys of the code. One excellent example is the Zillow EDA by Philipp Spachtholz.
  2. This will not only teach you the code syntax, but also how to approach a new dataset and slice/dice it to identify meaningful patterns before any analysis can begin.
  3. Once you are comfortable, you can move on to machine learning algorithms. Rather than Titanic, I actually prefer the Housing Prices Dataset. Initially, run the sample submission to establish a baseline score on the leaderboard. Then apply every algorithm you can look up and see how it works on the dataset. This is the fastest way to understand why some algorithms work on numerical target variables versus categorical versus time series.
  4. Next, look at the kernels with decent leaderboard score and replicate them. If you applied those algorithms but did not get the same result, check why there was a mismatch.
  5. Now pick a new dataset and repeat. I prefer competition datasets since you can easily see how your score moves up or down. Sometimes simple decision trees work better than complex Bayesian logic or Xgboost. Experimenting will help you figure out why.

Sample ideas –

  • Survey analysis: Pick up a survey dataset like the Stack overflow developer survey and complete a thorough EDA – men vs women, age and salary correlation, cities with highest salary after factoring in currency differences and cost of living. Can your insights also be converted into an eye-catching Infographic? Can you recreate this?
  • Simple predictions: Apply any algorithms you know on the Google analytics revenue predictor dataset. How do you compare against the baseline sample submission? Against the leaderboard?
  • Automated reporting: Go for end-to-end reporting. Can you automate a simple report, or create a formatted Excel or pdf chart using only R programming? Sample code here.

b. Senior Analyst/Coder

  1. At this stage simple competitions should be easy for you. You dont need to be in the top 1%, even being in the Top 30-40% is good enough. Although, if you can win a competition even better!
  2. Now you can start looking at non-tabular data like NLP sentiment analysis, image classification, API data pulls and even dataset mashup. This is also the stage when you probably feel comfortable enough to start applying for roles, so building unique projects are key.
  3. For sentiment analysis, nothing beats Twitter data, so get the API keys and start pulling data on a topic of interest. You might be limited by the daily pull limits on the free tier, so check if you need 2 accounts and aggregate data over a couple days or even a week. A starter example is the sentiment analysis I did during the Rio Olympics supporting Team USA.
  4. You should also start dabbling in RShiny and automated reports as these will help you in actual jobs where you need to present idea mockups and standardizing weekly/ daily reports.
Yelp College Search App

Sample ideas –

  • Twitter Sentiment Analysis: Look at the Twitter sentiments expressed before big IPO launches and see whether the positive or negative feelings correlated with a jump in prices. There are dozens of apps that look at the relation between stock prices and Twitter sentiments, but for this you’d need to be a little more creative since the IPO will not have any historical data to predict the first day dips and peaks.
  • API/RShiny Project: Develop a RShiny dashboard using Yelp API, showing the most popular restaurants around airports. You can combine a public airport dataset and merge it with filtered data from the Yelp API. A similar example (with code) is included in this Yelp College App dashboard.
  • Lyrics Clustering: Try doing some text analytics using song lyrics from this dataset with 50,000+ songs. Do artists repeat their lyrics? Are there common themes across all artists? Do male singers use different words versus female solo tracks? Do bands focus on a totally different theme? If you see your favorite band or lead singer, check how their work has evolved over the years.
  • Image classification starter tutorial is here. Can you customize the code and apply to a different image database?

c. Expert Data Scientist

DataScience Expert portfolio
  1. By now, you should be fairly comfortable with analyzing data from different datasource types (image, text, unstructured), building advanced recommender systems and implementing unsupervised machine learning algorithms. You are now moving from analyze stage to build stage.
  2. You may or may not already have a job by now. If you do, congratulations! Remember to keep learning and coding so you can accelerate your career further.
  3. If you have not, check out my book on how to land a high-paying ($$$) Data Science job job within 90 days.
  4. Look at building Deep learning using keras and apps using artificial intelligence. Even better, can you fully automate your job? No, you wont “downsize” yourself. Instead your employer will happily promote you since you’ve shown them a superb way to improve efficiency and cut costs, and they will love to have you look at other parts of the business where you can repeat the process.

Sample project ideas –

  • Build an App: College recommender system using public datasets and web scraping in R. (Remember to check terms of service as you do not want to violate any laws!) Goal is to recreate a report like the Top 10 cities to live in, but from a college perspective.
  • Start thinking about what data you need – college details (names, locations, majors, size, demographics, cost), outlook (Christian/HBCU/minority), student prospects (salary after graduation, time to graduate, diversity, scholarship, student debt ) , admission process (deadlines, average scores, heavy sports leaning) and so on. How will you aggregate this data? Where will you store it? How can you make it interactive and create an app that people might pay for?
  • Upwork Gigs: Look at Upwork contracts tagged as intermediate or expert, esp. the ones with $500+ budgets. Even if you dont want to bid, just attempt the project on your own. If you fail, you will know you still need to master some more concepts, if you succeed then it will be a superb confidence booster and learning opportunity.
  • Audio Processing: Use the VOX celebrity dataset to identify the speaker based on audio/speech dataset. Audio files are an interesting datasource with applications in customer recognition (think bank call centers to prevent fraud), parsing for customer complaints, etc.
  • Build your own package: Think about the functions and code you use most often. Can you build a package around it? The most trending R-packages are listed here. Can you build something better?

Do you have any other interesting ideas? If so, feel free to contact me with your ideas or send me a link with the Github repo.

How to Become a Data Scientist

This question and its variations are the most searched topics on Google. As a practicing datascience professional, and manager to boot, dozens of people ask me this question every week.

This post is my honest and detailed answer.

Step 1 – Coding & ML skills

  • You need to master programming in either R or Python. If you don’t know which to pick, pick R, or toss a coin. [Or listen to me, and pick R – programming as it is used at Top Firms like NASDAQ, JPMorgan, and many more..] Also, when I say master, you need to know more than writing a simple calculator or “Hello World” function. You should be able to perform complex data wrangling, pull data from databases, write custom functions and apply algorithms, even if someone wakes you up at midnight.
  • By ML, I mean the logic behind machine learning algorithms. When presented with a problem, you should be able to identify which algorithm to apply and write the code snippet to do this.
  • Resources – Coursera, Udacity, Udemy. There are countless others, but these 3 are my favorites. Personal recommendation, basic R from Coursera (JHU) and Machine learning fundamentals from Kirill’s course on Udemy.

Step 2 – Build your portfolio.

  • Recruiters and hiring managers don’t know you exist, and having an online portfolio is the best way to attract their attention. Also, once employers do come calling, they will want to evaluate your technical expertise, so a portfolio helps.
  • The best way to showcase your value to potential employers is to establish your brand via projects on Github, LinkedIn and your website.
  • If you do not have your own website, create one for free using wordpress or Wix.
  • Stumped on what to post in your project portfolio?
  • Step1 – Start by looking in the kernels portion on the site www.kaggle.com there are tons of folks who have leveraged free datasets to create interesting visualizations. Also enroll in any active competitions and navigate to the discussion forums. You will find very generous folks who have posted starter scripts and detailed exploratory analysis. Fork the script and try to replicate the solution. My personal recommendation would be to begin with titanic contest or the housing prices set. My professional website journeyofanalytics also houses some interesting project tutorials, if you want to take a look.
  • Step 2 – pick a similar datasets from kaggle or any other open source site, and apply the code to the new datasets. Bingo, a totally new project and ample practice for you.
  • Step3 – Work your way up to image recognition and text processing.

Step 3 – Apply for jobs strategically.

  • Please don’t randomly apply to every single datascience job in the country. Be strategic using LinkedIn to reach out to hiring managers. Remember, its better to hear “NO” directly from the hiring manager than to apply online and wait in eternity.
  • Competition is getting fierce, so be methodical. Books like “Data Science Jobs” will help you pinpoint the best jobs in your target city, and also connect with hiring managers for jobs that are not posted anywhere else.
  • Yes, I wrote the book listed above – this is the book I wished I had when I started in this field! Unlike other books on the market with random generalizations, this book is written specifically for jobseekers in the datascience field. Plus, I’ve successfully helped a dozen folks land lucrative jobs (data analyst/data scientist roles) using the strategies outlined in this book. This book will help you cut your datascience job search time in half!
  • Upwork is a fabulous site to get gigs to tide you until you get hired full-time. It is also a fabulous way of being unique and standing out to potential employers! As a recruiter once told me, “it is easier to hire someone who already has a job, than to evaluate someone who doesn’t!”
  • If your first job is not at your dream job, do not despair. Earn and learn, every company, big or small, will teach you valuable skills that will help you get better and snag your ideal role next year. I do recommend staying at roles for at least 12 months, before switching, otherwise you won’t have anything impactful to discuss in the next interview.

Step 4 – Continuous learning.

  • Even if you’ve landed the “data scientist” job you always wanted, you cannot afford to rest on your laurels. Keep your skills current by attending online classes, conferences and reading up on tech changes.
    Udemy, again is my go to resource to stay abreast of technical skills.
  • Network with others to know how roles are changing, and what skills are valuable.

Finally, being in this filed is a rewarding experience, and also quite lucrative. However, no one can get to the top without putting in sufficient effort and time. So, master the basics and apply your skills, you will definitely meet with success.

If you are looking to establish a career in datascience, then don’t forget to take a look at my book – “Data Science Jobs‘ now available on Amazon.

How to raise money on Kickstarter – extensive EDA and prediction tutorial

In this tutorial, we will explore the characterisitcs of projects on Kickstarter and try to understand what separates the winners from the projects that failed to reach their funding goals.

Qs for Exploratory Analysis:

We will start our analysis with the aim of answering the following questions:

    1. How many projects were successful on Kickstarter, by year and category.
    2. Which sub-categories raised the most amount of money?
    3. Projects originate from which countries?
    4. How many projects exceeded their funding goal by 50% or more?
    5. Did any projects reach $100,000 or more? $1,000,000 or higher?
    6. What was the average amount contributed by each backer, and how does this change over time? Does this amount differ with categories?
    7. What is the average funding period?

 

Predicting success rates:
Using the answers from the above questions, we will try to create a model that can predict which projects are most likely to be successful.

The dataset is available on Kaggle, and you can run this script LIVE using this kernel link. If you find this tutorial useful or interesting, then please do upvote the kernel ! 🙂

Step1 – Data Pre-processing

a) Let us take a look at the input dataset :

The projects are divided into main and sub-categories. The pledged amount “usd_pledged” has an equivalent value converted to USD, called “usd_pledged_real”. However, the goal amount does not have this conversion. So for now, we will use the amounts as is.

We can see how many people are backing each individual project using the column, “backers”.

b) Now let us look at the first 5 records:

The name doesn’t really indicate any specific pattern although it might be interesting to see if longer names have better success rates. Not pursuing that angle at this time, though.

c) Looking for missing values:

Hurrah, a really clean dataset, even after searching for “empty” strings. 🙂

 d) Date Formatting and splitting:

We have two dates in our dataset – “launch date” and “deadline date”.We convert them from strings to date format.
We also split these dates into the respective year and month columns, so that we can plot variations over time.
So we will now have 4 new columns: launch_year, launch_month, deadline_year and deadline_month.

Exploratory analysis:

a) How many projects are successful?

We see that “failed” and “successful” are the two main categories, comprising ~88% of our dataset.
Sadly we do not know why some projects are marked “undefined” or “canceled”.
“live”” projects are those where the deadlines have not yet passed, although a few among them are already achieved their goal.
Surprisingly, some ‘canceled’ projects had also met their goals (pledged_amount >= goal).
Since these other categories are a very small portion of the dataset, we will subset and only consider records with satus “failed” or “successful” for the rest of the analysis.

b) How many countries have projects on kickstarter?

We see projects are overwhelmingly US. Some country names have the tag N,0“”, so marking them as unknown.

c) Number of projects launched per year:

Looks like some records say dates like 1970, which does not look right. So we discard any records with a launch / deadline year before 2009.
Plotting the counts per year on a graphs: < br />From the graph below, it looks like the count of projects peaked in 2015, then went down. However, this should NOT be taken as an indicator of success rates.

 

 

Drilling down a bit more to see count of projects by main_category.

Over the years, maximum number of projects have been in the categories:

    1. Film & Video
    2. Music
    3. Publishing

 d) Number of projects by sub-category: (Top 20 only)


The Top 5 sub-categories are:

    1. Product Design
    2. Documentary
    3. Music
    4. Tabletop Games (interesting!!!)
    5. Shorts (really?! )

Let us now see “Status” of projects for these Top 5 sub_categories:
From the graph below, we see that for category “shorts” and “tabletop games” there are more successfull projects than failed ones.

 e) Backers by category and sub-category:

Since there are a lot of sub-categories, let us explore the sub-categories under the main theme “Design” 

Product design is not just the sub-category with the highest count of projects, but also the category with the highest success ratio.

 f) add flag to see how many got funded more than the goal.

So ~40% of projects reached or surpassed their goal, which matches the number of successful projects .

 g) Calculate average contribution per backer:

From the mean, median and max values we quickly see that the median amount contributed by each backer is only ~$40 whereas the mean is higher due to the extreme positive values. The max amount by a single backer is ~$5000.

h) Calculate reach_ratio

The amount per backer is a good start, but what if the goal amount itself is only $1000? Then an average contribution per backer of $50 impies we only need 20 backers.
So to better understand the probability of a project’s success, we create a derived metric called “reach_ratio”.
This takes the average user contribution and compares it against the goal fund amount.

We see the median reach_ratio is <1%. Only in the third quartile do we even touch 2%!
Clearly most projects have a very low reach ratio. We could subset for “successful” projects only and check if the reach_ratio is higher.

 i) Number of days to achieve goal:

 Predictive Analystics:

We will apply a very simple decision tree algorithm to our dataset.
Since we do not have a separate “test” set, we will split the input dataframe into 2 parts (70/30 split).
We will use the smaller set to test the accuracy of out algorithm.

Taking a peek at the decision tree rules:

kickstarter success decision tree

kickstarter success decision tree




Thus we see that “backers” and “reach-ratio” are the main significant variables.

Re-applying the tree rules to the training set itself, we can validate our model:

From the above tables, we see that the error rate = ~3% and area under curve >= 97%

Finally applying the tree rules to the test set, we get the following stats:

From the above tables, we see that still the error rate = ~3% and area under curve >= 97%

 

Conclusion:

Thus in this tutorial, we explored the factors that contribtue to a project’s success. Main theme and sub-category were important, but the number of backers and “reach_ratio” were found to be most critical.
If a founder wanted to gauge their probability of success, they could measure their “reach-ratio” halfway to the deadline, or perhaps when 25% of the timeline is complete. If the numbers are lower, it means they need to double down and use promotions/social media marketing to get more backers and funding.

If you liked this tutorial, feel free to fork the script. And dont forget to upvote the kernel! 🙂

Sberbank Machine Learning Series – Post 2 – Mind maps & Hypothesis

This is the second post of the Sberbank Russia housing set analysis, where we will narrow down the variables of interest and create a roadmap to understand which factors significantly impact the target variable (price_doc).

You can read the introductory first post here.

 

Analysis Roadmap:

This Kaggle dataset has ~290 variables, so having a clear direction is important. In the initial phase, we obviously do not know which variable is significant, and which one is not, so we will just read through the data dictionary and logically select variables of interest. Using these we create our hypothesis, i.e the relationship with target variable (home price) and test the strength of the relationship.

The dataset also includes macroeconomic variables, so we will also create derived variables to test interactions between variables.

A simple mindmap for this dataset is as below:

home price analysis mindmap

home price analysis mindmap

Hypothesis Qs:

The hypothesis Qs and predictor variables of interest are listed below:

Target Variable: (TV)

“price_doc” is the variable to predict. Henceforth this will be referred to as “TV”.

 

Predictor variables:

These are the variables that affect the target variable, although we do not know which one is more significant over the others, or indeed if two or more variables interact together to make a bigger impact.

For the Sberbank set, we have predictor variables from 3 categories:

  1. Property details,
  2. Neighborhood characteristics,
  3. Macroeconomic factors

(Note, all the predictors in the mindmap, marked with a # indicate derived or calculated variables).

 

Property details:

  1. Timestamp –
    1. We will use both the timestamp (d/m/y) as well as extract the month-year values to assess relationship with TV.
    2. We will also check if any of the homes have multiple timestamps, which means the house passed through multiple owners. If yes, does this correlate with a specific sub_area?
  2. Single family and bigger homes also have patios, yards, lofts, etc which creates a difference between living area and full home area. So we take a ratio between life_sq and full_sq and check if a home with bigger ratio plus larger full_sq gets better price.
  3. Kitch_sq – Do homes with larger kitchens command better price? So, we will take a ratio of kitch_sq / life_sq and check impact on house price.
  4. Sub_area – does this affect price?
  5. Build_year –
    1. Logically newer homes should have better price.
    2. Also check if there is interaction with full_sq i.e larger, newer homes gets better price?
    3. Check inter-relationship with sub_area.
  6. Material – how does this affect TV?
  7. Floor/max_floor –
    1. create this ratio and check affected price. Note, we need to identify how single-family homes are identified, since they would have to be excluded as a separate subset.
    2. Does a higher floor increase price? In specific sub_area? For example, certain top floor apartments in Chicago and NYC command better price since tenants get an amazing view of the skyline, and there is limited real estate in such areas.
  8. Product_type – Investment or ownership. Check if investment properties have better price.

 

Neighborhood details:

  1. Full_all – Total population in the area. Denser population should correlate with higher sale price.
  2. Male_f / female_f – Derived variable. If the ratio is skewed it may indicate military zones or special communities, which may possibly affect price.
  3. Kid friendly neighborhood – Calculate ratio of x13_all / full_all , i.e ratio of total population under 13 to overall population. A high ratio indicates a family-friendly neighborhood or residential suburb which may be better for home sale price. Also correlate with sub_area.
  4. Similar to above, calculate ratio of teens to overall population. Correlate with sub_area.
  5. Proximity to public transport: Calculate normalized scores for the following:
    1. Railroad_stn_walk_min,
    2. Metro_min_avto,
    3. Public_transport_walk
    4. Add all to get a weighted score. Lower values should hopefully correlate with higher home prices.
  6. Entertainment amenities: Easy access to entertainment options should be higher in densely populated areas with higher standards of living, and these areas presumably should command better home values. Hence we check relationship of TV with the following variables:
    1. Fitness_km,
    2. Bigmarket_km
    3. Stadium_km,
    4. Shoppingcentres_km,
  7. Proximity to office: TV versus normalized values for :
    1. Office_count_500,
    2. Office_count_1000,
    3. Logically the more number of offices nearby, better price value.
  8. Similarly, calculate normalized values for number of industries in the vicinity, i.e. prom_part_500 / prom_part_5000. However, here the hypothesis is that houses nearby will have lower sale prices, since industries lead to noise/pollution, and does not make an ideal residential neighborhood. (optional, check if sub_areas with high number of industries, have lower number of standalone homes (single-family/townhomes, etc).
  9. Ratio of premium cafes to inexpensive ones in the neighborhood i.e café_count_5000_price_high/ café_count_price_500. If the ratio is high, then do the houses in these areas have increased sale price? Also correlate with sub_area.

 

Macro Variables:

These are overall numbers for the entire country, so they remain fairly constant for a whole year. However, we will merge these variables to the training and test set, to get a more holistic view of the real estate market.

The reasoning is simple, if the overall mortgage rates are excessive (let’s say 35% interest rates) then it is highly unlikely there will be large number of home prices, thus forcing a reduction the overall home sale prices. Similarly, factors like inflation, income per person also affect home prices.

  1. Ratio of Income_per_Cap and real_disposable_income: ideally the economy is doing better if both numbers are high, thus making it easier for homebuyers to get home loans and consequently pursue the house of their dreams.
  2. Mortgage_value: We will use a normalized value, to see how much this number changes over the years. If the number is lower, our hypothesis is that more number of people took larger loans, and hence sale prices for the year should be higher.
  3. Usdrub: how well is the Ruble (Russian currency) faring against the dollar. Higher numbers should indicate better stability and economy and a stronger correlation with TV. (we will ignore the relationship with Euros for now).
  4. Cpi: normalized value over the years.
  5. GDP: we take a ratio of gdp_annual_growth/ gdp_annual, since both numbers should be high in a good economy.
  6. Unemployment ratio: Uemployment/ employment. Hypothesis is to look for an inverse relationship with TV.
  7. Population_migration: We will try to see the interaction with TV, while taking sub_area into consideration.
  8. Museum_visits_per_100_cap: Derive values to see if numbers have increased or decreased from the previos year, indicating higher/lower disposable income.
  9. Construction_value: normalized value.

 

In the next posts, we will use a) these hypothesis Qs to understand how the target variable is affected by the variables. (b) Apply the variables in different algorithms to calculate TV.

Older posts
Facebook
LinkedIn