Stock Price Analysis – Linear Regression Model in 5 simple steps

In this post we are going to analyze stock prices for company Facebook and create a linear regression model.


Code Overview:

Our code performs the following functions. You can download the code here.

  1. Load original dataset.
  2. Add data to benchmark against S&P500 data.
  3. Create derived variables. We create variables to store calculations for the following:
    • Date values: split composite dates into day, month and year.
    • Daily volatility : Price change between daily high and low prices or intraday change in price. A bigger difference signifies heavy volatility.
    • Inter-day price volatility : Price change from previous day
  4. Data visualization to validate relationship between target variable and predictors.
  5. Create Linear Regression Model


Step 1&2 – Load Datasets

We load stock prices for Facebook and S&P500. Note, the S&P500 index prices begin from 2004, whereas Facebook was listed as a public company only in May 2012.
We specify “all.x = TRUE” in the merge command to indicate that we do not want dates which are not present in the Facebook file.

fbknew = merge(fbkdf, sp5k, by = “Date”, all.x = TRUE)

Note, we obtained this data from the Yahoo! Finance homepage using the tab “Historical Data”.


Step 3 – Derived Variables

a) Date Values:
The as.Date() function is an excellent choice to breakdown the date variable into day, month and year. The beauty of this function is that it allows you to specify the format of the date in the original data since different regions format it differently. (mmddyyyy / ddmmyy/ ..)

fbknew$Date2 = as.Date(fbknew$Date, “%Y-%m-%d”)
fbknew$mthyr = paste(as.numeric(format(fbknew$Date2, “%m”)), “-“,
as.numeric(format(fbknew$Date2, “%Y”)), sep = “”)
fbknew$mth = as.numeric(format(fbknew$Date2, “%m”))
fbknew$year = as.numeric(format(fbknew$Date2, “%Y”))

b) Intraday price:

fbknew$prc_change = fbknew$High – fbknew$Low

We have only calculated the difference between the High and Low, but since data is available for “Open” and “Close” you can calculate the maximum range as a self-exercise.
Another interesting exercise would be to calculate the average of both, or create a weighted variable to capture the impact of all 4 variables.

c) Inter-day price:
We first sort by date, using the order() function, and then use a “for loop” to calculate price difference from the previous day. Note, since we are working with dataframes we cannot use a simple subtraction command like x = var[i] – var[i-1]. Feel free to try, the error message is really beneficial in understanding how arrays and dataframes differ!


Step 4 – Data Visualization

Before we create any statistical model, it is always good practice to visually explore the relationships between target variable (here “opening price”) and the predictor variables.

With linear regression model, it is more so, to identify if any variables show a non-linear (exponential, parabolic ) relationship. If you see such patterns you can still use linear regression, after you normalize the data using a log function.

Here are some of the charts from such an exploration between “Open” price (OP) and other variables. Note there may be multiple explanations for the relationship, here we are only looking at the patterns, not the reason behind it.

 a) OP versus Trade Volume:

From chart below it looks like the volume is inversely related to price. There is also one outlier (last data point) . We use the lines() and lowess() functions to add a smooth trendline.
An interesting self-exercise would be to identify the date when this occurred and match it to a specific event in the company history (perhaps the stock split?)

Facebook Stock – Trading Volume versus Opening Stock Price

b) OP by Month & Year:

We see that the stock price has been steadily increasing year on year.

Facebook_stock_price_by_month&and year

Facebook price by month and year

c) OP versus S&P500 index price:

Clearly the Facecbook stock price has a linear relationship with S&P500 index price (logical too!)


Facebook stock price versus S&P500 performance

d) Daily volatility:

This is slightly scattered although the central clustering of data points indicates this as a fairly stable.

Note, we use the sqldf package to aggregate data by monthyear / month for some of these charts.

Step 5 – Linear Regression Model

Our model formula is as follows:

lmodelfb = lm(Open ~ High + Volume + SP500_price + prc_change +     mthyr + mth + year + volt_chg,     data = fbknew)

We use the summary function to view the intercepts and identify the predictors with the biggest impact, as shown in table below:

predictor variables

We see that price is affected overall by S&P500, interday volatility and trade volume. Some months in 2013 also showed a significant impact.
This was our linear regression data model. Again, feel free to download the code and experiment for yourself.

Feel free to add your own observations in the comments section.

Dec 2016 – Project Updates

Hello All,

password analysis - text processing

password analysis – text processing

Just to notify that the code for monthly projects has been uploaded to the “Projects Page”.

This month’s code focuses on text analytics and includes code for:

  1. Identifying string patterns and word associations.
  2. string searches and string manipulations.
  3. Text processing and cleaning (remove emojis, punctuation marks, etc)
  4. weighted ranking
word association

word association

There are 2 projects, both under the header “TEXT ANALYTICS”, so you need to download two zipped folder using the appropriate download buttons:

  1. Text_analysis code: Detailed explanation given under link.
  2. Code – pwd strength. An explanation is given under this blog post.

Happy Coding! 🙂

Password Strength Analysis – a Tutorial on Text Analysis & String Manipulation

In this post we will learn how to apply our data science skills to solve a business problem – namely why passwords get stolen or hijacked?
This post is inspired from a blog entry on Data Science Central, where the solution was coded in Python. (Our analysis will use R programming and extend the original idea)

In this tutorial, we will explore the following questions:

  1. What are the most common patterns found in passwords?
  2. How many passwords are  banking type “strong ” combinations (containing special characters, length >8) ?
  3. How many passwords make excessive use of repetitive characters, like “1111”, “007”, “aaabbbccc” or similar.


Remember, this is a “real-world” dataset and this type of list is often used to create password dictionaries. You can also use it to develop your own password strength checker.


Overall, this tutorial will cover the following topics:

  1. basic string functions: stringlength, stringsearch, etc.
  2. data visualization using pie charts, histograms,
  3. Color coded HTML tables (similar to Excel) – a great feature if you plan to create Shiny Webapps with Tables.
  4. Weighted ranking.


So let’s get started:


What makes a “Strong” password?

First let us take a look at the minimum requirements of  an ideal password:

  1. Minimum 8 characters in length.
  2. Contains 3 out of 4 of the following items:
    • Uppercase Letters
    • Lowercase Letters
    • Numbers
    • Symbols


Analysis Procedure:


  1. Load input (password data) file:

TFscores = data.frame(fread(“C:/anu/ja/dec2016/passwords_data.txt”, stringsAsFactors = FALSE, sep = ‘\n’, skip = 16))


2. Calculate length of each password:

TFscores$len = str_length(TFscores$password)


3. Plot histogram to see frequency distribution of password lengths. Note, we use a custom for-loop to generate labels for the histogram.

hist(TFscores$len, col = “blue” , ylim = c(0, 150000),

main = “Frequency Distribution – password length”,

xlab = “Password Length”,  ylab = “Count / Frequency”, labels = lendf$labelstr)

Histogram for password lengths

Histogram for password lengths



a. Calculate number of digits in each password.

number of digits in password

number of digits in password

TFscores$strmatch = gsub(pattern = “[[:digit:]]”, replacement = “”, TFscores$password)

TFscores$numberlen = TFscores$len – str_length(TFscores$strmatch)

b. Similarly calculate number of characters from other character classes:

  • Upper case alphabets
  • Lower case alphabets
  • Special characters – ! ” # % & ’ ( ) * + , – . / : ;


5. Assign 1 point as password strength “rank” for every character class present in the password.  As mentioned earlier, an ideal password should have at least 3 character classes.

TFscores$rank = TFscores$urank + TFscores$lrank + TFscores$nrank +   TFscores$srank

Let us take a look to see how the passwords in our list stack up:

pie(piedfchar$Var1,labels = labelarrchar , col=rainbow(9),  main=”no. of Character classes in password”)


password strength analysis

password strength analysis

6. Count number of unique characters in password :


Note, this function is resource intensive, and takes couple of hours to complete due to size of the dataset.
To reduce the time/effort , the calculated values are added to the zipfolder, titled “pwd_scores.csv”.

 length(unique(strsplit(tempx$password, “”)[[1]]))


7. Assign  password strength category based on rank and length:

TFscores$pwdclass = “weak”   #default

TFscores$pwdclass[TFscores$len < 5 | TFscores$rank == 1 ] = “very weak”

TFscores$pwdclass[TFscores$len >= 8 & TFscores$rank >=2] = “medium”

TFscores$pwdclass[TFscores$len >= 12] = “strong”

TFscores$pwdclass[TFscores$len >= 12 & TFscores$rank == 4] = “very strong”

Based on this criteria, we get the following frequency distribution:

password strength

password strength

7. We can derive the following insights from steps 5 and 6:

  • 77.68% of passwords are weak or very weak!
  • ~3% of passwords have less than 5 characters.
  • ~72% of passwords have less only 1 type of character class.
  • 0.5% of passwords have 8+ characters yet number of unique characters is less than 30%.
  • ~0.9% of characters have less than 4 unique characters.
  • 72% of passwords contain only digits.

8. Let’s see if there are any patterns repeated in the passwords, like “12345”, “abcde”, “1111”, etc:

TFscores$strmatch = regexpr(“12345”, TFscores$password)

pwd with years

password with year prefixes.

  • 1.2% of passwords contain pattern “12345”.
  • 0.01% of passwords contain pattern “abcde”.
  • 0.3% of passwords contain pattern “1111”.
  • 0.02% of passwords contain pattern “1234”.
  • 15% of passwords contain year notations like “198*”, “197*”, “199”, “200*”. Sample shown alongside clearly shows that many people use important years from their life for their passwords. (logically true!)


9. View the password strength visually. We use the “condformat” function to create an HTML table that is easy to assimilate:

condformat(testsampledf) +  rule_fill_discrete(password, expression = rank < 2, colours = c(“TRUE”=”red”)) +
rule_fill_discrete(len, expression = (len >= 12), colours = c(“TRUE”=”gold”)) +
rule_fill_discrete(pwdclass, expression = (rank>2 & len>=8) , colours = c(“TRUE”=”green”))

password strength HTMl table

password strength HTMl table

Machine Learning Algorithms

In the last few posts, we saw standalone analytics projects to perform sentiment analysis, visually explore large datasets for insights  and create interesting Shiny applications.

In the coming months however, we will cover how to implement machine learning algorithms in depth. We will explore the underlying concepts behind the algorithm, (why and how the formula works) , implement using a real-world Kaggle dataset and also learn about limitations and advantages.


What algorithms will we cover?

There are many algorithms to choose from, and this infographic from ThinkBigData provides an excellent and comprehensive list. Feel free to use as a handout or print one for your cubicles! For our purposes, we will cover two algorithms from each category.

Categories of machine learning algorithms

Categories of machine learning algorithms. Source –, by author Anubhav Srivastava.


Quick FAQ – selecting algorithm in practice

Many readers often ask, “how do I understand which algorithm to select? ” And this is also where new programmers often get stuck.

The long-winded answer is there is no secret sauce, and unfortunately often comes from experience or the problem definition itself.

The above answer is not very satisfying, so here are two “cheat-sheet” answers:

  1. A good approximation is given by this infographic by Microsoft Azure is a great example.  Download it from Link here. 
  2. Regression is a very common and flexible model, so the table below provides idea to create a base model based on whether your target variable is qualitative (numeric) or categorical (e.g gender or country)

    regression algorithms

    regression algorithms based on target variables


US Presidential Elections – Roundup of Final Forecasts

With barely 48 hours remaining for the US Presidential Elections, I thought a roundup post curating the “forecasts” seemed inevitable.

So here are the analysis from 3 Top Forecasters, known for their accurate predictions:

US Presidential Elections 2016

US Presidential Elections 2016


(1) Nate Silver, FiveThirtyEight:

This website has been giving a running status of the elections and has been accounting for the numerous pendulum swing (and shocking) changes that have characterized this election. Currently, it shows Hillary Clinton to be the clear winner with a ~70% chance of being the next President. You can check out the state-wise stats and electoral vote breakdown in their webpage here.  If you are interested you can also view their forecasts using 3 different models: polls only, polls+forecast and now-cast (current sentiment) and how they have changed over the last 12  months.

Their analytics are pretty amazing, so do take a look as a learning exercise, even if you do not agree with the forecast itself!


(2) 270towin:

Predictions and forecasts from Larry Sabato and the team at the University of Virginia Center for Politics. The final forecast from this team also puts Ms. Clinton as the clear winner.  They also expect Democrats to take control over the Senate. You can view their statewise electoral vote predictions here.


(3) Dr. Lichtman’s 13-key system:

Unlike other statistical teams and political analysts, this distinguished professor of history at American University, rose to fame using a simplified 13-key system for predicting the Presidential Elections. According to Dr. Allan J. Lichtman’s theory, if six or more questions are answered true, then the party holding the White House will be toppled from power. His system has been proven right for the past 30 years, so please do take a look at it before you scoff that it does not contain the mathematical proof and complex computations touted by media houses and political analytics teams. Dr. Allan J. Lichtman predicts  Trump to be the winner,  as he shows six of the questions are currently TRUE. Read more about this system and the analysis here.



Finally, looking at the overall sentiment on Twitter and news media, it does look like Hillary’s win is imminent.

But until the final vote is cast, who knows what may change?

