Journey of Analytics

Deep dive into data analysis tools, theory and projects

Category: Data visualization (page 2 of 2)

Dec 2016 – Project Updates

Hello All,

password analysis - text processing

password analysis – text processing

Just to notify that the code for monthly projects has been uploaded to the “Projects Page”.

This month’s code focuses on text analytics and includes code for:

  1. Identifying string patterns and word associations.
  2. string searches and string manipulations.
  3. Text processing and cleaning (remove emojis, punctuation marks, etc)
  4. weighted ranking
word association

word association

There are 2 projects, both under the header “TEXT ANALYTICS”, so you need to download two zipped folder using the appropriate download buttons:

  1. Text_analysis code: Detailed explanation given under link.
  2. Code – pwd strength. An explanation is given under this blog post.

Happy Coding! ūüôā

Password Strength Analysis – a Tutorial on Text Analysis & String Manipulation

In this post we will learn how to apply our data science skills to solve a business problem – namely why passwords get stolen or hijacked?
This post is inspired from a blog entry on Data Science Central, where the solution was coded in Python. (Our analysis will use R programming and extend the original idea)

In this tutorial, we will explore the following questions:

  1. What are the most common patterns found in passwords?
  2. How many passwords are ¬†banking type “strong ” combinations (containing special characters, length >8) ?
  3. How many passwords make excessive use of repetitive characters, like “1111”, “007”, “aaabbbccc” or similar.


Remember, this is a “real-world” dataset and this type of list is often used to create password dictionaries. You can also use it to develop your own password strength checker.


Overall, this tutorial will cover the following topics:

  1. basic string functions: stringlength, stringsearch, etc.
  2. data visualization using pie charts, histograms,
  3. Color coded HTML tables (similar to Excel) – a great feature if you plan to create Shiny Webapps with Tables.
  4. Weighted ranking.


So let’s get started:


What makes a “Strong” password?

First let us take a look at the minimum requirements of  an ideal password:

  1. Minimum 8 characters in length.
  2. Contains 3 out of 4 of the following items:
    • Uppercase Letters
    • Lowercase Letters
    • Numbers
    • Symbols


Analysis Procedure:


  1. Load input (password data) file:

TFscores = data.frame(fread(“C:/anu/ja/dec2016/passwords_data.txt”,¬†stringsAsFactors = FALSE, sep = ‘\n’, skip = 16))


2. Calculate length of each password:

TFscores$len = str_length(TFscores$password)


3. Plot histogram to see frequency distribution of password lengths. Note, we use a custom for-loop to generate labels for the histogram.

hist(TFscores$len, col = “blue” , ylim = c(0, 150000),

main = “Frequency Distribution – password length”,

xlab = “Password Length”,¬† ylab = “Count / Frequency”, labels = lendf$labelstr)

Histogram for password lengths

Histogram for password lengths



a. Calculate number of digits in each password.

number of digits in password

number of digits in password

TFscores$strmatch = gsub(pattern = “[[:digit:]]”, replacement = “”, TFscores$password)

TFscores$numberlen = TFscores$len – str_length(TFscores$strmatch)

b. Similarly calculate number of characters from other character classes:

  • Upper case alphabets
  • Lower case alphabets
  • Special characters –¬†! √Ę¬Ä¬Ě # % & √ʬĬô ( ) * + , – . / : ;


5. Assign 1 point as password strength “rank” for every¬†character class present in the password. ¬†As mentioned earlier, an ideal password should have at least 3 character classes.

TFscores$rank = TFscores$urank + TFscores$lrank + TFscores$nrank +   TFscores$srank

Let us take a look to see how the passwords in our list stack up:

pie(piedfchar$Var1,labels = labelarrchar , col=rainbow(9),¬†¬†main=”no. of Character classes in password”)


password strength analysis

password strength analysis

6. Count number of unique characters in password :


Note, this function is resource intensive, and takes couple of hours to complete due to size of the dataset.
To reduce the time/effort , the calculated values are added to the zipfolder, titled “pwd_scores.csv”.

¬†length(unique(strsplit(tempx$password, “”)[[1]]))


7. Assign  password strength category based on rank and length:

TFscores$pwdclass = “weak”¬†¬† #default

TFscores$pwdclass[TFscores$len < 5 | TFscores$rank == 1 ] = “very weak”

TFscores$pwdclass[TFscores$len >= 8 & TFscores$rank >=2] = “medium”

TFscores$pwdclass[TFscores$len >= 12] = “strong”

TFscores$pwdclass[TFscores$len >= 12 & TFscores$rank == 4] = “very strong”

Based on this criteria, we get the following frequency distribution:

password strength

password strength

7. We can derive the following insights from steps 5 and 6:

  • 77.68% of passwords are weak or very weak!
  • ~3% of passwords have less than 5 characters.
  • ~72% of passwords have less only 1 type of character class.
  • 0.5% of passwords have 8+ characters yet number of unique characters is less than 30%.
  • ~0.9% of characters have less than 4 unique characters.
  • 72% of passwords contain only digits.

8. Let’s see if there are any patterns repeated in the passwords, like “12345”, “abcde”, “1111”, etc:

TFscores$strmatch = regexpr(“12345”, TFscores$password)

pwd with years

password with year prefixes.

  • 1.2% of passwords contain pattern “12345”.
  • 0.01% of passwords contain pattern “abcde”.
  • 0.3% of passwords contain pattern “1111”.
  • 0.02% of passwords contain pattern “1234”.
  • 15% of passwords contain year notations like “198*”, “197*”, “199”, “200*”. Sample shown alongside clearly shows that many people use important years from their life for their passwords. (logically true!)


9. View the password strength visually. We use the “condformat” function to create an HTML table that is easy to assimilate:

condformat(testsampledf) + ¬†rule_fill_discrete(password, expression = rank < 2, colours = c(“TRUE”=”red”)) +
rule_fill_discrete(len, expression = (len >= 12), colours = c(“TRUE”=”gold”)) +
rule_fill_discrete(pwdclass, expression = (rank>2 & len>=8) , colours = c(“TRUE”=”green”))

password strength HTMl table

password strength HTMl table

Crime Density Area Contour Map

Hello All,

Today’s post is related to geographical heat maps – where a specific variable (say ethic groups, art colleges or crime category) is color coded to show areas ¬†of high or low concentration.

The dataset is from the Philadelphia crime database, generously posted on Kaggle. I’m using the geographical coordinates available in this file to plot crime¬†density maps for 4 specific crime categories. A simple function is created which takes the “crime category” as input and returns a contour map, using the ggmap library.

A detailed instruction is already posted as an RMarkdown file on the RPubs website. Please take a look at the link here.

The entire source code is also available for philly_crime_density_maps as a zipped file which includes РR program (easy to modify and play with the data!), the RMarkdown file. Please remember to add the dataset .csv file  from the Kaggle website and store in the same directory.

Philly Burglary-prone area maps

Burglary crime density area maps for Philadelphia

If you liked this post, and would like to receive updates for similar projects then please do signup for our blog updates. New projects are also added on our parent site at the beginning of every month, so do subscribe! If you think others may find this site, then please do share this link on Twitter and other social media! Thank you.

We love hearing feedback and questions. If you have any tips or would have taken a different approach please do share your thoughts in the comments section.

Happy Coding!

Newer posts

Thanks for reading so far! If you liked our content, please share!