In this post, we are going to learn how to apply a machine learning model for predictive analytics. We will create 5 models using different algorithms and test the results to compare which model gives the most accurate results. You can use this approach to compete on Kaggle or make predictions using your own datasets.

Dataset – For this experiment, we will use the birth_weight dataset from Delaware State Open Data Portal, which includes data from infants born in the period 2009-2016, including place of delivery (hospital/ birthing center/ home), gestation period (premature/ normal) and details about mother’s health conditions. You can download the data directly from the Open Data Link (https://data.delaware.gov/browse) or use the file provided with the code.

## Step 1 – Prepare the Workspace.

- We clean up the memory of current R session, load some standard library packages (data.table, ggplot, sqldf, etc).
- We load the dataset “Births.csv”.

1 |
birthdf2 = data.frame(fread("Births.csv"), stringsAsFactors = FALSE) |

## Step 2 – Data Exploration.

- This step helps us understand the dataset – the range of values for variables, most common occurrences, etc. For our dataset, we look at a summary of birth years, birth weight and number of unique values.

1 2 3 4 5 |
summary(birthdf[, c(1,18)]) table(birthdf$BIRTH.WEIGHT) # categories: <1500 1500-2499 2500+ # num of records: 986 3769 50916 |

2. This is the point where we process for missing values and make a decision whether to ignore (entire column with large number of missing data), delete (very few records) or possibly replace it with median values. In this set however, there are no missing values that need to be processed.

1 |
sapply(birthdf, function(x) sum(is.na(x))) |

3. Check how many unique values exist for each column.

1 |
sapply(birthdf, function(x) length(unique(x))) |

## Step 3 – Test and Training Set

If you’ve ever competed on Kaggle, you will realize that the “training” set is the datafile used to create the machine learning model and the “test” set is the one where we use our model to predict the target variables.

In our case, we only have 1 file, so we will manually divide our set into 3 sets – one training set and one 2 test sets. (70% ,15%, 15% split) Why 2 test sets? Because it helps us better understand how the model reacts to new data. You can work with just one if you like. Just use one sequence command and stop with testdf command.

1 2 3 4 5 6 7 8 9 10 |
set.seed(270) # this will ensure reproducibility, and the same random numbers will be generated each time. seq_index = sample(seq_len(nrow(birthdf)), size = 8000) traindf = birthdf[-seq_index,] testdf = birthdf[seq_index,] # code to divide the test set into two set.seed(1986) seq_index2 = sample(seq_len(nrow(testdf)), size = 5000) test1 = testdf[seq_index2,] test2 = testdf[-seq_index2,] |

## Step 4 – Hypothesis Testing

In this step , we try to understand which predictors most affect our target variable using statistical functions such as ANOVA, chisquare, correlation, etc. The exact function you use can be determined using the table alongside.

Irrespective of which function we use, we assume the following hypothesis:

a) Ho (null hypothesis) – no relation exists. Ho is accepted if p-values if >= 0.05

b) Ha (alternate hypothesis) – relation exists. Ha is accepted if p-value < 0.05. If Ha is found true, then we conduct posthoc tests (for Anova and chisquare tests ONLY) to understand which sub-categories show significant differences in the relationship.

(1) Relation between birth_weight and mom’s_ethnicity exists since p-value < 0.05.

1 2 3 4 |
chisq.test(BIRTH.WEIGHT, MOM.s.RACE) # Pearson's Chi-squared test # data: BIRTH.WEIGHT and MOM.s.RACE # X-squared = 600, df = 6, p-value < 2e-16 |

Using BONFERRONI adjustment and posthoc tests, we realize that mothers with “unknown” race are more likely to have babies with low birth weight, as compared to women of other races.

1 2 3 4 5 6 7 8 9 10 11 12 |
# post-hoc tests: # BONFERRONI adjustment p-val = 0.0833 (6 comparisons) # a) between "black" and "other" tdf = subset(traindf, (MOM.s.RACE == "BLACK" | MOM.s.RACE == "OTHER")) p = prop.table(table( tdf$BIRTH.WEIGHT, tdf$MOM.s.RACE),2)*100 chisq.test(p) # b) between "black" and "white" tdf = subset(traindf, (MOM.s.RACE == "UNKNOWN" | MOM.s.RACE == "WHITE")) p = prop.table(table( tdf$BIRTH.WEIGHT, tdf$MOM.s.RACE),2)*100 chisq.test(p) # p-val = 0.0003, relation exists |

We also see this from the frequency table (below). Clearly only 70% of babies born to mothers of “unknown” race are of normal weight (2500 gms or above) compared to 92% babies from “other” race moms and 93% babies of White-race origins.

(2) Relation between birth_weight and when prenatal_care started (first trimester, second, third or none) Although we see p-value < 0.05 Ha cannot be accepted because the posthoc tests do NOT show significant differences among prenatal care subsets.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
chisq.test(BIRTH.WEIGHT, START.OF.PRENATAL.CARE) # post-hoc tests: # BONFERRONI adjustment p-val = 5e-3 (10 comparisons) # a) between "DK" and "NO PRENATAL CARE" tdf = subset(traindf, (START.OF.PRENATAL.CARE == "DK" | START.OF.PRENATAL.CARE == "NO PRENATAL CARE")) p = prop.table(table( tdf$BIRTH.WEIGHT, tdf$START.OF.PRENATAL.CARE),2)*100 chisq.test(p) # p-val = 0.9 # b) between "FIRST" and "THIRD" tdf = subset(traindf, (START.OF.PRENATAL.CARE == "FIRST" | START.OF.PRENATAL.CARE == "THIRD")) p = prop.table(table( tdf$BIRTH.WEIGHT, tdf$START.OF.PRENATAL.CARE),2)*100 chisq.test(p) # p-val = 0.5 |

(3) Relation between birth_weight and gestation period:

1 2 |
chisq.test(BIRTH.WEIGHT, GESTATION) # P-val = 2e-16 |

Posthoc tests show that babies in the groups POSTTERM 42+ WKS and TERM 37-41 WKS are similar and have higher birth weights than premature babies.

(4) We perform similar tests between birth_weight and multiple-babies (single, twins or triplets) and gender.

## Step 5 – Model Creation

We create 5 models:

- LDA (linear discriminant analysis) model with just 3 variables:
- LDA model with just 7 variables:
- Decision tree model:
- Model using Naïve Bayes theorem.
- Model using Neural Network theorem.

**(1) Simple LDA model:**

Model formula:

1 |
gmodel = lda(BIRTH.WEIGHT ~ MOM.s.RACE + GESTATION + PLURALITY ,data = traindf) |

Make predictions with test1 file.

1 2 3 |
ldaprediction = predict(gmodel, test1) newvals = ldaprediction$class |

Examine how well the model performed.

1 2 3 |
chkvalid = data.frame( new = newvals, old = test1$BIRTH.WEIGHT) table(chkvalid$new, chkvalid$old) |

From alongside table, we see that number of correct predictions (highlighted in green)

= (32+166+4150) / 5000

= 4348 / 50

= 0.8696

Thus, ** 86.96%** predictions were correctly identified for test1! (Note, we will use the same process for checking all 5 models.)

Using a similar process, we get 88.4% correct predictions for test2.

#### (2) LDA model with just 7 variables:

Formula:

1 |
ldamodel = lda(BIRTH.WEIGHT ~ MOM.s.RACE + GESTATION + PLURALITY + MOM.S.AGE + START.OF.PRENATAL.CARE + MOM.S.MARITAL.STATUS + MODE.OF.DELIVERY , data = traindf) |

Make predictions for test1 and test2 files:

1 2 3 4 5 |
ldaprediction2 = predict(ldamodel, test1) newvals2 = ldaprediction2$class ldaprediction2 = predict(ldamodel, test2) newvals2 = ldaprediction2$class |

We get 87.6% correct predictions for test1 file and 88.57% correct for test2.

#### (3) Decision Tree Model

For the tree model, we first modify the birth weight variable to be treated as a “factor” rather than a string variable.

1 2 3 |
traindf$bw = as.integer(as.factor(traindf$BIRTH.WEIGHT)) test1$bw = as.integer(as.factor(test1$BIRTH.WEIGHT)) test2$bw = as.integer(as.factor(test2$BIRTH.WEIGHT)) |

Model Formula:

1 |
tree1 <- tree(bw ~ MOM.s.RACE + GESTATION + PLURALITY + MOM.S.EDUCATION + MOM.S.AGE + START.OF.PRENATAL.CARE + MOM.S.MARITAL.STATUS + MODE.OF.DELIVERY, data = traindf) |

Make predictions for test1 and test2 files:

1 2 |
Pred1 <- predict(tree1, test1) Pred1 <- predict(tree1, test2) |

We get 91.16% correct predictions for test1 file and 91.6% correct for test2. However, the sensitivity of this model is little low, since it has predicted that all babies will be of normal weight i.e “2500+” category. This is one of the disadvantages of tree models. If the target variable has a highly popular option which accounts for 80% or more records, then the model basically assigns everyone to it. (sort of brute force algorithm)

#### (4) Naive Bayes Theorem :

Model Formula:

1 |
fit2 <- naiveBayes(bw ~ MOM.s.RACE + GESTATION + PLURALITY + MOM.S.AGE + START.OF.PRENATAL.CARE + MOM.S.MARITAL.STATUS + MODE.OF.DELIVERY , data = traindf) |

Make predictions for test1 and test2 files:

1 2 3 4 5 6 7 8 |
x = test1[,c("MOM.s.RACE" , "GESTATION" , "PLURALITY" , "MOM.S.AGE" , "START.OF.PRENATAL.CARE" , "MOM.S.MARITAL.STATUS" , "MODE.OF.DELIVERY")] prednb <- predict(fit2, x) x = test2[,c("MOM.s.RACE" , "GESTATION" , "PLURALITY" , "MOM.S.AGE" , "START.OF.PRENATAL.CARE" , "MOM.S.MARITAL.STATUS" , "MODE.OF.DELIVERY")] |

Again we get model accuracy of 91.16% 91.6% respectively for test1 and test2 files. However, this model also suffers from a “brute-force” approach and has marked all babies with normal weight i.e “2500+” category. This reminds us that we must be careful about both accuracy and sensitivity of the model when applying an algorithm for forecasting purposes.

#### (5) Neural Net Algorithm Model :

Model Formula:

1 2 |
set.seed(270) # for reproducability fit2n <- nnet(bw ~ MOM.s.RACE + GESTATION + PLURALITY + MOM.S.AGE + START.OF.PRENATAL.CARE + MOM.S.MARITAL.STATUS + MODE.OF.DELIVERY , data = traindf, size=4, decay=0.0001, maxit=500) |

In the above formula, the “maxit” operation specifies a stop after maximum number of iterations, so that the program doesn’t go into an infinite loop trying to converge values. Since we have set the seed to 270, our formula converges after 330 iterations. With other “seed value” this number may be higher or lower.

Make predictions for test1 file:

1 2 3 |
x = test1[,c("MOM.s.RACE" , "GESTATION" , "PLURALITY" , "MOM.S.AGE" , "START.OF.PRENATAL.CARE" , "MOM.S.MARITAL.STATUS" , "MODE.OF.DELIVERY")] prednn <- predict(fit2n, x, type = "class") |

Validation table (below) shows that total number of correct observations = 4592. Hence model forecast accuracy = 91.84%

Test with second file:

1 2 3 |
x = test2[,c("MOM.s.RACE" , "GESTATION" , "PLURALITY" , "MOM.S.AGE" , "START.OF.PRENATAL.CARE" , "MOM.S.MARITAL.STATUS" , "MODE.OF.DELIVERY")] prednn <- predict(fit2n, x, type = "class") |

Thus, Neural Net models are accurate at 91.84% and 92.57% respectively for test1 and test2 respectively.

## Step 6 – Comparison of models

We take a quick look at how our models fared using a tabular comparison: We conclude that neural network algorithm gives us the best accuracy and sensitivity.

The code and datafiles for this tutorial are added to the New Projects page under “Jan” section. If you found this useful, please do share with your friends and colleagues. Feel free to share your thoughts and feedback in the comments section.