TheWord.Rmd

---
title: "The Word"
subtitle: "A textual analysis of the King James Version, American Standard Version, and World English Bible."
author: "Wesley Chioh"
date: "April 9, 2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# loading libraries
library(quanteda)
library(tidyverse)
library(ggplot2)
library(factoextra)
library(caret)
library(doParallel)
library(cluster)
library(kableExtra)

#reading in bible
all_bibles <- read_csv("bibles.csv")

# filtering for gospels
gospels <- all_bibles %>%
  filter(str_detect(Verse, "Matthew|Mark|John|Luke"))
gospels <- gospels %>%
  mutate(Book = str_split(Verse, " ")[[1]][1])

# extracting the name of each book
# extracting the KJV
# paste and collapse each verse grouped by book
all_bibles <- all_bibles %>%
  mutate(Verse = str_replace_all(Verse, c("1 " = "First", "2 " = "Second", "3 " = "Third")))
books <- sapply(1:dim(all_bibles)[1], function(x){
  book = str_split(all_bibles[x,1], " ")[[1]][1]
  book
})
chapters <- sapply(1:dim(all_bibles)[1], function(x){
  chapter = str_split(all_bibles[x,1], ":")[[1]][1]
  chapter
})
all_bibles$Book <- books
all_bibles$Chapter <- chapters
kjv_collapsed <- all_bibles %>%
  select(Book, `King James Bible`) %>%
  group_by(Book) %>%
  summarise(Verse = paste(`King James Bible`, collapse = " "))
kjv_collapsed_vec <- kjv_collapsed$Verse %>%
  unlist()
names(kjv_collapsed_vec) <- unique(books)

# create DFM
# convert to matrix
kjv_books_dfm <- dfm(kjv_collapsed_vec, tolower = T, remove_punct = T, remove_numbers = T, stem = T)
kjv_books_mat <- convert(kjv_books_dfm, to = "matrix")

## ASV
asv_gospels_collapsed <- all_bibles %>%
  filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
  select(Book, `American Standard Version`) %>%
  group_by(Book) %>%
  summarise(Verse = paste(`American Standard Version`, collapse = " "))
asv_collapsed_vec <- asv_gospels_collapsed$Verse %>%
  unlist()
names(asv_collapsed_vec) <- c("Matthew", "Mark", "Luke", "John")

# create DFM
# convert to matrix
asv_books_dfm <- dfm(asv_collapsed_vec, tolower = T, remove_punct = T, remove_numbers = T, stem = T)

## WEB
web_gospels_collapsed <- all_bibles %>%
  filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
  select(Book, `World English Bible`) %>%
  group_by(Book) %>%
  summarise(Verse = paste(`World English Bible`, collapse = " "))
web_collapsed_vec <- web_gospels_collapsed$Verse %>%
  unlist()
names(web_collapsed_vec) <- c("Matthew", "Mark", "Luke", "John")

# create DFM
# convert to matrix
web_books_dfm <- dfm(web_collapsed_vec, tolower = T, remove_punct = T, remove_numbers = T, stem = T)
```

#### Introduction  
  
The New York Times copyrighted bestseller list changes every week. But on the time scale of centuries and millennia, the longstanding global copyright-free bestseller has not. It is most probably the Bible. The Bible commonly used today is quite different in style and phrasing from that a mere century ago. It undergoes revisions for clarity and ease of comprehension as linguistic norms change. The subject matter and meaning of its verses have not, however. But have they?    
  
The Bible is a "multi-parallel corpora" (Xia and Yarowsky, 2017, p.448) with multiple versions of what is essentially a highly similar corpus. There are 66 books in the Bible, each of which is distinct from another in terms of authorship, stylometry and topics. Some, such as the four Gospels of Matthew, Mark, Luke, and John are among the most similar in terms of content and topics, despite being written by different authors. The Gospels are also known as the Synoptic Gospels (Linmans, 1998; Murai, 2006) with corresponding sections. Murai (2013) argues that from a network analysis perspective, they can be characterized as a series of multiple one-to-many relationships.  
  
Thus, this paper will seek to test the hypothesis that despite controlling for topical similarity, the four Gospel books remain stylistically distinct from another. This distinction is retained over the ages and various versions of the Bible. This paper will use the King James Version (KJV), American Standard Version (ASV), and World English Bible (WEB) as the basis of comparison.  
  
#### Similarity  
  
First, a spreadsheet of various versions of the Bible was downloaded from BibleHub and the KJV, ASV, and WEB were selected. The preprocessing involved lower-casing and stemming each word, and removing punctuation and numerals. Stopwords were not removed because the English stopwords available through `quanteda` are contemporaneous with English language norms today but not with the Elizabethan English of the KJV and ASV (Hamlin, 2011). Furthermore, removal of stopwords is likely to result in document-feature matrices (DFMs) with high frequencies of proper nouns in the modern WEB, as well as other stopwords in the older KJV and ASV. This results in an imbalance that might adversely affect a true assessment of their similarities and differences.  
  
From a bag of words perspective, word choice and frequency contain latent information. Cosine similarity thus hints at the relative similarity of the four books in terms of topics, and sentiments. 
```{r KJV cosine_similarity, echo=FALSE}
# cosine similarity
cosine_similarity_mat <- sapply(40:43, function(x){
  sapply(40:43, function(y){
    cosine_similarity <- textstat_simil(kjv_books_dfm[x], kjv_books_dfm[y], method = "cosine")
    round(cosine_similarity@x,3)*100
  })
})
row.names(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
colnames(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
cosine_similarity_mat %>%
  kable(caption = "Table 1: Cosine Similarity of KJV Gospel Books") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kjv_sim <- paste(round(mean(cosine_similarity_mat),2), "%", sep = "")
```

```{r ASV cosine Similarity, echo = FALSE}
# cosine similarity
cosine_similarity_mat <- sapply(1:4, function(x){
  sapply(1:4, function(y){
    cosine_similarity <- textstat_simil(asv_books_dfm[x], asv_books_dfm[y], method = "cosine")
    round(cosine_similarity@x,3)*100
  })
})
row.names(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
colnames(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
cosine_similarity_mat%>%
  kable(caption = "Table 2: Cosine Similarity of ASV Gospel Books") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
asv_sim <- paste(round(mean(cosine_similarity_mat),2), "%", sep = "")
```

```{r WEB Cosine Similarity, echo = FALSE}
# cosine similarity
cosine_similarity_mat <- sapply(1:4, function(x){
  sapply(1:4, function(y){
    cosine_similarity <- textstat_simil(web_books_dfm[x], web_books_dfm[y], method = "cosine")
    round(cosine_similarity@x,3)*100
  })
})
row.names(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
colnames(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
cosine_similarity_mat %>%
  kable(caption = "Table 3: Cosine Similarity of WEB Gospel Books") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
web_sim <- paste(round(mean(cosine_similarity_mat),2), "%", sep = "")
```
  
From the tables above, the average cosine similarity of the four Gospel books across the KJV, ASV, and WEB is `r kjv_sim`, `r asv_sim`, `r web_sim` respectively. This is fairly high, but they are not identically so. The degree of similarity between the books is much higher for the ASV and WEB as compared to the KJV. The ASV and WEB agree on the relative similarity of Matthew, Mark, and Luke, but not the KJV. For example, in the KJV, Matthew is most similar to Mark. But in both the ASV and WEB, Mathew is most similar to John. Furthermore, the ASV and WEB suggest that Mark and Luke are more than 98% similar; but in the KJV, it is Matthew which is almost identical to Mark, not Luke. However, the three versions agree that John is most similar to Mark and Luke.    

#### Separability  
  
The second part of this paper builds on the understanding that although the three versions of the Bible and the four Gospels might very well share similar content, they are ultimately distinct. This can be attributed to stylistic differences among the different authors, and translations and word choice contemporaneous to the day and age of each version.  
  
To test this hypothesis, random forest classifiers were fitted on a sample of verses from each of the four books across all three versions. The training and test data split takes into account the slight class imbalance shown below by ensuring that every class is sampled according to its distribution in the population of verses. Three random forest classifiers were fitted:  
1. Classify verses based on the version and the book  
2. Classify verses based on the version  
3. Classify verses based on the book
```{r random forest prep, echo = FALSE}
gospels_class <- all_bibles %>%
  filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
  select(Book, `American Standard Version`, `King James Bible`, `World English Bible`) %>%
  pivot_longer(., cols = c(`American Standard Version`, `King James Bible`, `World English Bible`), names_to = "Version") %>%
  group_by(Book, Version) %>%
  summarise(Count = n()) %>%
  pivot_wider(.,names_from = "Version", values_from = "Count")
gospels_class %>%
  kable(caption = "Table 4: Class Imbalance - Number of verses in each book and version") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
  

```{r random forest1, echo = FALSE}
set.seed(1728)

## creating variable class
gospels <- all_bibles %>%
  filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
  select(Book, `American Standard Version`, `King James Bible`, `World English Bible`) %>%
  pivot_longer(., cols = c(`American Standard Version`, `King James Bible`, `World English Bible`), names_to = "Version") %>%
  mutate(Class = paste(substr(Version, 1,1), substr(Book, 1, 3), sep = "_"))

# creating DFM
# partition data with factor to preserve class distribution
gospels_dfm <- dfm(gospels$value, stem = T, remove_punct = T, tolower = T) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 3) %>%
  convert("matrix")
gospels_dfm <- scale(gospels_dfm)
gospels$Class <- as.factor(gospels$Class)
ids_train <- createDataPartition(gospels$Class, p = 0.7, list = F, times = 1)
train_x1 <- gospels_dfm[ids_train,] %>%
  as.data.frame()
train_y1 <- gospels$Class[ids_train] %>%
  as.factor()
test_x1 <- gospels_dfm[-ids_train,] %>%
  as.data.frame()
test_y1 <- gospels$Class[-ids_train] %>%
  as.factor()

## tuning random forest
# mtry <- sqrt(ncol(train_x1)) 
# ntree <- 101
# trainControl <- trainControl(method = "cv", number = 10, search = "grid" )
# metric <- "Accuracy"
# tunegrid <- expand.grid(.mtry = mtry)
# cl <- makePSOCKcluster(5)
# registerDoParallel(cl)
  
# running rf
# gospels_rf <- train(x = train_x1, y = train_y1,
#                  method = "rf", metric = metric, tuneGrid = tunegrid, trControl = trainControl, ntree = ntree, doParallel = T)
# stopCluster(cl)
# 
# # save RDS and then load to speed things up
# saveRDS(gospels_rf, "./models/gospels_rf.RDS")
gospels_rf <- readRDS("./models/gospels_rf.RDS")

# print model
# print(gospels_rf)

# predict and confusion matrix
gospels_rf_predict <- predict(gospels_rf, newdata = test_x1)
gospels_rf_cm <- confusionMatrix(gospels_rf_predict, reference = test_y1, mode = "prec_recall")
gospels_rf_cm$byClass[,c(11,5:7)] %>%
  kable(caption = "Table 5: Model (Book and Version) Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 
```
With 12 classes to predict, the out-of-sample accuracy is `r round(gospels_rf_cm$overall[1],3)`. It is a slight improvement over its base rate of `r round(gospels_rf_cm$overall[5],3)`, which is obtained by simply predicting the most commonly occurring class in the dataset. The confusion matrix (Table A1) suggests that verses from the ASV and KJV are highly likely to be misclassified as the other. Of the four books, it appears that verses from John is the least likely to be misattributed to others whereas verses from Luke and Mark are especially prone to misattribution. This analysis can be further broken down into the versions and the books with fewer classes in each case.  
  
```{r random forest 2, echo = FALSE}
set.seed(1728)

## creating variable class
gospels_reduced_class <- all_bibles %>%
  filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
  select(Book, `American Standard Version`, `King James Bible`, `World English Bible`) %>%
  pivot_longer(., cols = c(`American Standard Version`, `King James Bible`, `World English Bible`), names_to = "Version") 

# creating DFM
# partition data with factor to preserve class distribution
gospels_dfm <- dfm(gospels_reduced_class$value, stem = T, remove_punct = T, tolower = T) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 3) %>%
  convert("matrix")
gospels_dfm <- scale(gospels_dfm)
gospels_reduced_class$Book <- as.factor(gospels_reduced_class$Book)
gospels_reduced_class$Version <- as.factor(gospels_reduced_class$Version)
ids_train2 <- createDataPartition(gospels_reduced_class$Version, p = 0.7, list = F, times = 1)
train_x2 <- gospels_dfm[ids_train2,] %>%
  as.data.frame()
train_y2 <- gospels_reduced_class$Version[ids_train2] %>%
  as.factor()
test_x2 <- gospels_dfm[-ids_train2,] %>%
  as.data.frame()
test_y2 <- gospels_reduced_class$Version[-ids_train2] %>%
  as.factor()

## tuning random forest
# mtry <- sqrt(ncol(train_x2)) 
# tunegrid <- expand.grid(.mtry = mtry)
# cl <- makePSOCKcluster(5)
# registerDoParallel(cl)
 
# running rf
# gospels_rf_version <- train(x = train_x2, y = train_y2,
#                     method = "rf", metric = metric, tuneGrid = tunegrid, trControl = trainControl, ntree = ntree, doParallel = T)
# stopCluster(cl)
  
# save RDS and then load to speed things up
# saveRDS(gospels_rf_version, "./models/gospels_rf_version.RDS")
gospels_rf_version <- readRDS("./models/gospels_rf_version.RDS")

# print model
# print(gospels_rf_version)

# predict and confusion matrix
gospels_rf_version_predict <- predict(gospels_rf_version, newdata = test_x2)
gospels_rf_version_cm <- confusionMatrix(gospels_rf_version_predict, reference = test_y2, mode = "prec_recall")
gospels_rf_version_cm$byClass[,c(11, 5:7)] %>%
  kable(caption = "Table 6: Model (Version) Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
  
In the previous model, it was observed that the poor performance of the classifier could be attributed to errors in separating the versions. Training another random forest model to just classify the verses based on its bible version, the out-of-sample accuracy is `r round(gospels_rf_cm$overall[1],3)` is similar to the previous model. Further, the precision and recall rates are especially poor for the ASV and KJV as compared to the WEB. This confirms the notion that the KJV and ASV are highly similar and difficult to classify.   

```{r random forest 3, echo = FALSE}
##### Book
ids_train3 <- createDataPartition(gospels_reduced_class$Book, p = 0.7, list = F, times = 1)
train_x3 <- gospels_dfm[ids_train3,] %>%
  as.data.frame()
train_y3 <- gospels_reduced_class$Book[ids_train3] %>%
  as.factor()
test_x3 <- gospels_dfm[-ids_train3,] %>%
  as.data.frame()
test_y3 <- gospels_reduced_class$Book[-ids_train3] %>%
  as.factor()

# ## tuning random forest
# mtry <- sqrt(ncol(train_x3)) 
# tunegrid <- expand.grid(.mtry = mtry)
# cl <- makePSOCKcluster(5)
# registerDoParallel(cl)
# 
# # running rf
# gospels_rf_book <- train(x = train_x3, y = train_y3,
#                  method = "rf", metric = metric, tuneGrid = tunegrid, trControl = trainControl, ntree = ntree, doParallel = T)
# stopCluster(cl)
# 
# save RDS and then load to speed things up
# saveRDS(gospels_rf_book, "./models/gospels_rf_book.RDS")
gospels_rf_book <- readRDS("./models/gospels_rf_book.RDS")

# print model
# print(gospels_rf_book)

# predict and confusion matrix
gospels_rf_book_predict <- predict(gospels_rf_book, newdata = test_x3)
gospels_rf_book_cm <- confusionMatrix(gospels_rf_book_predict, reference = test_y3, mode = "prec_recall")
gospels_rf_book_cm$byClass[,c(11,5:7)] %>%
  kable(caption = "Table 7: Model (Book) Results") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
The out of sample accuracy for predicting the book given a verse is `r round(gospels_rf_book_cm$overall[1], 3)`. Of the three classifiers trained and tested thus far, this shows the highest degree of accuracy. However, given that there is a class imbalance, the F1 score as shown above might be more appropriate instead. The high degree of precision in book classification suggests that distinctions between books are relatively strong and retained over the various versions. However, the book of Mark has an uncharacteristically low recall score of `r round(gospels_rf_book_cm$byClass[3,6], 3)`. This suggests that verses in Mark tend to be misattributed to other books.  

#### Clusters  
    
To test this further, kmeans clustering was performed to test for the optimal number of clusters and the ability of the kmeans algorithm to perfectly separate the corpus into four distinct clusters, each corresponding to a book. Collapsing verses into chapters can provide a better approximation of chapter meaning as they can be relatively short and easily taken out of context.  Furthermore, stylometric signatures should be stronger at the chapter level.     
```{r PCA, echo = FALSE, fig.height=5, fig.width=6}
web_collapsed_pca <- all_bibles %>%
  filter(Book %in% c("Mark", "John", "Luke", "Matthew")) %>%
  select(Book, Chapter, `World English Bible`) %>%
  group_by(Book, Chapter) %>%
  summarise(Verse = paste(`World English Bible`, collapse = " "))
web_collapsed_pca_vec <- web_collapsed_pca$Verse %>%
  unlist()
names(web_collapsed_pca_vec) <- web_collapsed_pca$Chapter

# create DFM
# convert to matrix
web_chapter_dfm <- dfm(web_collapsed_pca_vec, tolower = T, remove_punct = T, remove_numbers = T, stem = T)
web_chapter_mat <- convert(web_chapter_dfm, to = "matrix")
web_chapter_mat_scale <- scale(web_chapter_mat)
# PCA

web_chapter_pca <- prcomp(web_chapter_mat, center = T, scale = T)

web_chapter_pca_df <- web_chapter_pca$x %>%
  as.data.frame()
web_chapter_pca_df$Chapter <- web_collapsed_pca$Chapter
web_chapter_pca_df$Book <- web_collapsed_pca$Book

fviz_eig(web_chapter_pca, addlabels = T,
         main = "Screeplot of WEB DFM's Principal Components",
         caption = "Plot 1: Screeplot of principal components")
```
  
The screeplot above suggests that dimensionality has not been significantly reduced. The first two dimensions account for less than 7% of the variability. Thus, principal components analysis is not particularly useful in this case as the trade-off between dimensionality and variance is extremely steep.    
  
```{r chapter cosine similarity, echo=FALSE, fig.width=10, fig.height=8}
# # Distance
cosine_similarity_mat_chapter <- sapply(1:89, function(x){
  sapply(1:89, function(y){
    cosine_similarity <- textstat_simil(web_chapter_dfm[x], web_chapter_dfm[y], method = "cosine")
    round(cosine_similarity@x,3)*100
  })
})
rownames(cosine_similarity_mat_chapter) <- rownames(web_chapter_dfm)
colnames(cosine_similarity_mat_chapter) <- rownames(web_chapter_dfm)
reorder_cormat <- function(cormat){
  dd <- as.dist((1-cormat)/2)
  hc <- hclust(dd)
  cormat <- cormat[hc$order, hc$order]
  return(cormat)
}
get_upper_tri <- function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}

cosine_similarity_mat_chapter_reorder <- reorder_cormat(cosine_similarity_mat_chapter) %>%
  get_upper_tri()
melted_cosine_sim <- reshape2::melt(cosine_similarity_mat_chapter_reorder, na.rm = T)
min_value <- floor(min(melted_cosine_sim$value))
midpoint <- floor((100 - min_value)/2) + min_value
ggplot(melted_cosine_sim, aes(Var1,Var2, fill = value)) + 
  geom_tile(color = "white") + 
  scale_fill_gradient2(low="#FC4E07", high = "#00AFBB", mid = "white", midpoint = midpoint, limit = c(min_value, 100), name = "Cosine \nSimilarity") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 7, hjust = 1),
        axis.text.y = element_text(size = 7)) + 
  labs(title = "Cosine Similarity of Each Chapter",
       caption = "Plot 2: Cosine similarity matrix",
       x = "Chapters",
       y = "Chapters")
```
  
As a baseline of comparison, the Euclidean distance between each book was calculated and plotted. For the most part, books and chapters are distinct from one another with the exception of certain chapters that echo each other content-wise such as Mark 11 and Matthew 21. Curiously, Luke 19 and 20 are dissimilar, possibly because the same content is  split between two chapters. It also appears that Luke 3, Matthew 1, and John 13 through 17 are relatively distinct from the rest.  
  
```{r silhouette score,fig.height=5, fig.width=7, echo=FALSE}
fviz_nbclust(web_chapter_mat_scale, kmeans, method = "silhouette", k.max = 28) + 
  theme(axis.title = element_text(size = 8),
        axis.text = element_text(size = 8),
        title = element_text(size = 11)) + 
  labs(title = "Silhouette Score",
       caption = "Plot 3: Optimal Number of Clusters")
```
  
  
```{r kmeans, echo=FALSE, fig.height=8, fig.width=10}
# KMeans
set.seed(1728)
# k5 <- kmeans(web_chapter_mat_scale, 5, nstart = 25)
# saveRDS(k5, "./models/kmeans.RDS")
k5 <- readRDS("./models/kmeans.RDS")
fviz_cluster(k5, data = web_chapter_mat_scale, labelsize = 8) + 
  theme(axis.title = element_text(size = 8),
        axis.text = element_text(size = 8),
        title = element_text(size = 11)) + 
  labs(title = "Cluster Plot: Chapters",
       caption = "Plot 4: KMeans Clustering of Chapters")
```
    
KMeans clustering was then performed without PCA to see whether books can be identified from one another, in which case there ought to be four clusters. Alternatively, the books can also be clustered by content similarity, in which case there should be anywhere between 16 and 28 clusters given that shortest book has 16 chapters, and the longest has 28. This assumes that content order and chapter delimitation is highly similar. The silhouette score plot suggests that the ideal number of clusters is 5. Plotting the five clusters using the first two principal component, the following observations about the clusters can be made:  
1. Cluster 1 contains chapters on the crucifixion of Jesus.  
2. Cluster 3 contains chapters on the betrayal of Jesus by Judas and the Last Supper.  
3. Cluster 2 is an outlier. Luke 3 was identified as one of the least cosine similar chapters relative to the rest.  
4. Cluster 5 is a potentially an outlier.  
5. Cluster 4 is most probably an amalgamation of other chapters, given that most chapters somewhat, as the cosine similarity matrix suggests.  
Surprisingly, the KMeans cluster plot do not echo the ability of the random forest models in separating the each of the Gospel books from one another. 

#### Discussion  
  
Having performed the analyses detailed above, it is possible to make the argument that using a bag-of-words approach, the three versions are highly similar in terms of word choice, and by extension, meaning. This can be seen from the difficulty that the random forest models have in distinguishing between the KJV and the ASV even though they were published three centuries apart. This is because the "ASV maintained close ties to Elizabethan English" (Hamlin, 2011, pp.165-166), which came to be known as "formal equivalence" (p.166). On the other hand, even though the WEB is an update of the ASV, it is written in modern-day informal English which sets it apart from the KJV and ASV. This implies that the English language has evolved enough over the years for a supervised machine learning technique to set it apart.  
  
On the issue of authorship, the ability of a random forest model to separate the four books across versions imply that verses in all four books retain unique stylometric traces of authorship that have been preserved over the years. In the case of the WEB, of the verses which were misclassified as thoseof other versions, 87% of them were attributed to the correct book albeit the wrong version.   
  
However, the book of Mark is often misattributed to Luke and Matthew. This lends some support to the theological argument of Marcan Priority in which the Gospels of Matthew and Luke are based on Mark (Goodacre, 2000; Murai, 2006), and therefore similar. To extend this argument a little further, one might expect to find four distinct clusters when performing kmeans analysis on verses collapsed into clusters, with some chapters from Mark, Luke, and Matthew mixed together.   
  
However, this expectation was not supported. Instead, some clusters were topical in nature. Murai & Tokosumi (2006) argued that the Gospels contain verses that are echoed in one another in syncoptic fashion. Thus, various chapters across the four Gospel books could exhibit high degrees of similarity after controlling for Bible version. Furthermore, as the KMeans clustering algorithm minimizes distance across the dimensions, individually similar observations such as the syncoptic chapters are clustered together. This is why the clusters identified mirror to a certain extent, groups of highly similar cells in the cosine similarity matrix. On the other hand, a random forest model might not necessarily be looking to minimize the distance between the observations. Furthermore, the random forest model was trained on verses, and not chapters as the KMeans cluster plot was. There might be broader patterns that the random forest model picked up, which are not observable individually or when it is aggregated.  
  
However, it is curious that this was not the case for all topics discussed in the Gospel Books, evident from the Euclidean distance plot, and the optimal number of KMeans clusters. This could be because of chapter delimitation where content covered in one chapter in a book, can be covered in two in another. This reduces the similarity of each chapter as a document.Further, of the two outliers, Luke 1 is uncharacteristically similar to the other outlier, Luke 3, but otherwise typical of other chapters as calculated by cosine similarity. Thus in a high-dimensional space, this sets the two books apart from the others, resulting in their separate classification even though when decomposed into the first two principal components, Luke 1 appears relatively close to the other clusters.  
  
#### Conclusion  
  
Thus, what is observed here is an inability to satisfactorily explain why a random forest model is better able to classify and predict than a KMeans model. Neither can the hypothesis that the four Gospel are stylometrically distinct be conclusively supported. An extension to this work will ideally include some form of topic modeling across chapters or verses. This will test topical similarity. Further, if there are common topics and proper nouns among the books, then burstiness can be used to explore intensity. Normalized for length, and relative to one another, burstiness should be identical as the books cover the same issues over the same timeline. In the final summary, it appears fitting for a text that has been extensively studied for millenia to retain inexplicable features that continue to capture future collective interest.       
  
  
#### References  
  
Goodacre, M. (2000). A Monopoly on Marcan Priority? Fallacies at the Heart of Q. In Society of Biblical Literature Seminar Papers 2000 (pp. 538-622).  
Hamlin, H. (2011). The King James Bible after 400 years: literary, linguistic, and cultural influences. Cambridge Univ. Press.  
Linmans, A. J. M. (1998). Correspondence analysis of the Synoptic Gospels. Literary and linguistic computing, 13(1), 1-13.  
McDonald, D. (2014). A text mining analysis of religious texts. The Journal of Business Inquiry, 13(1), 27-47.  
Murai, H., & Tokosumi, A. (2006). Synoptic Network Analysis of the Four Gospels. In SCIS & ISIS SCIS & ISIS 2006 (pp. 1590-1595). Japan Society for Fuzzy Theory and Intelligent Informatics.  
Murai, H. (2013). Exegetical Science for the Interpretation of the Bible: Algorithms and Software for Quantitative Analysis of Christian Documents. In Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (pp. 67-86). Springer, Heidelberg.  
Xia, P., & Yarowsky, D. (2017, November). Deriving Consensus for Multi-Parallel Corpora: an English Bible Study. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (pp. 448-453).  
  
#### Appendix
```{r Appendix, echo=FALSE}
# rf 1
gospels_rf_cm$table %>%
  kable(caption = "Table A1: Predicting Book and Version with Random Forests | Prediction(V), Reference(H)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  pack_rows("ASV", 1,4) %>%
  pack_rows("KJV", 5,8) %>%
  pack_rows("WEB",9,12)

# rf 2
gospels_rf_version_cm$table %>%
  kable(caption = "Table A2: Predicting Version with Random Forests | Prediction(V), Reference(H)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

# rf 3
gospels_rf_book_cm$table %>%
  kable(caption = "Table A3: Predicting Book with Random Forests | Prediction(V), Reference(H)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))  
```