-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTheWord.Rmd
477 lines (404 loc) · 28.9 KB
/
TheWord.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
---
title: "The Word"
subtitle: "A textual analysis of the King James Version, American Standard Version, and World English Bible."
author: "Wesley Chioh"
date: "April 9, 2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# loading libraries
library(quanteda)
library(tidyverse)
library(ggplot2)
library(factoextra)
library(caret)
library(doParallel)
library(cluster)
library(kableExtra)
#reading in bible
all_bibles <- read_csv("bibles.csv")
# filtering for gospels
gospels <- all_bibles %>%
filter(str_detect(Verse, "Matthew|Mark|John|Luke"))
gospels <- gospels %>%
mutate(Book = str_split(Verse, " ")[[1]][1])
# extracting the name of each book
# extracting the KJV
# paste and collapse each verse grouped by book
all_bibles <- all_bibles %>%
mutate(Verse = str_replace_all(Verse, c("1 " = "First", "2 " = "Second", "3 " = "Third")))
books <- sapply(1:dim(all_bibles)[1], function(x){
book = str_split(all_bibles[x,1], " ")[[1]][1]
book
})
chapters <- sapply(1:dim(all_bibles)[1], function(x){
chapter = str_split(all_bibles[x,1], ":")[[1]][1]
chapter
})
all_bibles$Book <- books
all_bibles$Chapter <- chapters
kjv_collapsed <- all_bibles %>%
select(Book, `King James Bible`) %>%
group_by(Book) %>%
summarise(Verse = paste(`King James Bible`, collapse = " "))
kjv_collapsed_vec <- kjv_collapsed$Verse %>%
unlist()
names(kjv_collapsed_vec) <- unique(books)
# create DFM
# convert to matrix
kjv_books_dfm <- dfm(kjv_collapsed_vec, tolower = T, remove_punct = T, remove_numbers = T, stem = T)
kjv_books_mat <- convert(kjv_books_dfm, to = "matrix")
## ASV
asv_gospels_collapsed <- all_bibles %>%
filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
select(Book, `American Standard Version`) %>%
group_by(Book) %>%
summarise(Verse = paste(`American Standard Version`, collapse = " "))
asv_collapsed_vec <- asv_gospels_collapsed$Verse %>%
unlist()
names(asv_collapsed_vec) <- c("Matthew", "Mark", "Luke", "John")
# create DFM
# convert to matrix
asv_books_dfm <- dfm(asv_collapsed_vec, tolower = T, remove_punct = T, remove_numbers = T, stem = T)
## WEB
web_gospels_collapsed <- all_bibles %>%
filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
select(Book, `World English Bible`) %>%
group_by(Book) %>%
summarise(Verse = paste(`World English Bible`, collapse = " "))
web_collapsed_vec <- web_gospels_collapsed$Verse %>%
unlist()
names(web_collapsed_vec) <- c("Matthew", "Mark", "Luke", "John")
# create DFM
# convert to matrix
web_books_dfm <- dfm(web_collapsed_vec, tolower = T, remove_punct = T, remove_numbers = T, stem = T)
```
#### Introduction
The New York Times copyrighted bestseller list changes every week. But on the time scale of centuries and millennia, the longstanding global copyright-free bestseller has not. It is most probably the Bible. The Bible commonly used today is quite different in style and phrasing from that a mere century ago. It undergoes revisions for clarity and ease of comprehension as linguistic norms change. The subject matter and meaning of its verses have not, however. But have they?
The Bible is a "multi-parallel corpora" (Xia and Yarowsky, 2017, p.448) with multiple versions of what is essentially a highly similar corpus. There are 66 books in the Bible, each of which is distinct from another in terms of authorship, stylometry and topics. Some, such as the four Gospels of Matthew, Mark, Luke, and John are among the most similar in terms of content and topics, despite being written by different authors. The Gospels are also known as the Synoptic Gospels (Linmans, 1998; Murai, 2006) with corresponding sections. Murai (2013) argues that from a network analysis perspective, they can be characterized as a series of multiple one-to-many relationships.
Thus, this paper will seek to test the hypothesis that despite controlling for topical similarity, the four Gospel books remain stylistically distinct from another. This distinction is retained over the ages and various versions of the Bible. This paper will use the King James Version (KJV), American Standard Version (ASV), and World English Bible (WEB) as the basis of comparison.
#### Similarity
First, a spreadsheet of various versions of the Bible was downloaded from BibleHub and the KJV, ASV, and WEB were selected. The preprocessing involved lower-casing and stemming each word, and removing punctuation and numerals. Stopwords were not removed because the English stopwords available through `quanteda` are contemporaneous with English language norms today but not with the Elizabethan English of the KJV and ASV (Hamlin, 2011). Furthermore, removal of stopwords is likely to result in document-feature matrices (DFMs) with high frequencies of proper nouns in the modern WEB, as well as other stopwords in the older KJV and ASV. This results in an imbalance that might adversely affect a true assessment of their similarities and differences.
From a bag of words perspective, word choice and frequency contain latent information. Cosine similarity thus hints at the relative similarity of the four books in terms of topics, and sentiments.
```{r KJV cosine_similarity, echo=FALSE}
# cosine similarity
cosine_similarity_mat <- sapply(40:43, function(x){
sapply(40:43, function(y){
cosine_similarity <- textstat_simil(kjv_books_dfm[x], kjv_books_dfm[y], method = "cosine")
round(cosine_similarity@x,3)*100
})
})
row.names(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
colnames(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
cosine_similarity_mat %>%
kable(caption = "Table 1: Cosine Similarity of KJV Gospel Books") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
kjv_sim <- paste(round(mean(cosine_similarity_mat),2), "%", sep = "")
```
```{r ASV cosine Similarity, echo = FALSE}
# cosine similarity
cosine_similarity_mat <- sapply(1:4, function(x){
sapply(1:4, function(y){
cosine_similarity <- textstat_simil(asv_books_dfm[x], asv_books_dfm[y], method = "cosine")
round(cosine_similarity@x,3)*100
})
})
row.names(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
colnames(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
cosine_similarity_mat%>%
kable(caption = "Table 2: Cosine Similarity of ASV Gospel Books") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
asv_sim <- paste(round(mean(cosine_similarity_mat),2), "%", sep = "")
```
```{r WEB Cosine Similarity, echo = FALSE}
# cosine similarity
cosine_similarity_mat <- sapply(1:4, function(x){
sapply(1:4, function(y){
cosine_similarity <- textstat_simil(web_books_dfm[x], web_books_dfm[y], method = "cosine")
round(cosine_similarity@x,3)*100
})
})
row.names(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
colnames(cosine_similarity_mat) <- c("Matthew", "Mark", "Luke", "John")
cosine_similarity_mat %>%
kable(caption = "Table 3: Cosine Similarity of WEB Gospel Books") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
web_sim <- paste(round(mean(cosine_similarity_mat),2), "%", sep = "")
```
From the tables above, the average cosine similarity of the four Gospel books across the KJV, ASV, and WEB is `r kjv_sim`, `r asv_sim`, `r web_sim` respectively. This is fairly high, but they are not identically so. The degree of similarity between the books is much higher for the ASV and WEB as compared to the KJV. The ASV and WEB agree on the relative similarity of Matthew, Mark, and Luke, but not the KJV. For example, in the KJV, Matthew is most similar to Mark. But in both the ASV and WEB, Mathew is most similar to John. Furthermore, the ASV and WEB suggest that Mark and Luke are more than 98% similar; but in the KJV, it is Matthew which is almost identical to Mark, not Luke. However, the three versions agree that John is most similar to Mark and Luke.
#### Separability
The second part of this paper builds on the understanding that although the three versions of the Bible and the four Gospels might very well share similar content, they are ultimately distinct. This can be attributed to stylistic differences among the different authors, and translations and word choice contemporaneous to the day and age of each version.
To test this hypothesis, random forest classifiers were fitted on a sample of verses from each of the four books across all three versions. The training and test data split takes into account the slight class imbalance shown below by ensuring that every class is sampled according to its distribution in the population of verses. Three random forest classifiers were fitted:
1. Classify verses based on the version and the book
2. Classify verses based on the version
3. Classify verses based on the book
```{r random forest prep, echo = FALSE}
gospels_class <- all_bibles %>%
filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
select(Book, `American Standard Version`, `King James Bible`, `World English Bible`) %>%
pivot_longer(., cols = c(`American Standard Version`, `King James Bible`, `World English Bible`), names_to = "Version") %>%
group_by(Book, Version) %>%
summarise(Count = n()) %>%
pivot_wider(.,names_from = "Version", values_from = "Count")
gospels_class %>%
kable(caption = "Table 4: Class Imbalance - Number of verses in each book and version") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
```{r random forest1, echo = FALSE}
set.seed(1728)
## creating variable class
gospels <- all_bibles %>%
filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
select(Book, `American Standard Version`, `King James Bible`, `World English Bible`) %>%
pivot_longer(., cols = c(`American Standard Version`, `King James Bible`, `World English Bible`), names_to = "Version") %>%
mutate(Class = paste(substr(Version, 1,1), substr(Book, 1, 3), sep = "_"))
# creating DFM
# partition data with factor to preserve class distribution
gospels_dfm <- dfm(gospels$value, stem = T, remove_punct = T, tolower = T) %>%
dfm_trim(min_termfreq = 5, min_docfreq = 3) %>%
convert("matrix")
gospels_dfm <- scale(gospels_dfm)
gospels$Class <- as.factor(gospels$Class)
ids_train <- createDataPartition(gospels$Class, p = 0.7, list = F, times = 1)
train_x1 <- gospels_dfm[ids_train,] %>%
as.data.frame()
train_y1 <- gospels$Class[ids_train] %>%
as.factor()
test_x1 <- gospels_dfm[-ids_train,] %>%
as.data.frame()
test_y1 <- gospels$Class[-ids_train] %>%
as.factor()
## tuning random forest
# mtry <- sqrt(ncol(train_x1))
# ntree <- 101
# trainControl <- trainControl(method = "cv", number = 10, search = "grid" )
# metric <- "Accuracy"
# tunegrid <- expand.grid(.mtry = mtry)
# cl <- makePSOCKcluster(5)
# registerDoParallel(cl)
# running rf
# gospels_rf <- train(x = train_x1, y = train_y1,
# method = "rf", metric = metric, tuneGrid = tunegrid, trControl = trainControl, ntree = ntree, doParallel = T)
# stopCluster(cl)
#
# # save RDS and then load to speed things up
# saveRDS(gospels_rf, "./models/gospels_rf.RDS")
gospels_rf <- readRDS("./models/gospels_rf.RDS")
# print model
# print(gospels_rf)
# predict and confusion matrix
gospels_rf_predict <- predict(gospels_rf, newdata = test_x1)
gospels_rf_cm <- confusionMatrix(gospels_rf_predict, reference = test_y1, mode = "prec_recall")
gospels_rf_cm$byClass[,c(11,5:7)] %>%
kable(caption = "Table 5: Model (Book and Version) Results") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
With 12 classes to predict, the out-of-sample accuracy is `r round(gospels_rf_cm$overall[1],3)`. It is a slight improvement over its base rate of `r round(gospels_rf_cm$overall[5],3)`, which is obtained by simply predicting the most commonly occurring class in the dataset. The confusion matrix (Table A1) suggests that verses from the ASV and KJV are highly likely to be misclassified as the other. Of the four books, it appears that verses from John is the least likely to be misattributed to others whereas verses from Luke and Mark are especially prone to misattribution. This analysis can be further broken down into the versions and the books with fewer classes in each case.
```{r random forest 2, echo = FALSE}
set.seed(1728)
## creating variable class
gospels_reduced_class <- all_bibles %>%
filter(Book %in% c("Matthew", "Mark", "Luke", "John")) %>%
select(Book, `American Standard Version`, `King James Bible`, `World English Bible`) %>%
pivot_longer(., cols = c(`American Standard Version`, `King James Bible`, `World English Bible`), names_to = "Version")
# creating DFM
# partition data with factor to preserve class distribution
gospels_dfm <- dfm(gospels_reduced_class$value, stem = T, remove_punct = T, tolower = T) %>%
dfm_trim(min_termfreq = 5, min_docfreq = 3) %>%
convert("matrix")
gospels_dfm <- scale(gospels_dfm)
gospels_reduced_class$Book <- as.factor(gospels_reduced_class$Book)
gospels_reduced_class$Version <- as.factor(gospels_reduced_class$Version)
ids_train2 <- createDataPartition(gospels_reduced_class$Version, p = 0.7, list = F, times = 1)
train_x2 <- gospels_dfm[ids_train2,] %>%
as.data.frame()
train_y2 <- gospels_reduced_class$Version[ids_train2] %>%
as.factor()
test_x2 <- gospels_dfm[-ids_train2,] %>%
as.data.frame()
test_y2 <- gospels_reduced_class$Version[-ids_train2] %>%
as.factor()
## tuning random forest
# mtry <- sqrt(ncol(train_x2))
# tunegrid <- expand.grid(.mtry = mtry)
# cl <- makePSOCKcluster(5)
# registerDoParallel(cl)
# running rf
# gospels_rf_version <- train(x = train_x2, y = train_y2,
# method = "rf", metric = metric, tuneGrid = tunegrid, trControl = trainControl, ntree = ntree, doParallel = T)
# stopCluster(cl)
# save RDS and then load to speed things up
# saveRDS(gospels_rf_version, "./models/gospels_rf_version.RDS")
gospels_rf_version <- readRDS("./models/gospels_rf_version.RDS")
# print model
# print(gospels_rf_version)
# predict and confusion matrix
gospels_rf_version_predict <- predict(gospels_rf_version, newdata = test_x2)
gospels_rf_version_cm <- confusionMatrix(gospels_rf_version_predict, reference = test_y2, mode = "prec_recall")
gospels_rf_version_cm$byClass[,c(11, 5:7)] %>%
kable(caption = "Table 6: Model (Version) Results") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
In the previous model, it was observed that the poor performance of the classifier could be attributed to errors in separating the versions. Training another random forest model to just classify the verses based on its bible version, the out-of-sample accuracy is `r round(gospels_rf_cm$overall[1],3)` is similar to the previous model. Further, the precision and recall rates are especially poor for the ASV and KJV as compared to the WEB. This confirms the notion that the KJV and ASV are highly similar and difficult to classify.
```{r random forest 3, echo = FALSE}
##### Book
ids_train3 <- createDataPartition(gospels_reduced_class$Book, p = 0.7, list = F, times = 1)
train_x3 <- gospels_dfm[ids_train3,] %>%
as.data.frame()
train_y3 <- gospels_reduced_class$Book[ids_train3] %>%
as.factor()
test_x3 <- gospels_dfm[-ids_train3,] %>%
as.data.frame()
test_y3 <- gospels_reduced_class$Book[-ids_train3] %>%
as.factor()
# ## tuning random forest
# mtry <- sqrt(ncol(train_x3))
# tunegrid <- expand.grid(.mtry = mtry)
# cl <- makePSOCKcluster(5)
# registerDoParallel(cl)
#
# # running rf
# gospels_rf_book <- train(x = train_x3, y = train_y3,
# method = "rf", metric = metric, tuneGrid = tunegrid, trControl = trainControl, ntree = ntree, doParallel = T)
# stopCluster(cl)
#
# save RDS and then load to speed things up
# saveRDS(gospels_rf_book, "./models/gospels_rf_book.RDS")
gospels_rf_book <- readRDS("./models/gospels_rf_book.RDS")
# print model
# print(gospels_rf_book)
# predict and confusion matrix
gospels_rf_book_predict <- predict(gospels_rf_book, newdata = test_x3)
gospels_rf_book_cm <- confusionMatrix(gospels_rf_book_predict, reference = test_y3, mode = "prec_recall")
gospels_rf_book_cm$byClass[,c(11,5:7)] %>%
kable(caption = "Table 7: Model (Book) Results") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
The out of sample accuracy for predicting the book given a verse is `r round(gospels_rf_book_cm$overall[1], 3)`. Of the three classifiers trained and tested thus far, this shows the highest degree of accuracy. However, given that there is a class imbalance, the F1 score as shown above might be more appropriate instead. The high degree of precision in book classification suggests that distinctions between books are relatively strong and retained over the various versions. However, the book of Mark has an uncharacteristically low recall score of `r round(gospels_rf_book_cm$byClass[3,6], 3)`. This suggests that verses in Mark tend to be misattributed to other books.
#### Clusters
To test this further, kmeans clustering was performed to test for the optimal number of clusters and the ability of the kmeans algorithm to perfectly separate the corpus into four distinct clusters, each corresponding to a book. Collapsing verses into chapters can provide a better approximation of chapter meaning as they can be relatively short and easily taken out of context. Furthermore, stylometric signatures should be stronger at the chapter level.
```{r PCA, echo = FALSE, fig.height=5, fig.width=6}
web_collapsed_pca <- all_bibles %>%
filter(Book %in% c("Mark", "John", "Luke", "Matthew")) %>%
select(Book, Chapter, `World English Bible`) %>%
group_by(Book, Chapter) %>%
summarise(Verse = paste(`World English Bible`, collapse = " "))
web_collapsed_pca_vec <- web_collapsed_pca$Verse %>%
unlist()
names(web_collapsed_pca_vec) <- web_collapsed_pca$Chapter
# create DFM
# convert to matrix
web_chapter_dfm <- dfm(web_collapsed_pca_vec, tolower = T, remove_punct = T, remove_numbers = T, stem = T)
web_chapter_mat <- convert(web_chapter_dfm, to = "matrix")
web_chapter_mat_scale <- scale(web_chapter_mat)
# PCA
web_chapter_pca <- prcomp(web_chapter_mat, center = T, scale = T)
web_chapter_pca_df <- web_chapter_pca$x %>%
as.data.frame()
web_chapter_pca_df$Chapter <- web_collapsed_pca$Chapter
web_chapter_pca_df$Book <- web_collapsed_pca$Book
fviz_eig(web_chapter_pca, addlabels = T,
main = "Screeplot of WEB DFM's Principal Components",
caption = "Plot 1: Screeplot of principal components")
```
The screeplot above suggests that dimensionality has not been significantly reduced. The first two dimensions account for less than 7% of the variability. Thus, principal components analysis is not particularly useful in this case as the trade-off between dimensionality and variance is extremely steep.
```{r chapter cosine similarity, echo=FALSE, fig.width=10, fig.height=8}
# # Distance
cosine_similarity_mat_chapter <- sapply(1:89, function(x){
sapply(1:89, function(y){
cosine_similarity <- textstat_simil(web_chapter_dfm[x], web_chapter_dfm[y], method = "cosine")
round(cosine_similarity@x,3)*100
})
})
rownames(cosine_similarity_mat_chapter) <- rownames(web_chapter_dfm)
colnames(cosine_similarity_mat_chapter) <- rownames(web_chapter_dfm)
reorder_cormat <- function(cormat){
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <- cormat[hc$order, hc$order]
return(cormat)
}
get_upper_tri <- function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
cosine_similarity_mat_chapter_reorder <- reorder_cormat(cosine_similarity_mat_chapter) %>%
get_upper_tri()
melted_cosine_sim <- reshape2::melt(cosine_similarity_mat_chapter_reorder, na.rm = T)
min_value <- floor(min(melted_cosine_sim$value))
midpoint <- floor((100 - min_value)/2) + min_value
ggplot(melted_cosine_sim, aes(Var1,Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low="#FC4E07", high = "#00AFBB", mid = "white", midpoint = midpoint, limit = c(min_value, 100), name = "Cosine \nSimilarity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 7, hjust = 1),
axis.text.y = element_text(size = 7)) +
labs(title = "Cosine Similarity of Each Chapter",
caption = "Plot 2: Cosine similarity matrix",
x = "Chapters",
y = "Chapters")
```
As a baseline of comparison, the Euclidean distance between each book was calculated and plotted. For the most part, books and chapters are distinct from one another with the exception of certain chapters that echo each other content-wise such as Mark 11 and Matthew 21. Curiously, Luke 19 and 20 are dissimilar, possibly because the same content is split between two chapters. It also appears that Luke 3, Matthew 1, and John 13 through 17 are relatively distinct from the rest.
```{r silhouette score,fig.height=5, fig.width=7, echo=FALSE}
fviz_nbclust(web_chapter_mat_scale, kmeans, method = "silhouette", k.max = 28) +
theme(axis.title = element_text(size = 8),
axis.text = element_text(size = 8),
title = element_text(size = 11)) +
labs(title = "Silhouette Score",
caption = "Plot 3: Optimal Number of Clusters")
```
```{r kmeans, echo=FALSE, fig.height=8, fig.width=10}
# KMeans
set.seed(1728)
# k5 <- kmeans(web_chapter_mat_scale, 5, nstart = 25)
# saveRDS(k5, "./models/kmeans.RDS")
k5 <- readRDS("./models/kmeans.RDS")
fviz_cluster(k5, data = web_chapter_mat_scale, labelsize = 8) +
theme(axis.title = element_text(size = 8),
axis.text = element_text(size = 8),
title = element_text(size = 11)) +
labs(title = "Cluster Plot: Chapters",
caption = "Plot 4: KMeans Clustering of Chapters")
```
KMeans clustering was then performed without PCA to see whether books can be identified from one another, in which case there ought to be four clusters. Alternatively, the books can also be clustered by content similarity, in which case there should be anywhere between 16 and 28 clusters given that shortest book has 16 chapters, and the longest has 28. This assumes that content order and chapter delimitation is highly similar. The silhouette score plot suggests that the ideal number of clusters is 5. Plotting the five clusters using the first two principal component, the following observations about the clusters can be made:
1. Cluster 1 contains chapters on the crucifixion of Jesus.
2. Cluster 3 contains chapters on the betrayal of Jesus by Judas and the Last Supper.
3. Cluster 2 is an outlier. Luke 3 was identified as one of the least cosine similar chapters relative to the rest.
4. Cluster 5 is a potentially an outlier.
5. Cluster 4 is most probably an amalgamation of other chapters, given that most chapters somewhat, as the cosine similarity matrix suggests.
Surprisingly, the KMeans cluster plot do not echo the ability of the random forest models in separating the each of the Gospel books from one another.
#### Discussion
Having performed the analyses detailed above, it is possible to make the argument that using a bag-of-words approach, the three versions are highly similar in terms of word choice, and by extension, meaning. This can be seen from the difficulty that the random forest models have in distinguishing between the KJV and the ASV even though they were published three centuries apart. This is because the "ASV maintained close ties to Elizabethan English" (Hamlin, 2011, pp.165-166), which came to be known as "formal equivalence" (p.166). On the other hand, even though the WEB is an update of the ASV, it is written in modern-day informal English which sets it apart from the KJV and ASV. This implies that the English language has evolved enough over the years for a supervised machine learning technique to set it apart.
On the issue of authorship, the ability of a random forest model to separate the four books across versions imply that verses in all four books retain unique stylometric traces of authorship that have been preserved over the years. In the case of the WEB, of the verses which were misclassified as thoseof other versions, 87% of them were attributed to the correct book albeit the wrong version.
However, the book of Mark is often misattributed to Luke and Matthew. This lends some support to the theological argument of Marcan Priority in which the Gospels of Matthew and Luke are based on Mark (Goodacre, 2000; Murai, 2006), and therefore similar. To extend this argument a little further, one might expect to find four distinct clusters when performing kmeans analysis on verses collapsed into clusters, with some chapters from Mark, Luke, and Matthew mixed together.
However, this expectation was not supported. Instead, some clusters were topical in nature. Murai & Tokosumi (2006) argued that the Gospels contain verses that are echoed in one another in syncoptic fashion. Thus, various chapters across the four Gospel books could exhibit high degrees of similarity after controlling for Bible version. Furthermore, as the KMeans clustering algorithm minimizes distance across the dimensions, individually similar observations such as the syncoptic chapters are clustered together. This is why the clusters identified mirror to a certain extent, groups of highly similar cells in the cosine similarity matrix. On the other hand, a random forest model might not necessarily be looking to minimize the distance between the observations. Furthermore, the random forest model was trained on verses, and not chapters as the KMeans cluster plot was. There might be broader patterns that the random forest model picked up, which are not observable individually or when it is aggregated.
However, it is curious that this was not the case for all topics discussed in the Gospel Books, evident from the Euclidean distance plot, and the optimal number of KMeans clusters. This could be because of chapter delimitation where content covered in one chapter in a book, can be covered in two in another. This reduces the similarity of each chapter as a document.Further, of the two outliers, Luke 1 is uncharacteristically similar to the other outlier, Luke 3, but otherwise typical of other chapters as calculated by cosine similarity. Thus in a high-dimensional space, this sets the two books apart from the others, resulting in their separate classification even though when decomposed into the first two principal components, Luke 1 appears relatively close to the other clusters.
#### Conclusion
Thus, what is observed here is an inability to satisfactorily explain why a random forest model is better able to classify and predict than a KMeans model. Neither can the hypothesis that the four Gospel are stylometrically distinct be conclusively supported. An extension to this work will ideally include some form of topic modeling across chapters or verses. This will test topical similarity. Further, if there are common topics and proper nouns among the books, then burstiness can be used to explore intensity. Normalized for length, and relative to one another, burstiness should be identical as the books cover the same issues over the same timeline. In the final summary, it appears fitting for a text that has been extensively studied for millenia to retain inexplicable features that continue to capture future collective interest.
#### References
Goodacre, M. (2000). A Monopoly on Marcan Priority? Fallacies at the Heart of Q. In Society of Biblical Literature Seminar Papers 2000 (pp. 538-622).
Hamlin, H. (2011). The King James Bible after 400 years: literary, linguistic, and cultural influences. Cambridge Univ. Press.
Linmans, A. J. M. (1998). Correspondence analysis of the Synoptic Gospels. Literary and linguistic computing, 13(1), 1-13.
McDonald, D. (2014). A text mining analysis of religious texts. The Journal of Business Inquiry, 13(1), 27-47.
Murai, H., & Tokosumi, A. (2006). Synoptic Network Analysis of the Four Gospels. In SCIS & ISIS SCIS & ISIS 2006 (pp. 1590-1595). Japan Society for Fuzzy Theory and Intelligent Informatics.
Murai, H. (2013). Exegetical Science for the Interpretation of the Bible: Algorithms and Software for Quantitative Analysis of Christian Documents. In Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (pp. 67-86). Springer, Heidelberg.
Xia, P., & Yarowsky, D. (2017, November). Deriving Consensus for Multi-Parallel Corpora: an English Bible Study. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (pp. 448-453).
#### Appendix
```{r Appendix, echo=FALSE}
# rf 1
gospels_rf_cm$table %>%
kable(caption = "Table A1: Predicting Book and Version with Random Forests | Prediction(V), Reference(H)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
pack_rows("ASV", 1,4) %>%
pack_rows("KJV", 5,8) %>%
pack_rows("WEB",9,12)
# rf 2
gospels_rf_version_cm$table %>%
kable(caption = "Table A2: Predicting Version with Random Forests | Prediction(V), Reference(H)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
# rf 3
gospels_rf_book_cm$table %>%
kable(caption = "Table A3: Predicting Book with Random Forests | Prediction(V), Reference(H)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```