Final Report 2011: Who wrote the Letter to the Hebrews?

From Derek
Jump to: navigation, search

Acknowledgement

The team wishes to extend our thanks to Professor Derek Abbott and Dr Brain Ng thank for their support and guidance throughout the course of this project. In addition, the team would also like to extend our thanks to researchers who previous studied in this project. The project could not have progress well without the support and assistance from them.

Executive Summary

The Letter to the Hebrews, one article from the New Testaments, is one of the famous texts that have their authorship widely debated. Continuous research on authorship attribution over the years has developed a various number of algorithms and methods that can be implemented the Letter to the Hebrews. In this project, two feature extraction methods and three text classifiers are selected to study the controversy over the authorship of the Hebrews. The extraction methods are Maximal Frequent Word Sequence and Common Ngram, and the classifiers are Naïve Bayes, Support Vector Machine and Dissimilarity Calculation. The team will evaluate the performance of chosen methods and provide unbiased statistic evidence in aiming to help the study of the authorship of the Letter to the Hebrews.


Contents

Introduction

Project Objective

This project aims to analyse the controversy over the authorship of the Letter to the Hebrews. In order to do so, the project team will study techniques in Data mining area and implement them to authorship attribution problems. Two extraction methods and three text classifiers are proposed for this project. The project team will implement the proposed methods and evaluate their competencies as authorship classification tools. In the end of the project, the project team will provide a conclusion to the authorship of the Letter to the Hebrews based on non-biased approach and collected statistic evidence.

Project Approach

Researchers have studied authorship attribution for many years. Numbers of techniques are created and tested in authorship problems. Some of the techniques focus on the content level while others focus on the lexical level. The project team understands the importance of providing unbiased results and the importance for the selected algorithms to have non-English languages compatibility, therefore algorithms chose by the project team is based on pattern extraction principle. Two proposed extraction methods, namely Maximal Frequent Word Sequence and Common N-gram, are all using pattern extraction methodology. They are independent of languages and can provide unbiased statistic data. Three text classifiers are used in conjunction with the extraction methods. These methods will be testing through a number of documents before being finally applied to the New Testaments in Greek. By using this approach the project team can adjust the algorithms accordingly and minimise errors in the algorithms.

Previous studies on authorship attribution

The earliest study in authorship attribution was by Mr. Mendenhall [1]. He used characteristic curve of composition, which was then heavily criticised by Florence (1904) that the algorithm was biased by the language in which the text is written in. Result showed only small difference can be obtained in the characteristic curves of different English writers, which indicates that two authors with completed different writing styles but yet using the same language in writing will probably have the roughly same characteristic curve. Thus a conclusion is made that characteristic curves fails to capture the author's writing style.


A successful case of using statistical evidence in authorship attribution is made by Mostella and Wallace [2] in 1964 over the controversy of the Federalist. The Federalist consists of 85 papers from three authors: Alexander Hamilton, James Madison and John Jay. While 73 of the papers found their writers, the other 12 papers were claimed both by Hamilton and Madison. In 1964 Mostella and Wallace published a book ‘Inference and Disputed Authorship: the Federalist’, which analysed the authorship problems of Federalist and gave conclusion that James Madison is the author of the 12 disputed papers. In the test Mostella and Wallace examined the frequencies of 70 ‘maker words’ used by these authors in their writing and found James Madison shared the same characteristics with the 12 papers, thus the paper were attributed to James Madison.


In the 90s conclusions made by some researchers on authorship attribution lead future researches into a new domain. In 1990 Hilton [3] pointed out that simple feature extraction such as word frequency occurrence would perform very well in authorship attribution. In 1995 Holmes and Forsyth [4] used Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) to approve that vocabulary richness variables can provide a good set of discriminators for authorship attribution.


A major breakthrough in authorship attribution is in the year of 2001. Stamatatos [5] et al. created a new method using Natural Language Processing (NLP) for authorship attribution. The researchers applied existing NLP tools to analysis-level style markers, which contain useful stylistic information, and achieved excellent results. With success, their method outperformed existing lexically based methods in authorship attribution at that time. Authorship attribution has entered statistic and computer domain.


In 2002, Baayen [6] conducted a series of experiments using most frequent function words as the feature extraction and PCA and LDA as text classifiers. It was later concluded by Baayen that LDA is more suitable than PCA for authorship attribution as PCA failed in performance in the experiments. In 2003, Baayen collaborated with Juola[7] and this time they used a method called cross-entropy in authorship attribution. The method quantifies the difference between two data and measures the distance between them. Claimed by Juola, the cross-entropy method has better accuracy than LDA and it can be efficiently applied to shorter texts. The method can also be applied to a variety of linguistic and text-analysis problem, which suggests that the measured distance between two data, is the same as numerical measure. The result can be compared with other scalar measurements for different types of documents.


In 2004, Sabordo [8] uses the data compression technique and Word Recurrence Interval (WRI) to "the Letter to the Hebrews" in the New Testaments. While Sabordo concluded that data compression technique performed poorly in authorship attribution, the WRI method is proved to be useful as the technique could identify similarities in styles of texts written by the same author. In the year of 2010, Jie Dong [9] et al. used Function Words Analysis, WRI and Trigram Markov model with Support Vector Machine to “the Letter to the Hebrews”. While each method is able to capture some of the writers’ writing styles, the final results produced by the three methods are effected by the small number of training data of New Testaments and do not come to an agreement.

Motivation

In today’s world where information is overwhelming, data mining has become an increasingly important area to study with. The requirements to extract useful information from the giant information sea have become more strictly in terms of accuracy and computing speed ever since. It is inevitable that a purely statistical technique would no longer produce satisfactory result in front of highly noisy and complex data structures. Therefore techniques from data mining are combined together in order to filter out noise and extract useful and relevant information that are well hidden within large volumes of data.

The project provides the team an excellent chance to study and apply techniques from data mining to real classification problems. The algorithms used in the project could be further improved and applied to various applications such as plagiarism detection, online search engine and malware detection, etc.

In regards to the authorship attribution, findings from the project would contribute to the study of New Testaments and provide valuable reference to researchers who will continue the study of the Letter to the Hebrews.

Corpus

In order to evaluate the performance of selected algorithms and adjust the algorithms to their optimal performance in authorship attribution, four sets of text corpuses were obtained from the Project Gutenberg archives. The four sets of fata are: English Text, The Federalist, King James Version of New Testaments and the Koine Greek of New Testaments.

The English Text contains 155 articles from 6 well-known English writers. The 6 writers are Henry James, Zane Grey, B.M. Bower, Sir Arthur Conan Doyle, Richard Harding Davis and Charles Dickens. Each author has 26 articles (Henry James only has 25). A complete list of articles is in appendix.

The Federalist contains 85 papers that were written by Alexander Hamilton, James Madison and John Jay. Out of the 85 papers, 51 of the papers were written by Hamilton, 14 of the papers were written by Madison, 5 of the papers were written by John Jay, 3 of the papers were collaboration between Hamilton and Madison. The last 12 papers, No. 49-58 and No. 62-63 were originally disputed but were contributed to Madison. In order to achieve an accurate classification model, the papers written by both Hamilton and Madison are not used. A complete list of articles used is in appendix.

The King James Version of New Testaments is an English translation of the original texts. The translation was the work of 47 scholars from the Church of England. However, translation itself might add noisy to the data hence affect the result in authorship attribution. It is worth to compare the results obtained from the King James Version to the original Greek New Testaments. A completed list of articles is in appendix.

The New Testament was originally written in Koine Greek. In analysing the New Testament in its original form, Koine Greek, this preserves the original fingerprint of the author. Findings obtained from the Koine Greek New Testaments would hold a more realistic and accurate result that is closed to the true author of the Letter to the Hebrews. Two more Greek articles from Barnabas and Clement of Rome are added to the collection, as they are potential authors of the Letter to the Hebrews claimed by Biblical scholars. A completed list of articles is in appendix.

Background

Background of New Testaments

The Bible has 66 books that are divided into two sections: the Old Testaments and the New Testaments. The Old Testaments consist of 39 books and the New Testaments consist of 27 books. Table show the authors of the 27 books of the New Testaments. Articles which authors are not widely agreed are not included in the corpus.

Error creating thumbnail: Unable to save thumbnail to destination

Background of the Letter to the Hebrews

The letter to the Hebrews, or known as the epistle to the Hebrews, is said to have been written in the late 63 AD or early 64 AD. Traditional scholars have said that the letter was written to the Jews at that time. Since the author’s name was not included in the letter, the authorship becomes a mystery. Debate over the authorship of Hebrews has been continued for over 1,800 years.

In the 4th century, Jerome and Augustine of Hippo supported Paul’s authorship and the Letter to the Hebrews was contributed as the 14th letter written by Paul (Fonck, 1910). However, Milligan George [10] commented that Paul was unlikely the author of Hebrews because of the inconsistencies in content and writing style compared to other letters written by Paul. However, Clement of Alexandria made claims that Paul could be the author of the original Hebrews, which is in Greek. The current letter was latterly translated by Luke, thus explaining the differences in style of writings. The reason for Paul not leaving his name on the letter was because the Hebrews had conceived a prejudice against him during that time.

Over the year Biblical scholars have created a list of potential authors for the Letter to the Hebrews. Some of the authors were added based on their knowledge on subject in the Letter to the Hebrews. Barnabas was added as a potential author as he was from the tribe of Levites, and the major theme of the letter to the Hebrews focuses on the Levitical law and the priesthood. Clement of Rome was also selected as one of the authors due to his close association to Paul. The list of potential authors for the Letter to the Hebrews is below:

Error creating thumbnail: Unable to save thumbnail to destination

Pre-process Text

All training corpuses, testing texts and disputed texts are pre-processed prior to feature extraction. The decision is made based on previous study and the objective that the project team wants to implement a statistical approach for authorship attribution. It is determined before those non-alphabet characters do not contribute to the accuracy of classification results.

Remove redundant information

For the English text, most of them contain redundant information or content related information such as the title for the article and subtitles for each paragraph. The redundant information will not contribute to the results and may introduce errors to the testing, therefore the project team decided to remove all of them.

Remove Punctuations

The project team utilises algorithms used by previous group [9]. Texts with punctuations removed can be divided into these two situations:

Error creating thumbnail: Unable to save thumbnail to destination

Remove Non-English ASCII Symbols and Control Characters

The project team used Java as the programming language for the extraction methods. Java is competent at reading non-English ASCII symbols and control characters. Once the program encounters these characters, the program will remove them from the text thus leaving the output texts consist of only alphabets.

Mirror Koine Greek to English Alphabets

Since Java only reads plain text file (.txt) and does not support the format that Koine Greek text is encoded, the Koine Greek characters will be replaced with the corresponding alphabet characters. It is important to know by mirroring Greek into English alphabet will not affect the content of Greek texts. Hence all information that can be used for authorship attribution in the Greek texts is reserved.

Error creating thumbnail: Unable to save thumbnail to destination

Maximal Frequent Word Sequence & Naïve Bayes classifier

A common practice in text classification is to view text as sequences of words grouped into sentences, and apply algorithms to the text to extract frequent patterns. These extracted patterns are used as style markers to represent an author’s writing habits, hence classifying a disputed text to an author that holds most of extracted style markers from the disputed text.

The idea of extracting Maximal Frequent Word Sequence (MFWS) is to extract word sequences that appear multiple times in the same text collection. The algorithm generates a set of word sequences that normally contain both function words and content words [11], the features from function words and content words could be utilizes as style markers for classification problems. Since it is a pattern extraction method, the algorithm is language independent. It can be applied to non-English texts such as Chinese, Japanese and etc., not just limited to English texts.

Definition of Maximal Frequent Word Sequence

Giving a document collection consists of N numbers of articles, a word sequence (should at least contain 2 words) is a MFWS if the word sequence appears n or more than n times between these articles, where n is a given threshold and it should be greater than 1 and less or equal to the total number of articles N. Important things to know here is that multiple occurrences of a word sequence within 1 article will only be counted as appearing 1 time.

It is clear that different values of threshold n will affect the selection of MFWS and the definition has strict requirements for the document collection. A document collection with small or unbalanced number of articles can only set small values to n. Therefore for MFWS to work appropriately, a large number of databases are required. However, New Testament only has a small number of collections for each author. For instance, author James only has one article.

In order to implement MFWS in this project, a less strict definition for MFWS is used. In the new definition, the algorithm will no longer seek for word sequences that repeat between articles. Instead it is set to extract frequent word sequences within an article and there is no limit to the value of threshold n. A lower value of n allow the algorithm to extract more features, whereas a high value of n allow less features be extracted and it tends to produce shorter word sequences. However, this approach is less resistive to the length of an article.

Introduction of Naïve Bayes Classifier

Naïve Bayes classifier is a probabilistic classifier that based on Bayes’ rule. Despite its simplicity, Naïve Bayes has been proved to be quite competitive for text classification and it is frequently used in text categorization [12 ].

Given a set of features
Error creating thumbnail: Unable to save thumbnail to destination
from a category
Error creating thumbnail: Unable to save thumbnail to destination
, the probability of a text d belonging to
Error creating thumbnail: Unable to save thumbnail to destination
can be represented using Bayes’ rule:
Error creating thumbnail: Unable to save thumbnail to destination
By making assumption of statistical independent of features, that is for
Error creating thumbnail: Unable to save thumbnail to destination
,
Error creating thumbnail: Unable to save thumbnail to destination
and
Error creating thumbnail: Unable to save thumbnail to destination
are conditionally independent given c, the equation can be simplified to:
Error creating thumbnail: Unable to save thumbnail to destination

Since P(d) is a constant for category c, the equation can be wrote as:

Error creating thumbnail: Unable to save thumbnail to destination
Using the maximum likelihood estimate method (MLE),
Error creating thumbnail: Unable to save thumbnail to destination
can be written as:
Error creating thumbnail: Unable to save thumbnail to destination
Where
Error creating thumbnail: Unable to save thumbnail to destination
is the number of documents of category
Error creating thumbnail: Unable to save thumbnail to destination
. N is the number of documents of the whole collection.
Error creating thumbnail: Unable to save thumbnail to destination
is the conditional probability as the relative frequency of feature
Error creating thumbnail: Unable to save thumbnail to destination
in documents belong to category
Error creating thumbnail: Unable to save thumbnail to destination
. It can be written as:
Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination
is the number of documents from collection
Error creating thumbnail: Unable to save thumbnail to destination
having feature
Error creating thumbnail: Unable to save thumbnail to destination
. |F| is the number of total features.

MLE will always return 0 probabilities for a document if the document does not have all the features. To eliminate the zero probabilities problem, Laplace smoothing is added to the formula:

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination
is the user-defined smoothing parameters.
Error creating thumbnail: Unable to save thumbnail to destination
=0 corresponds to no-smoothing maximum likelihood parameter estimates and
Error creating thumbnail: Unable to save thumbnail to destination
=1 corresponds to Laplace add-one smoothing.

The goal of authorship attribution is to find the most likely class for the disputed document. Based on the rules, Naïve Bayes classifier will return a class that has the largest probability:

Error creating thumbnail: Unable to save thumbnail to destination

In Java programming, the equation is expressed as:

Error creating thumbnail: Unable to save thumbnail to destination

Problems with Laplace add-one smoothing

Laplace add-one smoothing is very easy to implement and by using Laplace add-one smoothing the problem of zero probabilities is eliminated. However, the smoothing reduces the value of probability for each seen feature due to the large value of F. This gives a large probability mass to unseen features (noise). For example, a document collection D has only 10 examples, a test article has 100 features (F =100) and 1 feature can be found in 9 of the 10 examples of D. According to MEL the probability for the test article to belong to the document collection is 0.9 given that feature. According to Laplace smoothing, the probability is reduced to 0.0909 due to the large value of F.

This project would evaluate the performance of Naïve Bayes classifier with Laplace add-one smoothing. If the result shows inefficient performance, a better smoothing technology may be used.

Method Implementation

Figure 2 illustrates the classification process using MFWS.

Error creating thumbnail: Unable to save thumbnail to destination

MFWS algorithm is firstly used to extract features for both disputed text and training corpus. Once it is done, features extracted from the training corpus are filtered to form a training database that matches the features extracted from the disputed text. Then both the disputed text’s features and the training database are feed into the classifier to find the most likely author.

Software development

During the project several updates were made to the MFWS algorithm and the Naïve Bayes classifier, showing in table 3 and 4 :

Error creating thumbnail: Unable to save thumbnail to destination

For detailed program design, please refer to the stage 2 report. An example of summary of extracted features and an example of the Result.txt are included in Appendix F & G.

Result and discussion

Result from English Text

As described in previous section, a set of 155 articles are obtained from the Project Gutenberg archives. AD, BB, CD, HJ, RD and ZG the 6 authors contributed to the collection. All the 155 articles are used as training corpus to assess the performance of MFWS algorithm and Naïve Bayes classifier.

In order to obtain a more accurate performance from the methods, the project team decided to run tests on all the 155 articles. The test is arranged as following:

Each author has 26 their own articles from the 155 articles as training data (HJ only has 25). The 155-document collection is then used as the testing database. Each time the test will pick 1 article from the 155 articles as a ‘disputed text’ and remove the same article from the training database to perform the classification process. The test will continue to do so until all 155 articles have been tested. For each test, the threshold n used for MFWS will be increased from 2 to 10 in an increment of 1 to analyse the affect of different threshold values. The reason to set the upper limit of n to 10 is that there is no feature to be extracted for most of the training texts after the threshold 10.

Table 5 shows the average number of features per article extracted from each author for a given threshold value n= 2, 5, 8 and 10. As it shows, the number of features would generally decrease as the threshold increases, and the number of features extracted from each author is roughly balanced.

Error creating thumbnail: Unable to save thumbnail to destination

A small threshold n allows more features to be extracted and the average length of these features is longer compared to the ones extracted using a larger threshold n. By carefully examining the extracted features for different threshold, it is also found that features extracted using small threshold value would contain more content words, whereas most features extracted using large threshold value contain only function words.

The test is done for all 155 articles. A table that contains the completed results is attached in appendix. The results are summarized into table 6.

Error creating thumbnail: Unable to save thumbnail to destination

Table 6 shows the percentages of MFWS and Naïve Bayes classifier assign the test file to its correct author for different threshold values. In most cases the best results is achieved at threshold value n=2. Notice that the results obtained using combination of features from n=2 to 10 have slightly worse accuracy; the decrement in accuracy might be introduced by features from large thresholds.

The table also shows text classification for author RD receives the poorest results whereas the classification for author HJ has the best performance. By studying the classifier outputs, it can be found that most of the time articles of RD are wrongly assigned to HJ, which might lead to two assumptions: RD and HJ shared some common writing styles, or the features extracted from HJ dominate the test and skew results for authors whose writing characteristics are not strong enough compared to HJ.

Error creating thumbnail: Unable to save thumbnail to destination

Overall, after 155 tests the results show that the performance for MFWS and Naïve Bayes classifier is sufficient. As showing in figure 3A, a peak 83% accuracy can be achieved at threshold n=2 and the average accuracy is approximately 70% for n = 2 to 10.

Result from Federalist

The test is arranged in a similar manner to the test for 155 English texts. As described in previous section, Federalist contains 85 articles from three authors labelled as H, J and M. 51 of the articles are written by H, 5 of the rest are written by J. 3 of the rest are written together by H and M, and the rest 26 are written by M.

Based on previous studies, the project team decided only to conduct tests between H and M, J is not considered due to his small number of texts. The test is carried out on 77 texts from both H and M as test database, articles written by J and articles written together by H and M are not included. The training texts used for M are 26 articles and the training texts used for H (51 articles) are divided into two parts, each part contains 26 texts. Thus the test is divided into test A and test B.

Table 7 contains the average number of features per article extracted from both H and M’s training data for test A and B at threshold n = 2, 5 and 10.

Error creating thumbnail: Unable to save thumbnail to destination

For test A, the number of training features that H has is considerably less compared to the number of training features that M has. The variation in feature number counts approximately 43% (59.92/138.19) of H’s total training data size. In contrary, test B has a much-balanced training data size for both authors. The maximum variation is 5.04, which is insignificant compared to the training data size.

Figure 3 and 4 show the test results for H for test A and B respectively. In both cases the test articles from M are all correctly assign to M. On the other hand, the information obtained for H clearly shows the negative influence that a seriously unbalanced training data size can have on the classification result using Naïve Bayes classifier.

Error creating thumbnail: Unable to save thumbnail to destination

According to figure 3, most test articles of H are wrongly classified to M in test A. Whereas in figure 4 for test B, the result shows a significant improvement in classification accuracy for H as approximately 80% of the test articles are correctly classified. Figure 5 compares the overall accuracy for test A and B. a 38% increment in accuracy is achieved when a balanced training database is used.

Error creating thumbnail: Unable to save thumbnail to destination

Based on observations from the previous tests, it suggests that in order to use MFWS and Naïve Bayes classifier for accurate text classification, a well structured and balanced training database is necessary and reasonable performance is most likely to be achieved using threshold n =2 or using combination of features extracted for n = 2 to 10.

Results from New Testaments King James Version

There are three major reasons for doing tests on the English King James version of New Testaments. Firstly, the results obtained can be used to evaluate the performance of MFWS and Naïve Bayes classifier on small size of data collection. Secondly the experience gained for doing tests on the KJV New Testaments can be later used for testing on the Koine Greek version of New Testaments. Finally, it is also worth to study if the English KJV New Testaments shared common characteristics with the Koine Greek version and if they can both produce the same author for the Letter to Hebrews.

The test is conducted on the Gospel of Luke and Acts of Apostles. It is commonly agreed the Gospel of Luke and the Acts are written by the same person, hence the test result may provide accurate feedback to the method and classifier. According to previous test results the threshold values used for extracting MFWS are set to be n =2 and n=11 (A combination of features from n =2 to 10). In order to construct a balanced training database, the training articles are divided into small sections. Table 8 shows the number of features extracted for divided texts.

Error creating thumbnail: Unable to save thumbnail to destination

It can be seen that articles from James, Jude and peter cannot produce efficient number of features. Therefore it is decided to only conduct tests for author John, Luke, Mark, Matthew and Paul. Articles highlighted in yellow the table are used as training texts, each author has 3 articles to maintain the balance of the training database. The last file is used as testing text. Same structure is used for testing with n=11.

Figure 6 shows the results for both tests. When n =2, the classifier assigned the testing file wrongly to Mark, whereas n = 2 to 10 the classifier assigned the file to Luke. Results are all normalized to the most likely author to compare the difference. The figure also shows some similarities between Luke and Mark’s results, as the distance between them are quite narrow. This could indicate that Luke and Mark may share some characteristics in writing. Another thing to note is that if the training data used for Luke is the Gospel of Luke and the test file is the Acts of Apostles, the classifier will perform correctly for n = 2 and n = 2 to 10. The behavior was also observed by previous group [9]. An explanation to this behavior is that features that the Acts of Apostles holds are less significant and representative compared to features that the Gospel of Luke holds, which are hard to be distinguished from authors with similar writing habits.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 7 shows the classification results for the KJV the Letter to Hebrews for n=2 and n=2 to 10. In both cases, the classifier all predicts that Paul is the most likely author for the Hebrews. The second likely author is Luke and the third likely author is changed between Mark and Matthew in the two tests.

Error creating thumbnail: Unable to save thumbnail to destination

The test results from KJV New Testaments indicate that the project’s approach for small collection of articles is on the right track. The experience will be used for the Koine Greek New Testaments. The classification results also suggest the first three likely authors for KJV the letter to the Hebrews is Paul, Luke, Mark or Matthew and John in the descending order.

Result from New Testaments Koine Greek

The tests on the Koin Greek version New Testaments are similar to the tests in KJV New Testaments. Two more candidates Clement of Rome (CL) and Barnabas (BA) are added to the list, as they are potential author of the letter to Hebrews. The test is divided into two sections. The first section Test A will use the whole text of each author as training data. The second section Test B longer training texts will be divided into small paragraphs with roughly same length. Test A Figure 8 shows the test result on The Acts of Apostles using uncut Greek texts. The score is normalized to Luke to show the differences. The classifier is able to correctly assign The Acts of Apostles to Luke. It also indicates that Matthew shared some common features with Luke, as his score is closest to Luke. Mark and Paul come to the third. John is in the fourth place. Comparing the result with data obtained from the KJV test, there are parts that do not agree with each other. Mark is ranked as the second most likely author in the KJV test. Matthew and Paul do not have such high ranks from previous test as well. They are in the fourth and fifth positions in the KJV test. The variations may be introduced by translation the original Greek into English.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 9 shows the test result for the Letter to Hebrews using uncut Greek text. The most likely author is Paul, following by Luke, Matthew, Mark and John. It is supervising to find here that the result matches the KJV test result. While the difference between each author in the KJV test is not so significant, the result obtained here shows larger dissimilarities. Luke & Matthew are in the second rank and Mark & John are in the third rank.

Error creating thumbnail: Unable to save thumbnail to destination

Test B In test B longer training texts are divided into shorter length texts to ensure every article all has roughly the same length. The test is once again conducted on The Acts of Apostles and The letter to Hebrews. Results are normalised to the most likely author.

Figure 10 shows the result of The Acts of Apostles. Despite of some slight changes in some authors, the overall result is the same as in test A. The tested article is rightly assigned to Luke and the ranks for the other authors remain unchanged.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 11 contains the classification result for The Letter to Hebrews. Once again the result is consistent to the one in test A and KJV test. Luke’s rank rises closer to Paul; John now is in the same rank as Mark does. The reductions in dissimilarities are probably caused by doubly registering of same features. Since longer texts are divided into small ones, features that used to be registered only once may be registered more times in the divided texts, thus reducing gaps between each author. However, the overall result has not been changed. Base on test A and test B the likely authors of Greek version The Letter to Hebrews are Paul, Luke, Matthew, Mark, John, James, BA, CL, Peter and Jude.

Error creating thumbnail: Unable to save thumbnail to destination

Conclusion

Based on the tests and results, it was found that good classification accuracy could be achieved using MFWS features extracted at threshold n equals 2. In addition, the classification accuracy can be further improved as the number of text for training data is balanced out and the number of extracted features is balanced out between each author. An important thing to note here is that the number of training data does affect the classification accuracy. While the approach can be applied to small number of text collection, the result is rather limited. A large number of training data is always preferred if a more accurate classification is to be achieved.

The MFWS gives fairly good efforts in capturing the characteristics of writing of the authors and the Naïve Bayes classifier is able to give reasonable classification accuracy. The results obtained from both the KJV and the Koine Greek New Testaments also share consistency, which suggest that MFWS is adequate as a feature extraction algorithm and is not affected by the translation of text. On the other hand, the performance of Naïve Bayes classifier is limited by the Laplace smoothing and a more complex smoothing algorithm is needed for a more accurate classification. Due to the imperfection of the approach, further development with the algorithm is encouraged in order to have a reliable result.

Since MFWS is language independent, further testing is recommended on other non-English texts such as Chinese and Japanese. The pattern extraction nature of MFWS can also be further developed for several of applications such as spam detection and internet search engine.

Common N-gram and Support Vector Machine

Introduction of Common Ngram

The Common Ngram is a character level N-gram and also independent of the language. The Common Ngram approach firstly was introduced by Keselj et al[19]. The method of Common Ngram is to extract segments with the length of n, which represent the author’s writing style. The algorithm of Common Ngram extracts the feature and also counts its corresponding frequency of occurrence.

The extraction algorithm of Common Ngram was implemented using Java. The pre-process text used as the input of the algorithm of Common Ngram. For example, if one of the 155 English texts is chosen, then the output of using extraction algorithm is shown in Table 9 below:

Error creating thumbnail: Unable to save thumbnail to destination

Table 9 shows the extracted features with its frequency ranking in the descending order from the text of RD_MC when n=4.

Introduction of Support Vector Machine

As mentioned in the progress reports before, Support Vector Machine (SVM) is chosen as one of classification models followed by using extraction algorithm Common Ngram. SVM is subjected to begin by building large sets of training data by collecting extracted features from authors to create a classifier. The testing data (the dispute text) is fed to the SVM classifier to test the accuracy of the classifier after the training phase is finished.

Error creating thumbnail: Unable to save thumbnail to destination

SVM was implemented using Matlab. SVM could not directly read the output (shown in Table 9) produced by Java, a specific format for the output file is required. Therefore, we need to convert the output to the standard format for the data file, which produces feature vectors with the same dimension for both training and testing data and is used as input for SVM. Since SVM reads number, the frequency of occurrence of extracted features would be used. The frequencies of extracted features are arranged in order in each author profile.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 13 shows the example of the standard SVM input format when n=4. This progress is accomplished by using Java. The input format contains four main parts:

  • Number of training and testing texts
  • Number of testing texts (the disputed texts)
  • Data dimensions: for Common Ngram, it indicates the number of the extracted features used above the frequency threshold. Matlab would not consider if the frequency of the extracted feature is below the threshold. We set threshold to 10 in this project. In the figure, zero means the frequency of the extracted feature occurs below threshold.
  • The data matrix: in this matrix, each row represents the frequency of extracted feature. It contains extracted features of training data and testing data, and starts with a String of text with author’s name labeled A1, A2, A3… A120 or the author from the disputed text labeled D1.

Kernel Function

SVM has four kernel functions, namely linear, Quadratic, Gaussian Radial Basis Function (RBF) and Polynomial. Each kernel function performs difference accuracies and the important thing is to illustrate the accuracy with reasonable function used in difference text files.

Output Ranking System

SVM is binary machine and is designed to classify between two authors. However, the project is to satisfy the purpose of multi-author (more than two) classification. There are three sets of text files and the maximum number of author is eight. The pairwise classification is utilized in the project to compare and judge the preferred author.

For instance, we have four authors AD, BB, CD and HJ, and need to find who is the author of the disputed text. Then, SVM progress six tests, which are AD and BB, AD and CD, AD and HJ, BB and CD, BB and HJ, CD and HJ. One author from the predicted result is recorded in each test.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 14 shows the predicted results conducts that BB is the most likely author of the disputed text. The second and third are CD and AC respectively. Also, HJ is the most unlikely author.


If two authors receive the same score in the predicted results and we could not conclude exactly which one is the most likely author, then we could interpret the most unlikely author or conclude the most likely author by further tests.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 15 shows the output of the classifier SVM with four kernel functions when n=4. This testing file comes from the 155 English Texts, named SVM_Output_4_AD_AB, which the author is AD. The linear, quadratic and polynomial kernel functions conduct the right author AD, which named A1.

Software development

The algorithm of Common N-gram was implemented by using Java. Each version of programming of the algorithm Common N-gram and classifier Support Vector Machine are presented in table 10:

Error creating thumbnail: Unable to save thumbnail to destination

Result and discussion

Result from English Text

The English text is a set of 155 articles and contains six authors, namely, AD, BB, CD, HJ, RD and ZG. Five of the authors contribute 26 articles equally and the other one author (HJ) contributes 25 articles. In the testing, all 155 articles are used to test that would help to evaluate the accuracy of the algorithm Common Ngram. One article is chosen to test each time and the remaining articles are treated as the training data. For instance, if one article from author AD is used to test as the disputed data, then the remaining articles from author AD and all the other authors’ articles (154 articles) are arranged as the training data. After this “disputed” article is tested and the classifier performs its result, the program will move to test next article until all 155 articles are tested. The number of the training data (154 articles) used to develop the classifier SVM do not change in each testing.

Since the algorithm Common Ngram extract n from 2 to 10, all 155 English texts have to generate 9 times. The total tests are 1395 times. Four kernel functions (Linear, Quadratic, RBF and Polynomial) are applied in the classification, the accuracy of each author gained from the classifier SVM when n=2 are given in table 11:

Error creating thumbnail: Unable to save thumbnail to destination

Table 11 shows in author AD has 25 out of 26 articles that tested correct in the linear kernel function, which is 96%. The 17 out of 26 articles tested correct in quadratic. No article tested correct in the RBF and 22 out of 26 articles tested correct in polynomial. Overall, the linear kernel function shows the higher accuracy than the others in author AD’s articles, and it also shows the higher accuracy in the other authors BB, CD, HJ, RD and ZG. Moreover, there are totally 25 articles in author HJ. The average of accuracy obtained from the linear kernel functions is 96% when n=2. In the average of accuracies obtained in quadratic and polynomial functions are 34% and 55% respectively.

However, no article that tested correct in the RBF, which means for extraction features produced using n=2, the RBF is not sufficient to produced desired result.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 16 shows the accuracies of tested articles from six authors in four kernel functions. The quadratic and polynomial lines show the accuracies in both functions are unstable due to random values when n=2. However, the trendline of the linear kernel function is quite smooth. The accuracies from testing are above 85%, which is high, that can be used to test the author of who wrote the letter to Hebrews. This is my initial assumption. However, this assumption might be changed or confirmed by the following tests.

All tables and figures from n=3 to n=9 for the accuracy using different functions, please refer to Appendix. From n=3 to n=10, the accuracy of the linear kernel function is decreasing slightly, but still shows the highest accuracy among those functions.

Error creating thumbnail: Unable to save thumbnail to destination

Table 12 shows the average accuracy in six authors in the linear kernel function is around 62% when n=10. The average accuracy in the RBF kernel function is non-zero after N=8. The average accuracy in RBF is 61%, which is much closer to the RBF’s. But the accuracies from six authors in quadratic and polynomial functions are very low when n=10 that two functions are not appropriate for testing the author in the dispute text.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 17 shows both trendlines of linear and RBF are closer to each other when n=10. Then, we add the number of correctly classified texts from each author, and divide the total 155 English texts from n=2 to n=10 shown in Table 13, also plot in Figure 18.

Error creating thumbnail: Unable to save thumbnail to destination


Error creating thumbnail: Unable to save thumbnail to destination

Figure 18 shows that the linear kernel function performs the highest accuracy among all kernel functions from n=2 to n=10 even it is decreasing as n is increasing. In addition, it is observed that the accuracy using RBF increases significantly at n=9 and n=10. In other word, for both linear and RBF they would overlap at n=9 and n=10.

For the quadratic kernel function, the overall accuracy is less than 35%, which is quite low in the 155 English Texts. Furthermore, the accuracy is unstable from n=2 to n=10 for 155 English Texts in the polynomial kernel function. Therefore, we would gain the poor results if we use these two kernel functions to identify the author of the dispute text.

Overall, for 155 English Texts, the linear kernel function is an appropriate choice for classification in SVM and it shows high accuracy. But we could consider using BRF if author’s features are extracted with larger n (e.g. n=8, 9, 10).

Result from Federalist

In Federalist, it contains 85 articles from three authors, namely, H, J and M. H wrote 51 articles, J wrote 5 articles, M wrote 26 articles, and last three articles were written by collaboration of H and M. Since the purpose is to find the dispute text’s author is H or M, three collaborated articles are not considered here. Then, three collaborated articles are removed from database. Therefore, 82 articles are used to allocate into training and testing database.

Since J has 5 articles, J’s articles are only used to set as training data. In addition, we need to consider how to set the number of the training texts from each author shall approach. Otherwise, the potential of author would be close to the author that has greater training data. In this testing, we chop H’s 51 articles into two parts as author M has 26 articles. Each part has 26 articles and the 26th article would be used twice. Then, each part has H’s 26 articles, J’s 5 articles and M’s 26 articles as training data. For each part of testing data, we have H’s total 51 articles and M’s 26 articles.

Error creating thumbnail: Unable to save thumbnail to destination

Table 14 & 15 shows test results for author M in two parts of testing from n=2 to n=10. In both table, the linear kernel function shows the high accuracy and almost assign to M. The potential author is no doubt to M. Moreover, the accuracy in testing part 2 is more stable and all above 96%. The average accuracy from testing part 2 is 2% higher than the average from testing part 1 in the linear kernel function as different training data used.

Error creating thumbnail: Unable to save thumbnail to destination


Error creating thumbnail: Unable to save thumbnail to destination


Figure 19 & 20 shows the high accuracy using the linear kernel function, also the results from the linear function is significant different from the other functions in the classification from n=2 to n=10. Therefore, the linear is a much appropriate function to classify. With the observation obtained in the Federalist paper, setting a balanced number of texts is very important and also the different texts from each author are used as training database is an optimal condition for an accurate classification model.

Error creating thumbnail: Unable to save thumbnail to destination

Table 16 & 17 shows tested results for author H in two parts of testing from n=2 to n=10. In table, the accuracy is the highest at n=2. The accuracy decreases as the value of n increases. Overall, the linear kernel function shows the highest accuracy comparing with the other functions, but it has 86% of accuracy at n=2 decreasing to 41% at n=10. The linear function would not perform high accuracy at larger n (n=8, 9, 10) as the number of author’s features decreases.

Error creating thumbnail: Unable to save thumbnail to destination


Error creating thumbnail: Unable to save thumbnail to destination


It is observed that the RBF has the similar accuracy at n=9 and n=10, and the RBF overlap with the linear kernel function at these n, which are quite closed to the conclusion we gained from the 155 English texts. Based on observations, we rely on the SVM results produced by linear kernel function since it has the highest accuracy of all. For large number of n, the results produced by RBF kernel function can also be used as references because the figures shows RBF kernel has a reasonable accuracy at large value of n.

Results from New Testaments King James Version

King James Version is a translated version of the New Testament text. The purpose of using New Testaments King James Version is to examine the effectiveness of using Common Ngram and SVM classifier on small size of data collection. Furthermore, a similar approach will be used in the original Greek version.

For the accurate classification model, we have to set balance texts to use from each author. This approach taken would chop text into a set of small text equally. The disputed texts are the Acts of the Apostle and the Gospel of Luke. Both texts are written by the same person Luke. The first test is to use the Gospel of Luke as one of the training data and the Acts of the Apostle as the testing data. The second test is to use the Acts of the Apostle as the training data to test the Gospel of Luke. In the training data, there are eight authors with its balanced training data. Each test will run from n=2 to n=10.

Error creating thumbnail: Unable to save thumbnail to destination

Table 19 shows the results for testing the Acts of the Apostle from n=2 to n=10. The linear kernel function assign the testing file correctly to Luke from n=3 to n=6 and RBF assign the testing file correctly to Luke from n=5 to n=8. Overall, the linear kernel function is preferred function to determine the author in this case and RBF could be considered as reference.

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination

Figure 23 shows in the ranking system, the possibility that Luke is the most likely author to write the Acts of the Apostle as the highest ranking, the author Mark is second and Matthew is third from the classifier linear kernel function. However, RBF perform a bit different in the ranking system even author Luke is the most likely author, as well.

Error creating thumbnail: Unable to save thumbnail to destination

Table 20 shows the results for testing the Gospel of Luke from n=2 to n=10. Only RBF and polynomial kernel function assign correctly to Luke at n=9 and n=2 respectively. In this case, the linear kernel function assigns the testing file wrongly to Luke from n=2 to n=10 even it always perform high accuracy to identify the author in the other test, such as 155 English texts and 82 Federalist paper.

The Acts of the Apostle used as the training data is unable to correctly attribute the Gospel of Luke. This could cause by insufficient author’s features produced by the Acts of the Apostle that small difference from the other author in writing or the imposition of the translator style from the original Greek, which is difficult to distinguish authors.

Since the linear kernel function assigns the test file wrongly from n=2 to n=10, it would give the biased result in the ranking system. Therefore, there is no the ranking system shown when the Acts of the Apostle used as training data.

Result from New Testaments Koine Greek

The original Greek version of the New Testaments has the same database. The way of test the Greek version of the New Testament is the same as that of English version. However, there are another two authors Barnabas (BA) and Clement of Rome (CL) added to the training data to test who wrote the letter to the Hebrews as they are the potential authors.

Error creating thumbnail: Unable to save thumbnail to destination

Table 21 shows the results for Greek version testing the Acts of the Apostle from n=2 to n=10 when the Gospel of Luke used as the training data. The linear kernel function assign the testing file of the Acts of the Apostle correctly to Luke. However, in this testing case RBF do not predict correctly. Overall, the linear kernel function is preferred function to predict the author, which shows the same as the previous.

Error creating thumbnail: Unable to save thumbnail to destination

Figure 24 shows in the ranking system, the possibility author writing the Acts of the Apostle is the same as in the English version. The only difference is that BRF predicts all wrong and no performance shown in the ranking system.

Error creating thumbnail: Unable to save thumbnail to destination

Table 22 shows the results for testing Greek version of the Gospel of Luke from n=2 to n=10. In this table, there is only one correct prediction that assign to Luck in the linear kernel function at n=4 when the Acts of the Apostle used as the training data and the Gospel of Luke used as the testing data. It performs poor result as in the English version.

Since the linear kernel function from n=2 to n=10 has only one correct tested at n=3, the Acts of the Apostle used as training data has insufficient author’s features to predict the author and it would give the biased result in the ranking system.

The Gospel of Luck used as one of the training data to predict the author who wrote the letter to Hebrews is much appropriate. Also, we add authors BA and CL to the list of the training data. The linear kernel function is preferred function to predict the author in this case and RBF could be considered as reference especially at n=9 and n=10.


Error creating thumbnail: Unable to save thumbnail to destination

Figure 25 shows author Paul is the most likely author who wrote the letter to Hebrews in the Greek version of New Testament by using extraction algorithm Common N-gram and the SVM classifier of the linear kernel function and FBR. Also, the second likely author is John and the third likely author is Luke. Comparing the English version of the New Testament, it predicts the first three most likely authors, Paul, John and Luke, are the same as the Greek version.

Conclusion

The approach for chop long text into small collection is on the right way to make sure the balanced number of texts used as the training data. The other factor would change the accuracy to predict the author is the number of extracted feature gained from the article. The longer articles would have the sufficient extracted features to represent the author’s behaviour. Therefore, the length of the articles would affect the accuracy of predicted author.

The number of extracted feature would reduce if the value of n increases and the accuracy of linear kernel function would decreases slightly even its accuracy is still high. But, the RBF would be considered as a reference to predict the author as it overlap with the linear kernel function at n=9 and n=10 in some cases.

Overall, the extraction algorithm Common Ngram is an efficient method to extract the author’s writing style and the classifier SVM with linear kernel function has the good performance by presenting reasonable accuracy.

Common N-gram and Dissimilarity Calculation

Introduction of Dissimilarity Calculation

Dissimilarity calculation is a text classification algorithm that is used in this project. This algorithm is introduced by Bennett in 1976, and to evaluate how different between two data. Dissimilarity can be defined as the distance between two samples under some criterion. In other words, the concept of dissimilarity is to determine the pairwise difference between two samples. The dissimilarity calculation as a classification algorithm is widely used in the field of bioinformatics, for example, to arrange the sequences of protein and identify regions of similarity that may be a consequence of relationships between the sequences.

In the field of authorship detection, dissimilarity calculation as an effective classifier can be determines whom the texts belong to. Occurrence of each feature can be evaluated using CNG algorithms. These occurrences corresponding to each feature can be inputs of dissimilarity calculation algorithm, which work out the difference percentage between two texts. The overall difference percentage for these two texts is average value for the whole features. In processing of dissimilarity calculation, very significant occurrence features will be chopped. For example, occurrence of the letter “a” must be very high, but this phenomenon cannot be considered as a particular feature for author. Similarly, very low occurrence features need not to be considered as well.

Clustering

Features and occurrences can be obtained by using CNG feature extraction. But the format of outputs are disorder and significant different from each texts. Clustering them into matrix is a necessary step before beginning dissimilarity calculation.

Cluster analysis groups objects based on the information found in data describing the objects. The goal is that that objects in a group will be similar to one other and different from the objects in other group. In doing this project, the objects are subjected by features from the texts, and each object is corresponding to each occurrence. Objects are usually represented as points in a multi-dimensional space, where each dimensional represents a distinct occurrence describing the object. Thus, a set of objects is represented as an m by n matrix.

Algorithm

For this algorithm, two tested profiles A and B are forming two sets of data.

A = {(x1, f1A), (x2, f2A) ... (xL, fLA)} and

B = {(x1, f1B), (x2, f2B) ... (xL, fLB)}

Where x represents feature and f represents occurrence.

Therefore, the equation is shown as below.

Error creating thumbnail: Unable to save thumbnail to destination

Using above algorithm, a set of dissimilarity {D1, D2... Dn} can be obtained. Then, averaging the set of dissimilarity values is the final result. Upon to Equation 1, the domain of Dissimilarity is from 0 to 2. When f_(1 ) (n) and f_2 (n) equal, Dissimilarity is 0. It concludes that two texts are absolutely same. When one of feature never appears in the text, Dissimilarity is two. It concludes that two texts are totally different. Thus, a reasonable dissimilarity threshold should be through testing large amount of training data.

Software development

Implement of CNG algorithm is using JAVA and text classifier is using Matlab. The processing is shown as table 23:

Error creating thumbnail: Unable to save thumbnail to destination

Result and discussion

In order to evaluate the effectiveness on Common N-Gram associated with Dissimilarity Calculation, training data and testing data are including that English Texts, Federalist Papers, and King James Version of New Testaments and Koine Greek version of New Testament.

The numbers of texts are as training data and left of texts are as testing data. Accuracy are calculated by number of texts predicted correctly divide by total number of disputed texts. Then depending on accuracy, the results are ranked. The first ranking is predicted author on technique.

Result from English Text

In this section, 156 English texts are implemented by algorithm. A set of 156 English texts were written by six authors and twenty-six texts for each one. The authors are namely AD, BB, CD, HJ, RD and ZG. There are Sir Arthur Conan Doyle, B. M. Bower, Charles Dickens, Henry James, Richard Harding Davis and Zane Grey respectively. All of the 156 English texts are tested with the increasing of n value from 2 to 10. Subsequently, there are 9 sets of output obtained for each text by CNG algorithms. Then these outputs are tested by dissimilarity calculation algorithms with several of n value. However the accuracy is affected by the dissimilarity calculation threshold changed. Hence, the reasonable threshold plotted below in the following figure 27.

Error creating thumbnail: Unable to save thumbnail to destination

From above figure, it can be observed that the relative high accuracy is presented when the threshold equals 0.45 with increasing of n value. Thus 0.45 can be determined by threshold. The following tables are shown that dissimilarity value for each author’s first five texts and accuracy associated with tested data for each authors.

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination

As listed tables above, it shows that dissimilarity calculation algorithms could analysis effectively on the English texts. Among six sets of testing, texts written by Arthur Conan (AD) and Richard Harding Davis (RD) have higher accuracy than other authors’. And texts written by Charles Dickens and Zane Grey have lower accuracy. It could lead to the number of feature extraction from the original texts. If the texts are longer, then the larger number of features can be captured and enlarging of the features could improve the accuracy. Thus, the distinct text length may cause the different accuracy.

Result from Federalist

The federalist papers contain that sixty-five texts totally, these texts are written by three authors who are Alexander Hamilton (H), John Jay (J) and James Madison (M). Forty-six texts are written by Alexander Hamilton. Only five texts are written by John Jay and the left fourteen texts are written by James Madison. Due to only five texts by John Jay, J’s texts are not tested as training data. The testings have been done on all of H and M’s texts with the different n value, which method is same as English texts testing. Thus the result in figure 28 is shown below.

Error creating thumbnail: Unable to save thumbnail to destination

As above figure shown, the accurate results are obtained when threshold is 0.45. This value is the same as English texts.

Results from New Testaments King James Version

The processing tests are on King James Version is similar to English texts testing and Federalist papers testing. Adjusting the n value from 2 to 10, then the accuracy can be conducted for distinct thresholds. As we known, King James Version Corpus is written by Luke. In this testing, using the 26 training data written by different authors to exam “who wrote the Acts of the Apostle and who wrote Gospel of Luke”. Firstly the accurate threshold is shown in figure 29 and figure 30.

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination

Most accuracy thresholds can be found from both of figures. When threshold is 0.5, the average accuracy is a bit higher than other thresholds. This result is different from above two tests. As mentioned before, the King James Version of the Bible is translation edition rather than original text edition. Hence, the slight different could due to the changing of the structure of sentence or words in translation.

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination

Although translation edition could affect threshold accuracy selection, the significant distinction results are shown that the predicted author is Luke, whatever the corpus is “Acts of the Apostles” or “Gospel of Luke”.

Compared with the previous results derived by previous testing, the results are relative in coherence. The conclusion from this corpus testing performs that CNG and dissimilarity calculation algorithms have enough effectiveness to capture features and analysis the authorship.

Result from New Testaments Koine Greek

In this testing, 2 sets of 18 texts written by different authors are as training data, and two texts written by Luke are as test data for each set respectively. The threshold selecting results are shown in figure 33 and as following.

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination

Two figures show that the higher accuracy is associated with threshold at 0.45. This result has been proven in previous testing.

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination

Two figures above show that the proposed author is still Luke as same as King James Version Testing. As the same expected, Paul is ranked the second order and the following is Matthew in the “Acts” corpus. In “Gospel of Luke” corpus, Luke is concluded by author and the second and third order is Matthew and Paul respectively.

CNG feature extraction method and dissimilarity calculation algorithms are highly effective and proven by testing on various corpuses. At this stage, 18 disputed texts written by different authors are as training data, “the Letter to Hebrew” is testing data to derive the problem “who wrote the Letter to Hebrew?” The result is performed in figure 37, the Letter to Hebrew is written by Matthew. The second probability is Luke and the third is Peter.

Error creating thumbnail: Unable to save thumbnail to destination

Conclusion

To sum up, the Common N-Gram feature extraction method collaborated with dissimilarity calculation classifier is effective. But compared with results from other team members, the author cannot be identified until now. One reason is affected by length of the texts. Based on the testing experiences, the longer length of text the higher accuracy performance. Hence the length of text is an affected element. In addition, in implementation of dissimilarity calculation processing, only feature which appears in both texts is to be considered. If one certain feature appears in one of the texts, this feature is to be omitted. Although this omitting could not affect the results significantly, the results of longer text might be slightly different. Thus in the further investigation, considering the length of text and accounting all occurrence of features is mainly necessary.

Project Management

Work Breakdown Structure

Tasks of the project are divided based on the Work Breakdown Structure for the Project. Referring to the work breakdown structure, each team member is responsible for one particular approach to maximize time and work efficiency: Programming of Maximal Frequent Word Sequence & Naïve Bayes was completed by Kai He, programming of Common Ngram & Support Vector Machine was completed by Yan Xie and programming of Dissimilarity Calculation was completed by Zhaokun Wang. A Gantt chart is created according to the WBS and is included milestones to help the project team to monitor the project progress.

In addition, a WBS is created for the writing of the Final Report (refers to Appendix E). Furthermore, after the completion of the individual write up, the following tasks were completed by: proofreading by Kai He, formatting and editing by Yan Xie and upload of document onto Wiki by Kai He and Yan Xie.

Error creating thumbnail: Unable to save thumbnail to destination

Milestones

Before the starting of the project, Gantt Chart have been set up to monitor and control the progress of this project. Milestones are corresponding to the Gantt Chart, which shows the important events and due dates. Additional internal milestones have been added to the table to help team manage the project process.

Error creating thumbnail: Unable to save thumbnail to destination

Project Budget

A total of $ 750 is allocated for the project. In the start of stage one the project team estimated the total expense for this project would be approximately $200 which covers the expenses for printing research material, making poster and video, etc. However, the project team soon found out the project does not involve any hardware implementation and documents are submitted in electronic format, capital spend will be much less than estimated. Thus the project team adjusted the estimation to $30 for the entire project.

The actual budget spent for this project is $0. Costs for printing research documents are from team member’s school printing allowance.

Risk Assessment

Because no hardware implementation is involved in this project, the risks of the project are constricted in schedule and data management.

Error creating thumbnail: Unable to save thumbnail to destination

Project Conclusion

As a summary, the extracted features of Maximal Frequent Word Sequence and Common Ngram have been proven to be sufficient style markers to capture the authors’ writing characteristics. Three classifiers used in this project also present reasonable accuracy, where Support Vector Machine has the best performance of all with its linear kernel function. Naïve Bayes is relatively easy to implement but also suffers from its simplicity, a more complex smoothing algorithm will certainly improve the classifier’s overall performance. Dissimilarity Calculation is also an easy-to-understand concept but the choice of threshold value needs to be carefully established in order to achieve its optimal performance.

Figure 38 shows the predicted authors of the Koine Greek the Letter to the Hebrews from the three approaches. The results show both agreements and disagreements in some section. Paul is supported as the most likely author of Hebrews from both the MFWS & Naïve Bayes and Common Ngram & SVM approaches, whereas the Common Ngram & Dissimilarity Calculation approach suggests Matthew is the most likely author. Note that the results also do not agree on the rank of Peter. In the first two approaches Peter is the most unlikely author while Dissimilarity Calculation ranks Peter as the third likely author for the letter to the Hebrews. However, Barnabas and Clement are deemed as unlikely authors. The remaining possible authors are Luke, Mark and John. Their positions vary sightly in the results. Variations in the results are possibly caused by the length of training text and the way the training database is constructed. Interestingly, the second and third approach used similar Ngram features with different classifiers but the results are vast different, which suggests that different classifiers could produce completed results even if the similar features are used in the training database.

At the point, even though the results cannot reach a definite conclusive conclusion about the authorship of the Letter to the Hebrews, the outcomes do disagree the claim that Paul is not the author of the Letter to the Hebrews. The project adds more evidence to Clement Alexandria’s theory that Paul is possibly the author of the original the Letter to the Hebrews and Luke might be the author of the translated version, as both of them are ranked highly in the results.


Error creating thumbnail: Unable to save thumbnail to destination


Future Work

Due to pattern extraction nature and language independent, Maximal Frequent Word Sequence and Common Ngram can be further developed for applications such as online search engine and DNA analysis tool, etc. Since the algorithms can be used to capture authors’ writing styles, it would become relatively simple to find a book or a piece of lyrics of an author by using only a single chapter of the text. In addition, the algorithms could also be applied to authorship attribution for different languages such as disputed Chinese or Japanese text.

References

[1] Eddy H. T., The Characteristic Curves of Composition, <http://www.jstor.org/stable/1763509>, viewed September 2011

[2] Mosteller, Frederick and Wallace, David L. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. New York: Springer-Verlag, 1984.

[3] Hilton, J. L., On Verifying Wordprint Studies: Book of Mormon Authorship, BYU Studies, vol. 30, 1990.

[4] Holmes, D. I., & Forsyth, R. S., The 'Federalist' Revisited: New Directions in Authorship Attribution, Literary and Linguistic Computing, 10, 111-127, 1995.

[5] Stamatatos, E., Fakotakis, N. & Kokkinakis, G., Automatic Text Categorization in Terms of Genre and Author, Computational Linguistics, vol. 26, no. 4, pp. 471-495(25), December 2000

[6] ] Baayen, H., Halteren, H. V., Neijt, A. & Tweedie, F., An experiment in authorship attribution, 6th JADT, 2002

[7] Juola, P. & Baayen, H., A controlled corpus experiment in authorship attribution by crossentropy, Proceedings of ACH/ALLC- 2003, 2003.

[8] Sabordo,M., Shong, C. Y., Berryman, M. J. & Abbott, D., Who Wrote the Letter to the Hebrews? – Data Mining for Detection of Text Authorship, SPIE vol. 5649 pp. 513 – 524, 2004.

[9] Jie. D., Leng. Y. T. & Tien-en. J. P., Who Wrote the Letter to the Hebrews? – Data Mining for Detection of Text Authorship, University of Adelaide, 2010.

[10] George Milligan, The Vocabulary of the Greek New Testament, Eerdmans, Grand Rapids, 1954.

[11] Rosa M C, Luis V.P, Manuel M.G, Paolo R, Authorship Attribution using Word Sequences, Universidad Politécnica de Valencia.

[12] Antunes, C., Oliveira A., “Generalization of Pattern-growth Methods for Sequential Pattern Mining with Gap Constraints”, Third IAPR Workshop on Machine Learning and Data Mining MLDM2003 LNCS 2734,2003,pp.239-251.

[13] Chaski, C.E. (1997). “Who wrote it? Steps toward a Science of Authorship Identification.” National Institute of Justice Journal, pp. 15-22. September.

[14] Carcia-Hernandez, R., Martinez-Trinidad F., and Carrasco-Ochoa A. (2006). A New Algorithm for Fast Discovery of Maximal Sequential Patterns in a Document Collection. International Conference on Computational Linguistics and text Processing. CICLing-2006, Mexico City, Mexico, 2006.

[15] Jason, Gastrich, The Authorship of the Epistle to the Hebrews, 1998, <https://jcsm.org/Education/authorshiopofHebrews.htm> viewed August 2011

[16] Jian Pei, Jiawei Han, et. al. “Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach”, IEEE Transactions on Knowledge and Data Engineering, 16(10), 2004,pp. 1424-1440.

[17] J. Vernon McGe, The Authorship of Hebrews or Did Paul Write Hebrews?,Thru the Bible RadioNetwork<http://www.thruthebible.org/atf/cf/%7B91e2424c-636c-40c2-9c55-890588e90ece%7D/AUTHORSHIP%20OF%20HEBREWS.PDF> viewed August 2011.

[18] Krsul, I., and Spafford, E. H. (1996). “Authorship analysis: identifying the author of a program.” Technical Report TR-96-052.

[19] Keselj, V., Peng, F., Cercone, N., Thomas, C. (2003). “N-gram based author profiles for authorship attribution.” In Proc. Pacific Association for Computational Linguistics.

[20] Lin, M. Y., and S. Y. Lee, “Efficient Mining of Sequential Patterns with Time Constraints by Delimited Pattern-Growth”, Knowledge and Information Systems, 7(4), 2005,pp.499-514.

[21] Christopher J.C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. 2, pp. 121-167, 1998

[22] Noble, S. William, What is support Vector machine, available: http://www.nature.com/naturebiotechnology, Nature Publishing Group, 2006.

[23] Alba, Sloin, David Burshtein, Support Vector Machine Training for Improved Hidden Markov Modeling, IEEE transactions on signal processing, vol. 56, NO. 1, Junuary, 2008

[24] Anderson C P, The Epistle to the Hebrews and the Pauline Letter Collection, The HArvard Theological Review, Cambridge University Press, Vol 59, No 4, 1966

[25] J. Vernon McGe, The Authorship of Hebrews or Did Paul Write Hebrews?, Thru the Bible Radio Network < http://www.thruthebible.org/atf/cf/%7B91e2424c-636c-40c2-9c55-890588e90ece%7D/AUTHORSHIP%20OF%20HEBREWS.PDF> viewed April 2011

[26] W. Gary Crampton, Hebrews: Who is the Author?, First Presbyterian Church of Rowlett < http://www.fpcr.org/blue_banner_articles/Who-Wrote-Hebrews.htm> viewed May 2011

[27] Jason, Gastrich, The Authorship of the Epistle to the Hebrews, 1998, < http://jcsm.org/Education/authorshipofHebrews.htm> viewed April 2011

[28] Putnins, T. J., Signoriello, D. J., Jain, S. Berryman, M. J., & Abbott, D., Who wrote the Letter to the Hebrews? Data mining for detection of text authorship, University of Adelaide, 2005

[29] Matron, Y., Wu, N., Hellerstein, “On Compression-Based Text Classification. Advances in Information Retrieval: 27th European Conference on IR Research,” Springer LNCS – 3408, pp. 300-314 (2005).

[30] Smith, M. W. A., Recent experience and new developments of methods for the determination of authorship, ALLC Bulletin, 11:73–82, 1983.

[31] Holmes, D. I., & Forsyth, R. S., The 'Federalist' Revisited: New Directions in Authorship Attribution, Literary and Linguistic Computing, 10, 111-127, 1995.

[32] Hilton, J. L., On Verifying Wordprint Studies: Book of Mormon Authorship, BYU Studies, vol. 30, 1990.

Appendices

Appendix A: English Texts (Six authors)

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination

Appendix B: Federalist Papers (Three authors)

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination

Appendix C: King James Version of the New Testaments

Error creating thumbnail: Unable to save thumbnail to destination

Appendix D: Koine Greek text of the New Testaments

Error creating thumbnail: Unable to save thumbnail to destination

Appendix E: Work Breakdown Structure for the Final Report

Error creating thumbnail: Unable to save thumbnail to destination
Error creating thumbnail: Unable to save thumbnail to destination

Appendix F: A sample summary output from MFWS extraction algorithm

Error creating thumbnail: Unable to save thumbnail to destination

Appendix G: A sample summary output from Naïve Bayes Classifier

Error creating thumbnail: Unable to save thumbnail to destination

Appendix H: Test results for English Texts (MFWS+Naïve Bayes)

Result for Author AD:

Error creating thumbnail: Unable to save thumbnail to destination

Result for Author BB:

Error creating thumbnail: Unable to save thumbnail to destination

Result for Author CC:

Error creating thumbnail: Unable to save thumbnail to destination

Result for Author HJ:

Error creating thumbnail: Unable to save thumbnail to destination

Result for Author RD:

Error creating thumbnail: Unable to save thumbnail to destination

Result for Author ZG:

Error creating thumbnail: Unable to save thumbnail to destination

Appendix I: Test results for Auhtor Hamilton from Federalist (MFWS + Naïve Bayes)

Error creating thumbnail: Unable to save thumbnail to destination

Appendix J: Test results when n=3 for English Texts (Common Ngram & SVM)

Error creating thumbnail: Unable to save thumbnail to destination

Appendix K: Test results when n=4 for English Texts (Common Ngram & SVM)

Error creating thumbnail: Unable to save thumbnail to destination

Appendix L: Test results when n=5 for English Texts (Common Ngram & SVM)

Error creating thumbnail: Unable to save thumbnail to destination

Appendix M: Test results when n=6 for English Texts (Common Ngram & SVM)

Error creating thumbnail: Unable to save thumbnail to destination

Appendix N: Test results when n=7 for English Texts (Common Ngram & SVM)

Error creating thumbnail: Unable to save thumbnail to destination

Appendix O: Test results when n=8 for English Texts (Common Ngram & SVM)

Error creating thumbnail: Unable to save thumbnail to destination

Appendix P: Test results when n=9 for English Texts (Common Ngram & SVM)

Error creating thumbnail: Unable to save thumbnail to destination

Appendix Q: Gantt Chart

Error creating thumbnail: Unable to save thumbnail to destination