Final report 2010: Who wrote the Letter to the Hebrews?

From Derek
Jump to navigation Jump to search

Acknowledgement

The team wishes to extend our thanks to Professor Derek Abbott and Dr Brain Ng thank for their support and guidance throughout the course of this project. In addition, the team would also like to extend our thanks to Dr Matthew Berryman for his input and guidance in this project. The project could not have progress well without the support and assistance from them.

Executive Summary

The New Testaments of the Bible contains a number of text that have disputed or unknown authorship. One of the texts that has been widely debated would be the letter to Hebrews. A number of authorship attribution methods have been developed and refined over the years and this project aims to implement these methods to address the issue of who wrote the letter to Hebrews. Three methods had been selected, researched and enhanced for authorship attribution. The effectiveness of these methods are then analysed and applied to the letter to Hebrews. The three methods are, Function Word Analysis, Word Recurrence Interval and Trigram Markov. In addition, the Support Vector Machine will be implemented in this project to develop a classification model that uses the feature vector that is extract from a given text. The results obtained from these methods would shine light onto the possible authors and eliminate other suspects. This information will be essential in identifying the rightful author of the letter to Hebrews.

Introduction

Project Objectives

This project aims to provide a non-biased approach using authorship attribution algorithms to analysis the letter to the Hebrews. In doing so, the project team will be looking at the results obtained and possibility identify the likely authors to the letter of Hebrews and eliminate other authors from the list of possible authors. The project will further enhance three feature extraction algorithms in order to identify the author of the letter. The three algorithms are frequency of occurrence of function words, Word Recurrence Interval and Trigram Markov Model. In addition, the Support Vector Machine will be implemented to develop a classification model. The Support Vector Machine had been demonstrated to be accurate in classification (as discussed in section 4) and would be able to contribute significant evidence regarding the author of the letter to Hebrews. This project also aims to implement the usage of authorship attribution algorithms with other languages other than English.

Project Approach

There are a number of algorithms and techniques that had been used in authorship attribution. Some of the techniques include analysing the other works of the author and comparing it with the disputed text to see if the writings of the disputed text is similar to that of his other works. Another technique include analysing the author's character and knowledge of the subject matter and analyse if the disputed text could be attributed to the author based on his knowledge and understanding. However, these approaches often produce a result that scholar view as biased and often this result ends up being debated. In this report, the team would like to focus on a non-biased approach, by analysing and observing key features in a text that would be able to distinguish an author from another author. These algorithms have been proven in the past to be effective markers in identifying the rightful author of a disputed text (see section 1.4). A number of algorithms exist such as, Word Recurrence Interval, Kolmogorov-Smirnov test, Trigram Markov model, Gutman (LZ preprocessing) method and frequency occurrence of function words. However, this project will focus on three algorithms to provide evidence on who wrote the letter to the Hebrews. The chosen algorithms are frequency occurrence of function words, Word Recurrence Interval and Trigram Markov.

Structure of This Report

In this report, the background of the letter of Hebrews will be briefly discussed followed by a brief overview of the Bible, namely, the New Testaments. The aims of this project and the approach taken are provided in section 1.1 and 1.2 respectively. A short discussion on past studies conducted in the field of authorship attribution is provided in section 1.4. The significant of the project and the data set that the algorithm will be using to analyse the effectiveness of the algorithm are provided in section 1.5 and 1.6. Thereafter, this report will present the task of processing the text into a suitable form before specific features will be extracted and presented to the Support Vector Machine, called feature vectors, to develop a classification model. A discussion on the Support Vector Machine is present to readers who would like to know more about it in section 4. The results from the three algorithms are presented in their respectively section and are ordered in the following manner - Function Word Analysis, Word Recurrence Interval and Trigram Markov method. The data used in the analysis are the English text, the Federalist papers, the translated version of the New Testaments and the Koine Greek text of the New Testaments. Therefore, the project management aspect of this project will be presented and in the last section, a conclusion is presented.

Previous Studies

In the past centuries, several researchers have spent hard work on authorship detection resulting in a vast of number and diversity. The earliest study in this subject was conducted by Mr. Mendenhall [4] who uses characteristic curve of composition. However it was heavily criticised by Florence in 1904 that the technique applied was principally controlled by the language in which the text is written. It was stated out by Florence that there were only a very small difference in the characteristic curves between various English writers; hence it is presumable that using completely different authors yet using the same language will produce approximately the same characteristic curves for each author. This arrives to a conclusion that characteristic curves have inquiries that are too narrow to properly distinguish the author's style of writing.

In 1964, Mostella and Wallace published a book titled, 'Inference and Disputed Authorship: the Federalist', providing statistical evidence which led to the conclusion on who wrote the twelve disputed papers of the Federalist. The Federalist papers were written in 1787-1788 by Alexander Hamilton, James Madison and John Jay. The authors of these 85 papers then were unknown and were signed off as "Publius". In 1807, a Philadelphia periodical received a list that is said to have been left by Hamilton prior to his death in 1804, and assigned the various papers written to their authors, namely Alexander Hamilton, James Madison and John Jay. However, in 1818, James Madison made a claim saying that he wrote twelve of the Federalist papers that Hamilton had ascribed to himself. In a quest to clear the disputed on the Federalist papers, Mostella and Wallace examined the use of 'marker words' that were used with very different frequencies by the two authors, Hamilton and Madison. In total, 70 function words were applied and the results were presented. The twelve disputed paper were then attributed to James Madison.

In the year of 1983, Smith [5] conducted researches in an attempt to obtain evidence by using word and sentence length as discriminators for authorship attribution. He uses the chi-square measure as the method of detection for the measurements in the research. However the problems that occur during the interpretation of the chi-square measure leads him to a conclusion that authorship attribution base on word count and sentence length were not feasible, consequentially leading into having incorrect results.

In the year of 1990, Hilton [12] uses the word pattern ratios on 'The Book of Mormon' such as the number of times 'a' appears as the first word of sentence divided by the number of sentence in a paragraph. It was arrive to his attention that both non-contextual word frequencies and word-pattern ratios have relatively good differentiating power, rising to the suggestion that it would be a good first choice in authorship attribution. However it was also mentioned by Hilton that vocabulary richness measurements is not a good method for differentiating texts. It does provide considerable information but have several limitations on detecting differences and similarities in the texts. As a conclusion, Hilton's results highlighted the fact that simple feature extraction such as word frequency occurrence would be a good choice.

In the year 1995, Holmes and Forsyth [15] use Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for the classification and vocabulary richness and word frequency analysis for data preparation. The main idea was the transformation of the observed variables to a new set of variables which are not correlated. This is done with the intension to reduce the dimensionality of the problem and also an attempt to find new information in the variables for easier interpretation. These components are the linear combinations of the observed variables hence it was anticipated that the first few components would have most of the variation of the original texts. Base on their results, it was shown that using vocabulary richness as the feature extraction and LDA as the data classification gives excellent results, suggesting that vocabulary richness variables provide a good set of discriminators for authorship attribution. Another conclusion was that it was also advisable to use large sets of common words (such as >=50) if the technique word frequency analysis is used to obtain meaningful results which would concur with Burrows [16] research.

In the year of 2001, Stamatatos [8] et al. made a new method that is to use text categorization using genres or authors using Natural Language Processing (NLP). They reached to a conclusion that the method achieved relatively high classification accuracy. They attempted to take advantage of existing NLP tools by using analysis-level style markers that provide useful stylistic information. With success, their method has outperformed existing lexically based methods in authorship attribution at that time. The results that were achieved shows that using stylistic differences for text genre detection is a much better choice than using it for authorship attribution as stylistic differences are clearer among text genres, rather than authorship attribution.

In the year of 2002, Baayen [18] conducted an experiment on student's essays by using the most frequent function words as the feature extraction and PCA and LDA as the data classification technique. Subsequently he arrived to a conclusion that LDA is a more appropriate technique to use than PCA for authorship attribution. This is because the simplicity of using PCA with the highest frequency function words fails to detect the authorial structure of the texts and leads to insightful clustering. In addition, it was also found that with the simple inclusion of punctuation marks in the analysis, it actually enhanced the classification accuracy, giving the suggestion that punctuation marks may prove to be an effective style markers, especially for texts that have not been altered or modified by editorial for publications.

In the year 2003, Baayen collaborated with Juola [19] uses a method name as cross-entropy in authorship attribution. The technique was to take a "measurement on unpredictability of a given event with a specific model of events and expectations". In other words, the method can quantify the difference between two data and measure the distance between them. It is argued by Juola that the cross-entropy method perform task more accurately than PCA or LDA techniques as it is highlighted that the method can be effectively applied to shorter texts. Juola claimed that the method could even accurately determine the authorship of a disputed text using less than a page of data. Furthermore, the cross-entropy method could widely be applied to a variety of linguistic and text-analysis problem which suggest that the method of measurement, "distance", is the exact sense as numerical measure that can be compared with other scalar measurements from different types of document.

In the year 2004, Sabordo [20] uses the data compression technique and also the Word Recurrence Interval (WRI) to "the Letter of Hebrew" in the New Testament. It was concluded by Sabordo that the Prediction by Partial Matching (PPM) compression and the GZip compression technique was not effective in analysing the similarities or differences in the pattern or relationship between texts. More specifically, the GZip and PPM compression technique produce graphical results that have a poor discrimination ability due to overlapping standard deviations. However, the WRI algorithm proved to be useful and successful as the technique could identify similarities in styles of texts written by the same author.

In the year 2005, Putnins et al. [21] used function word frequency and Trigram Markov model with Multiple Discriminant Analysis (MDA) and Word Recurrence Interval. Based on the results, he showed that the function word frequency and Trigram Markov Model gives a relatively high accuracy results in authorship attribution thus has the ability to provide statistical evidence to authorship attribution problems. However it was shown by him that the WRI does have poor performance in authorship attribution.

Motivation

Since the invention of internet in the 1960s, it has been developed at such a rapid rate that it has changed people's lifestyle throughout the world. An increasing amount of economic and intellectual activities involve global internet as a medium and the scrutiny of the authenticity of documents have become more challenging. In addition, the data we have is often vast and noisy, implying that it might be imprecise and the data structure is complex. A purely statistical technique would not produce satisfactory result, therefore data mining is developed. It is a technique that filters out noise data and extracts useful and relevant information hidden within large volumes of data.

Data mining technique can perform well in a wide range of applications. Three major fields of applications include plagiarism analysis, authorship identification and near-duplicate detection. In plagiarism analysis, the technique is able to determine the similarities between objects, such as source codes, articles, music and hence provides statistical plagiarism evidence. Authorship identification is another important application whereby many existing controversies such as Shakespeare authorship in question and The Letter to Hebrews may be unravelled. The motivation of this project is to provide information pertaining to the author of the letter to Hebrews.

Corpus

For the purpose of evaluating the accuracy of the algorithm and to identify a set of conditions that would be optimal in authorship attribution, a set of corpus was obtained from the Project Gutenberg archives. Four sets of data were used for this purpose, namely, English Text, The Federalist Paper, King James Version of the Bible, and Koine Greek of the New Testaments.

The English Text was used to develop and evaluate the effectiveness of the algorithm. These English texts were chosen from the Project Gutenberg archives and provided many texts by various authors. In this project, a set of twenty-six texts from each author was obtained from Project Gutenberg. Four texts from each author was set aside and labelled as disputed text while the remaining twenty-two text would help develop the classification model of the Support Vector Machine. In total, there will be a hundred and thirty-two training text and twenty-four disputed text.

The Federalist paper comprises of eighty-five papers that were written by Alexander Hamilton, James Madison and John Jay. Out of the eighty-five papers of the Federalist, fifty-one of the papers were written by Hamilton, fourteen of the papers were written by Madison, five of the papers were written by John Jay, three of the papers were a collaboration between Hamilton and Madison and twelve papers, No 49-58 and No 62-63, were disputed. For the purpose of an accurate classification model, the three papers that were collaborated between Hamilton and Madison were removed. The Federalist paper provides a idealize model for the New Testaments as shown in Table 1 that most of the books of the New Testament were written by Paul which is similar to the Federalist paper where most of the papers were written by Alexandra Hamilton.

The King James Version of the Bible is an English translation of the original text, Hebrews for the Old Testaments and Koine Greek for the New Testaments, which started in 1604 and completed in 1611. The translation was the work of 47 scholars from the Church of England. However, translation itself might affect the accuracy in authorship attribution as word for word translation was occasionally impossible due to the complexity of languages such as the difference in sentence structure and grammar. The New Testament was originally written in Koine Greek. In analysing the New Testament in its original form, Koine Greek, this preserves the original fingerprint of the author and removes the fingerprint from the translators. This approach provides a more accurate classification model in analysing the letter to the Hebrews.

Background

Background of the Letter to Hebrews

According to the Catholic Encyclopedia, the letter to the Hebrews, or also known as the epistle to the Hebrews, is said to have been written in the late 63 AD or early 64 AD. Traditional scholars have said that the letter was written to the Jews at that time. The author of the letter to Hebrews has been wavering since the time of Origen (185 - 256AD). Numerous authorship techniques had been applied to the text but results had often been inconclusive or had only been able to show that it is most likely that Paul of Tarsus also known as Apostle Paul was not the author of Hebrews.

In the 4th century, Jerome and Augustine of Hippo supported Paul's authorship and the letter to the Hebrews was identified as the fourteenth letter by Paul (Fonck, 1910). However, Milligan George commented that it was unlikely that Paul was the author of Hebrews because of the anonymity of the letter was not consistent with Paul's pattern and the style of writing is also different from that of Paul. Conversely, Clement of Alexandria claimed that Paul did not indicate his name because the Hebrews had conceived a prejudice against Paul. Furthermore, Clement of Alexandria commented that Paul was likely the author of Hebrews originally and not the Greek version. The latter is the work of Luke who translated Paul's letter.

A list of potential authors had been generated over the year by Biblical scholars. Some of the criteria are for example, the author's knowledge on the subject in the letter of the Hebrews. Barnabas was as a potential author of the letter as he was from the tribe of Levites and because the major theme in the letter to the Hebrews focus on the Levitical law and the priesthood. The Levities are often recognise as the choose tribe for priesthood. Some others claim that Pricilla and Aquilla might have been the rightful authors as in the letter to the Hebrews a reference was made in regards to a person called Timothy who Priscilla and Aquilla knew. Paul was the widely debated author to this letter as the writing in this letter was similar to that of his other works. Clement of Rome was selected as one of the authors due to his close association to Paul, that is if Paul did not write the letter, it was likely that Clement did. Peter was listed down as potential author as well by some Biblical scholars as he was seem as one of the leaders of the Church at that time and his close association to the Jews or Hebrews.

Background of the Bible

The Bible itself consists of 66 books that were written by various authors. It is divided into two sections, namely the Old Testaments and New Testaments. The Old Testaments itself consist of 39 books and the New Testaments with 27 books. The various authors of the 27 books of the New Testaments are shown in Table 1. In addition, most of the books are said to have been written in Koine Greek.

Table 1 : List of Authors of the Bible
Texts Authors
The Gospel of Matthew Matthew
The Gospel of Mark Mark
The Gospel of Luke Luke
The Gospel of John John
The Acts of the Apostles Luke
The General Epistle of James James
The First Epistle of Peter Peter
The Second Epistle of Peter Peter
The First Epistle of John John (might be disputed)
The Second Epistle of John John (might be disputed)
The Third Epistle of John John (might be disputed)
The General Epistle of Jude Jude
The Book of Revelation John (might be disputed)
The Epistle to the Romans Paul
The First Epistle to the Corinthians Paul
The Second Epistle to the Corinthians Paul
The Epistle to the Galatians Paul
The Epistle to the Philippians Paul
The Epistle to Philemon Paul
The Epistle to Titus Paul (might be disputed)
The First Epistle to Timothy Paul (might be disputed)
The Second Epistle of Paul to Timothy Paul (might be disputed)
The First Epistle to the Thessalonians Paul
The Second Epistle to the Thessalonians Paul (might be disputed)
The Epistle to the Ephesians Paul (might be disputed)
The Epistle to the Colossians Paul (might be disputed)
The Epistle to the Hebrews Unknown

There are a number of books in the New Testament that have disputed or unknown authorship. Those whose authors are not widely agreed upon are set aside and not included in the corpus in this project.

Pre-process Text

It is required to pre-process the Greek and English texts before the featured algorithm extraction is done on the texts. This process of filtering is done such that the text will only have vital information that will properly attribute the texts to its corresponding authors. Since one of the project objectives was to approach the authorship attribution statistically, the team have decided to extract out all non-alphabets of the text as it came to a conclusion that these non-alphabet character does not contribute into authorship attribution. The following paragraph will discuss on the different scenarios that were encountered by the team and the solutions that were implemented for pre-processing the text. Comparing with the English texts, it is noted that the Greek texts does not have punctuations hence giving a much more simplicity in pre-processing it.

Text Divider

A significant issue that was detected by the team was that the text length of known texts for each author were different. It is important to approximately let each known texts to have the same text length such that the results for authorship attribution would not be skewed. Due to this reason, the team have developed a program named as "chopper.java" which would separate a text file to a set of text files that approximately have the same text length.

Punctuations

In the pre-process algorithm, the program will automatically read-in the text and remove all of the punctuation that occurs in the text. This method will effectively remove the punctuations mark in the sentence.

However upon further research, it was found out by the team that not all punctuation marks should be removed via the method that was implemented. The table below shows the different exception punctuation marks that were needed to be handled differently.

Table 2 : Different Exception Punctuation Marks and the Corresponding Method of Implementation
Punctuation Mark Method to Resolve Example
Apostrophes Hyphens Remove the apostrophe or hyphen and fill the empty space with the remaining characters "Don't" becomes "Dont" "hand-craft" becomes "handcraft"
Parentheses Brackets Quotation Upon encountering the first bracket, scan the word for the second bracket and remove both of them together "hello world." becomes hello world

Non-English ASCII Symbols and Control Characters

It was verified and checked that the programming language, Java, supports a substantial amount of different types of encoding \[22\].

Due to this reason, the program will read most of the non-English ASCII symbols and control characters correctly. Upon encountering these characters, the program will remove it from the text as the Greek and English text should only consist of alphabets. Hence it was not required to implement any exceptions handler for this as the programming language Java is competent enough.

Carriage Returns and Line-feeds

The pre-process algorithm will scan the text and remove carriage returns and line-feeds. The algorithm will read each word in the text, convert it to lowercase and store it into a long sentence. Upon completion of the filtering of the text files, the algorithm will subsequently print the sentence into a newly created text file named as "modified_filename.txt". These modified text file would then be the output of the pre-process algorithm and be used as the input for the feature extraction algorithms.

Koine Greek

The New Testament text was written in Koine Greek. In order to be able to process the Koine Greek, it is necessary to map the language into a form that is able to be process by Java. Figure 1 shows an example of how the Koine Greek looks like. In the left column, it is shown the Greek letter both in capital and in lower caps. As noted that in this form, Java was not able to recognise the format in which the Koine Greek text was encoded in.

Figure 1 : Mapping of the Greek Letter

In the right column of Figure 1, it shows the mapping of Greek to its equivalent, which is also known as beta code. Java was able to process the Greek equivalent and processing it simply as an ASCII character.

Support Vector Machine

Background of SVM

As mentioned in the previous section, The Support Vector Machine is the classification model that is used in this project. It is a new classification technique which was first invented by Vladimir Vapnik and the current standard incarnation was proposed by Vapnik and Corinna Cortes in 1995. When this classification algorithm first came out, it was used in the field of bioinformatics such as DNA sequence analysis. Now it has became a very promising tool for machine learning and data classification.

Since SVM is a classifier, given a set of training data with each marked as belonging to one of two categories, an SVM training algorithm builds up a model that assigns new testing data into one of these two groups. Intuitively, these data can be regarded as points locating randomly in the space, what SVM does is to look for support vectors (data points with red circles in Figure 2) to form a linear boundary on each group. The gap between two boundaries is referred to as margin, and SVM tries to find the optimal separating hyper plane that maximises the margin. This is the reason that the Support Vector Machines are also called maximum margin classifier. When a new testing point comes into the system, it is assigned to one of the two groups based on which side of the gap it falls on.

Figure 2: Principal of Support Vector Machine

If these data are distributed like the way shown in Figure 2, then the data are called linearly separable and it is easy for a straight line to be fitted between these two categories. However, in practical when data points are not so perfectly distributed, in other words non-linear separable such as situation in Figure 3, finding a linear gap could be extremely difficult.

Figure 3: Non-linear Separable Distribution

In order to solve this problem, it is essential to map the non-linear separable cases into linear separable ones before applying SVM. The choice of mapping function is fairly important, and inappropriate choice of mapping function may lead to a situation in which data is hardly separable. An example of an efficient mapping function is shown in Figure 4. Consider a set of N input data where each point in the set has 2 coordinate values (x1, x2). The positive samples of this set are inside a circular region while negative samples are outside. Apparently, this problem is not linearly separable. However, after each data point is expanded using function f below which is the function defining mapping operation from 2-dimensional space to a 3-dimensional space,

Equation 1 : Mapping Operation from 2D to 3D Space

These transformed data points can be linearly separated by a plane as shown in Figure 4. When new testing data is fed into the system, the data are mapped using the same transform operation and the group it categorised to is predicted by observing which side of the plane it locates in.

Figure 4 : Wrapping Non-linear Data

Classification Kernel Functions

From section 4.1, we have seen how significantly important that the choice of the mapping function will influence SVMs performance. However, a kernel function is used to implicitly define a mapping function. Past research have shown that by computing the inner product in the higher dimensional space defined by the mapping function, then it is possible to implicitly define the non-linear warping effect of f. Using the same example as in Figure 4, observe the equivalence of the following two ways of computing the inner product of the mapping results:

Method 1 - Explicit Mapping

Equation 2 : Explicit Mapping

Method 2 - Implicit Mapping

Equation 3 : Implicit Mapping

In explicit mapping, data were first mapped before performing the inner product. Whereas in implicit mapping, the inner product operation was performed first in the original space and the result is squared. Hence kernel functions allow us to evaluate the inner product in higher dimensional feature space defined by f without having to explicitly compute the mapping function f.

Equation 4 : Kernel Function

There are many types of kernel functions where each of them defines a different mapping. Below is a list of some kernel functions available in Matlab bioinformatics toolbox, and it will be used as choices of kernel function for SVM classification model. One can also define their own kernel function for some purposes.

Linear

The linear kernel is the simplest kernel function which takes the inner product of feature vectors in their original feature space and a optional constant c. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts.

Equation 5 : Linear Kernel Function

Gaussian Radial Basis Function (RBF)

A common choice of positive radial basis function (RBF) kernel in machine learning is the Gaussian kernel. The geometry of RBF kernels states that the transformed points f(p) in the feature space induced by a positive semi-definite RBF kernel are equidistant to the origin and thus all lie on a hypersphere around the origin.

Equation 6 : Guassian Radial Basis Kernel Function (RBF)

Quadratic

Quadratic kernel function is a specific type of polynomial kernel function where d = 2. The example shown in section 4.2 is the quadratic kernel in two dimensions. Therefore, it has the form as equation shows below.

Equation 7 : Quadratic Kernel Function

Polynomial

The polynomial kernel is a non-stationary kernel. Polynomial kernels are widely used and well suited for problems where all the training is normalized. It involves adjustable slope a, the constant term c and polynomial degree d.

Equation 8 : Polynomial Kernel Function

Operation of SVM

The implementation of Support Vector Machine classifier in our project will be based on Matlab, since it has already been integrated with several built-in functions for SVM in bioinformatics toolbox. Two functions which will be used are svmtrain and svmclassify.

In most cases we would like to evaluate the predictor by how well it predicts the label for new data. This implies, however, that we would have to deploy the system first before evaluating it. Ideally it would be better to evaluate before deployment. This is achieved by the following steps:

  • Collect a large set of training data
  • Randomly divide it into two non-overlapping disjoint subsets: the training set and the testing set
  • Applying SVM on training data, producing a classifier
  • Measure the percentage of samples in the testing set that are correctly labelled by classifier
Equation 9 : Accuracy Equation
  • Repeat steps above for different sizes of training sets

Figure 5 demonstrates the construction process of the SVM model and how the efficiency is verified in the project. The aim of the project is to determine authorship of "the letter to Hebrews". In this case, the Letter to Hebrews is going to be dispute text C in Figure 5. The other groups, A and B, represent some other texts in New Testament which we already know their author. Texts in the same groups are proved to be written by one person. Firstly, part of texts from group A and B are labelled as A1 and B1 and chosen to train the model. Then in order to find out how accurate the model is, we pretend rest of texts from A and B are dispute. The predict results will then be compared with their real authors to calculate our model's accuracy.

While the confidence interval of the model reaches certain level, we apply the SVM to C, which is "the Letter to Hebrews", and find out who the most possible author is.

Figure 5: Classifier Construction and Validating

Multigroup Classification and Ranking System

Although SVM is an efficient tool for classification, it is designed to only classify between two groups. However, the data set as described previously consists of texts written by more than two authors. In order to ensure SVM keeps performing well in this situation, it is essential to develop algorithms to satisfy purpose of multi-group classification. In the project, it is achieved by utilising pairwise classification, by which it means comparing authors in pairs to judge which of each author is preferred.

Suppose a data set containing texts with each written by one of four authors, A, B, C, D and there is also a dispute text which needs to be assigned to one of them. Then SVM needs to perform 6 tests listed as shown in Table 3 and record the predict result for each round.

Table 3: Pairwise Comparison Test
Test Texts by Texts by Predict Result
1 A B A
2 A C C
3 A D D
4 B C C
5 B D D
6 C D C
Figure 6: Rank Plot

After all tests are executed, C appears the most time among the predict results. As a result, it can be concluded that C is the most likely author of the dispute text while B is the most unlikely author. If represented by a rank number, then author C has a rank of 1, while B has a rank number of 4. A plot of the ranking system is shown in Figure 6. The way rank system is used and interpreted will be further explained in later chapter.

Format Input and Output

As previously mentioned, feature extraction algorithm produces feature vectors with the same dimensions for both training texts and dispute texts. These vectors should be organised in a way that SVM classifier can understand and use for prediction. Therefore, it is crucial to set a standard format for the data file which is produced by extraction algorithm and used as input for SVM.

Figure 7: Standard SVM Input Format

Structure of SVM input file is shown in Figure 7. This file is generated by Java, in which contains two major parts:

  • Three header lines:
    • Number of texts (including both dispute texts and training texts)
    • Number of disputed texts
    • Vector dimensions:
      • For function words algorithm, it indicates the number of function words used
      • For Word Recurrence Interval( WRI ), it is the number of key words used to find their word recurrence interval
      • For Trigram Markov model, it is the number of trigrams (same set of trigrams for all) used to characterise each text
  • Data Matrix
    • In this matrix, every row indicates features from one text. It starts with a String of text with author's name or "unknown" and then is followed by some numbers (different meaning for different methods). This matrix can be divided into two parts:
      • Upper Matrix is for all the known author texts
      • Lower Matrix is for all disputed texts

With this SVM input, classification procedure is listed as below:

  • According to header lines, separate known author texts and disputed texts. The upper matrix forms a set of entries used for SVM training and lower part is used as testing data. The Matlab program converts all author strings into numerical label. Texts with same author will have identical numerical label.
  • Create a SVM object with a kernel function (linear, quadratic, Gaussian Radial Basis Function, polynomial)
  • Feed each row in training matrix into SVM using svmtrain() method to learn each author's characteristics.
  • After all train entries are fed into SVM object, it is ready for classification
  • Predict each test entries individually and compare the predict author with actual author and compute the accuracy

Function Words

Texts are made up from a combination of both content and function words. Function words (or grammatical words or auto semantic words) are words that have vague meaning and are used to express grammatical relationships with other words within a sentence. Function words can be prepositions, pronouns, determiners, conjunctions, auxiliary verbs or particles as shown in Table 4. Each function word either gives some grammatical information on other words in a sentence or clause, and cannot be isolated from other words, or it may indicate the speaker's mental model as to what is being said \[1\] .

Table 4: Examples of Function Words
Function Words Examples
Prepositions of, at, in, without, between
Pronouns he, they, anybody, it, one
Determiners the, a, that, my, more, much, either, neither
Conjunctions and, that, when, while, although, or
Modal verbs can, must, will, should, ought, need, used
Auxiliary verbs be (is, am, are), have, got, do
Particles no, not, nor, as

In contrast, content words are highly correlated with the document topics and might not be suitable for authorship attribution as two authors writing on the same topic or about the same event may use many similar words and phrases \[3\]. Examples of content words are shown in Table 5.

Table 5: Example of Content Words
Content Words Examples
Nouns John, room, answer, Selby
Adjectives happy, new, large, grey
Full verbs search, grow, hold, have
Adverbs really, completely, very, also, enough
Numerals one, thousand, first
Interjections eh, ugh, phew, well
Yes/No answers yes, no (as answers)

In authorship attribution, the usage of function words is appealing as it forms the writing style of an author. The incidence of function words is contributed to authorial style and is not affected by the content of the text \[3\]. The frequency of occurrence of a particular function word would differ from author to author and the choice of function words used in developing a text also differs. Hence each author would have a unique set of function words that appears a certain number of times.

In 2005, a student from the University of Adelaide identified thirty frequency occurring words and counted the appearance of these words in the different text by different author in the New Testaments and applied it to attribute the author of the letter to the Hebrews.

Table 6: List of Words used by Putnins
and the of to they that
he in him unto them a
was with when i paul which
for all had were god said
his we this from but not

In addition, Mosteller and Wallace \[2\] identified about 70 function words and applied them in the analysis of the Federalist Papers and produced conclusive results that attributed the twelve disputed text to James Madison.

Table 7 : List of Function Words used by Mosteller and Wallace
a do is or this all down it our to
also even its shall up an every may should upon
and for more so was any from must some were
are had my such what as has no than when
at have not that which be her now the who
been his of their will but if on then with
by in one there would can intro only things your

Automated Function Words

The challenge with Function Word Analysis is selecting a set of function words that is unique and able to attribute a disputed text to its rightful author. The choice of a set of function words is the key to an accurate classification of a disputed text. It is necessary to obtain a set of features from a set of text that would be able to distinguish one author from another. In the Federalist papers, Mosteller and Wallace notice that the number of occurrence of certain words that Hamilton used were different from that of Madison as shown in Table 8. It is observed the usage of the words "while" and "whilst" were particularly unique in these two authors.

Table 8: Unique Words
Rates per Thousand Words
On Upon While Whilst
Hamilton 3.28 3.35 0.28 0.00
Madison 7.83 0.14 0.02 0.48

However, one of the aims of this project was to provide a non-biased approach in authorship attribution. This translates to the software, algorithm and the choice of feature vectors developed should not to be skew to any particular author. This led to the development of an automated function word extraction algorithm by identifying the most frequently occurring words and filtering out the function words from the content words by using an algorithm called word sequence interval (WRI). The uniqueness of function words in a text can be express with the frequency of occurrence and the mean of the word sequence interval.

As in section 5.1, the usages of functional words serve to express grammatical relationships with other words within a sentence. Hence, the frequencies of such words are much higher than other words in comparison to content words. As shown in Table 9, in analysing the Gospel of Matthew, the most frequency occurring words are "and", "the" and "of" with the frequency of occurrence shown.

Table 9: Functional Words in the Gospel of Matthew
Functional Words Frequency of Occurence
and 1552
the 1405
of 673

Another characteristic of functional words as observed in a text is the word sequence interval between the function words. As shown in Table 10, the WRI of the words, "and", "the", "of" and "Jesus" are shown. The mean of these interval was calculated and an observation was made that the mean value of the functional words such as "and", "the" and "of" was smaller in comparison to the content word or noun, "Jesus" which shows a mean of 138.70 which is about four to nine times larger than the mean obtained from the function words.

Table 10: Word Sequence Interval in the Gospel of Matthew
WRI Mean Word Sequence Interval
and 14.28 3 3 2 3 3 3 3 3 3 3 5 5 3 5 3
the 15.86 2 4 3 65 3 8 59 61 14 12 9 10 27 34 31
of 34.23 2 4 3 21 25 5 17 6 121 1 46 24 37 12 39
Jesus 138.70 214 43 110 95 2 894 30 24 54 144 61 38 96 16 2573

With this two parameters, the most frequently occurring words and the word sequence interval, a program was develop to identify a set of feature vector that will be used in authorship attribution. Each text is analysed and the top five frequently occurring words are listed. In addition, only words that have an average word recurrence interval of fifty or less are selected. These strict conditions ensure that content words would not be included into the set of feature vector. This feature vector will be labelled as Set A.

In order to provide a benchmark for our set of automatically generated function words, the 70 function words that were used by Mosteller and Wallace in the Federalist papers dispute will be used in the analysis of the effectiveness of the algorithm. This set of function words, or feature vector, will be labelled as Set B. In addition, a list of commonly used function words, 172 function words, will be used and labelled as Set C. The feature vector Set A will be analyse with respect to Set B and Set C and compare the effectiveness and uniqueness of this set of feature vector in authorship attribution. It is noted that Set B and Set C will not be used in the analysis of the Koine Greek text. The reason being that the feature vectors in Set B and Set C are in English and require a translation before being able to be used in the analysis. However, in using a translated set of feature vectors, it would not be able to provide an accuracy result as the imposition of the translator or the effects of translation might cause an inaccurate classification.

Results and Discussion

Results from English Text

In this section, a set of English Texts obtained from the Project Gutenberg archives was used to test and evaluate the effectiveness of the algorithm. A set of twenty-six texts was obtained for each author as discussed in 1.6. The text is then divided into two sets. One set will be used as training data to develop a classification model, and the other set will be labelled as disputed text. The texts that were labelled as disputed texts would help evaluate the accuracy of the algorithm. The number of training data used to develop the classification model was gradually increased from five set of text from each author to ten set of text to fifteen to twenty and to twenty-two.

Figure 8 : Function Word Analysis on English Text

Figure 8 shows the accuracy of the classification model using the respective number of training data. We can observe that with Set A, using five set of text as training data to develop our classification model, an accuracy of 83.33% was obtained with twenty-four disputed text. This translates to twenty of the disputed text being classified correctly and the remaining four set of disputed text were incorrectly classified. Using Set B, we observed that the accuracy was 91.67% which translate to twenty-two out of the twenty-four disputed text being classified correctly. In observing Set C, a similar set of results with Set B was obtained.

As we gradually increase the number of available training data in developing the classification model, from five set of text from each author to ten, an observation was made that the accuracy of Set A dropped. The accuracy of Set A decrease from 83.33% to 79.17%, where nineteen of the twenty-four disputed texts were correctly classified. An opposite trend was noticed in Set C, where the accuracy increases from 91.67% to 100%, where all the twenty-four disputed texts were correctly classified. As we increase the number of training data further, from a set of ten text to fifteen, it is shown in Figure 8 that the accuracy of the individual set increases. However, with twenty set of text, we notice that the accuracy remains constant or have reach saturation. With the current observation, we proceed to the ranking system in the Support Vector Machine to analyse the text.

As shown in Table 11, the texts that were classified wrongly are highlighted as shown. The first column in Table 11 shows the results from the classification from SVM, in the second column, the ranking system is shown and in the last column shows the rightful author for that text. For example, the first incorrect classified text belongs to the author labelled BB was classified as RD. This incorrect classification is due to the set of feature vectors that were chosen to attribute the text to the author. The set of feature vector, Set A, which consist of 18 function words, is used quite similarly by both authors BB and RD. The portion of these function words that appears in the text that was written by both authors are quite similar. In observing Set B, which consists of 70 function words, the issue of the incorrect classified text was addressed and corrected. The set of feature vector in Set B is about three times of than in Set A, providing a more unique set of feature vector that would be able to distinguish one author from another. In addition, in observing Set C, which consist of 172 function words, the issue of the uniqueness of the set of feature vector was addressed and all the text that was labelled as disputed was correctly classified.

Table 11: Ranking System of SVM (English Text, Set A, 20 text)
Classification Results Ranking (AD,BB,CD,HJ,RD,ZG) True Result
AD 5 2 3 1 3 1 AD
AD 5 2 2 2 4 0 AD
AD 5 3 3 2 2 0 AD
AD 5 0 4 2 3 1 AD
BB 2 5 1 0 3 4 BB
RD 1 4 2 0 5 3 BB
BB 1 5 2 0 3 4 BB
BB 1 5 2 0 3 4 BB
CD 4 1 5 2 3 0 CD
CD 2 4 5 0 1 3 CD
CD 3 4 5 0 2 1 CD
CD 4 2 5 3 1 0 CD
HJ 1 4 3 4 2 1 HJ
HJ 2 2 0 5 3 3 HJ
HJ 3 3 3 3 1 2 HJ
HJ 1 2 2 5 4 1 HJ
RD 1 4 3 0 5 2 RD
RD 1 3 1 1 5 4 RD
RD 4 1 4 2 4 0 RD
RD 3 4 2 0 5 1 RD
BB 1 5 2 0 4 3 ZG
ZG 0 4 1 2 3 5 ZG
ZG 1 4 0 2 3 5 ZG
ZG 1 2 1 2 4 5 ZG

In a further observation of the ranking system, the difference between the ranks differs only by one. This signifies that although SVM attributed the disputed text to the wrong author, the feature vector, Set A, was able to rule out the unlikely authors for that particular text. For example, in the first incorrect classification, the first likely author is RD, followed by BB and ZG, the top three ranks. By this observation, it is very unlikely that the other author AD, CD and HJ were the rightful authors to this text as these three authors holds the lowest rank in the ranking system. Likewise in the second incorrect classification, that classified a text that is rightfully ZG as BB listed BB, RD and ZG as likely authors while AD, CD and HJ had the lowest rank, which rule out the possibility that AD, CD or HJ could have been the likely author. This principal will be applied in the analysis of the letter to the Hebrews to eliminate unlikely authors from the list of potential authors that was generated over the years.

We notice that with Set C, the peak of the classification model occurs with a set of ten texts from each author and with Set B, the peak occurs with a set of twenty texts and with Set A with twenty-two. This shows that the accuracy of the classification model is determine by two factors, namely, the number of training data and the set of function words used, which is also the number of function words chosen. For a set of feature vectors that is relatively small, a larger set of training data will help to improve the accuracy of the classification model. Likewise, for a set of feature vectors that is relatively large, a small set of training data is sufficient to give an accurate classification results.

As observed in the results presented in this section, the advantage of using function words in authorship attribution is central. Function words are able to distinguish one author from another. In addition, with the implementation of the Support Vector Machine in authorship attribution, the ranking system was able to help identify the likely authors of a given text and eliminate the possibilities of other authors as the rightfully authors. It was also observed that the effectiveness of the feature vector, Set A, was able to attain an accuracy of 83.33% with just five training data from each author. However, the feature vector, Set A, was limited to only 18 function words due to the parameters imposed as discussed in section 5.2. As shown in the results, the feature vector would be more likely to be more accurate in eliminating unlikely authors to a disputed text than attributing a disputed text to it rightfully author unless it is implemented with a large set of training data as observed. Additional results to the English text using training data of five, ten, fifteen, twenty and twenty-two are provided in the Appendix in section 11.6.

Results from the Federalist

In this section, an analysis of the Federalist paper was conducted in view of the effectiveness of the algorithm. The Federalist Paper was obtained from the Project Gutenberg archives and a set of text by each author was set aside for developing the classification model. As the Federalist Paper is similar to the New Testament text, as in, the maximum number of balance training set is limited by the maximum number of text by an author, the results from the Federalist paper would be an essential guide to working with a small corpus.

Figure 9 : Function Word Analysis on the Federalist Paper

The approach taken in analysing the Federalist paper would be similar to that in the English text. The number of training data used to develop the classification model will be gradually increased and the results analysed to obtain a set of optimal conditions to be used in the analysis of the New Testaments in terms of authorship attribution.

In the first set of results, using a set of two text from each author to develop the classification model, an accuracy of 52.94% was obtained using the feature vector in Set A and 70.59% accuracy using the feature vector in Set B and 82.35% accuracy using the feature vector in Set C. As the number of training data was increased from a set of two text from each author to three, the accuracy for Set A and Set C increases. However, an observation was made that a drop in accuracy for Set A when the number of training data was increased from a set of three text from each author to four. To further analyse the results obtain, the ranking system is evaluated.

Table 12: Ranking System of SVM (The Federalist Paper, Set A, 4 Text)
Classification Results Ranking (H,J,M) True Result
H 2 0 1 H
H 2 0 1 H
H 2 0 1 H
H 2 0 1 H
H 2 0 1 H
M 1 0 2 M
M 1 0 2 M
M 1 0 2 M
M 1 0 2 M
H 2 0 1 M
M 1 0 2 M
H 2 0 1 M
H 2 0 1 M
H 2 0 1 M
M 1 0 2 M
H 2 0 1 M
M 1 0 2 M

As shown in Table 12, the text that were incorrectly attributed to Hamilton (H) do share similarities with Madison (M) using the feature vector, Set A. Madison was ranked as the second likely author to those text that were incorrectly classified. Although the feature vector of Set A, which consists of only 13 function words was not able to classify the text correctly, it was observed that the feature vector was able to rule out the unlikely possible author, John Jay.

Likewise in the Federalist paper, it was observed that the saturation point of the classification model, for Set B and Set C, occurs when a set of five texts from each author is used in developing the classification model. When the number of training data was increased further, forty-six by Hamilton, fourteen by Madison and five by John Jay, the accuracy for Set B and Set C dropped. The training data was highly skewed to Hamilton with the data by Hamilton being three times more than Madison and nine times more than John Jay. This caused the classification model to be skewed towards Hamilton as well, attributing text in favour of Hamilton.

With the observation obtained from this section, the optimal conditions for an accurate classification model are one when a set of balance text from each author is used. This principal will be applied in the analysis of the New Testament as majority of the text wrote in the New Testament was attributed to Paul. In section 5.3.1 Results from English Text, it was observed that the accuracy of the classification model is determine by two factors, namely, the number of training data and the feature vector. For a set of feature vector that is relatively small, a set of large training data from each author will help to improve the accuracy of the classification model. Likewise, for a set of feature vectors that is relatively large, a small set of training data is sufficient to give an accurate classification results. Applying this principal to the New Testament text, the approach taken would be to "chop" text, for example, the Gospel of Matthew, into a set smaller text, as shown in Figure 10. This approach was taken by Talis in 2005 and had been proven to be an effective approach in expanding the number of training data.

Figure 10 : Creating Smaller Text File

Results from King James Version

In this section, we will analyse the effects of translation using the New Testament text that had been translated and named as the King James Version. The aim of this test is to identify the effectiveness of the algorithm when dealing with texts that have been translated. The disputed text used in this section is the Gospel of Luke and the Acts of the Apostles. It is widely agreed upon by Biblical scholars that both the Gospel of Luke and the Acts of the Apostle was written by the same person, Luke. In the first test case, the Acts of the Apostles will be used as the training data and the Gospel of Luke labelled as the disputed text. In the second test case, the Gospel of Luke will be taken as the training data and the Acts of the Apostle would be labelled as the disputed text.

As observed in the previous sections, the accuracy of the algorithm increases with a large set of training data. However, the New Testaments itself contain only twenty-seven books, with most of it written by Paul. This poses as a challenge in developing an accurate classification model. To address the issue of a lack of training data, an approach was taken that was similar to Talis, that a long text was broken into several smaller texts as shown in Figure 10. This approach provided a larger sample to help developed an accurate classification model. To ensure that the text was not chopped in the middle of a sentence or in the middle of a paragraph, the chapter number was used as markers to chop the text accordingly.

Figure 11 : Predicted Author for the Gospel of Luke

In analysing the predicted author of the Gospel of Luke, the ranking system in SVM is used. As shown in Figure 11, the predicted author for the Gospel of Luke using the translated Koine Greek text of the New Testaments, King James Version, of the Bible showed that the most likely author is Matthew followed by Mark and Luke. It appears that the Gospel of Matthew and the Gospel of Mark share many similarities in terms of the usage of the function words.

Although the same person, Luke, wrote the Acts of the Apostles and the Gospel of Luke. However, the training data, which includes the Acts of the Apostles, was unable to accurately attribute the Gospel of Luke to itself. This could be due to the stability of the feature vectors that were chosen for attribution or the imposition of the translator style that distorted the original words. However, in analysing the ranking system, the possibility that James, Jude, Paul or Peter could have written the Gospel of Luke was ruled out as these four authors hold the lowest rank in comparison to Matthew, Mark, Luke and John.

Figure 12 : Predicted author for the Acts of the Apostles

In analysing the predicted author of the Acts of the Apostles, the ranking system in SVM is used. As shown in Figure 12, the predicted author for the Acts of the Apostles using the translated Koine Greek text of the New Testaments, King James Version, of the Bible shows that the most likely author is Luke followed by Mark and Matthew. Here in this case, using the Gospel of Luke as the training text in developing the classification model and the Acts of the Apostles as the disputed text, the disputed text was attributed to the rightful author, Luke. This could be due to the feature vector of the Gospel of Luke being more stable as compared to that of the Acts of the Apostles.

In Figure 12, an observation was made that the Gospel of Mark and the Acts of the Apostles are quite similar as Mark would be the second most likely author for the Acts of the Apostles. Another observation was made that both Set A and Set B are quite similar in their predictions, producing similar results or results that differ by rank one. This observation is essential as when analysing the New Testament in Koine Greek, only Set A will be utilized, as the feature vectors in Set B and Set C are limited to English text. Set A is a set of feature vector that is automatically generated using the most frequently occurring words and its word recurrence interval as parameters.

Results from Koine Greek

In this section, we will analyse the effective of the algorithm using the Koine Greek text of the New Testament and a set of feature vector, Set A. The results here will be compared with those obtained in section 5.3.3 Results from King James Version. In using the original text from the author, Koine Greek New Testaments, the imposition of the translator style will be removed. For this section, only Set A, which is the automated function words or feature vector will be used.

Figure 13 : Predicted author for the Gospel of Luke

As shown in Figure 13, the likely author of the Gospel of Luke was predicted to be Luke or Matthew with Mark holding a rank of three instead. This result is different from that of Figure 11, where a translated version of the Koine Greek text, King James Version, is used. This result shows the effect of translation in producing an accurate result in authorship attribution. Baayen observed the effect that the imposition of editorial or publisher's style can distort the original words of the author, in this case, the imposition of the translator.

In addition, it was observed that the Gospel of Matthew and the Gospel of Luke do share similarities in terms of the set of feature vector that was used to develop the classification model. However, the classification results ruled out that the possibility that the authors for the Gospel of Luke would be James or Paul, which had the lowest rank. With this observation, the training text from James and Paul that were used to develop the classification model was removed as these texts contributed noise to the classification model and distorted the results of the classification. The algorithm was run again; this time without the text from James and Paul, and the results of the classification is shown in Figure 14. This result now shows a more accurate result and it was observed that the likely author of the Gospel of Luke is Luke.

Figure 14 : Predicted author for the Gospel of Luke (Filtered data)

In the next set of result, the Acts of the Apostle was labelled as the disputed text and replaced with the Gospel of Luke as training data from the author Luke. The results from this classification are shown in Figure 15. As observed, the classification model attributed the Acts of the Apostles to Luke with Matthew being the second most likely author. These results differ from that of the translated version of the New Testaments, King James Version, which showed that Mark was one of the likely authors of the Acts of the Apostle. In using the Koine Greek text, it showed that Mark was ruled out as the second likely author of the Acts of the Apostles.

Figure 15 : Predicted author for the Acts of the Apostles

Likewise, in analysing the ranking system, the classification result ruled out the possibility that Paul, Peter, Jude could have been the author of the Acts of the Apostles. The same observation was made that the likely authors for the Acts of the Apostles would be Luke or Matthew.

Hence, in analysing the letter to the Hebrew, the Koine Greek text of the New Testament is required to provide an accurate and conclusive result. This reduces the imposition of the translator style.

Results using Koine Greek for the Letter to Hebrews

In this section, the Letter of Hebrews was labelled as the disputed text. Various texts from the New Testaments were included into the corpus and in addition, the Epistle of Barnabas and the First Epistle by Clement of Rome was included. The corpus used was in beta code, which is a representation of the actual Koine Greek.

Figure 16 : Predicted author of the Letter to Hebrews

As shown in Figure 16, the predicted author of the letter to Hebrews was classified to be Peter or Matthew followed by Paul and thereafter by Mark or Luke. The result in Figure 16 also eliminates the possibility that Barnabas, Clement of Rome or John would be the likely authors of the letter to Hebrews. This result would also eliminate Barnabas and Clement from the list of potential authors that was generated over the years. Barnabas was named as a possible author due to his knowledge of the subject matter in the letter of Hebrews. However, as shown in Figure 16, it was observed that this is highly unlikely.

As the minimum number of text by a particular author limited the training data used in developing the classification model, a strong conclusion could not be formed. However, as shown in the previous results, despite having a small training set, the feature vector, Set A, was still able to make a distinction between similar text and eliminate unlikely authors.

Conclusion

In summary, the optimal conditions for authorship attribution would be when a large number of training data is available to provide an accurate classification of each author. In addition, it is essential that the training data provided is not highly skew to a particular author. A set of balance text from each author is required to prevent a skewing of the classification model and thus resulting in a higher accuracy. It is also noted that the effects of translation or imposition of editorial style can affect the accuracy of the results as observed in the analysis of the Kings James Version and the Koine Greek of the New Testaments.

In the analysis of the New Testament text with the function word analysis algorithm, it is concluded that it is unlikely that Barnabas, Clement of Rome or John, that could be the author of the letter to Hebrews as shown in Figure 16. However, the likelihood that Peter might be the author of the letter to the Hebrews arises with Matthew and Paul.

As shown in the above sections, the usage of function words in authorship attribution is appealing as the occurrence of these words are contributed by the author's authorial style and independent of the content of the writings. One of the majority claims in regards to the authorship of the letter to the Hebrews is the theme of the letter or the content that the author is writing about. For example, Paul was listed as a potential author of the letter to the Hebrews due to his previous writings. Barnabas was listed as a potential author of the letter to the Hebrews due to his knowledge and his close association to the Apostle Paul and Luke for his writing style (Jason G, 1998). However, as known in the results in Figure 16, Barnabas was eliminated as a potential author from the list. Function word analysis is able to provide a non-biased approach, independent on the content of the text, to attribute a disputed text to its rightful author if the optimal conditions are met. Nevertheless, function word analysis is able to eliminate unlikely authors as well with a small set of training data or feature vector.

Function word analysis is also applicable to a wide range of genres as shown in the English text, Federalist papers and the New Testament texts. It is independent of the content of the text making it suitable for authorship attribution, where two authors writing on a similar topic may share many similar words or phrases \[3\]. In addition, function word analysis is applicable to different languages as well, as shown in the implementation of the Koine Greek text. It is also observed that the function word analysis is able to work with text that are relatively short, approximately a thousand words, as shown in the Federalist papers. Function word analysis is computationally inexpensive as well, allowing it to be applied to a vast number of training data without much computational cost.

A future improvement in function word analysis would be the implementation of the probability of occurrence of a function word into the feature vector instead of the portion of function word that appear in a text. Future applications might include the use of function word analysis into authorship attribution in disputed Chinese text. Another future possible implementation of this algorithm would be a search engine that would be able to produce a list of books by a particular author given a text. As the usages of electronic books are becoming ever so popular, it would be relatively simple to find books by an author with only a single chapter of a book as an input parameter.

Word Recurrence Interval

A more statistical approach should be used for authorship detection; hence the data extraction algorithm Word Recurrence Interval (WRI) was chosen. In addition, the method gives the ability to analyse texts regardless of the texts language.

For this algorithm, the WRI is defined as the number of words in between successive occurrences of a keyword. A set of keywords, { x1,......,xk},in the text is selected based on the number of times the keyword appears in the text from smallest to largest. Thereafter a set of scaled standard deviation are obtained from the chosen keywords using the equation below.

Equation 10 : Standard Deviation Equation

In the context of authorship attribution, this algorithm was chosen as it eradicates the dependency of word frequency which characterizes the word distributions, thus utilizing a more statistical approach to the analysis of the text. The obtained set of standard deviations {s1,......, sk} could then be used to directly compare the words between different texts or within the same text itself.

Upon researching on a similar project that was conducted by Putnins et al. [21] , it was concluded that using WRI for data extraction and plotting graphs of scaled standard deviation of WRI vs. log 10 rank does not give satisfactory results. Instead another type of data classification should be incorporated. For this reason, the Support Vector Machine (SVM) was utilized for this project and showed relatively good results in past year research.

Choice of Keywords

The keywords selection for the WRI method is selected base on the number of occurrence of the particular keyword in the text, hence the question arise what will the number of occurrence be. For this reason, the limitation is implemented named as the "threshold", where it specifically defines the minimum number of occurrence of the word for it to be considered as a keyword. Looking at the larger picture, this threshold value is the parameter and an indicator as a depth of the restriction during the process of texts categorization. In other words, a very large threshold value would distinctively and specifically attribute a text to its corresponding author, but may have a good chance to not be considered as the same author for other text.

Choice of Data Dimension

The number of data dimension, which is the number of standard deviations of a text in this scenario, is required for the SVM classification, hence it also plays an important significant towards the automated authorship attribution. The number of data dimension need to be chosen wisely as it affects the number of data corresponding to each text to be inputted into SVM. In other words, a small number of data dimension would provide inadequate data for training purpose hence giving a low accuracy rate in prediction for the disputed texts. However a large number of data dimension will also give a low accuracy rate as found by the team as it will be considered as noise by SVM.

Implementation of Method

The Word Recurrence Interval algorithm was implemented in the programming language Java. During the course of the project, several versions were developed, implemented and shown in Table 13.

Table 13 : Version Description of Word Recurrence Interval
Version Number Description
V 1.0 Calculate the standard deviations for chosen keywords that are based on the number of occurrence in the text
V 2.0 Keywords are chosen base on 70 function words and its corresponding word interval and standard deviations in the text
V 3.0 Further development from WRI V1.0 with the implemented ability to choose appropriate text for training data

The effectiveness and establishing the benchmark of the WRI algorithm was evaluated by using the large corpus of English fictional texts as the test data. The team have chosen to use this corpus as it has a large data set and it can also verified the texts are written by its corresponding authors. The first test for the WRI algorithm was to use the first five texts from each author as the training data and two texts from each author as the test data. Initially five data dimension was input and increased every five times. The results were obtained and shown in Appendix G: Additional Word Recurrence Interval Results.

From the result, it could be seen that the algorithm have a low accuracy rate at approximately 60%. Upon deeper investigation and research, it was discovered that the keywords that are chosen were mostly function words, which shows that there is a possibility that WRI combining with function words might gives a better result. This provided the motivation to the team to further develop and enhance the algorithm hence the development on Word Recurrence Interval Version 2.0.

An approach to analyse the text differently with an attempt to improve the accuracy is done by focusing on the keywords that were chosen in the text. This algorithm retains the same calculation and measurement of the word recurrence interval and its corresponding standard deviation, with the difference that the keywords were chosen based on the seventy function words. The seventy function words were used and shown in table 111 as it was proven and identified by Mosteller and Wallace [23] that those function words are good candidates for authorship attribution studies.

Table 14 : 70 Function Words that are Chosen for the Algorithm
A Do Is Or This All Down It Our
To Also Even Its Shall Up An Every May
Should Upon And For More So Was Any From
Must Some Were Are Had My Such What As
Has No Than When At Have Not That Which
Be Her Now Who Been His Of Their Will
But If On Then With By In One There
Would Can into Only Things your

Since the keywords were chosen based on the 70 function words as shown above, the algorithm would only have the number of data dimension as its parameter, simplifying the algorithm significantly.

The algorithm is applied to the English texts, the exact same data set which was used as before. The result is obtained and shown in Appendix G: Additional Word Recurrence Interval Results under section 11.7.2 : Additional Results for Word Recurrence Interval Version 2.0.

From the result, the algorithm has an even lower overall accuracy. Furthermore, it loses the consistency which version 1.0 has and the linear property of the result no longer remains. This is an indication saying that the keywords should not only be chosen based on function words, which are not coherence from the findings made by Argamon and Levitan \[24\], where they concluded that content words are not a good indicator for authorship attribution. The reason for this is that there might be a slight possibility that the low accuracy is due to the effect of the different text length of each training data.

With this in mind, the team have reached a conclusion that it is required for WRI to have the ability to choose the training data base on the text length of the data itself, whereby each training data would have approximately the same text length which is implemented in Word Recurrence Interval Version 3.0.

In this algorithm, the extraction of word recurrence interval measurement was used with a modification to the algorithm where it will choose the training data that have the appropriate text length. The idea of the algorithm is to have it only accepts the texts with the limitation of having a standard deviation number between the mean of the whole corpus. This will let the algorithm have the benefit of the automated ability to choose suitable training data based on the average text length of the corpus.

This method is implemented by first calculating the average text length and the standard deviation number for the corpus itself. The range is then calculated and is compared with the texts in the corpus via Equation 11. Therefore, the algorithm will only take in texts that have a text length which lies in the range as the training data's.

Equation 11 : Range to Choose the Training Data

The algorithm is applied to the English texts, the exact same data set which was done before. The result is recorded and shown in Appendix G: Additional Word Recurrence Interval Results. The results show that having the ability to select the appropriate data for testing increase the overall accuracy by 10%. Furthermore the algorithm retains the linear property where increasing the number of training data increases the accuracy, with the addition that it also remains the consistency of the results. However, the idea of removal of training data does not seems to be favourable as the training data might be important, motivating the team to continue to use Word Recurrence Interval Version 1.0.

Several tests and outcomes which would be discussed at the next sections are done based on Word Recurrence Interval Version 1.0. Four different corpuses were used to verify the efficiency and performance of the algorithm. The procedure of the tests is to first obtain the threshold and data dimension value which gives the highest accuracy and subsequently use these values on the test data.

Results and Discussion

Results from the English Text

The full corpus that was used in the test contains a total of 156 texts whereby 22 texts are written by each of the six authors. in the results below, the authors are labelled AD, BB, CD, HJ, RD and ZG corresponding to Sir Arthur Conan Doyle, B. M. Bower, Charles Dickens, Henry James, Richard Harding Davis and Zane Grey respectively. It was decided by the team to use 22 texts for training data as the increase amount of training data increases the accuracy as shown in section 11.7.3. Looking at the results, it is clearly seen that the Linear kernel function in SVM gives the best consistency. The prediction accuracy is shown to be proportionally to the number of data dimension for the linear kernel function. The other kernel functions shows a very randomize prediction hence the difficulty in the interpretation of the results. The result is obtained and shown below.

Figure 17 : Word Recurrence Interval on English Text

It can be seen from the graph that the highest achievable accuracy is given using the threshold value of ten and data dimension of seventy. Hence the data is input into SVM and the prediction and results are shown in Table 15.

Table 15 : Ranking System of SVM (English Text)
Accuracy : 58.33% Ranking
Test Data Predicted Author AD BB CD HJ RD ZG
AD_AB AD 4 0 3 4 3 1
AD_AC AD 5 0 3 4 1 2
AD_AD AD 5 0 4 2 2 2
AD_AG RD 1 1 2 3 4 4
HJ_AA HJ 3 0 2 5 4 1
HJ_AD HJ 1 4 3 5 2 0
HJ_AM HJ 2 0 2 5 3 3
HJ_AP HJ 3 1 2 5 4 0
RD_BB RD 1 3 2 0 5 4
RD_BT RD 3 3 1 3 4 1
RD_CC RD 1 4 0 2 5 3
RD_CL BB 4 5 3 0 2 1
ZG_BE RD 4 2 2 0 5 2
ZG_BL HJ 1 1 4 5 1 3
ZG_BZ ZG 3 1 4 0 2 5
ZG_CC RD 2 2 2 0 5 4

Base on the table above, it can be seen that the algorithm WRI may have captured and categorized correctly for the works that were done by Sir Arthur Conan Doyle (AD). However the text AD_AG predicts AD to be the least likely author which is incorrect and proves to be contradictory. This same prediction also occurs for the works that were done by Richard Harding Davis (RD) where his works were predicted correctly with the exception of the text RD_CL, placing RD to be the 4th most likely author for it. For the work that is done by Zane Grey (ZG), most of the texts were predicted wrongly. Initially it was first considered that the large difference number in text length was the factor that caused the incorrect prediction as the incorrect predicted text has a substantial difference in text length comparing with other text that was written by the same author. However upon further investigation, the works that were done by Henry James (HJ) were predicted correctly even though the text lengths differ greatly for his works.

These results suggest that WRI may have captured some of the characteristics difference between authors but were inconsistent. These inconsistent results showed that more testing is required before a conclusion could be made.

Results from the Federalist

The full corpus that was used in this test contains sixty-five texts whereby forty-six texts were written by Alexander Hamilton (H), five texts were written by John Jay (J), and fourteen texts were written by James Madison (M). Testing was done first by using one text from each author as training data and is increased by one text as the tests proceed. The results were obtained and shown in section 11.7.4. Looking at the result, It is advisable that training data should as much as possible be balanced between all the authors such that the prediction would not be bias to one author. Thus further analysis was done on the Federalist Paper using the balanced data with the threshold value of 5 and data dimension value of forty-five since it gives the best accuracy. The prediction result was obtained and shown below.

Table 16 : Ranking System of SVM (Federalist Paper)
Accuracy : 47.06% Ranking
Test Data Predicted Author H J M
H M 0 1 2
H H 1 1 1
H J 0 2 1
H H 2 1 0
H H 2 0 1
M J 0 2 1
M J 0 2 1
M M 1 0 2
M M 1 0 2
M J 0 2 1
M H 2 0 1
M J 0 2 1
M J 1 2 0
M M 1 0 2
M H 2 1 0
M H 2 1 0
M H 2 0 1

Looking at the result above, the algorithm may have captured some of the characteristic of H, providing the ability to correctly identify text that was written by H. However most of the work that was done by M could not be correctly classified, classified, whereby it classifies the work to be written by J and the second most likely was written by M. This shows that the margin error is small.

However comparing with the previous results, the accuracy at 47% for Federalist Paper is significantly lower than the accuracy base on English Text, which is approximately 58%. The result is poorer because the average text length for Federalist Paper is approximately 2500 words but the average text length for the English Paper is approximately 8500 words. This shows that longer text length is much favourable for the WRI algorithm.

As a conclusion, the result further adds weight to the information that WRI may have captured the characteristic of different authors but with inconsistency.

Results from King James Version

Similar test has been conducted using the King James Version corpus. It is known that "Luke & Acts" were verified to be written by Luke; hence it is used as the test data. 26 training data corresponding to different authors were input as the training data and the text "Acts of the Apostles", supposed to be written by Luke, were used as the disputed text. The result of the test is shown in Figure 18.

Figure 18 : Word Recurrence Interval on Acts of the Apostles

From the graph it can be seen that the WRI performs very well, correctly classifying "Acts of the Apostles" to Luke regardless of the threshold value and the number of data dimension. The prediction and ranking for the test data is shown in Figure 19.

Figure 19 : Predicted Author for the Acts of the Apostles

The figure above shows that the most likely author in descending order is Luke, Mark, Matthew, Paul, John, James, Peter and lastly Jude. It can be seen that WRI performs very well in categorising text that was written by Luke. This is the only conclusion which could be derived from this result hence another test was conducted to verify if WRI could capture the characteristic of the author Luke.

Another test was done by using the "Gospel of Luke" as our test data, which also suppose to be written by Luke. The result is obtained and shown in Figure 20.

Figure 20 : Word Recurrence Interval on Gospel of Luke

From the graph it can be seen that the WRI performs very well and the threshold value and number of data dimension does not affect the result, showing the desired consistency. The prediction and ranking for the test data is shown in Figure 21.

Figure 21 : Predicted author for the Gospel of Luke

The figure above shows that the most likely author in descending order is Luke, John, Matthew, Mark, Paul, James, and Peter. For this test the WRI also performs very well, which adds weight to the discussion that WRI could capture the characteristic of the author Luke.

The results from this test obtained shows that WRI does have the ability to properly categorize authorship attribution with fairly good consistency. Comparing with the previous results in the previous section, the results obtained are not in coherence. The results based on the English Text and the Federalist Paper shows that WRI may have the ability to capture authors' characteristic with fairly inconsistency. However this test shows that WRI actually proves to be a suitable technique to use for authorship attribution with consistency.

It is good to highlight that a significant factor that may have led to this result is that the King James Version is a direct English translation from the original text, which give a suggestion that the translation itself may have affect the result.

Results from Koine Greek

Looking at the results from King James Version, the result may have been affected by the direct English translation from the original text. Hence Koine Greek texts were used in the next test.

18 texts from different authors were used as the training data and the text "Act" which is written by Luke was used as the test data. The result is obtained and shown in Figure 22.

Figure 22 : Word Recurrence Interval on Acts of the Apostles

The graph shows that it classifies the text correctly regardless of the threshold value and number of data dimension. A more detailed result is shown in Figure 23.

Figure 23 : Predicted author for Acts of the Apostles

The figure above shows that the most likely author in descending order is Luke, Paul, Matthew, John, Mark, Peter, James and lastly Jude. It can be seen that WRI performs very well in categorising text that was written by Luke. This result is almost the same except that Paul is the 2nd most likely author whereas in the King James Version, Matthew is the 2nd most likely author.

Another test was done by using the "Gospel of Luke" as our test data, which was supposed to be written by Luke. The result is obtained and shown in Figure 24.

Figure 24 : Word Recurrence Interval on Gospel of Luke

The WRI classifies the text correctly and consistently as anticipated. The ranking table is shown in Figure 25.

Figure 25 : Predicted author for the Gospel of Luke

The figure above shows that the most likely author in descending order is Luke, Matthew, Paul, John, Mark, Peter, James and lastly Jude.

Base on all of the results, it can be seen that the most 3 likely authors for "Luke & Acts" are Luke, Matthew and Paul. Although the WRI algorithm predicted the author to be Luke, there is a slight chance that it may classify the text to be Matthew or Paul. Upon further investigation at the input of the training data, it was found that that the standard deviation numbers for Luke, Matthew and Paul are very similar within one another. This shows that WRI may not have distinguish the characteristic of each author completely.

Results using Koine Greek for the Letter to Hebrew

The test for "the Letter to Hebrews" was done base on the Koine Greek Corpus. 18 texts from different authors were used as the training to classifies the Letter to Hebrews as the disputed text. The ranking result is obtained and shown in Figure 26.

Figure 26 : Predicted author for the Letter to Hebrews

Looking at the result, it is shown that there is a high probability that the letter to Hebrews was written by Luke or Matthew and subsequently Paul. This result supports the claim that Luke could be the possible author for the Letter to Hebrews. An interesting observation is that because Luke and Paul have a high ranking value, this actually supports the claim that Paul might be the original author for the Letter to Hebrews and also the claim that Luke might have translated Paul original Letter to the Hebrews. Also with the WRI algorithm, it has a very low probability that Jude, James and Peter wrote the letter to Hebrews.

Conclusion

Based on the tests and results, it was found that the classification accuracy increases as the number of data dimension increases until a certain point where the classification accuracy decreases due to model overfitting. In addition, the classification accuracy increases as well as the number of text for training data is balanced out between each author. Furthermore it was suggested to maximize the text length for each corpus as this would also furthermore increase the classification accuracy. An important sighting is that it seems the accuracy of the Word Recurrence Interval algorithm is not affected by the translation of the text, as seen in the analysis of the Kings James Version and Koine Greek of the New Testaments.

The WRI does give fairly decent results in the accuracy of classification where the algorithm may have captured the characteristic of the authors. However the results that were generated were too inconsistent for it to be used as meaningful evidence for authorship attribution, thus further development with the algorithm is required to be more useful.

It is seen that Word Recurrence Interval could be applied to different type of languages and hence it is recommended to do further testing on its performance by using other languages such as mandarin. It is also noticed that Word Recurrence Interval has the versatility in choosing the keywords for feature extractions which could be further implemented in internet search engine.

Trigram Markov Model

Markov chains are widely used in variety areas in mathematics and engineering. Previous study shows that it is a useful tool in stochastic text generation. Specifically, Markov n-gram models are very powerful in statistical natural language processing, and have been shown through abundant experiments to be extremely effective in creating language models which are a core component in modern statistical language application.

Technically, a Markov Chain is defined by a set of states and transitions. It has the memoryless property which indicates that occurrence of future states does not depend upon past states, but only on the current one. Trigram Markov Chain is a particular example in this class, which means occurrence of the coming state only depends on its previous two states.

In the context of authorship attribution, state can be letters, words or even characters in some other languages. In this project, the smallest element in a text is considered to be words. As a result, trigram Markov Chain uses a pair of words as a state and makes an assumption that the probability of next word (transition) is related only to the two words before it. In mathematics, it is represented as:

Equation 12 : Probability Equation for the Next Transition

Such probability of a transition following a given state can be calculated by taking the ratio of occurrence number of the whole trigram and occurrence number of the state throughout of a text. Use the example of "Xn, Xn-1, Xn-2" above, its probability can be expressed by the following formula.

Equation 13 : Probability Equation with the Ratio Occurrence

With all probabilities calculated based on each text document, a vector which contains the states and transitions information can be formed. It is believed that texts written by the same author may have similar vectors. Therefore, these characteristic vectors can then be used for classification.

Suppose there is a language that has a vocabulary of two words only - 'x' and 'y', then there are four potential states: 'xx', 'xy', 'yx', 'yy' with each state has two possible transitions, 'x' and 'y'. However, in real language, the amount of vocabulary could be very large. With a vocabulary of n words, there are 2n possible states and each state is followed by n possible transitions. This fact leads to an extremely high dimensional probability vector for every text. Hence, a main challenge to this method is about selecting useful states and transitions and hence calculating their corresponding probabilities that can characterise texts.

Implementation of Method

The trigram Markov extraction algorithm was implemented using JAVA. During the progress of this project, several versions were implemented. They are listed in the following table:

Table 17 : Version Description of the Trigram Markov Model
Version Algorithm Description Implement Environment
V0.1 Calculate probability information for all trigrams appeared in the input text Standard Java class
V0.2 Apply Hidden Markov Model to the v0.1. Take into accounts the effect of bigrams and unigrams Standard Java class, makefile
V1.0 & V1.1 Based on v0.2, select significant trigrams among all input trigrams, calculate their probabilities to form feature vectors Eclipse
V1.2 Consider the position of a word in the sentence, especially beginning and end of a sentence Eclipse

The first prototype of this algorithm makes the assumption that only previous two words have an effect on the probabilities of the next word. However, while this model is applied, it shows that that there are almost no common trigrams among texts. Past research shows this problem on sparse data as well. In other words, suppose that the relevant statistics were collected for our trigram model and then apply it to a new text in which a trigram occurs that never appeared in the training corpus. Then the third word coming after the first two would have a zero probability, resulting in very poor cross entropy.

In this case, a second prototype was developed as a solution to this problem to smooth the probabilities by using both bigram and unigram probabilities. It is defined using the formula:

Equation 14 : Probability Equation with both Bigram and Unigram

Where pe is the unigram, bigram and trigram probabilities calculated in a similar way as defined in the above equation. The three non-negative coefficients ?, ?, ? have a sum of one. If we assume that most of the time we do have trigrams and that they yield a more accurate assessment of the probabilities than bigrams or unigrams, then ? should be much higher than the other two such that it dominates the probability calculation. Their values can be determined by a scheme called hidden Markov models (HMM). In this project, values of 0.1, 0.3 and 0.6 were assigned to them.

While running the program with an input of thirty text corpuses, the amount of common trigrams among texts was still not satisfactory. Hence, statistics that can be used for training the support vector machine is insufficient resulting in an inaccurate prediction on disputed texts. In this case, the choice of trigrams used to form the feature vector was no longer limited to be common among texts. All trigrams appeared in different texts are included. If a text did not contain a specific trigram, then it would be assigned with a zero probability. However, there were two major problems associated with this method. First, it was computational expensive. As shown in the table below, the running time to extract all trigram information from both chopped English text and King James New Testament was quite long. Then the second problem with no feature selection was due to high percentage of low occurrence states in every text. The pie chart in Figure 27 shows the state distribution in one of the Federalist Paper. The total states appeared in this text is 1406, among them about 91% of states only occur once. Thus, the trigram generated from such state will have an occurrence probability of one. Since those trigrams were rarely shown in the text, they were not only no helpful in authorship attribution, and even created a lot noise during classification process.

Table 18: Average Length of Text for Different Data Set
Data Set Approximate Average Length of Text Running Time
Federalist Paper (25) 1000 2 mins
English Text (84) 5000 30+mins
King James New Testament 5000 20+mins
Figure 27: Occurrence Count of States in Federalist Paper

Therefore, a selection algorithm was introduced in version 1.1, which is called threshold selection. The new parameter "threshold" represents the minimum number of occurrence of a state in order to be kept for forming the feature vectors. Trigrams that generated from states which occur less than the threshold would not be considered.

The final version of trigram Markov model was inherited from Version 1.1, It inserted a "\#" at the beginning of a sentence and a "$" at the end. For example, there is a text with two sentences - "Today is a good day. I want to go to picnic." After editing, it would become "\# Today is a good day $ \# I want to go to picnic $". The motivation to this modification is because each sentence exists relatively independent from each other. It is not necessary to calculate the probability of appearance of "I" after the bigram "good day". Instead, it will be more significant to characterise an author's writing habit by knowing the probability of appearance of "I" in the start of a sentence, i.e. after the bigram "$ \#". Likewise, the probability of a word appearing at the end of the sentence is important to know as well. To determine the beginning and end of a sentence, delimiter "." was used.

The performance and results of trigram Markov model which will be discussed in the following sections were based on Version 1.1. Four different data sets were tested to verify the efficiency of this algorithm.

Results and Discussion

In order to evaluate the effectiveness of trigram Markov model cooperating with Support Vector Machine, tests based on data sets introduced in previous section were made. It includes English texts, Federalist Papers, King James Version of New Testament and Koine Greek version of New Testament.

Benchmarks used for testing algorithm performance are accuracy and average rank. Accuracy was calculated by taking ratio of number of texts predicted correctly and total number of dispute texts. The concept of average rank was inherited from Tails report. When texts were misclassified, the extent of the misclassification was recorded. If a text was correctly classified it was given a rank of one, if the true author is predicted as the second most likely candidate the text is given a rank of two and so on down to the worst case misclassification scoring a rank of six. Then ranks for the same author were averaged to be the average rank of the author. The closer the ranking is to one, the more accurate authorship attribution technique has been.

Results from English Text

The full set of English texts used was written by six different authors with each author contributed twenty-six texts. For a full list of these texts, see Appendix A: English Texts. For convenience purpose, authors were label by taking the combination of their first letters from given name and family name. Thus AD, BB, CD, HJ, RD and ZG are corresponding to Sir Arthur Conan Doyle, B.M. Brower, Charles Dickens, Henry James, Richard Harding Davis and Zane Grey respectively.

All 156 English texts were divided into two groups, 132 training texts (22 from each author) and a total of 24 dispute texts (4 for each author). Initially, threshold value was kept constant and 5 texts were selected from each author for training. Then the training set was increased to 10, 15, 20 and 22 texts from each author. Subsequently, the test was repeated with threshold value of 30 up to 90 by a step of 20. These different tests allow the question of which is the most effective overall to be examined.

While running these tests, the first conclusion can be drawn is that using SVM with quadratic, RBF and polynomial kernel function didn't give any consistent results. No clear classification pattern can be observed from their prediction, and the classification accuracy stayed in a very low level at around 17% no matter how size of training set and threshold changed. Hence, the result analysis discussed below is all based on using SVM with linear kernel function.

As mentioned, initial test had a constant threshold value of ten with varying size of training data. The following plots in Figure 28 and Figure 29 show the relationship between total average rank and data size, accuracy and data size.

Figure 28 : Total Average Rank versus Size of Training Data with Threshold equals to ten
Figure 29: Accuracy versus Size of Training data with Threshold equals to ten

From Figure 28, it is obvious to see that the total average rank for six authors decreases with increasing size of training data. As introduced previously, the closer average rank is to one, the more efficient our extraction algorithm is. In addition, classification accuracy increased from 41.67% when using five texts per author up to 83.33% when all twenty-two texts from each author were used for training. Since kernel algorithms using a linear kernel function are often equivalent to their non-kernel counterparts, it indicates that trigram Markov model can characterise an author's writing style more precisely with larger training set.

This conclusion is consistent with different threshold values. The following graphs illustrate the situation when threshold equals to thirty, figures and detailed statistical data table for all tests could be found in Appendix H - Trigram Markov model Classification Result for English text.

Figure 30: Total Average Rank versus Size of Training Data with Threshold equals to thirty
Figure 31: Accuracy versus Size of Training Data with Threshold equals to thirty

While running the testing program, it was also noticed that system performance varies with different threshold value. Even though it was an upward trend in general, the accuracy suffered a performance drop after threshold value reached 30 or 50. Figure 32 below has shown the effect of increasing threshold.

Figure 32: Classification Accuracy versus Threshold Value

Likewise, same effect could be seen from Figure 33. Total average rank started to leave away from 1 after threshold reached 30 or 50. Thus, it was clear to see that threshold selection did play an important role in filtering out most of unwanted trigrams from English texts. However, after dimension of feature vector has been reduced to a certain level, keep applying threshold selection would also block out some vital characteristic trigrams and hence leading to unsatisfactory system performance. As a result, it can be concluded that, threshold selection can only be used as a general feature selection tool. A more precise selection algorithm should be developed to prevent losing valuable data.

Figure 33: Total Average Rank versus Threshold Value

Results from the Federalist

The full set of Federalist Paper used was written by 3 authors, Alexander Hamilton, James Madison and John Jay. Among them, 10 were written by Alexander, 17 were written by James and the rest 5 were written by John. Consequently, 5 from each author were selected to form the training data pool, leaving the rest 17 as dispute texts. Three authors were labelled by the first letter of their family name, H, M and J respectively.

Similar tests were performed as ones for English text. The difference was that size of training data increased from 2 texts per author up to 5 texts per author with each time increased by 1, and threshold values used were 10, 20, 30, 40, 50 and 60.

Nevertheless, the results this time were quite bad and inconsistent. In the case of linear kernel function, the highest accuracy was 64.71% when 5 texts per author used for training and threshold value equalled 30. In most of the tests, classification accuracies were jumping around below 50%. Furthermore, it was not obvious to observe the same conclusions drawn from English text tests. Table 19 has shown the classification accuracy for all tests involving SVM with linear kernel function.

Table 19 : Classification Accuracy for Federalist Paper with Linear Kernel Function
Threshold
Texts per Author 10 20 30 40 50 60
2 17.65% 11.77% 11.77% 11.77% 23.53% 23.53%
3 0% 5.88% 29.41% 17.65% 23.53% 23.53%
4 0% 5.88% 41.18% 35.29% 23.53% 23.53%
5 0% 23.53% 64.71% 47.06% 35.29% 17.65%

In order to find out the reason why trigram Markov model had such poor performance, each text in the Federalist Paper data set was assessed. It was realised that texts in Federalist Paper were all very short. The average text length was 2268, while English texts had an average length of 6866 (example text length of Federalist paper and English texts shown in Table 20, Appendix E: Text Length of Training Data.

Table 20: Example Text Length of Federalist Paper and English Text
Federalist Paper Text Length English Text Text Length
H_FEDERALIST_1.txt 1625 AD_BC.txt 5461
H_FEDERALIST_7.txt 2325 AD_CP.txt 10711
H_FEDERALIST_21.txt 2014 AD_DD.txt 5784
H_FEDERALIST_22.txt 3582 AD_DL.txt 5910
H_FEDERALIST_23.txt 1834 AD_DR.txt 6161
H_FEDERALIST_24.txt 2025 AD_DU.txt 7602

Taking H_FEDERALIST_7.txt as an example, it has a length of 2325 as shown in the table above. There were total 2383 distinct trigrams found throughout the text. On the other hand, as shown in Figure 34, the most frequent trigram only occurred 20 times, and the rest ones appeared no more than 5 times. Such high percentage of low occurrence trigrams would not be able to precisely reflect an author's writing style.

Figure 34: Trigram Distribution for H_FEDERALIST_7.txt

As a result, tests on Federalist Paper have shown that text length had great impact on system performance. Trigram Markov model is not an effective feature extraction algorithm for texts with very short length.

Results from King James Version

Among texts in King James Version of New Testament, it is known that Gospel of Luke and Acts were definitely written by the same person. Thus, the aim of using King James Version is to examine the performance of trigram Markov model. Two tests were performed:

  • Using Gospel of Luke as dispute text and all other texts as training data, trigram Markov model should predict its author to be Luke.
  • Using Acts of Apostles as dispute text and all others as training data, trigram Markov model should predicts its author to be Luke.
  • If the prediction for two tests above were both Luke, then they were written by the same person. Hence, trigram Markov model can be proved to work effectively and correctly.

Since this time, only one text needs to be classified, average rank is no longer used as a benchmark. Instead, ranking weight is chosen to examine algorithm's performance. For example, suppose there were six potential authors for a dispute text. By applying pairwise comparisons, the most likely author would have the highest ranking weight. In other words, the higher ranking weight an author got, the more likely he/she could be its author.

In the first test where Gospel of Luke was treated as dispute text, the classification results were relatively consistent with varying value of threshold. In Figure 35, different colours represent different threshold values. Even though as shown in this figure that Luke was not the most likely author for Gospel of Luke, he was still listed as the second potential author. The first three possible authors in this test are Mark, Luke and Matthew.

Figure 35: Author Ranking for Gospel of Luke with Different Threshold

Then in the second test where Acts of Apostles was treated as dispute text, classification results demonstrated that Luke was the most likely author and the first three possible authors for Acts were Luke, Mark and Matthew which was similar to test one.

Figure 36: Author Ranking for Acts of Apostles with Different Threshold Values

Even though for these two tests, the dominant authors for Gospel of Luke and Acts of Apostles did not exactly match, the ranking weight distributions in Figure 35 and Figure 36 were extremely similar. Hence, it was confident to say from the classification result that these two texts were written by the same person. As a result, the performance of trigram Markov model was acceptable.

Results from Koine Greek

In last section, New Testament in King James Version was tested to evaluate trigram Markov model's performance. In this section, same testing procedures were applied on the Koine Greek version of New Testament. The purpose for this test was to see if translation from Koine Greek to English would affect classification result. Since the Barnabas text we have is only in English, if translation can be proved that have no impact on original author's writing style, we can then include Barnabas text in English into our classification model.

The first test was based on the "dispute" text Gospel of Luke in Koine Greek. The main difference with previous section was that number of trigrams in Koine Greek version was quite limited. Compared with English version, the dimension of feature vector has been reduced by approximate a factor of 5. Hence, smaller threshold value was used in this part. Figure 37 showed the classification result for it with threshold value equalled 5, 10, 15, 20 and 25. As we can see, the most likely author for this text was predicted to be Luke which was the same as the test result for King James Version of Gospel of Luke.

Figure 37: Author Rank for Koine Greek Gospel of Luke with Different Threshold Values

Additionally, tests on Koine Greek version of Acts of Apostle had also shown the similar ranking weight distribution as English version. Consequently, translation from Koine Greek to English does not change the original author's writing style a lot. Trigram Markov model is able to effectively capture important features from both version of New Testament.

Figure 38: Author Ranking for Koine Greek Acts of Apostle with Different Threshold Values

Results using Koine Greek for the Letter to Hebrews

With conclusion got from all previous tests, Trigram Markov model was applied to The Letter to Hebrews. Hebrews was treated as the dispute text and all other texts from Koine Greek New Testament were used as our training data. Since all testing data used had text length around 5000, it would not introduce a bias that influences prediction result. The program produced a rank as shown in Figure 39. From the graph, the most likely author of the Letter to Hebrews would be Matthew and the second ranked possible author would be Paul.

Figure 39: Author Ranking for the Letter to Hebrews

Conclusion

In conclusion, trigram Markov model is generally an effective feature extraction algorithm cooperating with support vector machine. However, it has very poor performance for texts with short length. In order to ensure trigram Markov model's accuracy, a certain level of text length should be satisfied. In addition, there is a clear trend that classification accuracy using trigram Markov model increases as the size of training data increases and the way to select useful features for classification should be further enhanced.

One recommendation for future implementation of Trigram Markov model is to use feature selection algorithm purely on statistics. In the project, algorithm to select useful features is based on choice of trigrams. However, an alternative method is to generate all trigram probabilities, and apply technique of selecting a subset of relevant features for building robust learning models. A popular form of feature selection algorithms recommended is stepwise regression. With carefully selected features, SVM might have an improved performance.

Project Management

At the start of this project, the team identified key milestone as shown in Table 21. Several milestones have been met thus far, such as proposal seminar, stage 1 design document and peer review of stage 1 design document. In addition to these key deliverables to the school, additional internal milestones have been added by the team members.

Table 21 : List of Deliverables
Events Date Action By
Proposal Seminar 11th August 2010 ALL
Stage 1 Design Document 23rd August 2010 ALL
Peer Review of Stage 1 Design 30th August 2010 ALL
Progress Report 22nd Oct 2010 ALL
Interim Performance 29th Oct 2010 ALL
Final Seminar 2nd May 2011 ALL
Final Performance 23rd May 2011 ALL
Project Exhibition 3rd June 2011 ALL

Project Schedule

Overall, the project proceeded successfully and accordingly to the initial plans. Furthermore, some tasks began earlier than the starting date as the development for the feature extraction algorithm were completed sooner than expected. However there were a few changes in the project which were different from what was described in the project proposal. The main changes were made on the technical side which are the three feature extraction algorithms in the project. These changes were essential as the initial phase of the techniques were not satisfactory in terms of its effectiveness, hence modifications and improvements to the techniques were made by team members.

These modifications led to a slight delay from the expected completion date of some task as indicated in Appendix M: Gantt Chart. The delayed task took a longer duration than initially planned due to the modification and improvement of the feature extraction algorithms. However these delays were not critical and the deliveries and milestones of the project were still met.

Information Management

Each team member has the responsibility to report and update their own findings on the online wiki - [[Authorship detection: 2010 group ]]. In addition, the team has been meeting up fortnightly to update our progress, analyse the current stage of the project and plan the next step of our project. Minutes of the meetings can be obtained by the online wiki - [[Minutes of Meeting 2010: Who wrote the Letter to the Hebrews? ]]

Work Breakdown Structure

The workload for this project was divided to the team member based on the Work Breakdown Structure as shown in Appendix L: Work Breakdown Structure. Referring to the work breakdown structure, each team member is responsible for one of the data extraction techniques, namely function word analysis, word recurrence interval and Trigram Markov. This approach provided the team with the flexibility to undergo three tasks in parallel, maximizing time and work efficiency. Furthermore, a Gantt chart (Section 11.13) was produced to include the additional internal milestones that were decided by the team to monitor the project progress.

In addition to the work breakdown structure, the writing up of the various reports, namely the Stage 1 Critical Design Document, the Progress Report and the Final Report including the preparation of the Project Exhibition was divided equally to different team members to handle a particular section. Furthermore, after the completion of the individual write up, each team member was given a follow-up task such as, compilation of the different sections by Leng Yang Tan, proof reading and editing by Tien-en Joel Phua and upload of document onto Wiki by Leng Yang Tan and Jie Dong.

Project Budget and Resources

An amount of two hundred and fifty dollars was allocated to each student for this project, resulting in a total budget of seven hundred and fifty dollars.

Additional resources||$100
Table 22 : Expected and Actual use of Project Budget
Total allocated budget $750 Actual Expenses
Expected Expenses A book titled: Statistic Language Learning $20
Printing of research documents $200
Purchase of resources $200
Total Expected Expenses $500 Total Actual Expenses $20

Printing of research documents would consist of past research done by various institutions, for the project team to analyse and evaluate the research that has been carried out up to date.

Purchase of resources would include books that have been written in regards to the author of the letter to the Hebrews. Additional resources such as online books would be purchased to use as our training and testing data to measure the accuracy of our classification model.

Additional resources include purchase of storage devices such as compact disc, to store data, software programs, handbook and reports.

Comments on Variations

Referring back to Table 22, all of the expected expenses were not used as most of the required materials were obtained via online journals and e-resources. This shows that the team have successfully minimised the usage of the allocated project budget. Furthermore the project itself does not require the need of hardware hence no electrical or electronic components need to be bought, resulting in a low usage of the project budget.

Risk Management

Operation Health and Safety risk are managed to reduce the overall cost of the project and to improve performance in both the team's morale and productivity. It is important to provide a safe working environment for team members and also raise the awareness of risk to the members. Due to nature of this project, the team area of risk is constricted indoors.

The following terminology will be used in this section

  • Hazard
    • A potential source of injury or ill-health
  • Risk
    • A measure which combines the probability (likelihood) and possible severity ( or consequences) of a hazard causing injury, illness or property damage
Table 23 : Risk Analysis as made from the Proposal Plan
Hazard Preventive Measures Probability Rating / 10 Impact Rating / 10 Priority Score / 100
Suffer from back and neck injury due to sitting in bad posture Ensure that position is in a upright position and obtain a comfortable chair 10 6 60
Develop hand and leg soreness due to lack of rest Regularly stand up and walk around to exercise the limbs 5 4 20
Inadequate sleep resulting headache or migraine Rest when required 6 5 30
Suffer from depression and anxiety due to incapability of solving program code Seek for help if mentally road block occurs 9 4 36
Strain on visual optics due to staring too long on the computer screen Look away from monitor every 5 minutes after working for 30 minutes 3 10 30

The Risk Management plan that was set out in the project proposal proved to be very effective as during the period of the project, none of the team members suffer from any hazards thus the project was able to run smoothly and accordingly. Furthermore, the hazards from these risks were reduced even more due to the cooperation and help that were received among team members themselves. Most of the hazards were avoided by team members with adequate rest and having a good and healthy diet.

Conclusion

As a summary, the three feature extraction algorithm supports that the classification accuracy generally increases by implementing large and balanced training data. It is also noticed that all three feature extraction algorithm shows that the classification accuracy increases as the data dimension increases till a certain point where model overfitting begins to reduce the classification accuracy. It is also observed that Function Word Analysis works relatively well for text that are short in length. However, the contradictory happens for the other 2 feature extraction algorithm, Word Recurrence Interval and Trigram Markov Model, where a certain text length for each text is required for the algorithm to be accurate. The results showing the most likely author for the Letter to Hebrews for each algorithm is shown below.

Table 24 : Results for the most likely author for Letter to Hebrews in Decreasing Order
Feature Extraction Algorithm Most Likely Author in Descending Order
Function Word Analysis Peter Matthew Paul Mark Luke John Clement Barnabas
Word Recurrence Interval Luke Matthew Paul Clement Mark John Barnabas Peter
Trigram Markov Model Matthew Paul Luke John Mark Peter Clement Barnabas

Comparing the various results from the different feature extraction algorithm, it is noticed that Barnabas, John and Clement are the least likely authors to be for the Letter to Hebrews. Peter was classified as the most likely author using the Function Word Analysis algorithm but it is shown that it is very unlikely for Word Recurrence Interval algorithm and Trigram Markov Model. The reason is because the text from Peter has only approximately 1000 words each and Word Recurrence Interval and Trigram Markov Model requires at least 4000 words for its training data. Thus, Peter was listed as an unlikely author using WRI and Trigram Markov Model. However, the Function Word Analysis algorithm is not greatly affected by the length of a text and hence Peter was strongly considered as one of the possible authors.

Hence the remaining possible authors are Peter, Matthew, Paul, Mark and Luke. At this point, no conclusive results could be obtained. This is due to the reason that there are insufficient training data for each author, consequentially the inability to specifically attribute an author to the Letter to Hebrews. However, as shown in the results for the three algorithms, the algorithms were able to eliminate unlikely author given the small training set.

In conclusion, our results disagree with Fonck, L and Milligan G claiming that Paul is not the author for the Letter to Hebrews. The results further add weight to the claim made by Clement Alexandria that Paul and Luke might be the author for the original version and the Greek version for the Letter to Hebrews respectively.

Future Work

A future improvement for the algorithms could be done by implementing the algorithms to a search engine that would be able to produce a list of books by a particular author given a text. The increase of usages of electronic books motivates this implementation as it would be relatively simple to find books by an author with only a single chapter of a book as an input parameter. In addition, the algorithms could also be applied to authorship attribution for different languages such as disputed Chinese text.

Another recommendation for future implementation is to utilise the stepwise regression for the feature selection algorithms. This could be implemented by obtaining the output from each algorithm, apply the stepwise regression, which implement a technique to select a subset of relevant features to build a robust learning model, and input into SVM for data classification. The SVM classification accuracy might have an improvement with such careful feature selection.

References

  1. Rosa M C, Luis V.P, Manuel M.G, Paolo R, Authorship Attribution using Word Sequences, Universidad Politnica de Valencia
  2. Fung G, Mangasarian O, The disputed Federalist Papers: SVM Feature Selection via Concave Minimization, 14 April 2003
  3. Zhao, Y, Zobel, J, Effective and Scalable Authorship Attribution Using Function Words, RMIT University
  4. Eddy H. T., The Characteristic Curves of Composition, <http://www.jstor.org/stable/1763509 >, viewed August 2010
  5. Smith, M. W. A., Recent experience and new developments of methods for the determination of authorship, ALLC Bulletin, 11:73-82, 1983.
  6. Hilton, J. L., On Verifying Wordprint Studies: Book of Mormon Authorship, BYU Studies, vol. 30, 1990.
  7. Holmes, D. I., & Forsyth, R. S., The 'Federalist' Revisited: New Directions in Authorship Attribution, Literary and Linguistic Computing, 10, 111-127, 1995.
  8. Stamatatos, E., Fakotakis, N. & Kokkinakis, G., Automatic Text Categorization in Terms of Genre and Author, Computational Linguistics, vol. 26, no. 4, pp. 471-495(25), December 2000
  9. Baayen, H., Halteren, H. V., Neijt, A. & Tweedie, F., An experiment in authorship attribution, 6th JADT, 2002
  10. Juola, P. & Baayen, H., A controlled corpus experiment in authorship attribution by crossentropy, Proceedings of ACH/ALLC- 2003, 2003.
  11. Sabordo,M., Shong, C. Y., Berryman, M. J. & Abbott, D., Who Wrote the Letter to the Hebrews? - Data Mining for Detection of Text Authorship, SPIE vol. 5649 pp. 513 - 524, 2004.
  12. Smith, M. W. A., Recent experience and new developments of methods for the determination of authorship, ALLC Bulletin, 11:73-82, 1983
  13. Hilton, J. L., On Verifying Wordprint Studies: Book of Mormon Authorship, BYU Studies, vol. 30, 1990
  14. Hilmes, D.I., & Forsyth, R. S., The 'Federalist' Revisited: New Directions in Authorship Attribution, Literary and Linguisti Computing, 10, 111-127, 1995
  15. Holmes, D. I., & Forsyth, R. S., The 'Federalist' Revisited: New Directions in Authorship Attribution, Literary and Linguistic Computing, 10, 111-127, 1995
  16. Burrows, J. F. (1992). Not Unless You Ask Nicely: The Interpretative Nexus between Analysis and Information. Literary and Linguistic Computing, 7(2): 91-10
  17. Baayen, H., Halteren, H. V., Neijt, A. & Tweedie, F., An experiment in authorship attribution, 6th JADT, 2002
  18. Juola, P. & Baayen, H., A controlled corpus experiment in authorship attribution by crossentropy, Proceedings of ACH/ALLC- 2003, 2003
  19. Sabordo,M., Shong, C. Y., Berryman, M. J. & Abbott, D., Who Wrote the Letter to the Hebrews? - Data Mining for Detection of Text Authorship, SPIE vol. 5649 pp. 513 - 524, 2004
  20. Putnins, T. J., Signoriello, D. J., Jain, S. Berryman, M. J., & Abbott, D., Who wrote the Letter to the Hebrews? Data mining for detection of text authorship, University of Adelaide, 2005
  21. Oracle, Supported Encodings <http://download.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html >, viewed May 2011
  22. F. Mosteller and D. L. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Massachusetts, series in behavioural science: quantitative methods edition, 1964.
  23. Argamon, S. & Levitan, S. (2005) "Measuring the Usefulness of Function Words for Authorship Attribution." Association for Literary and Linguistic Computing/ Association Computer Humanities, University Of Victoria, Canada.
  24. Christopher J.C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. 2, pp. 121-167, 1998
  25. Noble, S. William, What is support Vector machine, available: http://www.nature.com/naturebiotechnology, Nature Publishing Group, 2006.
  26. Alba, Sloin, David Burshtein, Support Vector Machine Training for Improved Hidden Markov Modeling, IEEE transactions on signal processing, vol. 56, NO. 1, Junuary, 2008
  27. Khemelv. Dmitir, V., Fiona J. Tweedie, Using Markov Chains for Identification of Writers, Literary and Linguistic Computing, Vol. 16, NO. 3, Oxford University Press, 2001
  28. Altun, Yasemin, Ioannis, Tsochantaridis, Thomas, Hofmann, Hidden Markov Support Vector Machines, Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DS, 2003
  29. Saul, Lawrence, Fernando, Pereira, Aggregate and mixed-order Markov models for statistical language processing, AT&T Labs, 1997
  30. Sanderson, Conrad, Simon Guenter, Short Text Authorship Attribution via Sequence Kernels, Markov chains and Author Unmasking: An Investigation, Australian National University, National ICT Australia, 2005
  31. Charniak, Eugene, Hidden Markov Models and Two Applications, Statistical Language Learning, pp. 39-51, The MIT Press, 1996
  32. Talis J. Putnins, Domenic J. Signoriello, Samant Jain, Matthew J. Berryman and Derek Abbott, "Advanced text authorship detection methods and their application to biblical texts", Proc. SPIE: Complex Systems 6039 ed. Axel Bender, Brisbane, Qld., Australia, December 11-14, 2005.
  33. Oberlander, J. & Brew, C., Stochastic text generation, Philosophical Transactions of the Royal Society of London, Series A, 358, pp. 1373-1385, 2000.
  34. Joachims, T., Making large-scale support vector machine learning practical, MIT Press, Cambridge, 1998.
  35. Martin, Sven, Jorg Liermann, Hermann Ney, Algorithms for bigram and trigram word clustering, Lehrstuhl fur Informatik ZVI, University of Technology, German, 1997
  36. Jason, Gastrich, The Authorship of the Epistle to the Hebrews, 1998, < http://jcsm.org/Education/authorshipofHebrews.htm > viewed April 2011
  37. W. Gary Crampton', '''Hebrews: Who is the Author?, First Presbyterian Church of Rowlett < http://www.fpcr.org/blue_banner_articles/Who-Wrote-Hebrews.htm > viewed May 2011
  38. J. Vernon McGe, The Authorship of Hebrews or Did Paul Write Hebrews?, Thru the Bible Radio Network < OF HEBREWS.PDF http://www.thruthebible.org/atf/cf/%7B91e2424c-636c-40c2-9c55-890588e90ece%7D/AUTHORSHIP%20OF%20HEBREWS.PDF > viewed April 2011
  39. TLG Beta Code Quick Reference Guide <http://unicodegreek.org/doc/beta-code-quick.pdf >, viewed May 2011
  40. Sequence Publishing, Function Words, <http://www.sequencepublishing.com/academic.html >, viewed Dec 2010
  41. Lake K, Apostolic Fathers, Christian Classics Ethreal Library
  42. Anderson C P, The Epistle to the Hebrews and the Pauline Letter Collection, The HArvard Theological Review, Cambridge University Press, Vol 59, No 4, 1966
  43. Fung, G and Mangasarian O, The Disputed Federalist Papers: SVM Feature Selection via Concave Minmization, 2003

Appendices

Appendix A: English Texts

James, Henry (1843-1916)

  • The Altar of the Dead (Englsh)
  • The Ambassadors (English)
  • The American (English)
  • The Aspern Papers (English)
  • The Awkward Age (English)
  • The Beast in the Jungle (English)
  • The Beldonald Holbein (English)
  • A Bundle of Letters (English)
  • The Chaperon (English)
  • Confidence (English)
  • The Coxon Fund (English)
  • DaisyMiller (English)
  • The Death of the Lion (English)
  • The Diary of a Man of Fifty (English)
  • Eugene Pickering (English)
  • The Europeans (English)
  • The Figure in the Carpet (English)
  • Glasses (English)
  • Greville Fane (English)
  • An International Episode (English)
  • In the Cage (English)
  • The Jolly Corner (English)
  • The Lesson of the Master (English)
  • Louisa Pallant (English)
  • Madame De Mauves (English)
  • The Madonna of the Future (English)

Grey, Zane (1872-1939)

  • Betty Zane (English)
  • The Border Legion (English)
  • The Call of the Canyon (English)
  • The Day of the Beast (English)
  • Desert Gold (English)
  • The Desert of Wheat (English)
  • Heritage of the Desert (English)
  • The Last of the Plainsmen (English)
  • The Last Trail (English)
  • Light of the Western Stars (English)
  • The Man of the Forest (English)
  • The Mysterious Rider (English)
  • The Rainbow Trail (English)
  • The Redheaded Outfield (English)
  • Riders of the Purple Sage (English)
  • The Rustlers of Pecos County (English)
  • The Spirit of the Border (English)
  • Tales of lonely trails (English)
  • To the Last Man (English)
  • The U. P. Trail (English)
  • Wildfire (English)
  • The Young Forester (English)
  • The Border Legion End (English)
  • Light of the Western Stars End (English)
  • To the Last Man End (English)

Bower, B. M. (1871-1940)

  • Cabin Fever (English)
  • Casey Ryan (English)
  • Chip, of the Flying U (English)
  • Cow-Country (English)
  • The Flying U Ranch (English)
  • The Flying U's Last Stand (English)
  • Good Indian (English)
  • The Gringos (English)
  • The Happy Family (English)
  • The Heritage of the Sioux (English)
  • Her Prairie Knight (English)
  • Jean of the Lazy A (English)
  • Lonesome Land (English)
  • The Lonesome Trail and Other Stories (English)
  • The Long Shadow (English)
  • The Phantom Herd (English)
  • The Range Dwellers (English)
  • Rowdy of the Cross L (English)
  • Starr, of the Desert (English)
  • The Thunder Bird (English)
  • The Trail of the White Mule (English)
  • The Uphill Climb (English)
  • Good Indian (English)
  • The Gringos (English)
  • Good Indian End (English)
  • The Gringos End (English)

Doyle, Arthur Conan, Sir (1859-1930)

  • The Adventure of the Bruce-Partington Plans (English)
  • The Adventure of the Cardboard Box (English)
  • The Adventure of the Devil's Foot (English)
  • The Adventure of the Dying Detective (English)
  • The Adventure of the Red Circle (English)
  • The Adventure of Wisteria Lodge (English)
  • The Adventures of Gerard (English)
  • The Adventures of Sherlock Holmes (English)
  • Beyond the City (English)
  • The Captain of the Polestar (English)
  • The Doings of Raffles Haw (English)
  • A Duet: a duologue (English)
  • The Exploits of Brigadier Gerard (English)
  • The Firm of Girdlestone (English)
  • The Green Flag (English)
  • His Last Bow (English)
  • The Hound of the Baskervilles (English)
  • The Lost World (English)
  • TheMystery of Cloomber (English)
  • The Parasite (English)
  • The Poison Belt (English)
  • Round the Red Lamp (English)
  • The Sign of the Four (English)
  • The Valley of Fear (English)
  • The Adventure of the Red Circle (English)
  • The Adventure of Wisteria Lodge (English)
  • The Adventures of Gerard (English)
  • The Adventures of Sherlock Holmes (English)
  • The Captain of the Polestar (English)
  • The Green Flag (English)
  • His Last Bow (English)
  • The Poison Belt (English)

Davis, Richard Harding (1864-1916)

  • The Amateur (English)
  • Billy and the Big Stick (English)
  • Captain Macklin (English)
  • A Charmed Life (English)
  • Cinderella And Other Stories (English)
  • The Congo and Coasts of Africa (English)
  • The Consul (English)
  • The Deserter (English)
  • The Frame Up (English)
  • Gallegher and Other Stories (English)
  • In the Fog (English)
  • The King's Jackal (English)
  • Lion and the Unicorn (English)
  • The Log of the Jolly Polly (English)
  • The Lost House (English)
  • The Lost Road (English)
  • The Make-Believe Man (English)
  • The Man Who Could Not Lose (English)
  • The Messengers (English)
  • My Buried Treasure (English)
  • The Nature Faker (English)
  • Peace Manoeuvres (English)
  • The Princess Aline (English)
  • A Question of Latitude (English)
  • Ranson's Folly (English)
  • The Red Cross Girl (English)

Dickens, Charles (1812-1870)

  • Barnaby Rudge (English)
  • The Battle of Life (English)
  • Bleak House (English)
  • The Chimes (English)
  • A Christmas Carol (English)
  • Dombey and Son (English)
  • George Silverman's Explanation (English)
  • Going into Society (English)
  • Great Expectations (English)
  • Hard Times (English)
  • A House to Let (English)
  • Hunted Down (English)
  • The Lamplighter (English)
  • Lazy Tour of Two Idle Apprentices (English)
  • Little Dorrit (English)
  • The Loving Ballad of Lord Bateman (English)
  • Martin Chuzzlewit (English)
  • Master Humphrey's Clock (English)
  • A Message from the Sea (English)
  • Mrs. Lirriper's Legacy (English)
  • Mugby Junction (English)
  • Oliver Twist (English)
  • The Holly-Tree (English)
  • A House to Let (English)
  • Hunted Down (English)
  • Mrs. Lirriper's Lodgings (English)

Appendix B: Federalist Papers

No. Title Author
1 General Introduction Alexander Hamilton
2 Concerning Dangers from Foreign Force and Influence John Jay
3 The Same Subject Continued: Concerning Dangers from Foreign Force and Influence John Jay
4 The Same Subject Continued: Concerning Dangers from Foreign Force and Influence John Jay
5 The Same Subject Continued: Concerning Dangers from Foreign Force and Influence John Jay
6 Concerning Dangers from Dissensions Between the States Alexander Hamilton
7 The Same Subject Continued: Concerning Dangers from Dissensions Between the States Alexander Hamilton
8 The Consequences of Hostilities Between the States Alexander Hamilton
9 The Union as a Safeguard Against Domestic Faction and Insurrection Alexander Hamilton
10 The Same Subject Continued: The Union as a Safeguard Against Domestic Faction and Insurrection James Madison
11 The Utility of the Union in Respect to Commercial Relations and a Navy Alexander Hamilton
12 The Utility of the Union In Respect to Revenue Alexander Hamilton
13 Advantage of the Union in Respect to Economy in Government Alexander Hamilton
14 Objections to the Proposed Constitution From Extent of Territory Answered James Madison
15 The Insufficiency of the Present Confederation to Preserve the Union Alexander Hamilton
16 The Same Subject Continued: The Insufficiency of the Present Confederation to Preserve the Union Alexander Hamilton
17 The Same Subject Continued: The Insufficiency of the Present Confederation to Preserve the Union Alexander Hamilton
18 The Same Subject Continued: The Insufficiency of the Present Confederation to Preserve the Union James Madison & Alexander Hamilton
19 The Same Subject Continued: The Insufficiency of the Present Confederation to Preserve the Union James Madison & Alexander Hamilton
20 The Same Subject Continued: The Insufficiency of the Present Confederation to Preserve the Union James Madison & Alexander Hamilton
21 Other Defects of the Present Confederation Alexander Hamilton
22 The Same Subject Continued: Other Defects of the Present Confederation Alexander Hamilton
23 The Necessity of a Government as Energetic as the One Proposed to the Preservation of the Union Alexander Hamilton
24 The Powers Necessary to the Common Defense Further Considered Alexander Hamilton
25 The Same Subject Continued: The Powers Necessary to the Common Defense Further Considered Alexander Hamilton
26 The Idea of Restraining the Legislative Authority in Regard to the Common Defense Considered Alexander Hamilton
27 The Same Subject Continued: The Idea of Restraining the Legislative Authority in Regard to the Common Defense Considered Alexander Hamilton
28 The Same Subject Continued: The Idea of Restraining the Legislative Authority in Regard to the Common Defense Considered Alexander Hamilton
29 Concerning the Militia Alexander Hamilton
30 Concerning the General Power of Taxation Alexander Hamilton
31 The Same Subject Continued: Concerning the General Power of Taxation Alexander Hamilton
32 The Same Subject Continued: Concerning the General Power of Taxation Alexander Hamilton
33 The Same Subject Continued: Concerning the General Power of Taxation Alexander Hamilton
34 The Same Subject Continued: Concerning the General Power of Taxation Alexander Hamilton
35 The Same Subject Continued: Concerning the General Power of Taxation Alexander Hamilton
36 The Same Subject Continued: Concerning the General Power of Taxation Alexander Hamilton
37 Concerning the Difficulties of the Convention in Devising a Proper Form of Government James Madison
38 The Same Subject Continued, and the Incoherence of the Objections to the New Plan Exposed James Madison
39 The Conformity of the Plan to Republican Principles James Madison
40 The Powers of the Convention to Form a Mixed Government Examined and Sustained James Madison
41 General View of the Powers Conferred by the Constitution James Madison
42 The Powers Conferred by the Constitution Further Considered James Madison
43 The Same Subject Continued: The Powers Conferred by the Constitution Further Considered James Madison
44 Restrictions on the Authority of the Several States James Madison
45 The Alleged Danger From the Powers of the Union to the State Governments Considered James Madison
46 The Influence of the State and Federal Governments Compared James Madison
47 The Particular Structure of the New Government and the Distribution of Power Among Its Different Parts James Madison
48 These Departments Should Not Be So Far Separated as to Have No Constitutional Control Over Each Other James Madison
49 Method of Guarding Against the Encroachments of Any One Department of Government James Madison (was disputed)
50 Periodic Appeals to the People Considered James Madison (was disputed)
51 The Structure of the Government Must Furnish the Proper Checks and Balances Between the Different Departments James Madison (was disputed)
52 The House of Representatives James Madison (was disputed)
53 The Same Subject Continued: The House of Representatives James Madison (was disputed)
54 The Apportionment of Members Among the States James Madison (was disputed)
55 The Total Number of the House of Representatives James Madison (was disputed)
56 The Same Subject Continued: The Total Number of the House of Representatives James Madison (was disputed)
57 The Alleged Tendency of the New Plan to Elevate the Few at the Expense of the Many James Madison (was disputed)
58 Objection That The Number of Members Will Not Be Augmented as the Progress of Population Demands Considered James Madison (was disputed)
59 Concerning the Power of Congress to Regulate the Election of Members Alexander Hamilton
60 The Same Subject Continued: Concerning the Power of Congress to Regulate the Election of Members Alexander Hamilton
61 The Same Subject Continued: Concerning the Power of Congress to Regulate the Election of Members Alexander Hamilton
62 The Senate James Madison (was disputed)
63 The Senate Continued James Madison (was disputed)
64 The Powers of the Senate John Jay
65 The Powers of the Senate Continued Alexander Hamilton
66 Objections to the Power of the Senate To Set as a Court for Impeachments Further Considered Alexander Hamilton
67 The Executive Department Alexander Hamilton
68 The Mode of Electing the President Alexander Hamilton
69 The Real Character of the Executive Alexander Hamilton
70 The Executive Department Further Considered Alexander Hamilton
71 The Duration in Office of the Executive Alexander Hamilton
72 The Same Subject Continued, and Re-Eligibility of the Executive Considered Alexander Hamilton
73 The Provision For The Support of the Executive, and the Veto Power Alexander Hamilton
74 The Command of the Military and Naval Forces, and the Pardoning Power of the Executive Alexander Hamilton
75 The Treaty Making Power of the Executive Alexander Hamilton
76 The Appointing Power of the Executive Alexander Hamilton
77 The Appointing Power Continued and Other Powers of the Executive Considered Alexander Hamilton
78 The Judiciary Department Alexander Hamilton
79 The Judiciary Continued Alexander Hamilton
80 The Powers of the Judiciary Alexander Hamilton
81 The Judiciary Continued, and the Distribution of the Judicial Authority Alexander Hamilton
82 The Judiciary Continued Alexander Hamilton
83 The Judiciary Continued in Relation to Trial by Jury Alexander Hamilton
84 Certain General and Miscellaneous Objections to the Constitution Considered and Answered Alexander Hamilton
85 Concluding Remarks Alexander Hamilton

Appendix C: King James Version of the New Testaments

  • The Epistle of James
  • The Epistle of Jude
  • The Gospel of John
  • The Gospel of Luke
  • The Gospel of Mark
  • The Gospel of Matthew
  • The First Epistle to the Corinthians
  • The Second Epistle to the Corinthians
  • The Epistle to the Romans
  • The First Epistle of Peter
  • The Second Epistle of Peter

Appendix D: Koine Greek text of the New Testaments

  • The Epistle of James
  • The Epistle of Jude
  • The Gospel of John
  • The Gospel of Luke
  • The Gospel of Mark
  • The Gospel of Matthew
  • The First Epistle to the Corinthians
  • The Second Epistle to the Corinthians
  • The Epistle to the Romans
  • The First Epistle of Peter
  • The Second Epistle of Peter
  • The Epistle of Barnabas
  • The First Epistle of Clement to the Corinthians

Appendix E: Text Length of Training Data

Text Length for the English Texts

File Name Text Length
AD_BC.txt 5461
AD_CP.txt 10711
AD_DD.txt 5784
AD_DL.txt 5910
AD_DR.txt 6161
AD_DU.txt 7602
AD_EB.txt 6012
AD_FG.txt 7203
AD_GF.txt 9122
AD_HB.txt 5571
AD_LB.txt 6076
AD_LW.txt 5401
AD_MC.txt 6855
AD_MD.txt 7605
AD_PB.txt 8843
AD_RC.txt 7295
AD_RL.txt 6314
AD_SF.txt 6510
AD_SH.txt 7122
AD_TP.txt 8064
AD_VF.txt 5336
AD_WL.txt 6410
BB_FU.txt 5368
BB_GE.txt 9625
BB_GI.txt 5303
BB_HF.txt 5383
BB_HS.txt 6368
BB_IE.txt 9961
BB_JL.txt 6484
BB_LD.txt 5629
BB_LL.txt 5282
BB_LM.txt 5822
BB_LO.txt 6970
BB_LS.txt 6248
BB_LT.txt 5479
BB_PH.txt 6510
BB_PK.txt 5296
BB_RC.txt 5282
BB_RD.txt 5525
BB_SD.txt 7007
BB_TB.txt 4095
BB_TG.txt 6586
BB_UC.txt 8915
BB_WM.txt 6692
CD_CH.txt 6724
CD_DC.txt 5518
CD_DM.txt 8825
CD_DS.txt 6431
CD_GE.txt 7622
CD_GI.txt 7345
CD_GS.txt 7838
CD_HD.txt 6224
CD_HL.txt 8182
CD_HM.txt 6534
CD_HO.txt 7267
CD_HR.txt 6756
CD_HT.txt 4176
CD_LD.txt 6343
CD_LL.txt 7836
CD_LT.txt 7991
CD_MC.txt 8931
CD_MH.txt 6952
CD_ML.txt 9160
CD_MS.txt 6258
CD_OT.txt 4606
CD_TC.txt 7228
HJ_BH.txt 8031
HJ_BJ.txt 7296
HJ_BL.txt 9678
HJ_CF.txt 8052
HJ_CO.txt 8404
HJ_DL.txt 7102
HJ_DM.txt 7615
HJ_EP.txt 7835
HJ_FC.txt 6395
HJ_GF.txt 7270
HJ_GL.txt 6954
HJ_IC.txt 8920
HJ_IE.txt 6699
HJ_JC.txt 8203
HJ_LH.txt 9450
HJ_LM.txt 8365
HJ_MA.txt 9140
HJ_MD.txt 10522
HJ_MF.txt 6953
HJ_TA.txt 8072
HJ_TC.txt 6944
HJ_TE.txt 6902
RD_CM.txt 6838
RD_CO.txt 6207
RD_FU.txt 6404
RD_GO.txt 6947
RD_IF.txt 7457
RD_JP.txt 6052
RD_KJ.txt 7764
RD_LH.txt 7010
RD_LR.txt 5277
RD_LU.txt 5901
RD_MB.txt 7053
RD_MC.txt 6196
RD_NF.txt 4648
RD_PA.txt 6898
RD_PM.txt 6461
RD_QL.txt 7121
RD_RC.txt 6185
RD_RF.txt 6810
RD_TA.txt 5260
RD_TC.txt 6354
RD_TD.txt 7143
RD_TM.txt 4746
ZG_DB.txt 5643
ZG_DG.txt 6173
ZG_DW.txt 6009
ZG_HD.txt 6612
ZG_LE.txt 5304
ZG_LM.txt 9398
ZG_LP.txt 5636
ZG_LS.txt 7160
ZG_LT.txt 6846
ZG_LW.txt 6424
ZG_ME.txt 4894
ZG_MF.txt 6767
ZG_MR.txt 5876
ZG_PC.txt 6880
ZG_RO.txt 7897
ZG_RP.txt 4965
ZG_RT.txt 6736
ZG_SB.txt 8390
ZG_TL.txt 7187
ZG_UP.txt 5791
ZG_WF.txt 6410
ZG_YF.txt 6379
ZZ_AD_AB.txt 5011
ZZ_AD_AC.txt 6463
ZZ_AD_AD.txt 6166
ZZ_AD_AG.txt 8324
ZZ_BB_CC.txt 5704
ZZ_BB_CF.txt 8024
ZZ_BB_CR.txt 6608
ZZ_BB_CU.txt 5923
ZZ_CD_BH.txt 7662
ZZ_CD_BL.txt 7753
ZZ_CD_BR.txt 6500
ZZ_CD_CC.txt 5930
ZZ_HJ_AA.txt 7612
ZZ_HJ_AD.txt 6082
ZZ_HJ_AM.txt 5909
ZZ_HJ_AP.txt 9553
ZZ_RD_BB.txt 5748
ZZ_RD_BT.txt 7808
ZZ_RD_CC.txt 8279
ZZ_RD_CL.txt 4788
ZZ_ZG_BE.txt 6772
ZZ_ZG_BL.txt 6707
ZZ_ZG_BZ.txt 7489
ZZ_ZG_CC.txt 9496

Text Length for the Federalist Papers

File Name Text Length
H_FEDERALIST_1.txt 1625
H_FEDERALIST_17.txt 1593
H_FEDERALIST_21.txt 2014
H_FEDERALIST_22.txt 3582
H_FEDERALIST_23.txt 1834
H_FEDERALIST_24.txt 2025
H_FEDERALIST_25.txt 2016
H_FEDERALIST_26.txt 2409
H_FEDERALIST_27.txt 1488
H_FEDERALIST_28.txt 1632
H_FEDERALIST_29.txt 2253
H_FEDERALIST_30.txt 1990
H_FEDERALIST_31.txt 1760
H_FEDERALIST_32.txt 1513
H_FEDERALIST_33.txt 1714
H_FEDERALIST_34.txt 2241
H_FEDERALIST_35.txt 2275
H_FEDERALIST_36.txt 2765
H_FEDERALIST_59.txt 1941
H_FEDERALIST_6.txt 2054
H_FEDERALIST_60.txt 2280
H_FEDERALIST_61.txt 1552
H_FEDERALIST_65.txt 2042
H_FEDERALIST_66.txt 2296
H_FEDERALIST_67.txt 1673
H_FEDERALIST_68.txt 1528
H_FEDERALIST_69.txt 3016
H_FEDERALIST_7.txt 2325
H_FEDERALIST_70.txt 3142
H_FEDERALIST_71.txt 1768
H_FEDERALIST_72.txt 2066
H_FEDERALIST_73.txt 2394
H_FEDERALIST_74.txt 1032
H_FEDERALIST_75.txt 1961
H_FEDERALIST_76.txt 2324
H_FEDERALIST_77.txt 2002
H_FEDERALIST_78.txt 3077
H_FEDERALIST_79.txt 1051
H_FEDERALIST_8.txt 2078
H_FEDERALIST_80.txt 2473
H_FEDERALIST_81.txt 3938
H_FEDERALIST_82.txt 1561
H_FEDERALIST_83.txt 5859
H_FEDERALIST_84.txt 4217
H_FEDERALIST_85.txt 2678
H_FEDERALIST_9.txt 2011
J_FEDERALIST_2.txt 1686
J_FEDERALIST_3.txt 1466
J_FEDERALIST_4.txt 1654
J_FEDERALIST_5.txt 1365
J_FEDERALIST_64.txt 2332
M_FEDERALIST_10.txt 3030
M_FEDERALIST_14.txt 2168
M_FEDERALIST_37.txt 2742
M_FEDERALIST_38.txt 3352
M_FEDERALIST_39.txt 2629
M_FEDERALIST_40.txt 3053
M_FEDERALIST_41.txt 3575
M_FEDERALIST_42.txt 2806
M_FEDERALIST_43.txt 3463
M_FEDERALIST_44.txt 2925
M_FEDERALIST_45.txt 2146
M_FEDERALIST_46.txt 2641
M_FEDERALIST_47.txt 2778
M_FEDERALIST_48.txt 1897
Z_FEDERALIST_11.txt 2523
Z_FEDERALIST_12.txt 2184
Z_FEDERALIST_13.txt 984
Z_FEDERALIST_15.txt 3106
Z_FEDERALIST_16.txt 2068
Z_FEDERALIST_49.txt 1688
Z_FEDERALIST_50.txt 1132
Z_FEDERALIST_51.txt 1951
Z_FEDERALIST_52.txt 1870
Z_FEDERALIST_53.txt 2194
Z_FEDERALIST_54.txt 2026
Z_FEDERALIST_55.txt 2066
Z_FEDERALIST_56.txt 1601
Z_FEDERALIST_57.txt 2249
Z_FEDERALIST_58.txt 2113
Z_FEDERALIST_62.txt 2402
Z_FEDERALIST_63.txt 3055



Text Length for the King James Version of New Testaments

File Name Text Length
JOHN_1.txt 4585
JOHN_2.txt 4915
JOHN_3.txt 5452
JOHN_4.txt 5021
LUKE_1.txt 5917
LUKE_2.txt 5543
LUKE_3.txt 5631
LUKE_4.txt 5393
LUKE_5.txt 4606
LUKE_Acts_1.txt 6245
LUKE_Acts_2.txt 5877
LUKE_Acts_3.txt 5201
LUKE_Acts_4.txt 5191
LUKE_Acts_5.txt 2739
MARK_1.txt 5763
MARK_2.txt 5018
MARK_3.txt 5063
MTT_1.txt 5545
MTT_2.txt 5177
MTT_3.txt 4597
MTT_4.txt 4852
MTT_5.txt 4584
PAUL_1_Cor_1.txt 4715
PAUL_1_Cor_2.txt 4747
PAUL_2_Cor.txt 6065
PAUL_Rom_1.txt 4898
PAUL_Rom_2.txt 4524
PETER_1.txt 2580
PETER_2.txt 1614

Text Length for the Koine Greek of New Testaments

File Name Text Length
BA_barnabas_01.txt 3222
BA_barnabas_02.txt 3479
CL_clement_01.txt 3261
CL_clement_02.txt 3565
CL_clement_03.txt 3247
JO_john_1.txt 5029
JO_john_2.txt 5014
JO_john_3.txt 5597
LU_luke_1.txt 6823
LU_luke_2.txt 6389
LU_luke_3.txt 6278
MA_mark_1.txt 5013
MA_mark_2.txt 6293
MT_matt_1.txt 6284
MT_matt_2.txt 6296
MT_matt_3.txt 5768
PA_1cor.txt 6832
PA_2cor.txt 4478
PA_romn.txt 7111
PE_1pet.txt 1684
PE_2pet.txt 1101
ZZ_hebrews.txt 4955

Appendix F: Additional Results from Function Word Analysis

Result of English Text using Feature Vector, Set A

Number of Training Data
5 10 15 20 22 True Result
AD AD AD AD AD AD
AD AD AD AD AD AD
RD AD AD AD AD AD
AD RD AD AD AD AD
BB BB BB BB BB BB
BB BB BB RD RD BB
BB BB BB BB BB BB
BB BB BB BB BB BB
AD CD CD CD CD CD
CD CD CD CD CD CD
BB BB CD CD CD CD
CD CD CD CD CD CD
HJ HJ HJ BB HJ HJ
HJ HJ HJ HJ HJ HJ
AD CD CD AD HJ HJ
HJ HJ HJ HJ HJ HJ
RD RD RD RD RD RD
RD RD RD RD RD RD
AD AD RD AD RD RD
RD RD RD RD RD RD
BB BB BB BB BB ZG
HJ ZG ZG ZG ZG ZG
ZG ZG ZG ZG ZG ZG
HJ ZG RD ZG ZG ZG

Result of English Text using Feature Vector, Set B

Number of Training Data
5 10 15 20 22 True Result
AD AD AD AD AD AD
AD AD AD AD AD AD
AD AD AD AD AD AD
AD AD AD AD AD AD
BB BB BB BB BB BB
BB BB BB BB BB BB
BB BB BB BB BB BB
BB BB BB BB BB BB
AD AD AD CD CD CD
CD CD CD CD CD CD
CD CD CD CD CD CD
CD CD CD CD CD CD
HJ HJ HJ HJ HJ HJ
HJ HJ HJ HJ HJ HJ
HJ HJ HJ HJ HJ HJ
HJ HJ HJ HJ HJ HJ
RD RD RD RD RD RD
RD RD RD RD RD RD
RD RD RD RD RD RD
RD RD RD RD RD RD
ZG ZG ZG ZG ZG ZG
ZG ZG ZG ZG ZG ZG
AD AD CD CD CD ZG
ZG ZG ZG ZG ZG ZG

Result of English Text using Feature Vector, Set C

Number of Training Data
5 10 15 20 22 True Result
AD AD AD AD AD AD
AD AD AD AD AD AD
AD AD AD AD AD AD
AD AD AD AD AD AD
BB BB BB BB BB BB
BB BB BB BB BB BB
BB BB BB BB BB BB
BB BB BB BB BB BB
CD CD CD CD CD CD
CD CD CD CD CD CD
CD CD CD CD CD CD
CD CD CD CD CD CD
HJ HJ HJ HJ HJ HJ
HJ HJ HJ HJ HJ HJ
AD HJ HJ HJ HJ HJ
HJ HJ HJ HJ HJ HJ
RD RD RD RD RD RD
RD RD RD RD RD RD
RD RD RD RD RD RD
RD RD RD RD RD RD
ZG ZG ZG ZG ZG ZG
ZG ZG ZG ZG ZG ZG
AD ZG ZG ZG ZG ZG
ZG ZG ZG ZG ZG ZG

Results of Federalist Papers using Feature Vector, Set A

Number of Training Data
Training 2 Training 3 Training 4 Training 5 Training All True Result
H H H H H H
H H H H H H
H H H H H H
H H H H H H
M H H H H H
M M M M M M
M M M M M M
M M M M M M
H M M M M M
H M H M M M
M M M M M M
H H H H H M
H H H H M M
H M H M H M
M M M M M M
H H H H M M
H M M M M M

Results of Federalist Papers using Feature Vector, Set B

Number of Training Data
Training 2 Training 3 Training 4 Training 5 Training All True Result
H H H H H H
M H H H H H
H H H H M H
H H H H H H
M H H H H H
M M M M M M
M H H M M M
M M M M M M
M M M M M M
H H M M M M
M H M M M M
H H H H H M
M M M M M M
H H M M M M
M M M M M M
M M H M M M
M M M M M M

Results of Federalist Papers using Feature Vector, Set C

Number of Training Data
Training 2 Training 3 Training 4 Training 5 Training All True Result
H H H H H H
M H H H M H
H H H H H H
H H H H H H
M H H H H H
M M M M M M
M M M M M M
M M M M M M
M M M M H M
M M M M M M
M M M M M M
M M M M M M
M M M M M M
H H M M M M
M M M M M M
M M M M M M
M M M M M M

Appendix G: Additional Word Recurrence Interval Results

Additional Results for Word Recurrence Interval Version 1.0

Table 24 : Table showing the results of the accuracy rate of WRI for threshold value of 0 with 5 training data for each author
Threshold Value : 0
Data Dimension Kernel Functions
Linear Quad Rbf Poly
5 0.41667 0.375 0.33333 0.375
10 0.458322 0.25 0.16667 0.45833
15 0.16667 0.16667 0.125 0.33333
20 0.16667 0.25 0.29167 0.20833
25 0.25 0.25 0.16667 0.20833
30 0.20833 0.25 0.20833 0.29167
35 0.33333 0.20833 0.25 0.20833
40 0.33333 0.25 0.25 0.375
45 0.41667 0.16667 0.25 0.25
50 0.41667 0.20833 0.33333 0.33333
55 0.54167 0.125 0.33333 0.25
60 0.29167 0.16667 0.375 0.16667
65 0.45833 0.125 0.20833 0.25
70 0.45833 0.125 0.25 0.20833

The graph for the linear kernel function is plotted and shown in Figure 40.

Figure 40 : Results for 5 training data for each author with the Linear kernel function

The plot shows that at the initial stage, varying the threshold value does produce the same accuracy results which shows consistency for the WRI. However as the number of data dimension increases, the overall accuracy increases until a point where the threshold value affects the outcome. In addition it is also noticed that the accuracy increases as the data dimension increases, till a point such that the accuracy decreases. This is due to the model overfitting of the data as too much data is input into the SVM, where SVM regards it as noise. It can be seen from the graph that the highest achievable prediction accuracy is approximately 55%. However it is noticed that the result is very inconsistent at that point as can be seen on the graph at the data dimension value of 55. Hence it is much preferred to use the data dimension value of 10 where the threshold parameter does not affect the result.

Figure 41 : Results for 22 training data for each author with the Linear kernel function

The plot above shows that as the data dimension and the threshold value increases, the accuracy shows a consistent results for the initial phase, which is the same when compared with the previous result. Furthermore it is noted that the overall prediction accuracy increases slightly as the number of training data increases. In addition as the data dimension increases at a large number, the overall accuracy for regardless of the threshold value decreases. This is because as stated from above the overfitting of the model occurs, which SVM regards it as noise. Since the training data increases in this test, eventually the noise also increases hence the decrease in accuracy.

Additional Results for Word Recurrence Interval Version 2.0

Figure 42 : Results for WRI with Function Words

It can be seen from the plot that the increasing number of training data for each author does not increase the overall accuracy. Furthermore, the variation of training data does not increase the accuracy but instead provide more inconsistency into authorship attribution. Additionally the increasing number of training data does not affect the results as the text length for each training data are not approximately the same, which might explain the low accuracy prediction.

With this in mind, the team have reach to a conclusion that it is required for WRI to have the ability to choose the training data base on the text length of the data itself, whereby each training data would have approximately the same text length which is implemented in the following section.

Additional Results for Word Recurrence Interval Version 3.0

Figure 43 : Results for 5 Training data for each author for WRI with selective Training Data

The plot above shows that as the data dimension increases, the overall accuracy increases till a point where it decreases, indicating model overfitting. Furthermore, the same trend and pattern occur as with the results from the previous results where the results obtained are consistent at the initial phase. However the overall accuracy was not satisfying as the highest achievable accuracy was only approximately 35%.

Figure 44 : Results for 22 Training data for each author for WRI with selective training data

Looking at the results above, as the number of training data increases, the overall accuracy increases. Furthermore the graph shows that the results tend to be more consistent regardless of the threshold value except the later stage where the data dimension value is approximately 55.

======Additional Results for Federalist Papers (WRI)

Figure 45 : Result for Federalist Paper with 1 Training Data from Each Author

Looking at the figure above, the overall accuracy tends to be very low. This is due to the insufficient training data that is inputted for the algorithm. Hence the next test was to change the number of training data into 5 for each author.

Figure 46 : Results for Federalist Paper with 5 Training Data from Each Author

Looking the graph above, as the number of training data increases, the overall accuracy increases.

Figure 47 : Results for Federalist Paper with all Training Data being used

The results shows that the overall accuracy actually decreases. This is because the training data is not balanced. The training data in this test consist of 46 work written by H, 5 work written by J and 14 work written by M. It is visible that the training data is slightly bias to the work written by H, causing the algorithm to predict most of the work to be written by H.

PDF Version

File:Authorship Detection 2010 Final Report.pdf

See also

Back