Semester B Final Report 2014 - Cracking the Voynich code

From Derek
Jump to: navigation, search

Executive Summary

The Voynich Manuscript is a mysterious 15th century manuscript written in an unknown alphabet. The mysterious nature of the manuscript has attracted linguists and cryptographers over the course of the last hundred years, each attempting to "Crack the Voynich code". However, despite a great deal of research into the text and illustrations of the manuscript, no one has ever conclusively deciphered a single word. This project expands upon past research into the linguistic features of the manuscript. Computational analysis techniques such as Word Recurrence Intervals and N-gram relationships, along with supervised learning algorithms such as Support Vector Machines (SVM) and Multiple Discriminant Analysis (MDA) are applied to an electronic transcription of the text. The team evaluates the use of these classification methods, and also develops new ways to identify grammar and syntax in the Voynich language.

Acknowledgement

The team wishes to extend their thanks to Professor Derek Abbott, Dr Brian Ng and Maryam Ebrahimpour for their guidance, support and ideas throughout the course of this project. Without their help, this project would not have gone as smoothly as it did.


Introduction

Background

The Voynich Manuscript is a mysterious book written in an unknown alphabet. So little is known about the nature and origins of the manuscript that it has been named after Wifred Michael Voynich, the collector who purchased it in 1912 from a castle in Italy. The manuscript has been verified by radiocarbon dating (at the University of Arizona) as belonging to the early 15th century[1], and appears to be a herbal (or medicinal) manual from this time period in Europe. However, despite a great deal of research into the text and illustrations of the manuscript, no one has ever conclusively deciphered a single word. The mysterious nature of the manuscript has attracted linguists and cryptographers over the course of the last hundred years, each attempting to "crack the Voynich code".

The manuscript itself is made up of several folios, numbered from f1 to f116. Each folio consists of two pages (labelled r and v), with the exception of ten foldouts of up to six pages. Although the page numbering was clearly added after the manuscript was written, there are gaps in the numbering which indicate missing folios, and indications that some folios have been reordered long after their completion. These oddities have led scholars to believe that the manuscript may have had several owners. Certain pages also contain a "key-like" sequence of characters, which is why many leading cryptographers of the time (and indeed, today) believe it to be a key cipher. [2]

The first major work into an electronic transcription of the manuscript was performed by famous cryptologist William Friedman, and a team called the ‘First Study Group’ (FSG). The FSG developed an early method with which they were able to convert between the characters in the manuscript and computer-readable text. Since this first work, several other research groups have developed alternate transcriptions of the text, each of them attempting to correct the errors of previous work and create a definitive data set. [2]

One of these transcriptions was developed by Captain Prescott Currier, who was the first to identify, based on handwriting and word usage, the possibility that the manuscript had two or more authors (He noted up to six distinct handwriting styles) [3] [4]. Currier identified two main ‘languages’ (which have since been referred to as Currier A and Currier B) and divided most of the pages in the manuscript into one of these two categories. This division of languages has since been supported by several experiments using computational cluster analysis. [5] [6]

Between 1996 and 1998, several academics began work on what they hoped to be a complete database of Voynich transcriptions. This database was known as the “Interlinear Archive of Electronic Transcriptions in EVA” (referred to as the Interlinear Archive), and came to include several partial transcriptions such as those developed by Currier and Friedman, along with a complete transcription by Takeshi Takahashi. [4] All the transcriptions in this file were converted to a newly developed alphabet, called EVA, that attempted to correct the errors of previous alphabets (which often oversimplified the characters in the manuscript and ignored rare characters) by using ligature analysis on the handwriting. [7]

The author of the manuscript has been heavily disputed. Some popular candidates are Roger Bacon, John Dee, and Leonardo da Vinci.[8]

See the Voynich Data Table (Appendix A) for more information about the sectioning, language, and page order of the Voynich Manuscript as it currently exists within the Yale Beinecke Library.

Technical Background

This project is heavily dependent on data mining techniques, as it involves the analysis and extraction of large quantities of data from different sources. Data mining can be used to analyse and predict behaviours, find hidden patterns and meanings or compare different data sets for commonalities or correlation. Authorship detection (stylometry) involves a subset of data mining techniques that help determine the authenticity of works and the possible authors of undocumented texts. The techniques involved in this project are also used in applications such as search engine development, code analysis, language processing, and plagiarism detection.

Part of the complexity involved in the VMS decoding process is the difficulty of transcription. The VMS has been transcribed many times by many different scholars, each attempting to create a universally acceptable transcription, but each of these has had to make limiting decisions about the structure of the manuscript. Due to the nature of the writing, the two characters observed by one scholar may also be interpreted as one combined character, and the spacing between words in the manuscript is notoriously vague. This ambiguity reduces the effectiveness of standard cryptographic and linguistic techniques to observe the relationship between characters and between separate word tokens.

Previous Studies

The fame of the Voynich Manuscript has led to a large amount of research into its origin, although this has been of varying quality. In the past, notable code-breakers (including the NSA) have attempted to crack the manuscript, but recent work has been done primarily by a small group of academics and amateurs, who have focused on data mining various electronic transcriptions. [2] In this section, we will detail a few experiments which relate to our own analysis.

Mary D'Imperio's 1978 Paper

Prescott Currier’s work and that of many other code-breaking groups were collected and curated by Mary D’Imperio, who worked on the manuscript for several years during her time as an NSA researcher. [3] D’Imperio’s paper, titled “The Voynich Manuscript: An Elegant Enigma”, which collected and analyzed the research so far, has become the most cited reference work on the Voynich Manuscript since it was first published in 1978 [2] In this document, D'Imperio states that most research attempting to match the herbal drawings to real plants has produced disappointingly vague results, but also claims that some plants have been indisputably identified as European. D'Imperio further highlighted a need for a strict adherence to the experimental method when dealing with computational analysis of the manuscript, claiming that this was necessary to avoid vague and meaningless conclusions.[3]

Reddy and Knight's 2001 paper titled "What We Know About the Voynich Manuscript" provides a useful summary of linguistic analysis into the manuscript. In particular, Reddy and Knight used an unsupervised classification algorithm (based on Hidden Markov Models) to separate vowels and consonants. They found that, in this area, the text is similar to "abjad" languages (such as modern Hebrew) which do not have vowels in the conventional sense of the term.[6] Jorge Stolfi, on the other hand, did an experiment looking at the word length distribution of the Voynich and of various other languages and concluded that, in this area, the text is similar to East Asian languages such as Chinese and Vietnamese. [9] We used the experiments of these researchers as part of the basis for our comparison corpus, but it is clear from the two examples above that different classification methods produce vastly different conclusions about the nature of the language in the Voynich.

Another researcher, Rene Zandbergen, recently developed an experiment to determine whether the language used on a page is related to the basic illustration type.[5] He concluded that, although the illustration types can be separated by supervised learning algorithms which look only at the words on the page, this doesn't necessarily mean that the text relates to the illustrations themselves. He drew attention to the fact that pages in the Currier A language appeared (to his algorithm) as separate from those in the Currier B language, even when the pages had similar illustrations.

This is the first year that a project attempting to decode the manuscript has been run at Adelaide University, but similar projects have looked at text classification in previous years. Notable examples include ‘Cipher Cracking’ and ‘Authorship Detection: Who wrote the letters to the Hebrews?’. Both of these projects developed techniques which may be applicable to the VMS, including textual comparison features such as Common N-grams and Word Recurrence Intervals (WRI) along with the use of supervised machine learning algorithms such as Support Vector Machines (SVM).

Project Objectives

As this is the first year in which an Honours Project has investigated the Voynich Manuscript, the long term objectives have been flexible and subject to change. Furthermore, given the large body of work already in existence on the manuscript, it was unlikely that even a partial decoding could be produced by our team within a year. Instead the primary focus of this project was to research and understand some features of the manuscript, and effectively analyse our data. We aimed to develop ideas and code which could contribute to the overall project outcome and aid future students with their research. The broad goals of our project included:

  • Developing possible methods and statistics that could be used to compare an unknown language with a known language or data set
  • Comparing the linguistic characteristics and features of the Voynich Manuscript with relevant languages and authors
  • Theorising as to whether the language contained within the manuscript is real, a code or a hoax
  • Developing a code base, documentation, and clear analysis to aid future projects

That being said, we had certain concrete milestones that we hoped to achieve this year, and certain research areas which we considered to be most worthwhile. These included:

  • Word Recurrence Interval as a language independent statistic
  • Identification of words which may relate to the illustrations in the manuscript
  • SVM and MDA text classification

Approach and Stages

In the early planning stages, we looked at our goals (concrete and flexible) and decided to divide our work into five stages. Each stage was dependent on areas from the previous stages and allowed continuous developments and improvements as our understanding of the manuscript progressed. Early stages involved basic characterisation of the manuscript, and later stages involved text classifications algorithms and research-based strategies. The stages themselves were split evenly between team members (detailed in the section of Project Management) and was designed to take advantage of the unique skills of each team member. The five stages are listed in more detail within the body of this report.

Stage 1: Text Processing and Early Characterisation

The Interlinear Archive

The Interlinear Archive, mentioned in the Previous Studies section above, was our primary source of text for the Voynich Manuscript. The IA is designed with the maxmimum amount of detail in mind, and includes 19 partial transcriptions of the Voynich manuscript. All transcriptions have been converted into the EVA alphabet, and are displayed in interlinear format for ease of comparison.

The IA also contains a large amount of metadata for each page. This metadata includes a description of the format and illustration of each page and commentary on interesting features of the text, along with sectioning data such as the Currier language, page type, and illustration type. We used this sectioning data to compare partial segments of the Voynich, particularly in the early stages of the project.

For reference, the details of the Interlinear Archive format and an example page are displayed at Interlinear Archive Example (Appendix D) .

Although the IA is extremely detailed, it presents a difficult candidate for comutational analysis. For the majority of our analysis, we selected a single transcription from the file, and processed it to remove line labels and illustrative labels. We also joined characters which were separated by “ambiguous” spacing in the manuscript, and removed characters which were marked as unknown in the transcription. A diagram showing the stages of processing for a single ‘paragraph’ of text (from the Takahashi transcription) is shown below.

Text processing stages

This process simplified the text on each page into a group of lines, with each line consisting of one or more space separated tokens, and allowed us to easily look at word and character relationships. However, it is undeniable that the text we analysed is only an approximate picture of the full Voynich Manuscript itself. It is clear from our research into previous decoding efforts that the transcription difficulties inherent to an old text like the Voynich present a major problem for all computational analysis.

Transcriptions

Figure 1.1: Transcription Completeness

In order to select the best transcription for our purposes and to aid future projects, we developed a few metrics to compare the 19 transcriptions available in the Interlinear Archive.

We first looked at the “completeness” of each transcription, which was measured by the transcribed percentage of all lines included in the EVT. The results of this test can be seen in Figure 1.1. By this metric, the Takahashi transcription was a clear victor, as it included data for 97.2% of the 5239 lines in the manuscript (21% more than any other transcription). Of the lines not included in the Takahashi transcription, nearly all were short labels from the astronomical and cosmological folios that were transcribed by Stolfi and/or Grove.

We also looked at the accuracy of each transcription, as recorded in the EVT. This was difficult to test rigorously without developing our own transcription from scans of the manuscript, so we used an approximate method that tested the number of characters for which the transcription agrees with the ‘majority rule’ (the character selected by the most other transcriptions). Takahashi again leads in this metric, with an “accuracy” of 99.1%.

Based on these tests, and our research into other work, we decided to use the Takahashi transcription for all of our analysis.

For more information into our metrics, see Electronic Voynich Transcriptions (Appendix C)

The UN Declaration of Human Rights (UDHR)

The initial plan for the project was to use the UN Declaration of Human Rights (abbreviated UDHR), a short statement about human rights with several accurate translations, to compare a broad selection of languages with the unknown text of the Voynich Manuscript. The UDHR is currently available as a corpus of 382 translations, and can be used as a proverbial “rosetta stone” for language comparisons. We hoped that statistical methods applied to the UDHR could provide a shortlist of languages similar to the VMS, which we could then use to build a more detailed corpus for authorship testing.

However, we found that the length of these translations (which had an average of 1800 word tokens) wasn’t long enough for rigorous comparisons of features such as Word Recurrence Interval, N-grams, and Word Length Distribution. Furthermore, the statement enclosed in each translation was extremely structured, and did not match the format of text in the Voynich manuscript. We used the UDHR for certain tests over the course of the year, but also built another corpus for the majority of our analysis.

The UDHR corpus is entirely encoded in UTF-8. A full table of languages available in the UDHR corpus can be found in UDHR Full Language List (Appendix D)

The Comparison Corpus

We compiled a corpus of texts for comparison against the Voynich Manuscript. The aim of this corpus was to test the data mining methods we developed, and to provide a more rigorous comparison than the UDHR. We generally selected languages that had been linked to the manuscript by previous researchers. We attempted to select texts which would provide comparable characteristics of line length and content, but were sometimes limited by what we could find available. After retrieving the texts from online sources, we cut them down to approximately 38,000 words and preprocessed them for use with our existing code. The selected languages and files are shown in the table below.

Language Author Title Token Count Source
1
English Sir Arthur Conan Doyle The Adventures of Sherlock Holmes 37,932 Project Gutenberg
2
English Jane Austen Pride and Predjudice 38,164 Project Gutenberg
3
English Unknown KJV Bible: Exodus and Ezekiel Books 38,423 Project Gutenberg
4
Latin Cicero Selected Works 38,026 http://www.thelatinlibrary.com/cic.html
5
Hungarian Zoltán Ambrus Álomvilág Elbeszélések 37,516 Project Gutenberg
6
Hebrew Unknown Old Testament: Exodus and Ezekiel Books 36,000 http://www.tanach.us/Tanach.xml
7
Russian Fyodor Dostoyevsky Crime and Punishment 38,101 http://az.lib.ru/d/dostoewskij_f_m/text_0060.shtml
8
Italian Dante The Divine Comedy 38,113 http://world.std.com/~wij/dante/
9
Chinese Unknown Hand Spaced Training Corpus 40,215 http://sighan.cs.uchicago.edu/bakeoff2005/
10
Chinese (Pinyin) Unknown Hand Spaced Training Corpus 38,086 http://sighan.cs.uchicago.edu/bakeoff2005/

The comparison corpus is entirely encoded in UTF-8.

Words, Sections, and Alphabet

Some basic statistics of the text are shown in the table below. In particular, it's worth noting that certain sections are almost entirely written in Currier Language A or Currier Language B, with the Herbal section written in a combination of the two. We split the text into sections using Jorge Stolfi's indicators from the Interlinear Archive (which relate to language and illustration type), and used Currier's original categorisation for language A and B.

The full EVA alphabet available in the Takahashi transcription contains 47 distinct “letters”, which are used for all 191,666 characters in the transcription. Many of these letters, however, appear very rarely in the manuscript. With these “weirdo” letters removed, the alphabet can be reduced to just 21 “common” characters. For comparison, “The Adventures of Sherlock Holmes” (in English) has a common alphabet of 53 characters and a full alphabet of 65 characters, and Hebrew without vowel-like accents as seen in selected books of the Old Testament has a common alphabet of 27 characters, with no rare characters.

Section Primary Currier Language Number of Pages Number of Tokens Number of Words Words Per Page (Ave) Full Alphabet Length Common Alphabet Length
Cosmological
Unknown 20 3008 1521 150 27 24
Biological
B 20 6917 1549 346 21 18
Herbal A
A 97 7956 2492 82 32 21
Herbal B
B 32 3442 1349 108 23 20
Recipes
B 25 11417 3328 457 29 19
Pharmalogical
A 18 2573 1139 143 21 19
Zodiac
Unknown 12 1331 808 111 20 19
Unknown Unknown 12 1276 708 106 28 24
Missing 20 0 0 0 0 0
Full Manuscript 256 37945 8105 161 47 21

Notes: Pages in manuscript includes missing pages, but average words per page does not. Common alphabet is that which makes up ~99.95% of all characters

Zipf's Law

Figure 1.2: Zipf's Law on Natural Languages for Russian

Zipf’s law states that the frequency of any word is inversely proportional to its rank in the frequency table if it is a natural corpus. In this case any natural corpus would have an exponentially decreasing graph as the rank increases along. This demonstrates that the language is a natural language. The use of Zipf's Law was to compare where sections of the Voynich Manuscript stood in terms of possibly being a natural corpus or language or be otherwise a hoax. Being author independent, it meant we could use this to test against the manuscript without any biasing issues. In this test, a normalized frequency was used in order to ensure that text length remained impartial to the statistics and therefore the results would hold well no matter the text length ensuring a fair test. It should be noted that this is only one method of determining the possibility of a natural corpus. In this case it can be seen that Zipf’s law somewhat holds, however isn’t as prominent as other eight selected texts from the other test languages. The closest matching case was Hebrew, however the peak for Voynich was significantly lower, around 25% less than Hebrew with a sharp drop off similar to Chinese.

For the complete graphs of Zipf's Law, see: Zipf's Law and Test Languages (Appendix E)

Word Length Distribution

Figure 1.3: Word Length versus Frequency for 8 comparison texts

We looked at the word length distribution of the Voynich manuscript and eight texts from our comparison corpus. Each word in the text was run through using a custom Matlab function which calculated the word length, and counted the frequency of words of that length. The resulting data (Figure 1.3) shows, for example, that five character words are most common in the Voynich Manuscript. Chinese predominately was difficult to use as a comparison as it tends to have many single and double character words which is seen by the peak at two characters. Interestingly the statistical data for the Voynich Manuscript stands out as it only has a single peak and then a nice roll off for the longer words. Of the remaining seven test texts, the closest one seems to be the Hebrew language with a peak shift of one character length with a similar percentage of frequency for the respective peaks.

The amount of words which were compared for each text were similar to the amount of words contained within the manuscript itself, this ensures that our statistics hold well.

The same run through was done for all 382 languages in the UDHR with the frequency normalized by the amount of words present for each language which was then converted into a percentage, allowing for similar comparisons. The results show that some languages when compared seemed quite similar statistically, however none that could be statistically high enough for a conclusion.

The full data for the test on the UDHR can be seen here: File:UDHR Word Length.xls

Stage 2: Picture Association and Frequency

The results of Phase 2 were somewhat inconclusive as far as finding picture descriptors in the text. The Zandbergen picture sections proved to be too large to produce much meaningful data about what words correspond to what picture types. The Herbal section, for example, contains over 2000 word types which don't appear in any other section. Finding the unique words on each page could provide more meaningful conclusions, but the average number of unique words per page was still too high for proper analysis. Perhaps all of these words relate to the picture on the page, but it is impossible to tell which (if any) without further information.

Figure 2.1: Unique words per page

The testing did produce other information about the VMS, however. When the unique words per page are normalised against the number of words on the page, spikes can be seen around folios 66-73 and 86-90, both of which correspond to Astronomical and Cosmological sections of the manuscript. This leads us to believe that the words in those sections are predominantly labels, names, or acronyms of other words in the manuscript. There's also no drop in unique words per page in the "Recipes" section, which has no distinct pictures. This may be an indication that not many of the unique words on a page relate to the image, but further testing is needed to be sure.

We also looked at words which "burst" (are used much more often) on pages with a certain illustration type by ranking words with a scoring method called TF-IDF. In the past, TF-IDF has been used search engine algorithms to find the most relevant words on webpages, but has not been used reputably for linguistic analysis. We concluded from this test that certain words only "burst" in certain sections; the word "qol", for example, appears to have an important relationship to the Biological section. We could not, however, develop any meaningful analysis based on this data without more knowledge about the usefulness of TF-IDF in this area.

Stage 3: WRI vs Rank Investigation

WRI versus Rank plot for the 8 tested languages

Word Recurrence Interval is the interval between successive repeating words. It is a language independent method of statistical analysis which can be commonly used to compare specific texts with a corresponding author. In this case, our method is modified. Instead of comparing authors, we tested the languages which were commonly hypothesised as possible origins of the Voynich Manuscript. The scaled standard deviation of WRI was used instead, and a minimum of three recurrences was used as a baseline index. The top 100 outputs were used as a dataset which was later expanded to the UDHR languages for comparison purposes. The reason for scaling the output is to remove the dependency on text length keeping the analysis strictly independent for length but dependent on the language characteristics instead. [10]

A plot of the eight most plausible languages is shown here, the x-scale being a logarithmic scale to better show the relationships. The y-axis shows the scaled standard deviation of the WRI. The longer tail on the end of the graph indicates that there is increased variability in the words within a text while the height indicates that there is a higher difference in word distances. There seems to be a possible statistical relationship to normal corpuses or more specifically to that group of test text.

An expansion into the UDHR language set was completed later on to determine if indeed there was any significant statistical relationship to any of the languages contained within the document. This would further give us a point to expand our tests on from a later date if we had the opportunity to do so. The resulting data set can be seen in the attached file. The best possible match we found in the entire test was with a section of the book called Recipes in Currier language B which indicated possibility into a European language. However due to the low percentage of matching, even this cannot be said conclusively and further investigation is needed.

File:UDHR Stats a.xls

Stage 4: Other Ideas and Research

Language Research

Time during the semester break was dedicated to studying language and linguistics for text written in a similar time period. It was hoped that this language analysis would lead to a better understanding of the intricacies of 15th century writing. In the course of the few weeks, collaboration with research librarians were undertaken and a specific text written in the 1540's was studied. Specifically this text was chosen due to the nature of the content, which discussed the medicinal uses of certain herbs (similar subject matter to what one might find in the Herbal section of the Voynich Manuscript). [11]

A sample of text from scans of a 1540s herbal manual

These findings were later used for further discussion and research. Some of the common issues found were:

  1. Language written in the earlier periods were quite relaxed about grammar. Words would be abbreviated, written in shorthand, or have multiple spellings.
  2. Letters were written differently depending on their position in the word. The structure of a letter at the beginning of the word was remarkably different to that of the same letter in the middle of a word.
  3. Letters were substituted with symbols depending on the author who wrote the text.
  4. It was often difficult to distinguish the spaces between words without the context of surrounding words.
  5. Capital letters were sometimes replaced with lower case equivalents and vice versa
  6. Words could be duplicated on either side of paragraph and page ends.
  7. Some words were continued onto the next line and a special indicator mark was used to show that the word was split. This mark was not used in all cases and depended on the free space left on a page or if the author felt like including it.

Entropy

Shannon entropy is the amount of information contained in each token in a data stream, and characterizes the uncertainty, or lack of predictability, in this stream. The lowest entropy for a given data stream occurs when the token values are uniformly distributed.

For a fixed-length text, the Shannon entropy in bits is equivalent to the minimum number of bits needed to encode each textual token. For ngrams, this can be calculated with the formula:

Shannon entropy formula.png

Where p(n)(...) is the probability of an ngram occurring divided by the total number of ngrams. For unigram word entropy, this is equivalent to the number of occurrences of a word s1 in the text divided by the total token count. [12]


Figure 4.1: Shannon Entropy Comparison

We looked at the ngram word entropy of the full VMS and compared it to that of similar length texts (Note that Shannon entropy is indirectly dependent on text length for natural language texts[12]). This data can be seen in Figure 4.1. We found that, for unigram word entropy, the VMS had a “middle of the road” value of 10.46 bits, which is close to the average of 10.25 bits. For bigrams, however, the VMS had an entropy value of 14.81 bits, which is close to the highest value of 14.83 bits (Simplified Chinese). The relatively high bigram entropy of Voynichese, especially given its comparatively normal unigram entropy value, indicates that there are relatively few repeated bigrams in the manuscript, and that the language has a relatively weak “word order” (predictability of a word given the previous word). Trigram entropy was of limited use, as all comparison texts had similar values.

N-Grams

A collocation is a word combination that occurs more often than would be expected by chance. An oft-cited English example of a collocation is the phrase “strong tea”; just based on the definitions of the words, it would be expected that “powerful tea” is a valid substitution, but this is not the case in empirical analysis.

We used a scoring method called Pointwise Mutual Information (PMI) to look at collocations in the VMS and in known comparison texts. The PMI of a combination of words is given by the formula:

Pmi formula 1.png

where p(x) and p(y) are the probability of word X and word Y appearing in the text, and p(x,y) is the probability of their coincidence. PMI is one of a number of accepted methods that quantify the likelihood of word association. Although generally used for bigram association, it can be generalized to trigrams and quadgrams.

We searched for collocations by collecting all of the n-grams in a text, filtering out combinations of rare words (rare words are known to have biased PMI scores), and ranking the rest by score.

In the known natural languages used for comparison, highly ranked n-grams are often indications of a group of proper nouns, such as “Tottenham Court Road” or “Scotland Yard” or numerical values (“50 guinea fee”) as well as indications of an idiom (“friendly footing”, “saucer of milk”, or “door flew open”).

A comparison of the bigrams in the Voynich against our 10 known texts is shown in Figure 4.2. While bigram collocations have a varying significance in the known languages, they are clearly of far lower significance in the VMS, which has a steep roll off and lower rankings overall. This supports our analysis of the bigram entropy in the VMS, and again indicates the bizarrely low word-order in the manuscript (lower even than Hungarian, which is generally considered to have a weak word order[6]).

Figure 4.2: Bigram Ranking Comparison

This could indicate that the manuscript is gibberish (a hoax) or some type of code. Ciphers are often designed to have weak word order and high entropy in order to prevent decryption.

Punctuation

In past experiments, researchers have looked at punctuation as symbols that occur “only at word edges, whose removal from the word results in an existing word”, and discovered that there is likely no punctuation by this definition in the Voynich Manuscript [6]

Based on an observation by Nick Pelling that ‘am’ character groupings appear particularly often at the end of lines (EOL), and his theory that these characters may be used to join words across line boundaries in a similar manner to hyphens in English[13], we looked at the EOL character groupings in both the VMS and our comparison texts. The results for single characters in the VMS and in Hungarian are shown in Figure 4.3. The Hungarian chart provides a good example of what we saw for all the comparison texts, which was that only punctuation characters (if they existed) were used significantly at the end of a line.

Figure 4.3a: EOL Characters in Hungarian
Figure 4.3b: EOL Characters in the Voynich Manuscript

In the VMS, on the other hand, the characters which appear significantly in line-end positions generally appear to be part of the word itself. The characters ‘m’ and ‘g’ are single characters that are used at the boundary of a line more than two-thirds of the time. We further noticed that the relationship between these characters and line breaks was strengthened when we looked only at the "Currier B" language.

While we were unable to determine the cause of this behaviour in the Voynich Manuscript, it is definitely anomalous when compared to our corpus. It’s possible the VMS has non-standard punctuation which is masquerading as part of the language. Our transcription of a 1500s herbal manuscript supports this conclusion; punctuation in the herbal text did not follow strict rules and often appeared at a glance to be a part of the word. When we ran this test on our herbal transcription, we saw similarly significant EOL behaviour from the known punctuation characters. Alternately, this could be more evidence for the language as a code, which would possibly use characteristics of the line to mask plaintext. We’d like to test the behavior of a known renaissance cipher in this area, but haven’t been able to find any electronic transcriptions. Finally, we speculated that the VMS could be written in verse and have corresponding rules for line structure, but the lack of any pattern in the physical formatting and line lengths makes this unlikely.

Stage 5: SVM and MDA Authorship

In this stage, focus was primarily put on statistical comparisons using both Support Vector Machine and Multiple Discriminant Analysis.

SVM Analysis

SVM as briefly described earlier is a binary classifier based on a kernel function, this can include basic linear support vectors. The way it works is fitting a hyperplane into between two classes to find the most fitting line of separation between the data sets. In our case Support Vector Machines (SVM) were used to try and classify multi class data sets into distinct classes in which then an unknown text, the Voynich Manuscript, was added to see which language class it closely related to. The resulting output was meant to give us a percentage of accuracy (created using cross validation techniques) and classify the unknown dataset into a supposed binary group. Due to Matlab only supporting 2D SVM models, we imported a multi class SVM library classifier developed by the National Taiwan University, LIBSVM to be used instead.[14]

The two distinct stages of creating and using a SVM are the training and classifying of known data with an unknown set of values. The training step involved using known language corpuses which were put into the classifier to first determine which set of data it most closely related to. The standard 8 test languages were used as training data for our classifier and then a transcribed version of the Voynich manuscript was put into the machine to see which class it most closely related to.

The resulting outputs from our SVM training and classification showed that it did not yield sufficiently conclusive results as we had hoped.

Difficulties
MDA classification and output for test corpuses and the Voynich Manuscript, Word Length
MDA classification and output for test corpuses and the Voynich Manuscript, WRI


An issue we encountered in the beginning was that Matlab's prebuilt SVM training and classification machine only allowed the comparison of 2 classes which hampered the use when comparing large data sets. This was resolved by using a user built library allowing for multi class SVM models. Another issue encountered was that if the unknown data set did not belong to one of the classes, the classifier was not able to show that. This meant that in order for SVM to be used at its full potential, the classifier must have a data set similarly matching the unknown text to be compared which is impossible in our case to do. With this in mind, we determined that the SVM model was not as useful in our goal of classifying unknown text when the Leave-One-Out Cross Validation returned poor results not classifying sections of the Voynich correctly to some degree.

MDA Analysis

MDA is another statistical analysis technique which is similar to regression analysis, it constructs boundary lines around clusters of data. It allows a large set of training data to be compared against similar to SVM, however reducing the differences between variables in order to classify them. The use of MDA was aided by the IBM program, SPSS. The data used for MDA was texts broken into exactly 2000 word length text files which were then fed into a Matlab function to generate both word length and WRI statistics. Each of the 8 languages compared had 8 test files for a total of 64 testing points. The 2 sections then compared with the known group data set and the results determined which cluster it most closely belonged to. [15]

The use of the stepwise function (Mahalanobis) removed duplicate test points only leaving the ones which would contribute to the final test results. Leave-One-Out classification and Casewise classifications were used to give an accurate final output in terms of percentage of accuracy for classification into the clusters.

The groups which were used as test corpuses were:

  1. Chinese
  2. English (Sherlock Holmes)
  3. Hebrew
  4. Hungarian
  5. Italian
  6. Latin
  7. Pinyin
  8. Russian

The resulting data that we obtained through MDA indicated that based on the clusters that the data in the 4 unknown cases fell into the following language sets:

  • MDA Word Length (Set A: Herbal A): Classified to Hebrew
  • MDA Word Length (Set B: Biological): Classified to English

For this classification, the Leave-One-Out Cross Validation (LOO-CV) resulted in a 82.3% accuracy into the classification for the correct groups. This results was very high and shows that there is a good statistical relationship between the two sets of clusters and the unknown language.

  • MDA WRI (Set A: Herbal A): Classified to English
  • MDA WRI (Set B: Biological): Classified to English

This classification for LOO-CV achieved poor classification results with an overall accuracy of only 55.6%. As seen in the pictures provided, the data cluster is quite tight for a few groups, hence the classification accuracy decreasing in this case.

In this case, further expansion into the sets of languages tested would yield better results and help determine the correct language set.

Project Management

Schedule and Task Management

As explained earlier, the project was split into 5 distinct stages. Each team member was in charge of specific stages, but there was a large degree of collaboration between team members throughout the year. The formal stages are listed below.

  1. Stage 1: Characterise the Text
  2. Stage 2: Associate Pictures with Word Frequency
  3. Stage 3: WRI versus Rank Plots
  4. Stage 4: Other Ideas and Research
  5. Stage 5: SVM and MDA Authorship Techniques

These stages, along with project management tasks, were allocated as follows:

Peter Roush:

  • Python Coding
    • Phase 2 (Associate Pictures with Word Frequency)
    • Phase 4 (Other Ideas and Research)
  • Preprocessing files
  • Compilation of testing material
  • Research into previous work on the Voynich Manuscript


Bryce Shi:

  • Matlab Coding
    • Phase 3 (WRI versus Rank Plots)
    • Phase 5 (SVM and MDA Authorship Techniques)
  • Research into linguistics and previous work on the Voynich Manuscript
  • Budget control

Budget and Purchases

Details Budget Cost
1
The Voynich Manuscript by Gerry Kennedy and Rob Churchill AUD 15.53
2
The Curse of the Voynich by Nicholas Pelling AUD 140.54
3
The Voynich Manuscript: An Elegant Enigma by Mary D'Imperio AUD 15.26
4
Lamination of A3 Handouts AUD 0.00
5
Lamination/Binding of Reference Book AUD 31.50
Total AUD 203.43

Project Deliverables

Deliverable Due Date File Location (If Applicable)
Proposal Seminar
21st March 2014 File:Proposal Seminar, Group 44, 2014.pdf
Progress Report
6th June 2014 Wiki report
Final Seminar
16th October 2014 File:Final Seminar, Group 44, 2014.pdf

Final Seminar Video, G44

Final Report
24th October 2014 This Page
Exhibition Poster
27th October 2014 File:G44 Poster Final.pdf
Youtube Teaser
30th October 2014 YouTube video
Ingenuity Exhibition
30th October 2014 Adelaide Convention Centre
Extra Deliverables (Due by Closeout)
Bound Voynich book
Documented Python and MATLAB code
All documentation and data

Risk Management

Project Risk Management

The table below shows the five most significant risks to the project.

Risk No. Type Likelihood Consequence Risk Mitigation
1
Inaccurate allocation of time and resources to a particular section of work Likely Major V. High Balance the load evenly between team members, regular updates. If needed, then reduce the scope and alter the current path of the project. Contact supervisors and work out a new plan if needed.
2
Files and working copies lost Rare Major Medium Keep local copies on computer, update the wiki site regularly, have copies of correspondence via e-mail for both parties and share files via Dropbox or other storage services for backup.
3
Not understanding the project correctly and the processes required Almost Certain Moderate V. High Keep in contact with group members and CC supervisors when asking questions or confirming problems. Make sure that the team keeps in regular contact with supervisors through meetings and weekly reports.
4
UofA Electrical Engineering server down for unknown reasons Unlikely Moderate Medium Try and complete tasks earlier than before deadline, keep copies of work on other servers and contact ITS if a long delay or unscheduled. Continue to work and update when resource are available again. Notify supervisors.
5
Not being able to solve the Voynich Manuscript code Almost Certain Negligible Medium Ensure that the work has been documented properly and all resources are cited for presentation anyway.

OH&S Risk Management

As the project was entirely code based, only one OH&S risk was identified.

Risk No. Type Likelihood Consequence Risk Mitigation
1
Health issues due to long periods of time sitting and working at a PC Likely Moderate High Take breaks every hour and dedicate time for own activities without impacting other work. Notify supervisors and team member of any significant issues. Medical advice if necessary.

Outcomes and Pathways

Conclusions and Summary

With this type of research project, it can be very difficult to come up with a concrete conclusion, especially with a case that has been unsolved for centuries now. The aim of the project was to develop a base of research (and code) that could be used for future analysis. However, we are confident enough with our findings that we can safely conclude with these final few points:

  1. There is a relationship between the language and sections themselves, but this may not have anything to do with illustrations presented in that section.
  2. Based on characteristics such as word length distribution and WRI, the text appears similar to languages such as Hebrew and Latin.
  3. Based on the line characteristics of comparison languages, the manuscript may contain punctuation characters.
  4. There is a distinctly low word order in comparison to known languages. This might be evidence of a code in the manuscript.
  5. WRI and Word Length Distribution are not useful statistics for supervised learning algorithm analysis of the manuscript.

As such, a project similar to this is never truly finished, each time fresh and inquisitive minds building upon what others have found to continually make progress, eventually one day finally unveiling the mystery surrounding the manuscript. This project has taught us concepts and techniques with approaching large unknown data set, improved upon our coding skills for both Python and Matlab and allowed us to develop skills into data interpretation, data mining and mathematical and linguistics analysis. These are important skills to have as data analysis is now an integral part of Engineering with designing of networks, simulations and development plans.

Future Pathways and Development

  • Test these characteristics against transcriptions of known 15th century codes to develop cipher theories.
  • Expand research into word/illustration relationship with better burst metrics, and Hidden Markov Model clustering algorithms
  • Test the effect of modified alphabets on the statistics like word length distribution and line break characteristics
  • Expand research into authorship if possible
  • Develop a rule-based grammar for the the Manuscript

Appendices

Reference List

  1. UA EXPERTS DETERMINE AGE OF BOOK 'NOBODY CAN READ'. (2011, February 9). States News Service. Retrieved October 24, 2014, from http://www.highbeam.com/doc/1G1-248748860.html?
  2. 2.0 2.1 2.2 2.3 Rene Zandbergen. (2014). History of research of the manuscript. Accessed on 24/10/14 from http://www.voynich.nu/solvers.html
  3. 3.0 3.1 3.2 D'Imperio, M. E. (1976). The Voynich Manuscript: An Elegant Enigma. Aegean Park Press.
  4. 4.0 4.1 Notes from the Interlinear Archive File
  5. 5.0 5.1 Rene Zandbergen. (2014). Currier Languages Revisited. Accessed on 24/10/14 from http://www.voynich.nu/extra/curabcd.html
  6. 6.0 6.1 6.2 6.3 Sravana Reddy and Kevin Knight. 2011. What we know about the Voynich manuscript. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 78-86.
  7. Renen Zandbergen. ( 2014). An Analysis of the Writing. Accessed on 24/10/14 from http://www.voynich.nu/writing.html
  8. Schuster, J. (2013). Haunting museums. New York: Tor.
  9. Jorge Stolfi. (2012). Chinese Theory Redux. Accessed on 24/10/14 from http://www.ic.unicamp.br/~stolfi/voynich/02-01-18-chinese-redux/
  10. M. J. Berryman, A. Allison, and D. Abbott, "Statistical techniques for text classification based on word recurrence intervals," Fluctuation and Noise Letters, Vol. 3, No. 1, pp. L1–L12, 2003.
  11. Wyer, R. (1540). Hereafter foloweth the knowledge, properties, and the vertues of herbes. London: Imprynted by me Robert Wyer, dwellynge in saynt Martyns parysshe, at the sygne of saynt Iohn Euangelyst, besyde Charynge Crosse. http://catalogue.nla.gov.au/Record/5881778
  12. 12.0 12.1 Kalimeri, M. et al. (n.d) Entropy Analysis Of Word-Length Series Of Natural Language Texts: Effects Of Text Language And Genre. International Journal of Bifurcation and Chaos, 1250223-1250223.
  13. Nicholas Pelling. (2013). Knight & Reddy on the Voynich, & the limits of statistical analysis. Accessed on 24/10/14 from http://www.ciphermysteries.com/2013/03/09/this-week-a-talk-at-stanford-on-the-voynich-manuscript
  14. Chih-Chung Chang and Chih-Jen Lin. (2014). LIBSVM for Multiclass SVM comparisons. Accessed on 24/10/14 from http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  15. M. Ebrahimpour, T. J. Putniņš, M. J. Berryman, A. Allison, B. W.-H. Ng, and D. Abbott, "Automated authorship attribution using advanced signal classification techniques," PLoS ONE, Vol. 8, No. 2, Art. No. e54998, 2013, http://dx.doi.org/10.1371/journal.pone.0054998

See also

Useful resources

External Links

Previous Studies

[1] M. Ebrahimpour, T. J. Putniņš, M. J. Berryman, A. Allison, B. W.-H. Ng, and D. Abbott, "Automated authorship attribution using advanced signal classification techniques," PLoS ONE, Vol. 8, No. 2, Art. No. e54998, 2013, http://dx.doi.org/10.1371/journal.pone.0054998

[2] M. J. Berryman, A. Allison, and D. Abbott, "Statistical techniques for text classification based on word recurrence intervals," Fluctuation and Noise Letters, Vol. 3, No. 1, pp. L1–L12, 2003.

Back