Projects:2016s1-141 Cracking the Voynich manuscript code
- 1 Topic
- 2 Supervisors
- 3 Team members
- 4 Project Introduction
- 5 Proposed method
- 6 Phase 1: Characterisation and text investigation
- 7 Phase 2: illustraction investigation
- 8 Phase 3: Marginal symbols research
- 9 Conclusion
- 10 Reference
Cracking the Voynich manuscript code
Prof. Derek Abbott Dr. Brian Ng
Ruihang Feng a1674940 Yaxin Hu a1672395
The Voynich the manuscript was created in the first half of the fifteenth century (probably between 1404 and 1438) . No one today knows what it says or who wrote it. The book is in a strange alphabet. At 1912, a book collector named Wilfried Voynich found it in an Italian Jesuit college. Since this book cannot be read, it is divided into six different sections by illustrations with different styles and images:
There are one or more plants on each page, which is a format of European herbals .
There are circular diagrams such as suns, moons, and stars which suggest this part as something about astronomy or astrology .
Mostly naked women show that this part should be biological section .
Circular diagrams of obscure nature make this section as cosmological section .
Drawings of isolated plants parts and objects resembling apothecary jars show that this section should be something about pharmaceutical .
This part are full pages of text in short paragraphs .
With statistical methods, trying to carry out a project that is used to investigate the language and linguistics of an unknown book is an attempt that may beyond excellent. Trying to find any features of relationships and patterns of the Voynich manuscript could be used to decode the unknown text with unknown languages. It may contribute significant progress in attempting decode a part of the book. The outcomes can be used to further linguistic or language decryption, such as information decoding, search engines and data mining. They can also be used in specific applications such as Google, Turn-it-in, Google translate, Yahoo, and Grammarly.
The aim of this project is to search the text and determine whether there are any possible features that can be used to decode the Voynich manuscript using statistical methods. The investigation of languages and linguistics is required to be processed with the unknown text. Furthermore, crack initial digits of the Voynich manuscript and determine the possible letters which may stand for digits. But, it is not necessary to fully decode the Voynich manuscript since it is not possible to be done in a one-year project.
There are many guesses about the Voynich manuscript. Because of the manuscript’s long history, many historians believe that the mysterious alphabets of the Voynich manuscript are related to ancient civilizations . If manuscript can be cracked, the Voynich manuscript will be helpful for historians to explore the culture of ancient society.
In addition, the statistical method which will be used in this project is also useful in other fields, such as engineering, finance and architecture. Moreover, comparison is widely used, such as Turn-It-In, Google translate, Grammarly and Bing.
The major technique which will be applied in this project is data mining. Data mining is an effective method to search laws among the massive number of data and has a fantastic performance. The two major methods of data mining are statistics and comparison. Statistics is used to count the frequency of the occurrence of some special words. Comparison is served to find out relations between two languages.
In the field of linguistics, European Voynich Alphabet (EVA) is a representative digital transcription of the Voynich manuscript . Then a Japanese linguist Takahashi organised the whole Voynich manuscript by using EVA .
Therefore, major data will be extracted from the transcription of Takahashi in the process of this project.
Moreover, other resources will be considered, such as expressions of some representative ancient languages.
Due to the massive amount of data in the Voynich manuscript, the project requires skilled data processing technique and software programming capabilities; however, no one in this project team has ever dealt with so much data. Hence members should develop data processing ability and software programming skills.
On the other hand, the project requires particular knowledge about statistics, so members must be adept at sorting data.
Technical challenges of this project involve two aspects.
First of all, it is very difficult to infer which language the author used. The language of the manuscript does not belong to any known languages  and even this language may have been extinct. What is more, due to the long history of the Voynich manuscript, some important information is nowhere to be searched, such as exact information about author. In that case, it is difficult to infer which language the author used from the author’s nationality. In order to solve the above problem, members must search many different languages as references and compare those languages with the language of the manuscript.
Secondly, references of cracking the Voynich manuscript are limited. Because of unknown language and mysterious illustrations in manuscript, it is difficult to crack the whole manuscript. Although there are very few words have been cracked by researchers, on one can guarantee that the results are right. In the field of linguistics, there are not recognized correct results about cracking the Voynich manuscript. In that case, it is hard to find reliable references. So members must search references from different ways and find out enough accurate references.
As shown in the Figure 1, the proposed methods of this project are divided into three phases.
Phase 1: Text investigation
There are two parts in this phase: words and digits.
During the process of words research, Matlab will be used as an essential tool. Team members will attempt to search laws from three aspects:
- The total number of words in the Voynich manuscript.
- The characters and words which may stand for digits from some paragraphs of the manuscript.
- The frequency of special characters and words.
On the other hand, in the course of digits investigation, team members will search for different kinds of known expressions of digits and make a comparison with the words in the Voynich manuscript. For example, the expression of digits in Roman is as shown in the Figure 2.
The word which is as shown in the Figure 3 is extracted from the Voynich manuscript, it is obvious that the form of the word in the Figure 3 is like “*##’. According to the method of comparison mentioned above, this word may mean ‘seven’ in Roman.
Phase 2: Illustrations investigation
An illustration which is extracted from the Voynich manuscript is as shown in the Figure 4.
In this phase, illustrations will be analysed by using Matlab. Generally, there are three aspects which are needed to be completed:
- The number of different elements in the illustrations.
- The characters which may stand for digits.
- Match the characters and digits.
Phase 3: Marginal symbols research
A page which contains marginal symbols is as shown in the Figure 5.
This phase also requires proficiency in programming by using Matlab. During the process of this phase, there are four major aspects:
- Ordering and quantitative features of the marginal symbols of each page.
- Search the characters which may stand for digits.
- The differences between marginal symbols in each page.
- Match the characters and digits and make inference about the relationship between characters and digits.
Phase 1: Characterisation and text investigation
Characterisation of the Voynich manuscript
Figure 6 shows the letter frequency in Voynich manuscript. There are 24 letters in Voynich manuscript. As the figure shows, that o, e, h, and y are the four most frequency letters, and S, z, v, x are the four least frequency letters. The blue line is the tendency of all the letters.
There are six kinds of languages are used in comparing the letter frequency, those are English, Latin, French, German, Greek and Spanish.
Figure 7 shows_the_letter_frequency_of_English. There are 26 words in total. The most frequency letters are e, t, a and o, and the least frequency letters are z, q, j and x.
Figure 8 shows the letter frequency of Latin. There are 23 words in total. The most frequency letters are i, e, a and u, and the least frequency letters are z, y, x and h.
Figure 9 shows the letter frequency of French. There are 38 words in total. The most frequency letters are e, s, a and i, and the least frequency letters are ï, ë, œ and ô.
Figure 10 shows the letter frequency of German. There are 30 words in total. The most frequency letters are e, n, s and r, and the least frequency letters are q, x, y and j.
Figure 11 shows the letter frequency of Greek. There are 24 words in total. The most frequency letters are A, E, O and I, and the least frequency letters are Ψ, Z, Ξ and B.
Figure 12 shows the letter frequency of Spanish. There are 33 words in total. The most frequency letters are e, a, o and s, and the least frequency letters are k, ü, w and ú.
With the Matlab, correlations between the tendency of letter frequency of the Voynich manuscript and English, Latin, French, German, Greek and Spanish. The correlation between the Voynich manuscript and English is 98.04%. The correlation between the Voynich manuscript and Latin is 98.66%. The correlation between the Voynich manuscript and French is 94.55%. The correlation between the Voynich manuscript and German is 94.81%. The correlation between the Voynich manuscript and Greek is 98.34%. The correlation between the Voynich manuscript and Spanish is 96.09%.
Comparing the Voynich manuscript with English, Latin, French, German, Greek and Spanish, the letter number of these languages shows that the most possible language is Greek, because they both have 24 letters. Furthermore, the letter frequency is also similar for the Voynich manuscript and Greek. In addition, the correlation between the Voynich manuscript and Greek is high. Therefore, Greek can be considered as a possible language that the Voynich manuscript used. However, this is not a strong evidence that can prove the Voynich manuscript is written in Greek. In conclusion, there is no specific evidence can prove that Voynich manuscript is one of these six kind of language, Greek is one of the possible language that the Voynich manuscript used.
Figure 13 shows the word frequency in the Voynich manuscript. There are 37104 words in the whole manuscript, and the total unique words are 8486. Furthermore, there are 2472 words that appears more than once, and 6014 words appears only once. 515 words appears more than 10 times and these words counts 65.66% of the total words in the Voynich manuscript.
In figure 14, 50 most frequency words are token to make a comparison with English.
Comparing word frequency in the Voynich manuscript and in English, the correlation between the tendencies of both curve is 93.65%, which shows that there may exist relationship between the Voynich manuscript and English. In conclusion, there is no strong evidence shows that there is any relationship between the Voynich manuscript and English.
Statistical Comparison of Letters and Words
This section gives a brief statistical comparison between the Voynich manuscript and three book in English, French and German. Among these languages, the percentage of unique words/total words, word length and the percentage of words appear more than once /total unique words were compared.
Figure 16 shows the percentage of unique words/total words. There is significant difference between the Voynich manuscript and English books (47.9%) or French books (27.7%). However, there is no significant difference between the Voynich manuscript and German (13.6%).
Figure 17 shows the word length the Voynich, English, French and German. There is small difference for the word length between the Voynich manuscript and English (6.7%) or French (6.0%). Furthermore, there is no significant difference for the word length between the Voynich manuscript and German (0.1%).
Figure 18 shows the percentage of words appear more than once /total unique words were compared. There is large difference between the Voynich manuscript and English (41.0%) or French (38.9%) or German (22.8%). However, the difference between the Voynich manuscript and German books is the smallest difference among these differences.
Among these statistical comparisons, German can be considered as a possible language that the Voynich manuscript used.
Specific Pattern Words
This section shows gives a brief analysis of current results of the specific pattern words. The first numeral language is Roman numerals. In Roman numeral, VII stands for 7 and VIII stands for 8. These two numerals have obvious patterns that are easy to search in the Voynich manuscript. Words follow VII pattern and VIII pattern have been found, and next step will continue finding all possible numerical words in Roman numerals from I to XX, and several obvious pattern numerals such as XX, XXX, C, CC and CCC.
Figure 19 shows part of Vii pattern words. All the numbers below the words are locations of these words, such as 534 means that 534th word in the Voynich manuscript contains aii. Since a large number of locations of several words were found, this figure could not show all the locations.
From the locations, there are 562 aii, 201 kee, 77 oee, 72 tee, 51 oii, 30 qoo, 27 dee, 18 qee, 10 see, 4 lee, 3 yee and 2 ree were found. All along with the result, it is obviously that *ii, *ee and *oo are three patterns that may be numerical words for VII.
Figure 20 shows part of Viii pattern words. From the locations, there are 44 aiii, 25 oeee, 22 keee, 72 tee, 11 oiii, 7 deee, 6 qeee, 5 teee, 3 seee, 2 leee , 2 reee and 1 yeee were found. All along with the result, it is obviously that *iii and *eee are two patterns that may be numerical words for VIII.
All along with the result, it is obviously that *ii (*iii) and *ee (*eee) are two patterns that may be numerical words for VIII. Comparing all possible VII words and VIII words, e, I and o can be considered as possible numerical characters.
Compare with triple letters in other languages, such as English, there is a list of triple letter words in English:
There are a lot of triple ‘l’ and triple ‘s’ appears inside of the words in English. Furthermore, comparing with the results got form the VIII pattern words before, as ‘i’ and ‘e’ appears most as triple letters in the Voynich manuscript and ‘l’ and ‘s’ appears most as triple letters in English, there may exist some relationship among ‘i’, ‘e’ in the Voynich manuscript and ‘l’ ‘s’ in English.
Furthermore, there is also some triple letter words in other language, such as German and Russian.
There is triple ‘e’ appears inside of the words in German.
There is triple ‘o’ appears inside of the words in Russian, and triple ‘e’ appears at the end of the words in Russian.
As we talked before, there is the highest possible relationship between the Voynich manuscript and German, and ‘e’ as a letter that appears three times in German, there is possible relationship between ‘i’, ‘e’ in the Voynich manuscript and ‘e’ in German, which need further searching if there can be found any breakpoint in the text investigation.
Phase 2: illustraction investigation
Searching initial numbers and possible numerical words inside images
The first part of this section is to find all initial numbers inside the images of the whole Voynich manuscript. There is a list of some part of the initial numbers below:
In order to make a comparison and mapping between initial numbers and the Voynich manuscript, all possible words that may stand for numbers. There is a list of some part of the possible words below:
Mapping all initial numbers and numerical words
When we compare the initial numbers and possible words, there can be seen some potential relationship between them, such as there are a lot of ‘s’ and ‘2’ appear in the same page (54 pairs), ‘o’ and ‘1’ for 24 pairs, ‘ol’ and ‘10’ for 14 pairs. Therefore, in order to make it simple to compare, mapping between initial numbers and possible words are made to show whether there is any relationship between them. There is a list of mapping pairs for letter 'o' and 'r' below:
In order to make it simple to find a more possible relationship among them, we choose the most frequency pairs for each pair, and made a new list, which is shown below:
There can be easily seen that ‘o’ and ‘1’ appears together for 24 times. Furthermore, there are a lot of ‘ol’ and ‘10’ (14 times), ‘ol’ and ‘13’ (12 times), ‘ol’ and ‘12’ (11 times), ‘or’ and ‘10’ (19 times), ‘or’ and ‘12’ (13 times), ‘or’ and ‘13’ (12 times), ‘os’ and ‘19’ (11 times) appear together. Therefore, there is a potential relationship between ‘o’ in the Voynich manuscript and number ‘1’.
Furthermore, there are ‘r’ and ‘1’ for 48 times, ‘r’ and ‘2’ for 26 times, ‘r’ and ‘3’ for 21 times, ‘s’ and ‘2’ appear together for 54 times, ‘s’ and ‘1’ for 46 times, ‘s’ and ‘3’ for 41 times, ‘s’ and ‘5’ for 32 times, ‘y’ and ‘2’ for 36 times, ‘y’ and ‘1’ for 30 times, ‘y’ and ‘3’ for 29 times, ‘y’ and ‘5’ for 20 times. There may exist potential relationship among them, which need further investigation.
In order to make it simple to see and compare, there is a list that all the possible pairs shown below:
Phase 3: Marginal symbols research
According to the chapter 5, this phase is divided into three parts: statistics for marginal stars of each page, digits mining and conclusion. In addition, this phase is completed by Ruihang Feng.
Statistics for Marginal stars of each page
There are 15 pages which involve marginal stars in the Voynich manuscript. As the analysis in the chapter 5, an example is shown in the Figure 5. The results of this part are shown in the Figure 28.
From the Figure 28, we can find that there are two kinds of marginal stars in the Voynich manuscript: white stars and coloured stars. In addition, Figure 28 also involves detailed information about the number of stars, arrangement and location in the text.
In this phase, first, the number of marginal stars for each page is counted. Then, letters which may stand for digits are extracted. An example (page number: f58r) is shown in the Figure 29.
For this page, there are 3 white stars (according the Figure 28) and the single letters which may stand for digits are m, o, r and s. Then all the 25 pages are counted in this way.
As the result, these 25 pages involve 16 kinds of digits: 1 3 4 5 6 7 8 9 10 12 13 14 15 16 17 and 19. Some of them stand for the total number stars of each page; some of them stand for the number of white stars or the number of the coloured stars of each page. The detailed information is shown in the Figure 28.
The results of this phase are shown in the Figure 30. The first column stand for those 16 kinds of digits, the information in brackets mean the number of the pages which involve that digit (for example, for the digit ‘5’, the information in brackets is 3 pages, that means there are 3 pages which involve ‘5’); the red mark represent the first three letters which has high occurrence frequency; the second column stand for the pages which involve the digits and the last column means the letters which may stand for digits.
According to the section 8.1 and 8.2, the conclusion of the phase ‘marginal symbol research’ is shown in the Figure 31.
The letters of the first column are extracted according to the red mark in the Figure 29. The forth column stand for the occurrence frequency of letters. For example, the occurrence frequency of y=5 is equal to 3/18=16.67%, ‘3’ means there are 3 pages which involve ‘y=5’ (according to the Figure 29.), ‘18’ means there are 18 pages which involve ‘y’.
As the result, according to the figures above, we can find that there are potential relationships between:
- ‘y’ and ‘6’
- ‘y’ and ‘7’
- ‘l’ and ‘7’
- ‘l’ and ‘5’
- ‘l’ and ‘8’
- ‘r’ and ‘5’
- ‘s’ and ‘6’
- ‘o’ and ‘6’
- ‘o’ and ‘1’
- ‘ar’ and ‘13’
- ‘ar’ and ‘15’
- ‘al’ and ‘13’
- ‘al’ and ‘10’
- ‘al’ and ‘15’
- ‘or’ and ‘13’
- ‘ol’ and ‘13’
- ‘am’ and 12
- ‘am’ and ‘19’
- ‘dy’ and ‘14’
- ‘dy and ‘19’
- ‘om’ and ‘16’
Results and Analysis
This project is divided into three phases: text investigation, illustration research and marginal symbol investigation. On the other hand, the major works of this project can be achieved by using computer.
In addition, the goals of this project involve three parts:
- Use statistical method Matlab to search the linguistic laws in the Voynich manuscript.
- Search laws from illustrations from the perspective of digits.
- Investigate laws from marginal symbols form the perspective of digits.
Over the past two semesters, the whole phases have been finished. As the analysis in the chapter 6, we can infer that the language which is used in the Voynich manuscript may be a branch of German.
The most Possible digits:
- ‘o’ and ‘1’
- ‘ol’ and ‘13’
- ‘or’ and ‘13’
Some other possible digits:
- ‘s’ and ‘2’, ‘6’
- ‘y’ and ‘2’, ‘6’
- ‘a’ and ‘1’
- ‘r’ and ‘1’
 Schmeh, Klaus (January–February 2011). "The Voynich Manuscript: The Book Nobody Can Read". Skeptical Inquirer. Retrieved 2013-09-05.
 Shailor, Barbara A.,Beinecke MS 408, Yale University, Beinecke Rare Book and Manuscript Library, General Collection of Rare Books and Manuscripts, Medieval and Renaissance Manuscripts, accessed 24 June 2013.
 Stojko, John, Letters to God’s Eye: The Voynich Manuscript for the first time deciphered and translated into English. New York: Vantage Press, 1978.
 Joachim Dathe, The EVA-Transcription [Online]. Available: https://voynich2arabic.wordpress.com/eva-transcription/
 Vladimir Sazonov, Voynich Manuscript [Online]. Available: http://voynich.naobum.de/
 Reed Johnson (2013, July 9), The Unread: The Mystery Of The Voynich Manuscript [Online]. Available: http://www.newyorker.com/books/page-turner/the-unread-the-mystery-of-the-voynich-manuscript