Final report 2009: Who killed the Somerton man?

From Derek
Jump to: navigation, search

Contents

Supervisors

Honours students

Due date

This assignment is due by 2pm on Thursday of Week 9, Semester 2, 2009. Whilst anyone can keep editing this page at will for the next century, your marks will be based on the state of this page at the deadline. Also by the due date you must hand up individual CDs of your complete project directory with all your software and output. You also need to hand in the signed coversheet with your CD.

View the video briefly outlining the project

Abstract

This project investigates different possibilities for discovering the meaning behind the obscure string of letters found associated with the unknown man found dead at Somerton beach in 1948. The project is structured as to selectively rule out different possibilities of the code's meaning in an effort to get a better idea of what the code may be. Specifically, the possibilities of a simple transposition cipher in many languages, an initialism in many languages, and a few cipher schemes in English have been investigated. These include the Playfair cipher, the Vigenere cipher and the one-time pad system. The claim that the string of letters has no meaning (cryptographically) has also been investigated.

Background to the case

The dead body and its circumstances

Perhaps one of South Australia's most bizarre and longest standing mysteries is that of the Somerton Man. His body was found resting against the rock wall on Somerton beach opposite the Crippled Children's home, shortly before 6:45am on the 1st of December, 1948. According to the coroner's inquest, the officer in charge, Constable John Moss, "looked to see if there was any disturbance of the sand and the body, and was sure there had not been." According to John Matthew Dwyer, who performed the postmortem, "the death (of the man) could not have been natural" (Coronial Inquest[1] page 11) due to severe congestion in many of the organs, most notably the liver and spleen. Such congestion is consistent with poisoning, but not a trace was found. There was an unlit cigarette just above his ear when he was found, and a partly smoked portion of a cigarette on the right collar of his coat. There were cigarettes on the body, which were in a packet" (Coronial Inquest[1] page 4) along with a quarter full box of matches. There was also a metal comb and some chewing gum. Also in his possession were "a railway ticket to Henley Beach, also a bus ticket, a tramway bus ticket" and a slip of paper bearing the words "Tamám Shud".

The railway ticket was for the 10:45am train to Henley Beach for the 30th of November, but had not been used. The Tramways bus ticket on the other hand, had been used, and according to the Claims Officer's Assistant, was purchased "somewhere between the railway station on North Terrace and the intersection of West Tce and South Tce while the bus was en route to St. Leonard's departing from the Railway Station at 11.15a.m. ... There may have been a Somerton bus before this, but this would be the first St. Leonard's bus to leave after the 10.45 a.m. train to Henley Beach" (Coronial Inquest[1] page 5).

Before leaving the railway station, the man deposited a suitcase in the cloakroom. This case contained various items of clothing, of which some bore the name "T. Keane", while others followed the trend set by the clothes that the man was wearing, in that they bore no name, many having had the name tag torn out. There were several other items in the suitcase, including a stencilling brush, scissors and knife in a sheath, and 3 pencils. However, these, like the name T. Keane, proved fruitless for useful information.

Here is a photo of the original code found in 1949 in the back of a copy of the Rubiayat of Omar Khayyam

The code

In early 1949, a small scrap of paper was found in the coat pocket of the deceased man bearing the words "Tamám", which means "finished". This scrap of paper was identified to be from the last page of the book of poems called "The Rubiayat", written by the famous Persian poet Omar Khayyam.

The police put out an announcement through the media that they were searching for a copy of The Rubiayat with the "Tamám Shud" phrase, if not the entire last page, removed. Shortly afterwards, a local Somerton man came forth with a copy of the book, claiming that it had been tossed onto the back seat of his car on the 30th of November the previous year.

Two things of significance were quickly noticed about this particular copy of The Rubiayat: Firstly, there was a phone number pencilled in on the back cover, which led police to a nurse (whose real name continues to be suppressed) who is known as Jestyn; and secondly, pencilled into the book was the short code that can be seen below. So far, this code has never been deciphered. It is unknown whether it is, indeed, a code or simple a meaningless array of letters. The original language of the message is also unknown.

The Inquest

Due to the suspicious nature of the case the coroner was called to perform an inquest. This inquest looked at the circumstances surrounding the body both during the lead up to and during the fallout of the death and the body itself. The inquest report contains detailed information about the state of the body, through witness reports and through the post mortem examination conducted by John Dwyer, and information is also given regarding the suitcase left at the Adelaide Railway Station cloakroom.

One of the most intriguing things about the body was that "The heart was of normal size, and normal in every way", yet "small vessels not commonly observed in the brain were easily discernible with congestion. There was congestion of the pharynx, and the gullet was covered with whitening of superficial layers of the mucosa with a patch of ulceration in the middle of it. The stomach was deeply congested...There was congestion in the 2nd half of the duodenum". "...acute gastritis haemorrhage, extensive congestion of the liver and spleen, and the congestion to the brain." This analysis, conducted, by Dr Dwyer, led him to conclude "I am quite convinced the death could not have been natural" and to say "the poison I suggested was a barbiturate or a soluble hypnotic." and yet that "There are other poisons which do come into the picture which would have decomposed very early after death" (Coronial Inquest[1] pages 12-13).

However, this is seems to be in conflict with the findings of Dr Robert Cowan, who analysed in detail a section of the stomach (and its contents), a section of the liver, a section of muscle, some blood and some urine taken from the body. In this report, Dr Cowan states:

"I feel quite satisfied that if the death were caused by any common poison, my examination would have revealed its nature. If he did die from poison, I think it would have been a vary rare poison. ... I think that the death is more likely to have been due to natural causes than poisoning."

This contradiction baffles experts still as, by the last significant look at the case in 1978, no matching poison had been uncovered[2] to fit the description given by the 2 doctors.

Also, of interest to the case, the clothing the man was wearing had the labels removed, so that no name was present on him, and nor did he possess any other form of identification. The suitcase that was found belonging to the deceased man did contain some labels "bearing the name 'T Keane'" (Coronial Inquest[1] page 22), while other items also had the labels torn out. This name, along with fingerprints and photographs of the deceased were sent around the world to all States in the Commonwealth and New Zealand, and also the important fingerprint bureaus overseas. The reply was "the person is not known to us" (Coronial Inquest[1] page 25).

This is odd, as no one anywhere seemed to know who the man was. Yet according the the Inside Story report[2] in 1978, someone kept leaving fresh flowers on his grave site every spring.

Operation Venona: a brief history

Towards the end of the second world war, in 1943, an operation was launched by the US and UK to spy on the Soviet Union, who had formed the Soviet Bloc (leading to widespread fear of a Third World War and also led to the Cold War). This operation, known as Operation Venona, was a code-breaking operation targeting encrypted Soviet diplomatic communications. This operation revealed that sensitive Australian government information had been leaked to the Soviet Union from an Australian source. Operation Venona uncovered a spy-ring in Australia being run from the Soviet Embassy, and over the next few years, during the Cold War, the UK Security Service made numerous investigative trips to Australia and reports to the Australian Government as to the security situation. Was the Somerton Man a member of that spy-ring, or a member of the UK Security Service? Was he an unsuspecting victim of the Cold War? Was it all coincidental, or is there something more behind it all? [3]

Previous attempts at cracking the code

Previous professional attempts at insight into this code were limited due to them having been carried out a long time ago without the benefit of modern techniques and databases. Also another problem was that they appeared to make fixed assumptions about the characters in the code. They did not appear to take into account that some of the symbols are ambiguous. For example, it is almost impossible to discern whether the first character on either the first or the third line is an M or a W (or indeed perhaps something else entirely), and a similar case is present for the fifth line's first character (is it an "I", a "V" or something else?).Also, the second line has been omitted entirely in previous cracking attempts, as it is assumed to be crossed out in error, but there is no definite reason to believe that it indeed is an error. And what of the "X" above the "O" in the fourth line?

The other problem is amateur attempts by members of the public and astrologers is that is has been statistically demonstrated that if you make the mistake of 'cherry picking' you can virtually read anything you want into a sequence of letters. A good example of this is the Bible code controversy[4]. The Bible code is a simple 'skip code'.

"Skip codes involve a different way to read a text than we normally do, usually we read one letter at a time, the first letter, the second letter and the third letter and so forth. But with a skip code we might start with the third letter and then skip ahead ten to the thirteenth letter and then to the twenty third letter and so forth and maybe that would spell out a new word jumping ten letters at a time. Here’s an example, there’s a sentence here that says my way of showing a skip code is encrypted in the very words I put down here. Let’s take the first letter and jump every 14 and make that red. It’s very hard to read as it is written here you can see the letter m-a-r-y–h-a-d, it’s not so legible. So to make it look nicer and to understand it better, what we usually do is we break the line before each red letter. So now we have each line starting with the red letter and now the red letters are in a column and it’s very easy to read. Mary had a little lamb."[5]

The real 'code' concept arises from what this skip code produces. Such skip codes run on the Bible have been accredited for predicting various events in the recent past, such as the global economic collapse supposedly set to start in the year 2002 (which did come true[5]), and that therefore, by reading such codes, it would be possible to discern and prepare for the future.

However, there is substantial evidence present to disprove the theory of the Bible Code concept. One article in particular, "Solving the Bible Code Puzzle"[4], states:

"A paper of Witztum, Rips and Rosenberg in this journal in 1994 made the extraordinary claim that the Hebrew text of the Book of Genesis encodes events which did not occur until millennia after the text was written. In reply, we argue that Witztum, Rips and Rosenberg's case is fatally defective, indeed that their result merely reflects on the choices made in designing their experiment and collecting the data for it. We present extensive evidence in support of that conclusion. We also report on many new experiments of our own, all of which failed to detect the alleged phenomenon."

"One appellation (out of 102) is so influential that it contributes a factor of 10 to the result by itself. Removing the five most influential appellations hurts the result by a factor of 860. Again, these appellations are not more common or more important than others in the list in any previously recognized sense. It should be obvious from these facts that a small change in the data definition (or in the judgment or diligence of the data collector) might have a dramatic effect. More generally, the result of the experiment is extraordinarily sensitive to many apparently minor aspects of the experiment design, ... These properties of the experiment make it exceptionally susceptible to systematic bias. ... there appears to be good reason for this concern."

There are many more examples as to why this type of code analysis is flawed stated in this article.

There have been a several 'decryptions' that have been done by amateur code-breakers nation-wide after the Somerton Man's code was released and published in newspapers.

"Results ranged from, Go & wait by PO. Box L1 1am T TG to Wm. Regrets. Going off alone. B.A.B. decieved me too. But I've made peace and now expect to pay. My life is a bitter cross over nothing. Also I'm quite confident this time I've made Tamam Shud a mystery. St. G.A.B."[6]


Simon Singh, an expert on codes who commented that the Somerton code "looks simple".

Background and existing theory

There are two main categories of cryptography. They are transposition and substitution[7]. In transposition, the letters are as they should be but are reordered (systematically) creating an anagram of the original text. In substitution (Note: the word cipher technically refers to a substitution but is often also used in reference to a transposition) the letters of the original text are systematically substituted for others (often using a keyword).

Transposition Schemes

There are many types of transposition schemes such as rail fence, route, columnar, double, among others. No transposition schemes have been specifically investigated in this project. However, some parts of this project have investigated the Somerton Man's code treating the order of letters as irrelevant. This has allowed us to investigate the code without assuming that a transposition has not occurred.

Substitution Schemes (Ciphers)

There exists a vast number of cipher schemes in modern literature. For this reason, we have chosen to investigate three cipher schemes that were common knowledge or in common use in 1948 when the body was discovered. They are the Vigenere Cipher, the Playfair Cipher and the One-time pad system, which is essentially a Vigenere Cipher with inherent assumptions about the cipher key (such as its length being the same as the length of ciphertext).

Vigenere Cipher

The Vigenere Cipher was invented in 1553. It is a substitution cipher scheme that uses a variable letter shift based on a key word[8]. Each letter is shifted along the alphabet and substituted with the corresponding letter. For example, with a shift of 5, the letter C would be substituted with the letter H.

The shift value is calculated from letters of the cipher keyword (A-Z meaning 0-25 respectively). The keyword is repeated until it is the same length as the ciphertext and each consecutive letter is used as the shift for each consecutive letter of the ciphertext.

For example, to encipher the phrase THIS TEXT IS SECRET with the keyword lemon, we shift the first letter by 11 (L) to get E, the second by 4 (E) to get L, and so on. The full resultant ciphertext is ELUGGPBFWFDIOFRE.

Plaintext: THISTEXTISSECRET
Keyword: LEMONLEMONLEMONL
Ciphertext: ELUGGPBFWFDIOFRE

Playfair Cipher

The Playfair cipher was invented in 1854 by Charles Wheatstone and promoted by Lord Playfair. It is a cipher that uses a 5-by-5 grid set up using a keyword usually containing 5 different letters (for example using the words death or plain). This keyword is placed along the top row of the 5-by-5 grid (and continued to subsequent rows if necessary and omitting repeat letters in the keyword) with the rest of the alphabet (omitting one letter of choice not in the phrase to be encrypted) filling up the remaining spaces in alphabetical order (left to right, then top to bottom).

For example, suppose we use the same phrase THISTEXTISSECRET as before, and the same keyword LEMON, and omit the letter Q. The 5-by-5 grid will be as follows:

L E M O N
A B C D F
G H I J K
P R S T U
V W X Y Z

The Playfair cypher is then implemented using each successive pair of letters. If there is an odd number of letters then a random letter is added to the end of the phrase and if any pair of letters are the same then a letter must be added between them (usually an X). Each pair of letters is used to form a rectangle on the grid and the pair of letters is substituted for the opposing corners of the rectangle.

For example: The first pair in the above example is "TH":

L E M O N
A B C D F
G|H I J|K
P|R S T|U
V W X Y Z

So the letters "TH" would be substituted for the letters "RJ"

In the event that a given pair of letters are in the same row or column, the letters are substituted with the letters immediately below if they are in the same column or immediately to the right if they are in the same row (If the letter is on the edge then the letter on the opposing side is used).

For example: The next pair of letters in the above example are "IS". It can quickly be observed that these two letters fall in the same column, so the letters are substituted by the letters immediately below in the table to avoid the letters being the same as in the unciphered text. Hence, the pair "IS" is substituted with the pair "SX" as they are the letters immediately below.

Implementing this process on the entire phrase THISTEXTISSECRET yields RJSXROYSSXRMBSOR.

One Time Pad

A one time pad (OTP) is conceptually identical to the Vigenere cipher but the cipher key used is exactly the length of the plaintext. This is achieved by using a pad as the key. A pad can be a book or any sufficient length of text that is accessible by both the sender and the receiver of the ciphertext. The OTP system is theoretically unbreakable provided that the key (pad) is a random sequence (each letter is independent and identically distributed). Provided that both the plaintext and the key (pad) are random sequences (or close enough to), the resultant ciphertext should have an equal distribution of letters. The same key (pad) can only be used once without reducing the security of the message to below 100% (if the two ciphertexts were concatenated then this is effectively a Vigenere cipher with a repeating key). Effectively, the only way to break the code is to know the key.

The one time pad presents as a real possibility for the scheme used in the Somerton Man's code. Since the code was found in a rare copy of The Rubaiyat of Omar Khayyam this yields the obvious idea that a certain poem in that edition of the book was used as the pad. The OTP system was commonly used during the cold war by Soviet and other spies. It was also common to use a verse from the bible (either the King James Version (KJV) or the Revised Standard Version (RSV) were used as they were easily accessible (for example, in hotel rooms). For these reasons, the possibility of a One Time Pad system being used to generate the Somerton Man's code from KJV, RSV and the Rubaiyat of Omar Khayyam has been investigated. This will be discussed further below.

Markov Chains

A Markov chain models an order of variables (in our case a string of text) as a stochastic process where the next variable (character) is treated as depending only on a fixed number of previous values (characters in our case). The number of previous characters defines the order of the Markov chain. In this project, we will only consider first and second order Markov Chains for reasons of practicality in estimating probabilities.

Given knowledge of all population conditional probabilities (eg. [math]p(A|B) =[/math] probability that an A will be the next letter given that B was the last) we can calculate a Markov probability (MP) defined as follows:

Given a sequence of random variables [math]\{X_1 . . . X_n\}[/math]

[math]{\rm MP (first order)} = p(X_1) p(X_2|X_1) p(X_3|X_2) . . . p(X_n|X_{n-1})[/math]
[math]{\rm MP (second order)} = p(X_1) p(X_2|X_1) p(X_3|X_2,X_1) . . . p(X_n|X_{n-1},X_{n-2})[/math]

In this project, the sequence used is the Somerton Man's code and the population probabilities have been estimated from various languages and ciphers (and initial letters only of some languages) to determine the likelihood (Markov probability) of the sequence (code) coming from that particular language/cipher (as each represents a unique stochastic process).

Given the category/language/cipher of interest, an eBook of such category (or a very long string of ciphered text for the case of a cipher being tested) has been used to estimate the probabilities of each alphabet letter following each alphabet letter (conditional probabilities). From there, a Markov Probability has been calculated as described above.

The Markov probability estimated for each language (or category) has been assessed against a benchmark probability of 1/26^44 = 5.51027E-63. This represents the probability that each letter of the code is the outcome of a uniformly random process (each letter of the alphabet having a 1/26 chance of occurring in each of the 44 positions of the sequence). With reference to the HMMER User's Guide[9], we define the HMMER score as below. We consider this as a useful comparison between results.

[math]{\rm HMMER\ Score} = {\rm log_2}(\frac{{\rm Markov\ Probability}}{\frac{1}{26^{44}}})[/math]

An HMMER score of zero indicates equal likelihood of the Somerton Man's code originating from a uniformly random process and originating from the relevant language or category.

Methodology and preliminary results

The Somerton Man's Code

The code (shown below) is subject to some interpretation. Most obviously, there is a line of text that looks to be crossed out (striked).There are also a few ambiguous letters described as follows:

  • The first letter of both the first and second lines (not counting the striked out line) looks as if it could be either an M or a W.
  • There is a cross on top of the O (this could indicate an error and, hence, to discount the character)
  • The third to last letter looks like a G but could be a C as the horizontal line that is part of the next character extends into this character.
  • The fifth to last letter looks like an S but with a line through the middle. This is inconsistent with the other S.

Each of these single-letter ambiguities are represented by error bars when graphs of letter frequency distribution are shown (except for the S with a line through it as we have no alternative theory on what the character is). As you can see in the frequency plot below, the ambiguity of the first letter of the first and second lines (M or W) is represented by green, the ambiguity of the first letter in the last line (I or V) is represented in yellow, and the ambiguity of the third to last letter (G or C) is represented in purple. These colours will be kept consistent throughout this report. Note that exactly two of the green portions must be included and exactly one yellow and one purple must be included in any interpretation of the code.

A photo of the original code found in 1949 in the back of a copy of the Rubiayat of Omar Khayyam Letter frequency distribution plot of the most likely intended sequence of letters

We consider the most likely intended sequence of characters (from pure first observation of the above picture) to be as follows:
First line: MRGOABABD
Second line: MTBIMPANETP
Third line: MLIABOAIAQC
Fourth line: ITTMTSAMSTGAB
Striked line: MLIAOI

Structure of common methodology implemented by software

The following modules have been used at various points in the project to calculate statistical information.

Streams

A punctuation filter stream was programmed to filter punctuation from an eBook or any text input. This module takes a text file input and outputs a stream (which can be used as inputs to other modules or outputting directly to a file) matching the text input, but removing punctuation characters (and turning hyphens into spaces to preserve the word structure).

Another stream was programmed with the purpose of finding only the initial letters of words. This is implemented as counting the first character after a space, newline, or tab as the first letter of a word. It assumes the input comes from a punctuation filter stream (as described above) so that this letter is never a punctuation character.


Letter counter

The following process is common to many parts of the project and is used to generate either (or both) a letter frequency distribution or a Markov probability and HMMER score (as described in section 6.3 above). It relies on the punctuation filter stream and initial letter stream as described above (the initial letter stream is only included if the process is modelling only the first letters of words). Each order of Markov counter has been implemented independently but has only been implemented for 1st and 2nd order in this project.

Counter Process.png

Results and interpretation

The significance of our findings can be modelled by the following hypotheses on the nature of the code:

  1. The code is meaningless
  2. The letters are as they are supposed to be (not substituted)
    1. The code is an anagram in English or another language
    2. The code represents the initial letters of a list of words (treating the order as irrelivant)
    3. The code represents the initial letters of words in a sentence or phrase (treating the order as relevent)
  3. The original letters have been substituted for others in a systematic way
    1. Vigenere Cipher
    2. Playfair Cipher
    3. One-Time Pad

This set of hypotheses includes both transposition and substitution cipher schemes in both English and foreign languages with hypothesis 1 being a "catch all" since we define meaningless to be that the string of letters does not represent meaningful information. We consider the set of hypotheses 1, 2 and 3 to be exhaustive (exactly one must be true) if and only if the approach we take for hypothesis 3 does not assume that there is no transposition


Hypothesis 1: The code is meaningless

One important hypothesis that we have investigated is that the code is just a series of random letters and is meaningless. By meaningless, we mean that it does not contain any hidden message nor can any origin be found for it. Under this hypothesis we assert the following possible scenarios of how the string of letters was written:

  1. The letters were deliberately intended to be random (either intended to confuse people or otherwise)
  2. The Somerton Man (or anyone else) was intoxicated and the letters are a result of delusion.

It is important to note that this investigation will be done with English speaking people and hence assumes that the random series of letters (as we are assuming it to be) was invented by an English speaking person. This should be investigated further with non-English speaking people but will not be done in this project.

Unintoxicated Random Samples

The graph below shows the distribution of letters (as a percentage) of 37 samples of approximately 50 random English letters taken from various people while they were (to the best of our knowledge) suffering no form of intoxication or delusion:

Soberbox.png

Of particular interest in the above boxplot is the frequency of the letters "R", "S" and "A", which occur noticeably more than any other letters. This, combined with the fact that "R" only appears once in the Somerton Man's code and "S" only twice, suggests the code is not a meaningless random set of letters.

One point that must be noted, however, is that a large percentage (all but 5 test subjects) had a high (but varying) degree of difficulty in writing random letters. Some test subjects took well over 5 minutes to write their 50 letters in an attempt to be random, yet despite this, in a few instances entire words were written down, even when the subject was consciouosly trying to not do so. While this fact does not rule out the random letters theory, it does make it seem less probable.

Intoxicated Random Samples

Due to legal, moral and ethical reasons, testing the influence of any probable poison that may have been ingested by the Somerton Man could not be performed, so other means had to be used. As a result, it was decided that alcohol would be the best substitute for the poison that could be achieved, due to it being legal, easily accessable, and most importantly, is ingested willingly by potential test subjects regularly by their own admission.

However, as alcohol is frequently ingested by potential test subjects, each individual has developed a different level of tolerance to it, and hence determining when that subject has reached a level of intoxication that is sufficient to replicate the effects of a poison proved very difficult to judge, and impossible to regulate or measure.

The graph below is a sample of 18 random letter distributions produced by intoxicated persons. While 18 is still a small sample, and it will be important to continue to extend the data accordingly, some trends are still visible.

Drunkbox.png

Of greatest interest in the above boxplot is the suprisingly high frequency of the letter "Z", which is noticably higher than any other letter. This is in complete contrast to the Somerton Man's code, which does not contain "Z" at all. This is a fairly strong indicator that, based on the sample population, it is highly unlikely that the code was written by someone who was intoxicated at that time. However, much more work is required to be more certain of this conclusion (see 'Future Studies' below)

Below is an image used purely for comparitive purposes between the unintoxicated samples (top left graph), the intoxicated samples (top right graph) and the Somerton Man's code letter distributions (in alphabetical order)


Compare.png

Future Studies

This section of the analysis is still incomplete, however, as the vast multitude of the sample population used were of western european heritage, with english as their first language, whilst the Somerton Man was described as being of eastern european appearance, and therefore may have had a different natural language predominance. Accordingly, future studies should incorperate people of many different backgrounds, and should not be restricted to english letters.However, it will be important to ensure that each different language is treated both as a seperate sample and combined with all languages to get the most appropriate distributions to compare with the Somerton Man's code.

Also, there are many different legal drugs other than alcohol that can affect an individuals mental functionality, and hence affect the distributions of random letters they generate, and likely in a different manner to the effects alcohol had. Therefore, in future studies it will be important to test different drugs as well, to see the effects this causes.

Hypothesis 2: Letters have not been substituted

This hypothesis assumes that each letter is as it is intended but that the letters may be in a different order. To assess this hypothesis, we have investigated the following possibilities:

  1. The string of letters is an anagram (the letters themselves are correct but their order has been rearranged) in English or another language.
  2. Each letter represents the first letter of a word in a list. This list could be periodic elements, place names, names of people, train station names, etc. This assumes that the order of the letters is irrelevant. This hypothesis has been only briefly investigated in English.
  3. Each letter represents the first letter of a word in a sentence. This is distinct from the last possibility by treating the order of the letters as relevant. This has been investigated in a range of languages.

Anagram or transposition cipher

The first possibility (anagram) has been easily tested by comparing the letter frequency distribution of the code to the letter frequency distribution of text of possible languages. The texts used are described in Appendix A. The letter frequencies of these languages are shown below. A plot of the frequency of letters in the Somerton Man's code is also shown for comparison.

Frequency of letters in an English eBook with the frequency of letters in the code. Note the huge differences

The first point to note (considering only at the graph on the left hand side) is that the frequency of letters does not differ dramatically between languages. The difference is larger for the vowels than the consonants but there is definitely a general trend.

Comparing this trend to the letter frequency of The Somerton Man's Code, there are far fewer Es and Ns and much more As in the code than is likely to arise if the code were an anagram. From this it is clearly unlikely that the Somerton Man's code is an anagram in any of the languages considered. The simple fact that there is a Q but no U in the code is also a good indication that this is not the case (for English at least). From this, we can rule out any transposition cipher not also accompanied by a substitution cipher.

Initial letters of an unordered list

To test the second possibility (initial letters of a category), a similar approach has been taken but by counting initial letters of words of certain categories. The counted letter frequency plots with comparison to the Somerton Man's code are shown below:

Initial Letters of English text vs the Somerton Man's Code Initial Letters of Periodic Elements vs the Somerton Man's Code Initial Letters of Australian Cities vs the Somerton Man's Code

The last two graphs (initial letters of periodic elements and of Australian cities) show too many big differences to the Somerton Man's code to jump out as likely possibilities. The first graph, however, of initial letters of general english text shows a big similarity to the code. While there are differences (no H's or F's in the code and a few more A's and B's in the code than in the English text) the similarities far outweigh them. It therefore presents as a strong possibility that the Somerton Man's code represents the initial letters of some English words (ordered or not).

If it is the case that the code is an initialism, we can start to make some judgements on the nature of the ambiguous letters. For instance, it seems more likely that the first letters of the first and second lines of the code are W's instead of M's if it is an initialism as the statistics indicate that there should be more W's in the code. Similarly (but less definitely) the third to last letter of the code seems more likely to be a C than a G). These judgements are based purely on the probabilities of initial letters and any studies based on the analysis of the original handwriting should be considered as a better indication of this.

Initial Letters of a sentence

The third possibility has been investigated by treating the sequence as a Markov chain. This treats the order as relevant by finding the probabilities of each transition (letter following another letter). The calculation of Markov probabilities and the HMMER score are described in section 6.3 (above). The following was obtained for the most likely sequence of the Somerton Man's code using transition probabilities of initial letters in sentences of various languages estimated from various eBooks described in Appendix A (the corrected zeroes value is the number of transitions estimated to have conditional probability p=0 with the value p=0.0001 used instead in order to not let this rule it out entirely):

First Order Second Order
English

Markov Probability: 5.755746003335865E-56
Corrected Zeroes: 0
HMMER Score: 23.316377212971148

Markov Probability: 1.0496288884966237E-55
Corrected Zeroes: 0
HMMER Score: 24.18318171171002

Italian

Markov Probability: 6.089831612262369E-59
Corrected Zeroes: 0
HMMER Score: 13.431992337096512

Markov Probability: 1.0949986922100798E-57
Corrected Zeroes: 0
HMMER Score: 17.6003753

Portuguese

Markov Probability: 2.2658166834014361E-60
Corrected Zeroes: 0
HMMER Score: 8.683693049158332

Markov Probability: 9.986807259493189E-70
Corrected Zeroes: 4
HMMER Score: -22.395595515749598

French

Markov Probability: 1.960919656262944E-61
Corrected Zeroes: 0
HMMER Score: 5.153264236029422

Markov Probability: 8.83846402382603E-70
Corrected Zeroes: 3
HMMER Score: -22.571823368605646

Spanish

Markov Probability: 3.204888821639082E-63
Corrected Zeroes: 0
HMMER Score: -0.7818480694202504

Markov Probability: 8.869612541473453E-71
Corrected Zeroes: 3
HMMER Score: -25.888676055317255

Swedish

Markov Probability: 2.6806903224847996E-64
Corrected Zeroes: 2
HMMER Score: -4.361445908011734

Markov Probability: 9.55816081723446E-69
Corrected Zeroes: 4
HMMER Score: -19.13695790770939

Dutch

Markov Probability: 6.9654365323502316E-68
Corrected Zeroes: 1
HMMER Score: -16.27154908302583

Markov Probability: 1.9749729570078271E-69
Corrected Zeroes: 2
HMMER Score: -21.41185805018986

German

Markov Probability: 7.441017498436695E-73
Corrected Zeroes: 1
HMMER Score: -32.78590341697

Markov Probability: 1.6757717490935368E-89
Corrected Zeroes: 10
HMMER Score: -88.08742718870606

The language that results in the greatest Markov probability (and hence HMMER score) is English. It also presents as much more likely than the next highest alternative (Italian) with an HMMER score of around 10 lower (for the first order model) and Markov probability around 1000 times lower. The HMMER score for English text is also much greater than zero indicating that it is much more likely that the Somerton Man's code represents the initial letters of an English phrase than originating from a uniformly random process (with 1/26 chance of each letter occurring in each of the 44 characters).

Ordered ranking of languages for the Somerton Man's Code representing initial letters of the given language

Using the result of this investigation (ie. assuming that the Somerton Man's code is, in fact, the initial letters of a phrase), efforts have been made to find the source of this phrase. The Rubaiyat of Omar Khayyam and the Holy Bible (King James Version and Revised Standard Version) were searched (using a programmed piece of software and making use of the streams described in section 7.2.1) for the sequence of initial letters RGOABABD (this being an unambiguous subsequence of the Somerton Man's code). No matches were made concluding that the Somerton Man's code is not the initial letters of a phrase from either the Rubaiyat of Omar Khayyam or these editions of the Holy Bible.

Hypothesis 3: Letters have been substituted

This hypothesis has been given the most attention as there are a number of different possibilities. We have investigated only the following possible cipher schemes and have only considered English as the original plaintext for each.

  1. Playfair Cipher
  2. Vigenere Cipher
  3. One-Time Pad
  4. First Order Substitution Cypher

Playfair cipher

Using programmed software in Java, the Playfair cipher (see section 6.2.2) was implemented. The concatenated series of English text described in appendix A was enciphered using a playfair cipher with a cipherkey of 'LEMON'. The resultant ciphertext was then used as an input to the Markov counter process described in section 7.2.2 (without the initial letter stream) and the following Markov probability (see 6.3) was calculated:
Markov Probability: 5.213910076344393E-69
HMMER Score: -20.01132524791302

This overwhelmingly low probability (HMMER score much less than 0) infers beyond reasonable doubt that the Playfair cipher was not used to generate the Somerton Man's code from an English plaintext. More formally, it shows that it is much more likely that each letter was picked purely at random with a 1/26 chance of each letter of the alphabet occurring.

Vigenere cipher

Similarly, the Vigenere Cipher (see section 6.2.1) was implemented and the same set of English text was enciphered using a cipherkey of 'LEMON'. The resultant ciphertext was used as an input to the Markov counter process described in section 7.2.2 (without the initial letter stream) and the following Markov probability (see 6.3) was calculated:
Markov Probability: 1.646391769425068E-70
HMMER Score: -24.99631136880728

Once again, this overwhelmingly low probability (HMMER score much less than 0) infers beyond reasonable doubt that the Vigenere cipher was not used to generate the Somerton Man's code from an English plaintext.

One-time pad

The circimstances in which the code was discovered (being scribbled on a page of a very common book) give rise to the possibility of a one-time pad system being used. The Rubaiyat of Omar Khayyam and the Holy Bible (both the King James Version (KJV) and Revised Standard Version (RSV)) have been investigated as possible pads. It is important to note here that while the circumstances may lead to the idea of a one-time pad, the letter frequency distribution of the Somerton Man's code is far from flat as a one time pad encryption is likely to give (although, as the assumption of a random sequence as a pad breaks down, so does the idea of the ciphertext having an even letter distribution).


Using the Holy Bible as a pad

Both the King James Version (KJV) and the Revised Standard Version (RSV) of the Holy Bible have been considered as possible pads for the possibility of a one-time pad system being used to produce the Somerton Man's code. Each verse of each of these versions was in turn treated as the cipherkey. The ciphertext was considered to be the Somerton Man's code. Using each Ciphertext/cipherkey pair, a Vigenere cipher implementation was used (the Vigenere cipher is identical to the one-time pad system if the cipherkey is at least as long as the ciphertext) to generate the plaintext needed to give the Somerton Man's code as a ciphertext. For example, using Daniel 1:2 as the cipherkey (in the KJV):

Ciphertext: MRGOABABDMTBIMPANETPMLIABOAIAQCITTMTSAMSTGAB
Cipherkey: AndtheLordgaveJehoiakimkingofJudahintohishan
Plaintext: MEDVTXPNMJNBNIGWGQLPCDWQTBUUVHIFTMEGZMFKBVAO

There are too many resultant plaintexts to show in this report but from observation of them (checking each one individually to determine if any english words are present) it is clear that if a one-time pad system was used in this fashion, no verse from either of these versions of the Bible was used as the cipherkey (pad).

Using the Rubaiyat of Omar Khayyam as a pad

Given the circumstances of the discovery of the code, the other obvious choice of pad is The Rubaiyat of Omar Khayyam. If this book was used as a pad, it is likely a particular verse was used as the cipherkey. Since the Somerton Man's code consists of four lines (discounting the striked line) and The Rubaiyat of Omar Khayyam consists of four line stanzas, there gives rise to the idea that each line of a poem may correspond to each line of the code. Hence, two possible structures for using the Rubaiyat of Omar Khayyam as a pad have been investigated.

Treating the code as a single cipher
Treating the code as a single cipher, the ciphertext becomes the concatenation of each line of the code. Similarly, the cipherkey is the concatenation of each line of a poem of the Rubaiyat of Omar Khayyam. For example, using the first poem of the book:

Ciphertext: MRGOABABDMTBIMPANETPMLIABOAIAQCITTMTSAMSTGAB
Cipherkey: AWAKEFORMORNINGINTHEBOWLOFNIGHTHASFLUNGTHEST
Plaintext: MEDVTXPNMJNBNIGWGQLPCDWQTBUUVHIFTMEGZMFKBVAO

This has been done for each poem in the book and all of the resultant plaintexts were observed to not contain any english words.


Treating the code as four separate ciphers
In this variation, each line of the four-line poem was paired with the corresponding line of the Somerton Man's code (also four lines if you don't count the crossed out line) giving 4 separate ciphers for each poem. For example, using poem 1:

Awake! For Morning In The Bowl Of Night
Has Flung The Stone That Puts The Stars To Flight:
And Lo! The Hunter Of The East Has Caught
The Sultan's Turret In A Noose Of Light.

Cipher key #1: AWAKEFORM
Cipher key #2: HASFLUNGTHE
Cipher key #3: ANDLOTHEHUN
Cipher key #4: THESULTANSTUR

Ciphertext #1: MRGOABABD
Ciphertext #2: MTBIMPANETP
Ciphertext #3: MLIABOAIAQC
Ciphertext #4: ITTMTSAMSTGAB

Once again, the resultant plaintexts were observed to not contain any English words.

First Order Substitution Cypher

A first order substitution cypher is a cypher where each letter is substituted based on its valuebeing the only variable. Two of the most common cyphers of this form the Alphabet Reversal Cypher and the Caeser Cypher.

Alphabet Reversal Cypher

As it's name suggests, the Alphabet Reversal Cypher involves reversing the alphabet and taking the letter in the same position as the letter it is replacing.

For the English alphabet:
A B C D E F ... X Y Z becomes
Z Y X W V U ... C B A

Therefore, a phrase such as THIS TEXT IS SECRET becomes GSRH GVCG RH HVXIVG

Caeser Cypher

Arguably the most well-known cypher, the Caeser Cypher shifts each letter by the same, preset value, and using a wrap-around system for letters that are shifted beyond the end of the alphabet.

Again looking at English; for a shift of 2:
A B C D E F ... X Y Z becomes
C D E F G H ... Z A B

In this case, the phrase THIS TEXT IS SECRET becomes VJKU VGZV KU UGETGV

Yet for a shift of 22:
A B C D E F ... X Y Z becomes
W X Y Z A B ... T U V

In this case, the phrase THIS TEXT IS SECRET becomes PDEO PATP EO OAYNAP

Testing First Order Substitution Cyphers By Pattern Matching

To test whether the Somerton Man's code is a first order substitution cypher, certain patterns that exist within the code were cross-referenced with an extensive list of English words[10]

Several patterns were identified from within the code, and for each pattern, code was written to cross-reference that pattern with a given word in such a way as to negate what the letters actually were, as long as they fitted the pattern. For example:

The pattern "ABAB" could be "ABAB", or it could be "ERER" or "ININ", or any other combination of letters that have the same letter corresponding to "A" in each (and every) instance of the letter "A", and another letter that is the same for each and every instance of the letter "B".

The following table displayes the results of cross-referencing each noticable pattern with each word in the aforementioned wordlist:

Table.png

The second column shows the number of words of exactly the same length as the line containing that pattern in the Somerton Man's code, and with the pattern in the same position within the word as it is in the code. The third column, however, shows the number of words of any length with the pattern occuring anywhere within the word.

Of greatest interest in this is the second row of the table, which states that there are no words that contain the pattern in question at all, regardless of the length.

Upon closer investigation of this phenomenon, it was discovered that, in fact, there are no words within the word list that contain the pattern "TTMT". This is a very strong indicator that the code is not a first order substitution cypher in english, as for it to be, there must be a change from one word to the next at some point during that pattern, and due to the lack of presence of a space in the code, that is unlikely.

Further work needs to be done on this area, as while the code has proven with very high certainty not to be an english first order substitution cypher, this has not yet been proven, or indeed tested, for any other language.

Project management

Process Structure, Task Allocation and Schedule

Shown below on the left is the rough planned task allocation as stated in the Critical Design Review report. The diagram on the right shows the distribution of tasks performed.

Task Allocation.png Task Allocation Final.png

The following Gantt chart shows the rough time allocation of each task as stated in the critical design review report. The actual progression of the project has differed slightly from the proposed timeline. The initial letter searcher was not originally considered but became necessary given some preliminary results. Instead of entropy and compression analysis, a Markov model was implemented and studied instead. Investigation of structural features was not performed.


CDR gannt chart.png

Risk management

As this project mostly involved coding of software, there was minimal risk to be concerned with. There was minimal process risk as there were few planned tasks that depended on the outcome of other tasks.

Budget

  • Cake for police historians so that they will talk to us - $24.95
  • DVD of 1978 ABC documentary - $88.00
  • Retrieval from the National Archives - $30.00
  • Coronial Inquest - $0 (it was free)

Summary and conclusions

Every attempt was made in the structure of this investigation to not discount any possibility as to the meaning of the Somerton Man's code. This includes the possibility that the letters were invented randomly (as a ruse or as the result of delusion). The main other possibilities investigated are that the code is an anagram or simple transposition cipher in many languages, that the code is an initialism of a sentence or an unordered list, and that the code is a cipher (specifically either a Playfair, Vigenere, or one-time pad cipher).

The possibility of a Playfair or Vigenere cipher has been ruled out beyond reasonable doubt by showing that the outputs of these ciphers show letter frequency statistics very inconsistent with the Somerton Man's code. The possibility of a simple transposition cipher was ruled out in a similar way. The possibility of a one-time pad was investigated using the Holy Bible and the Rubaiyat of Omar Khayyam as possible pads. No specific key could be found that deciphers the code to an English text. A suggested followup investigation would be to perform post-processing on these results to determine these have any meaning in languages other than English. If the reader wishes to do so the software and these outputs are available for download in Appendix B.

By far the most promising result found from this investigation is that the code is much more likely to be an initialism of a sentence of text than to have been produced by a uniformly random process. Many languages were investigated and English was found to be far more likely than any other. Deducing what the initialism stands for, however, presents as a very difficult task. The sequence was searched for in the Rubaiyat of Omar Khayyam and the Holy Bible and not found.

If we assume that the code is as our conclusion suggests (an initialism of English text) then the following presents as the most likely intended sequence of letters in the code:


WRGOABABD
WTBIMPANETP
MLIABOAIAQC
ITTMTSAMSTCAB


This is based purely on the statistics obtained about English text and not the handwriting itself. Further studies could be performed by a handwriting expert into the meaning of these ambiguous characters.

There are also many more cipher schemes that could be investigated as a followup study. This investigation indicates, however, that further investigations treating the code as an initialism are much more likely to prove fruitful if a method of investigation into this possibility can be created.

Appendices

Appendix A: Texts used for analysis

A compilation of texts was made for each language for use as inputs to software modules to estimate certain conditional probabilities about the languages. Below is a list of texts for each language that have been concatenated in the given order into a single text file for each language. Most of these texts were obtained from Project Gutenberg (www.gutenberg.org) and all of these texts are free from copyright (expired).

All of the texts used in this investigation are available here [1] (RAR archive 6.19MB).

English

  1. Alice's Adventures in Wonderland - Lewis Carrol
  2. Pride and Prejudice - Jane Austin
  3. Dracula - Bram Stoker

French

  1. Les Orientales - Victor Hugo
  2. Nouvelles mille et une nuits (New Arabian Nights) - Robert-Louis Stevenson
  3. Excelsior. Roman parisien - Léonce de Larmandie

German

  1. MÄRCHEN FÜR KINDER (Fairytales for children) - Hans Christian Anderson
  2. Alice's Abenteuer im Wunderland - Lewis Carrol
  3. Der Weihnachtsabend. Eine Geistergeschichte (A Christmas Carol) - Charles Dickens

Spanish

  1. Joan Burity (A Sociology article about religion)
  2. Pensamiento y accion por el socialismo (Thought and action for socialism) - Julio C. Gambina
  3. Mi tio y mi cura (My uncle and my priest) - Alice Cherbonnel
  4. La gloria de don Ramiro una vida en tiempos de Felipe segundo (The Glory of Don Ramiro in a life time of Philip the Second) - Enrique Larreta

Italian

  1. ALICE NEL PAESE DELLE MERAVIGLIE (Alice in Wonderland) - Lewis Carroll
  2. Pinochio - Collodi Carlo

Portuguese

  1. A Revolução Portugueza: O 5 de Outubro (Lisboa 1910) - Jorge de Abreu
  2. heophilo braga e a lenda do crisfal - Delfim Guimarães

Dutch

  1. Aan de Zuidpool From "De Aarde en haar volken," Jaargang 1913 - Roald Amundsen
  2. Mijnheer Snepvangers - Lode Baekelmans

Swedish

  1. Arbetets Herravaelde - Andrew Carnegie
  2. Folkungatraedet - Verner von Heidenstam

Appendix B: Software

Software written for use in this investigation can be downloaded below (outputs are also included). Also available is the archive of texts used to support the software. If these two archives are extracted into the same folder, the software should run error free. Feel free to use or modify this code for other purposes. Be sure to let us know if you find anything interesting to do with the code or the case in general.

  • Software and outputs [2] (RAR archive 1.71MB)
  • Supporting texts [3] (RAR archive 6.19MB)
  • Software and outputs #2and outputs (RAR archive 325kB)
  • Input data data.rar(RAR archive 164kB)

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 Inquest Into the Death of a Body Located at Somerton on 1st December 1948, State Records of South Australia, GX/0A/0000/1016/0B, 17th & 21st June 1949.
  2. 2.0 2.1 S. Littlemore, "The Somerton Beach Mystery" (Documentary), Inside Story, 24-08-1978
  3. "About ASIO. Significant Events in ASIO's History", 10 May 2009, <http://www.asio.gov.au/About/content/History.aspx>
  4. 4.0 4.1 B. McKay, D. Bar-Natan, M. Bar-Hillel, and G. Kalai, "Solving the Bible code puzzle," Statistical Science, Vol. 14, No. 2, pp. 150–173, 1999.
  5. 5.0 5.1 "BBC: Horizon: The Bible Code - transcript", 11 May 2009, <http://www.bbc.co.uk/science/horizon/2003/biblecodetrans.shtml>
  6. Orr, S 2009, 'Riddle of the end', The Sunday Mail, 11 January, p. 76.
  7. Simon Singh, "The Code Book", ISBN 1-85702-889-9, pp. 7, 1999.
  8. R. Morelli, “The Vigenere [sic] Cipher,” Historical Cryptography Web Site, Trinity College, <http://starbase.trincoll.edu/~crypto/historical/vigenere.html> (6 September 2007)
  9. ftp://selab.janelia.org/pub/software/hmmer/CURRENT/Userguide.pdf Page 43
  10. "The Corncob list of more than 58 000 English words", 4 August 2009, <http://www.mieliestronk.com/wordlist.html>

See also

Back