Semester A Progress Report 2012

From Derek
Jump to: navigation, search

Thomas Stratfold

June 1, 2012

Introduction

This progress report is a summary of the work in which Aidan Duffy and myself have achieved over the first semester of our Final Year Honours Project for the School of Electrical and Electronic Engineering, under the supervision on Derek Abbott and Matthew Berryman.

The overall goals of this project is to decipher the code found in association with the Somerton Man, and use this to identify the victim and ultimately solve the case. In order to do this we first must determine what the code is, what cipher is (if any) is used, what language it was written in or whether it is just a series of random letters.

The secondary aspects of the project is through the identification of the Somerton Man, this is a new aspect to the project and will be achieved through creating a 3D model of the victim.

The techniques and programs we have been using have been designed to be general, so that they could easily be applied to other cases in used in situations beyond the aim of this project.

Background

The Case

At 6:30am on December 1st 1948, a man was found deceased on Somerton Beach, South Australia, resting on the rock wall at the top of the beach. The victim contained no form of identification and his fingerprints and dental records didn't match any international registries. The only items on his body were some cigarettes, chewing gum, a comb, an unused train ticket and a used bus ticket.

The report from the autopsy identified the man's stomach and kidneys were congested and there was excess blood in his liver; this suggested that his death was unnatural, and most likely the cause of an unknown poison. 44 years later, in 1994 under a review of the case, it was suggested that the death fits that of the poison digitalis.

A month and a half later, a suitcase was found left at Adelaide Railway Station and believed to belong to the victim. However none of these items contained any further clues on the identity of the man or his killer.

The Code

One other item was found on the victim’s body, inside a sewn up pocket of his trousers, a small piece of paper torn form a book with the words "Tamam Shud". Translated from Persian this means ended or finished. Which can be found on the last page of the book called The Rubaiyat of Omar Khayyam. On November 30th 1948, a man in Glenelg found this book left in the backseat of his car, testing later on confirmed that the paper found on the victim matched this book.

In the back of this book, written in pencil were five lines of capital letters, with the second line crossed out:


WRGOABABD

MLIAOI

WTBIMPANETP

MLIABOAIAQC

ITTMTSAMSTGAB


The similarity between the second line and the fourth line indicates that a mistake was made, which increases the likelihood that the lines are in fact a code. However over the years no one has yet been able to determine the meaning or purpose of the code.

Over the years there have been multiple attempts at identify and deciphering the code, however there si yet to be any acceptable results. One notable attempt was made in 1978, by the Australian Defence Force, who conducted analysis the code and stated that there wasn't enough symbols to provide a pattern, the symbols could be a complex substitution code, or a meaningless response to a disturbed mind but ultimately were unable to provide a satisfactory answer.

This leads to the motivation behind this project, as there have been 60 years, and three years worth of projects run by the University yet the code still remains unsolved.

Previous Year's Work

This is now the fourth year of the project, and the previous three groups have provided some valuable insight into the case, and this is the basis on what we have built on.

In 2009, the group established the letters were not random, the code wasn't a transposition cipher, and the code was consistent with representing the first letter of words.

In 2010, the group continued along the first letter path, and compared the code against particular texts. However they were unsuccessful in their results, with the largest matches coming from The Rubaiyat. They also developed a simple web application and pattern matcher, which was designed to download and search the contents of webpages looking for patterns; this was then compared with the Somerton code.

In 2011, the group expanded the web application and created the web crawler that would search the Internet by itself. They also focused on various ciphers and cryptographic techniques that may have been used to generate the code.

All three groups also worked on a Cipher Cross-off List; this list contains ciphers and encryptions and these have been systematically tested and crossed off using frequency distribution and through decoding methods. There are currently more then 30 ciphers, which have been disproved, showing the method used for each of these.

Group Members

This year the project has myself, Thomas Stratfold, a Bachelor of Engineering (Telecommunications) student, and Aidan Duffy, a double degree Bachelor of Engineering (Electrical and Electronic Engineering) and Bachelor of Economics student. With the new aspect of the project, the 3D reconstruction, we decided to work together on it. However with the other aspects we decided to separate, Aidan would focus on the Web Crawler while I was going to focus on the Language analysis and cipher cross off.

2012 Progress

This year the main focus of the project is on the identification of the victim through a 3D reconstruction of a bust taken months after the victim's death. The other main focus of this year is on verifying and improving on the work of previous years; this has been done through expanding the Web Crawler and testing the code against more languages.


3D Reconstruction

The 3D reconstruction is a new aspect to the project, which we have been exploring, as a hope to provide a reconstructed image of the victim's face in the hopes of finally identifying him.

The first part of this aspect of the project involves scanning the bust, creating a model on the computer, modifying the model to undistorted the image and correct any changes to the face caused from the post modem. After this, it involves adding colour into the image, so that the model would look more realistic and would make it easier to identify the victim.

The first part of the semester involved determining how we would create the model; we begin by testing out a few easily available 3D modelling software, such as 123D and PM Scanner. Neither of these was very efficient or easy to use, and the examples did not appear to be of very high quality. As a result we decided to instead use a low-cost “David Laserscanner”, which is an easy to use kit that allows for 3D modelling. Since this kit involves the use of a 5mW laser line, which could have the potential to cause harm if handled incorrectly. As a result we were required to create a Risk Assessment and Standard Operating Procedure, included in the appendix, before we were allowed to purchase the kit.

My contribution to this aspect of the project involved meeting with one of the schools research engineers, Henry Ho, along with Aidan, and to create the risk assessment, which would be used in conjunction with the laser scanner.

We are still in the process of waiting for the David Laserscanner kit to arrive, so have been unable to complete any further progress in the area.


Web Crawler

The web crawler created in 2011 is already functional, however it takes a very long time to scan web pages, and as a result would take months to check even a fraction of the Internet. The idea behind this aspect of the project was to modify the crawler so that it would use pre-indexed data; the data search engines use to speed up searches.

Aidan did majority of the work conducted in this area.

Language Verification

In previous years there has not been a lot of focus on what language the code could be written in. In 2009 they tested 10 languages and came to the assumption that English fit the best. This year we have decided to verify this assumption by testing more languages and performing further analysis. This has been my main focus for contribution towards the project.

To start this analysis, it began with reviewing previous years work; previously they tested languages by using a frequency analysis of the most used letters, and compared this with the code. To test this we first went through the dictionary and calculated the number of words that begins with each letter of the alphabet. This was then converted into a frequency analysis and compared to the Somerton code. The results from this were not very promising, as there was very little matching between the two lots of frequencies. The results are shown in figure 1 of the appendix.

Considering the previous years had mainly focused on Western-European languages, and records had shown the victim was most likely of Eastern-European decent, we decided to expand and consider more languages. I was able to find a fantastic web site, which had the Tower of Babel bible passage translated into over a hundred different languages. Using this site, I was able to get 85 different languages that contained approximately 1000 characters in each translation. This allow for frequency analysis to be performed by running a Java program which I had modified to calculate the number of times each letter appeared in the texts. This analysis was then repeated; by calculating the number of times each letter was the initial letter of a word.

The results from this were quite interesting. When we compared the initial letter frequency results with the Somerton code, by computing the difference in frequency and standard deviation. The results showed that English was the third closest, behind two Philippians’ languages Ilocano and Tagalog. This helps to support the theory that the code is in English since it has such a high frequency comparison to the code. A table of these results is included within the appendix.

Further analysis need to be done in this area, the best thing to do would be to tae the top 20 languages and perform more frequency analysis using longer texts to refine the results. As a result of the data we have collected, we will be constructing a Language Cross-Off List, similar to the cipher version that will explain why we have discounted each of the languages tested.

Overall Progress

Overall we have made a solid start on the project, and have started to get back onto track with our original schedule. Currently the only delay we are facing is coming from accessing the laser scanner, so have been unable to achieve much in the 3D reconstruction. This will be our main focus for next semester, along with expanding the web crawler to be more user friendly and hopefully to be able to use indexed data to search for patterns on the Internet. We will continue looking into the languages and will also try to analysis and disprove more ciphers from the list.

References and useful resources

- Frequency analysis of texts

- Previous work - Languages tested

- Previous work - Web based application

- Previous work - Web crawler - Pattern Matcher - Cipher GUI

Aidan Duffy

June 9, 2012

Executive Summary

Working on a cold case over 60 years old, our project has been designed to apply modern cryptographic and computational techniques in order to try to solve the mystery of the Somerton Man. After 12 weeks, reasonable progress has been made in expanding upon prior years' work on the language of the code, whilst efforts to create a 3D model have met with unforeseen delays that should hopefully be overcome shortly.

Introduction

This progress report is designed as a record of the work done by myself and Thomas Stratfold during the first semester on our Final Year Honours Project for Electrical Engineering under the supervision of Derek Abbott and Matthew Berryman. The ultimate aim of our project is to solve the mystery of the Somerton Man case from 1948. In order to do this, we have been tasked to investigate and attempt to unravel a code that was found upon the body - to figure out the language the code is in, what (if any) cipher was used for encoding, or whether the "code" is in fact simply a series of random letters from the mind of a man possibly under the influence of poison. We have also been set to produce a 3-dimensional model of the victim's head from a bust that was made at the time, in order to provide a method for identifying the unknown man. Whilst the techniques and programs developed in this assignment are being used to try to solve the mystery of the Somerton Man, they are designed to have a broader application beyond the scope of just this case.

Background

At 6.30am on the 1st December 1948, a dead man was discovered on Somerton Beach, South Australia, resting against the seawall. No identification was found on the body, the only items he carried being some chewing gum, cigarettes, a comb, an unused train ticket, and a used bus ticket - for a bus stop just 250 metres from where his body was found. The pathologist performing the autopsy found that the man's stomach and kidneys were deeply congested and the liver contained excess blood. He suggested the victim died from poisoning, but was unable to identify the specific poison used. A review of the case in 1994 concluded that it was likely the man died from digitalis poisoning. As is evident from the previous paragraphs and one of the goals for this project, the identity of the man remains a mystery to this day. He was described as being Eastern European in appearance, mid-40s in top physical condition and with his hands showing no signs of physical labour. He was clean-shaven, dressed in a fashionable European suit and his boots were polished, but all the name tags from his clothing had been removed and no record of his fingerprints or dental structure was found in international registries. By February 1949, there had been eight different "positive" identifications of the body by members of the public. A month and a half after the discovery of the body, a brown suitcase believed to belong to the man was found at Adelaide Railway Station. It contained various items of clothing - again with no name tags - shaving items, and tools such as scissors, a screwdriver and stenciling equipment. The only identifying marks were "T. Keane" on a tie, "Keane on a laundry bag and "Kean" on a singlet, along with three dry-cleaning marks; these have never successfully been linked to anyone. Many theories have been put forward as to the identity of the mystery man. One of the more popular is that he was a spy, with the lack of identification and the mysterious poisoning pointed to as evidence for this theory.

The Code

Around the time of the inquest, a tiny piece of paper was found deep in a fob pocket sewn withing the dead man's trouser pocket. On it was printed the words "Tamam Shud", and public library offcials called in identified it as meaning "ended" or "finished", found on the last page of a book called The Rubaiyat of Omar Khayyam. A nation-wide search was then conducted to find a matching copy of the book, but this was unsuccessful until a man revealed he had found a rare first edition copy of the translation by Edward FitzGerald on the backseat of his unlocked car on Jetty Road the night of the Somerton Man's death. This copy was missing the Tamam Shud on the last page, and microscopic tests indicated that the piece of paper found on the body was torn from this book. In the back of the book were found faint pencil markings of five lines of capital letters, with the second line crossed out. This line's similarity to the fourth indicates a mistake was made, and adds to the likelihood the lines are a code.

WRGOABABD

MLIAOI

WTBIMPANETP

MLIABOAIAQC

ITTMTSAMSTGAB

However there is some debate over several of the letters: It is unclear whether the first and third lines begin with an `M' or a `W'; the struck-through second line could be an attempt to underline; and the `I' of the last line could possibly be a very narrow `V'. Code experts brought in to analyse the lines in 1978 concluded:

  • There are insufficient symbols to provide a pattern
  • The symbols could be a complex substitute code or the meaningless response to a disturbed mind
  • It is not possible to provide a satisfactory answer

Our aim is to provide more conclusive answers than these, through the power of modern technology and the vast swathes of data available on the World Wide Web.

Previous Year's Work

This project is now into its fourth year under Derek Abbott and Matthew Berryman's supervision, and the three previous groups to attempt to solve the code have all provided valuable insights for us to build upon. The group from 2009 were able to establish that:

  • The letters are not random - they mean something; they contain information
  • The code is not a transposition cipher - the letters are not simply shifted in position
  • The results are consistent with an English initialism - the letter distribution is consistent with the letter distribution of the first letter of English words

In 2010, the group were compared the code's letter distribution to a particular text. Whilst they were unsuccessful in their endeavour, they did generate a large amount of pattern-matching data, and also discovered, somewhat surprisingly, that The Rubaiyat contained few, if any, matches. They also sought to harness the huge collection of data in the internet by developing and running a simple web application and pattern matcher in order to download and screen the contents of websites for patterns. The 2011 group focused on two main aspects - expanding upon the functionality of the web crawler and pattern matcher, and investigating various cryptographic methodologies that could have been used to generate the code, then determining whether they were possible based on a comparison with the code itself. Both aspects were broadened so that they were applicable beyond the scope of the Somerton Man case. All three previous groups have worked together on a "Cipher Cross-off List" - eliminating potential methods for encoding the letters that were available at the time, based upon their similarity to the frequency distribution of the letters in the code. Currently, more than 30 possible encryption schemes have been disproved, with the use of a One-Time Pad being a notable exception due to its (virtually) infinite number of possible permutations.

This Year's Progress

The focus of this years' project is upon the identification of the victim through a 3D reconstruction of a bust that was taken after death. This involves investigating techniques for scanning in the bust in order to create a model on the computer, undistorting the image to correct for any changes that may have occurred after death and in the 60-plus years since the bust was taken, and then colourising the model to accurately portray what the man may have looked like when he was still alive. Time early on in the project was spent looking into various means of creating the initial 3D model, with different software programs tested and 3D scanning cameras as discussed. With the budget available for the project we were limited in the options we could afford, and so we settled for a low-cost "DAVID Laserscanner" package which seemed to offer the equipment and software we required in one easy to use kit. However, this kit involves the use of a laser line, and these are potentially harmful due to the high power of laser that is used (approximately 5mW). In order to be approved for the use of this kit, we were required to discuss and produce a Risk Assessment and Standard Operating Procedure applicable for our project. After meeting with one of the Engineering School's Research Engineers, Henry Ho, it fell to me to draft the initial Risk Assessment and Standard Operating Procedure. Once this was done, we again met with Henry to refine the documents, then submitted them for approval in order to begin scanning the bust. As of today we are still currently waiting for approval or the kit to be released to us, and this unforeseen delay has pushed back the scanning until the second half of the year. However we are confident that we should not be held up much longer, and may be able to make a start during the mid-year break. We have also sought to expand the range of languages considered, as the original investigation was limited to primarily western-European languages. In order to speed up this process, I wrote a simple Java program which took in a text file and output a list of the letters of the alphabet and a value alongside each indicated the number of times that letter occurred in the file. The aim was to then write a "batch file" for windows in order to process a large number of languages in one go, and then analyse the frequency distributions of each to compare with the code. This was then adjusted by Tom to be done entirely within Java using a driver program, with the output then imported by Microsoft Excel to do the statistical analysis. I also modified the program to only take the first letters of each word and count the occurrences of these, since this was suggested as more similar to the code by the group from 2009.

Appendix

Frequency Analysis

Dictionary Freq Anal.png

Figure 2 - Results from analysis of dictionary


Initial Freq.png

Figure 2 - Results from analysis of initial letters from text

Standard Operating Procedure

File:Standard Operating Procedure.pdf

Risk Assessment

File:Risk Assessment.pdf

Back