Progress Report 2010: Difference between revisions

From Derek
Jump to navigation Jump to search
A1162144 (talk | contribs)
No edit summary
Dabbott (talk | contribs)
 
(133 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Due Date==
==Due Date==
This report is due on the 3rd of June, 2010 (Thursday, Week 12).


****edit****
==Executive Summary==
For decades the written code that is linked to the “Somerton Man Mystery” (or [http://en.wikipedia.org/wiki/Taman_Shud_Case  Taman Shud Case]) has been undecipherable and delivered no solid information about his history or identity. This project aims to bring light to the unsolved mystery by tapping into the wealth of information available on the internet. 


==Abstract==
Recent collection of random letter samples has been used to reinforce previous claims that the code is unlikely to just be a randomly generated sequence of letters. Along with thorough revision and checking of previous code used to produce results, it has been agreed that the Somerton Man code is most likely initialism.
This project explores the meaning behind a secret message that is related to the dead body of an unknown man found at Somerton beach in 1948. The main purpose of the project is to create a web crawler to specifically search for words and patterns that match or are similar to the Somerton man's code. A java based open source web crawler will be used, and will be rewritten and modified in order to perform the searches that we need to acquire our results.


==Aim==
The code found with the dead man will be used in conjunction with a suitable “Web Crawler” and specific pattern matching algorithms. The aim is to target websites of interest and extract data from them that may hold answers to the Sumerton Man Code. Such data refers to the huge amount of available rew text that can be found on websites. This text will be filtered in search of character patterns that are also present in the Somerton code.   Once some solid results are obtained, frequency analysis will hopefully enable a solid hypothesis on some possibilities in regards to the meaning behind the code.
The project involves using engineering techniques such as information theory, probability, statistics, encryption, decryption and software writing skills. Using these techniques we attempt to recreate and validate the results of the project from past years in order to better understand what the secret message is. The Somerton man's code is thought to be an initialism of a sentence or text. The main aim of the project is to attempt to create a web crawler based in java code to find patterns in text on websites. Using these patterns, we hope to use frequency analysis to create a list of potential words that the initialism could stand for.


==Background and Significance==
So far there has been substantial progress towards achieving the goal of the project.  Verification of past results has been completed along with extensive research and experimenting with web crawling devices and pattern matching algorithms.  The project has undergone some specification changes and still has a significant amount of work remaining along with some small problems to overcome.  Despite this, the project is well underway and should be completed successfully and on time.


==Aims and Objectives==
The aim of the project is to make progression with respect to the Somerton Man Case through use of engineering techniques such as information theory, problem solving, statistics, encryption, decryption and software coding. Through use of such techniques, the results from the previous year’s project will be revised and used to solidify the pathway for the rest of the project.  Once past results are properly revised the next goal of the project is to attempt to utilise a web crawling device in order to access information stored in raw text on websites of interest.  Specifically designed algorithms will then be used on the extracted text in order to search for patterns and initialism sequences that may be of interest or relation to the contents of the Somerton Man Code.


===The Victim===
==Project Background==
===The Case===
The project is based upon attempts to revealthe identity of a middle aged man discovered dead on a metropolitan beach in South Australia in 1948.  With little solid evidence of his death apart from the body itself, the man’s passing has remained a mystery covered in controversy for decades.  He was found resting against a rock wall on Somerton Beach opposite a home for crippled children shortly before 6:45 am on the 1st of December in 1948. A post-mortem of the body revealed that there was no possibility that the death was natural due to severe congestion in his organs. The most likely cause of death was decided to be from poison, which in 1994 was uncovered by forensic science to be digitalis<ref name="Taman Shud Wiki">http://en.wikipedia.org/wiki/Taman_Shud_Case</ref>.


[[Image:Suitcase.jpg|thumb|170px|right|Suitcase and effects, found at Adelaide Railway Station.]]
Among the man’s possessions were cigarettes, matches, a metal comb, chewing gum, a railway ticket to Henley Beach, a bus ticket and a tram ticket. A brown suitcase belonging to the man was later discovered at the Adelaide Railway Station. In it contained various articles of clothing, a screwdriver, a stencilling brush, a table knife, scissors and other such utensils. Authorities found the name “T. Keane” on some of the items, but they were unable to find a missing person with that name.. The Somerton man has since been incorrectly identified as a missing stable hand, a worker on a steamship, and a Swedish man<ref name="Taman Shud Wiki"/>.
The body of an unknown man was found resting against a rock wall on Somerton Beach opposite a crippled children's home shortly before 6:45 am on the 1st of December in 1948. A post-mortem of the body revealed that there was no possibility that the death was natural due to severe congestion in his organs. The most likely cause of death was from poison, which in 1994 was uncovered by forensic science to be digitalis<ref name="Taman Shud Wiki">http://en.wikipedia.org/wiki/Taman_Shud_Case</ref>.


Among the man’s possessions were cigarettes, matches, a metal comb, chewing gum, a railway ticket to Henley Beach, a bus ticket and a tram ticket. A brown suitcase belonging to the man was later discovered at the Adelaide Railway Station. In it contained various articles of clothing, a screwdriver, a stencilling brush, a table knife, scissors and other such utensils. Authorities found the name “T. Keane” and the unknown man was temporarily identified as a local sailor by the name of Tom Keane. However, this was proved to be incorrect and the man’s identity has been a mystery ever since. The Somerton man has since been incorrectly identified as a missing stable hand, a worker on a steamship, and a Swedish man<ref name="Taman Shud Wiki"/>.
A spy theory also began circulating due to the circumstances of the Somerton man’s death. In 1947, The US Army’s Signal Intelligence Service, as part of Operation Venona, discovered that there had been top secret material leaked from Australia’s Department of External Affairs to the Soviet embassy in Canberra. Three months prior to the death of the Somerton man, on the 16th of August 1948, an overdose of digitalis was reported as the cause of death for US Assistant Treasury Secretary Harry Dexter White, who had been accused of Soviet espionage under Operation Venona. Due to the similar causes of their death, the rumours of the unknown man being a Soviet spy began to spread.


A spy theory also began circulating due to the circumstances of the Somerton man’s death. In 1947, The US Army’s Signal Intelligence Service, as part of Operation Venona, discovered that there had been top secret material leaked from Australia’s Department of External Affairs to the Soviet embassy in Canberra. Three months prior to the death of the Somerton man, on the 16th of August 1948, an overdose of digitalis was reported as the cause of death for US Assistant Treasury Secretary Harry Dexter White, who had been accused of Soviet espionage under Operation Venona. Due to the similar causes of their death, the rumours of the unknown man being a Soviet spy began to spread.
By far the most intriguing piece of evidence related to the case was a small piece of paper bearing the words “Tamam Shud”, which translated to “ended” or “finished”. This paper was soon identified as being from the last page of a book of poems called The Rubiayat by Omar Khayyam, a famous Persian poet<ref>http://en.wikipedia.org/wiki/Rubaiyat_of_Omar_Khayyam</ref>.  After a police announcement was made in search of the missing copy of The Rubiayat a local Glenelg resident came forth baring a copy that he had found in the back seat of his car. It was in the back of this copy of The Rubiayat that the Mystery Code was discovered.


===The Code===
===The Code===
The most puzzling of his belongings was a slip of paper bearing the words “Tamam Shud”, which translated to “ended” or “finished”. The fragment of paper was identified as being from the last page of a book of poems called [http://en.wikipedia.org/wiki/Rubaiyat_of_Omar_Khayyam The Rubiayat by Omar Khayyam], a famous Persian poet.
[[Image:Somertoncode.jpg|thumb|right|The Somerton Man Code.]]
 
The code found in the back of The Rubiayat revealed a sequence of 40 – 50 letters. It has been dismissed as unsolvable for some time due to the quality of hand writing and the available quantity of letters. The first character on the first and third line looks like an “M” or “W”, and the fifth line’s first character looks like an “I” or a “V”. The second line is crossed out (and is omitted entirely in previous cracking attempts), and there is an “X” above the “O” on the fourth line. Due to the ambiguity of some of these letters and lines, some possibly wrong assumptions could be made as to what is and isn’t a part of the code.
A police announcement was made through the media searching for a copy of The Rubiayat with the “Tamam Shud” phrase removed. Shortly afterward, a local person came forward to reveal that he had found a rare first edition of the book in the back seat of his unlocked car in Glenelg on the 30th of November, 1948.
Professional attempts at unlocking this code were largely limited due to the lack of modern techniques and strategies, because they were carried out decades earlier. When the code was analysed by experts in the Australian Department of Defence in 1978, they made the following statements regarding the code:
 
Inside that particular copy of The Rubiayat was a phone number penciled in on the back cover which led police to a nurse known as Jestyn (her real name is withheld). The woman has since denied all knowledge of the dead man. Also penciled into the book, was a short code.
 
 
[[Image:Somertoncode.jpg|thumb|left|The handwriting of the Somerton Man showing his pencil markings in the back of a book of poetry by Omar Khayyam.]]
 
The code has been unsolvable for some time due to the quality of hand writing. The first character on the first and third line looks like an “M” or “W”, and the fifth line’s first character looks like an “I” or a “V”. The second line is crossed out (and is omitted entirely in previous cracking attempts), and there is an “x” above the “O” on the fourth line. Due to the ambiguity of some of these letters and lines, some possibly wrong assumptions could be made as to what is and isn’t a part of the code.


Professional attempts at unlocking this code were largely limited due to the lack of modern techniques and strategies, because they were carried out decades earlier. When the code was analysed by experts in the Australian Department of Defence in 1978, they made the following statements regarding the code:


* There are insufficient symbols to provide a pattern.
* There are insufficient symbols to provide a pattern.
* The symbols could be a complex substitute code or the meaningless response to a disturbed mind.
* The symbols could be a complex substitute code or the meaningless response to a disturbed mind.
* It is not possible to provide a satisfactory answer.
* It is not possible to provide a satisfactory answer.




Line 51: Line 44:
<center>ITTMTSAMSTCAB </center>
<center>ITTMTSAMSTCAB </center>


===Significance===
==Project Requirements and Specifications==
Probably since the beginning of recognizable human behaviour, coding has been fundamental to all human groups. It was a way to enable communication when ordinary spoken or written language was difficult or impossible. But for as long as communication was needed, concealment, too, seems to be just as common in human societies. Secret languages and gestures are characteristic of many human groups, serving as a means of concealing messages.


A cipher system is the systematic concealment of a message by substituting words or phrases with a different set of characters. They began to be widely used in the 16th century, but they date back as far as 6th century BC<ref name="Code Book">The Secrets of Codes: Understanding the World of Hidden Messages by Paul Lunde</ref>.
The entire project can be broken into 3 main modules. These are:
*Verification of past results
*Design/utilisation of a web crawler
*Pattern matching algorithm


[[Image:Caesarshift.jpg|thumb|250px|right|A three letter [http://en.wikipedia.org/wiki/Caesar_cipher Caesar Shift]. There are 26 possible shifts in the English language]]
Each module has a set of requirements specific to its design and application with respect to the project.
Coding and cipher systems have historically been limited to espionage or warfare, but the recent transformation of communication technology has added new opportunities and challenges for the budding cryptologist. We live in an age, with computerised communication, where the most valuable commodity is code and the second most valuable is the private information that it gives access to. In order to guard this information, we use encryption systems that have to be continually updated as every encryption system is eventually foiled. How to preserve privacy in an electronic age is one of the burning questions of our time<ref name="Code Book"/>.


In this day and age, engineers now have the knowledge and skills of information theory and statistics to better analyse the data of previously unsolvable codes. We believe that the cracking of this code holds the key to the identity of the Somerton man. Solving this mystery could potentially bring closure to the man’s family as well as others who have become obsessed with the case.
===Software Requirements===


As stated, the main goal of the project is to create a java based web crawler. This web crawler is planned to be quite versatile and have many applications outside of this project. For instance, it could be used for further cipher or code related searching not exclusive to just the Somerton man’s code. Also, the web crawler would be able to search for different text patterns and language comparisons, or even to identify plagiarism. Furthermore, it could be used as a web searcher like Google with perhaps additional functions such as being able to search with wildcard letters (ie you could search for “be*r” where the * is a wildcard, or any letter).
The following block diagram shows the designed software layout for traversing a website and filtering through raw text and storing all important findings. This method is using a breadth first traversal which is currently being revised and possibly changed to depth first. for more information see the "Web Crawler Plans" section.
[[Image:Breadthfirstblock.jpg|600px|centre]]


==Requirements==
====Web Crawling Requirements====
The web crawling device will be used to visit a website of interest and retrieve data that may contain information relevant to the project.  The basic requirements of the web crawler are as follows:


===Key Requirements===
*Take in a base URL from the user
The key requirements for the project are shown below. They have been determined by the group in order to fully utilize the web crawler so that we can obtain the best results possible.
*Traverse the base website
*Pass forward any useful data to the pattern matching algorithm
*Finish after traversing entire website or reaching a specified limit


The code must handle getting an initial URL to start the web search, as well as getting subsequent URLs thereafter. Whether this is done by choosing a URL at random or a specific list is currently unknown/undecided.


The code must handle sub-links or web pages within a main page. That is, web pages with multiple frames or source codes. It is important that our code can handle this because if it doesn’t, precious data can be missed and left unparsed or unrecorded.
The crawler will be required to detect and handle the following circumstances:
*Bad I/O
*Found a valid HTML URL
*Found a Non HTML URL
*Found a URL linking to an external site
*Found an Invalid URL


There must be special cases when errors occur. This includes Input/Output errors that cause the web page to fail loading as well as invalid URL errors. For both cases, we have to decide whether to re-attempt parsing the page, or to simply skip it. Moreover, another special case must be considered for non-HTML sources such as PDF files or Microsoft Word documents. We should parse these because the data is probably useful for our results. Raw data programs should probably be skipped because it will mostly be code and not helpful.
====Pattern Matching Algorithm Requirements====
The pattern matching code must be able to determine the difference between HTML code and raw text and data. The HTML code will be largely useless and probably repetitive in multiple sites which wouldn’t really be useful for the results we are after (in fact, if the frequency was quite high they could potentially skew our results). How the algorithm determines what is and isn’t HTML code will be very important because it will decrease the amount of processes as well as the amount of time needed to complete the methods in the algorithm.


Additionally, the code must be able to determine the difference between HTML code and raw text and data. The HTML code will be largely useless and probably repetitive in multiple sites which wouldn’t be useful for the results we are after. The information we need comes from the raw text and data which we need to parse. However, the comments in the HTML code may be useful samples for our results; we may choose to parse them, but if we are not able to parse comments while ignoring other HTML code, it should be skipped.
The code must be able to switch between parsing every character or just initial letters of a word. This is mostly for added functionality, because it has been concluded that the Somerton man’s code is probably an initialism. This is done because we want our web crawler to have uses outside of this project.


The code must be able to parse both single characters and just initial letters of a word. This is mostly for added functionality, because it has been concluded that the Somerton man’s code is probably an initialism. This is done because we want our web crawler to have uses outside of this project. How our code chooses the initial letter of a word is also critical. In most cases it is simply the character following a space, but there will also have to be special cases (like words following a period with incorrect punctuation, such as “hello.there").
Moreover, we want the code to be able to search for exact patterns as well as “similar” patterns. That is, data that has the same pattern as our given input, but with different characters (for example, if our input is searching for “ABAB” and comes across text with “XYXY”, it would be considered as a “similar” pattern and noted down). Also, the code should handle non-letter characters such as punctuation.


Moreover, we want the code to be able to search for exact patterns as well as “similar” patterns. That is, data that has the same pattern as our given input, but with different characters (for example, if our input is searching for “ABAB” and comes across text with “XYXY”, it would be considered as a “similar” pattern and noted down). Also, the code must handle non-letter characters such as punctuation.
For every “found” pattern, the code should put the text (or sentence or phrase that the pattern was found in) and possibly URL in some sort of list or array. This is the most significant part of the algorithm because it will provide the basis of our results. From this, we can get frequency logs of certain words to better determine what the initialisms in the Somerton man’s code could possibly be.


The most important requirement is probably what the code does with the results. For every “found” pattern, the code should put the text (or sentence or phrase that the pattern was found in) and URL in some sort of list or array. This is significant because it will provide the basis of our results. From this, we can get frequency logs of certain words to better determine what the initialisms in the Somerton man’s code could possibly be.
==Progress So Far==
===Verification of Past Results===
[[Image:randomletter.jpg|thumb|right|Completed Random Letter Sample form.]]
In order to decide on a direction for the use of the web crawler, it was first important to look at the previous students work and make a decision on whether their conclusions are accepted.  


===Coding Requirements===
The verification of their results was undertaken in two sections:
The code must be well documented and commented. Clear explanations of our code will let others understand it more easily. This is an advantage in terms of maintenance and modification.


If we want our code to act as a standalone program, it must also have a well defined interface. The program should be able to switch between parsing single characters or initial letters and exact or similar patterns with the press of a button.
*Collection of random letter samples for comparison to the Somerton man code
*Analysis and Review of their code


The code is to be written in java because that is the language that the group is most comfortable with. It is probably one of the most common coding languages available so it can be easily maintained and modified. It is also free to download and very accessible.
A template for collecting random letter samples was created and used to collect 45 random letter samples from a variety of subjects.  45 was chosen as it is roughly the number of letters in the Somerton man code. Details taken from each subject include the following, and can be seen in the completed form to the right.


==Proposed Approach==
*Intox. Yes/no - the Physical state of each subject was recorded to keep samples taken from intoxicated subjects separate from those that were, to the best of our knowledge, not under the influence of any substance.
The project is essentially broken up into two main components. These are:
* Age – each subject was required to supply their age for future use
#. Reviewing and verifying past results.
*M/F – the sex of each subject was recorded for future use
#. Implementing a web crawling device to search the internet for items of interest. 


Briefly reviewing the techniques and procedures used by the previous year’s students in their statistical analysis of the Somerton Man Code will allow for a clearer understanding of the exact requirements of the second component of the project. It is currently expected that the Web Crawler will be searching for patterns of repetition in text both in sequence and as initialism (where each letter of the code is the first letter of a word). However, if the results obtained this year are significantly different to the previous outcome, the web crawler may be used to search for some other item of interest.
The graphs below show the recently collected samples next to the samples collected by the previous students and the letter frequency plot of the Somerton man code.  


[[Image:Comparefreq.JPG|center|800px]]


[[Image:Random_letter_sample.jpg|thumb|170px|left|Sample sheet used to collect random samples from straight and intoxicated subjects]]
===Preliminary work carried out so far===


Previous statistical analysis on the mystery code has shown that the letters in the code are most likely some type of English Initialism. This was achieved by running a series of tests and comparisons on the code to eliminate possibilities and find the most probable outcome.
Although the results are not identical to the previous student’s results, they are very different to the graph depicting the code from the Somerton man. It is clear that in both results there is not 1 letter that has a frequency less than 1. This leads us to agree with the previous assumption that the code has not been produced randomly.
So far there have not been enough samples from intoxicated subjects in order to produce a useful result.  This is an ongoing collection and will be finalised when there are enough samples to compile a graph that can be considered useful.
It was initially planned to use the age and sex data that has been collected to subdivide the results into different categories in order to conclude whether the letters in the Somerton man code are better correlated to a specific age group or sex. It was then realised that this will involve the collection of many more samples than is needed to verify the past students results and has therefore not been included in the results.  Collecting more samples and splitting the data into different categories may provide useful results for future studies into the case.


To verify that the code is not just random "meaningless" letters, random letter samples are being compiled from a range of different subjects (male, female, below 25 years old, above 25 years old). The random samples are also being collected from subjects who have been exposed to some level of alcohol. This will allow for comparison between the mystery code and a true sample of random letters in order to determine the likelihood that the code has been purposely randomised as an act of deception or intoxication.  To keep the samples as realistic as possible it has been decided that each person will list 40 random letters.  There are about 45 letters on the Summerton Man Code and it has already been noticed that after about 10 random letters most subjects slow down and start thinking more about the letters they are writing. The results will ideally reflect the characteristics of the mystery code, if it is in fact random.
The previous year's code was used to analyse and come to a conclusion of the Somerton man's code using several well known cipher techniques. Upon reviewing the created software modules, it was decided that the algorithms used would perform the proper functions in attempting to prove or disprove their hypotheses. Their code was then compiled and run to generate a number of output files. These output files were then examined and compared to the output files that the previous project group generated. From all of these outputs, none of the files generated any legible words and could then be assumed that the Somerton man's code was neither a Vigenere, Playfair or Caesar Cipher. In fact, the results claim that the mystery code resembles a set of initialisms which may or may not be using substituted letters.


Verifying the accuracy of the results previously obtained with respect to the language of the code and the probability of initialism involves thoroughly checking the software used to run the past tests.  This software is currently being revised and checked in order to accept its performance.  Initial proof reading and error checking has given good evidence that the software used was in fact working properly.
===Text Pattern Algorithms===


===Approach for remaining part of the project===
The most recent revision of the pattern finding algorithm and testing text file can be found [[Media:FindMatch.rar|here]].


====Verification of Past Results====
The text pattern algorithm was written from scratch using java.  The main purpose of the algorithm is to search through a text file and return an output saying if a word or initialism is found. This was done because we weren’t aware of what the web crawler’s were capable of using the initial functions (that is, using the web crawler without modifying any of the code).


Once the previous software has been thoroughly checked and re-run, and collection of random letter samples is completed and compiled, the verification of previous results will be completed. This leaves two possibilities for the direction of the project:
To start, a flow chart of the basic function of the algorithm was made to determine how to begin coding in the most efficient way possible. This is shown below:


*Past results are satisfied.


In this case the web crawler will be designed to search websites for patterns in text.  The patterns of interest will be specific to those that occur in the Somerton Man Code.
[[Image:FindMatchFlowChart.jpg|500px|centre]]


*Past results are not satisfied.


In this case the result will be followed by new conclusions with respect to the characteristics of the Somerton Man Code.  Extensive testing and error checking will then be undertaken to satisfactorily prove the new conclusions.  The use of the web crawler will reflect the new results and be designed to search websites for whatever item of interest will possibly provide answers to the meaning of the Code.
To explain the function as simply as possible, the basic algorithm was designed to:
*Accept a user entered input from the command prompt and place it into a string
*Open a text file and place a line from that file into a string
*Compare the file string with the user’s string character by character, printing an output if the entire user entered string was found
*Repeat comparison process for every line until the end of the file is reached.


====Development of the Web Crawler====
The algorithm utilises the scanner class from the util package as well as the File and FileReader classes from the io package. They are important because these classes allow the algorithm to open the needed text file and scan through it. They were easily found on the java API<ref>http://java.sun.com/j2se/1.5.0/docs/api/</ref> and are fairly simple to use. For this algorithm, the output would report the line and character number of the first letter of the matched word found in the text file.


Ideally a Java based web crawler will be implemented.  Both members of the project team are most comfortable with Java and its operation. Two Java based web crawlers that are currently being researched and experimented with are J-spider<ref>J-spider Web Crawler http://j-spider.sourceforge.net/</ref> and Arachnid <ref>Arachnid Web Crawler http://arachnid.sourceforge.net/</ref>.  Learning the techniques available with these crawlers and their individual ways of searching the web is an involved process and rather steep learning curve. One risk associated with this approach is if a suitable Java based web crawler is not available.  This would involve sourcing out another language crawler that will allow for the suitable searching to be undertaken.  Becoming familiar with another programming language will take time and would not be ideal for the project schedule.  This is not a huge issue at the present time as the Java web crawlers being looked at appear to have characteristics that are of interest to the project, and there are many other Java based crawlers that have not been investigated yet.
To simplify the searching process, both the text file string and the user string were converted to lower case letters. It was felt that this helped improve the functionality of the algorithm because it involved fewer cases where characters would be the same letter but not the same. Since it doesn’t really matter if the characters are upper case or lower case for the results, there will really be no negative effect on the final outcome of the project.


The Web Crawler will be designed to follow a set of predefined rules in order to traverse a web site and report on its findings with respect to inputs provided by the user. To simplify the construction of the Web Crawler, it will be developed in several smaller modules.
To implement searching for initialisms, the area circled in red was modified to only change the swCount variable if the previous character in the file string was a white space. The results of the output would report the line and character number like the previous method, as well as the sentence that the initialism represents. This was done because the aim of the project is to find what the initialisms of the Somerton code could possibly represent. This seemed to work, although several errors in the output were observed when some special cases were done.


* Produce a simple Java program/algorithm for traversing HTML code.  The program will only search through raw text and ignore all of the unwanted syntax that is found in HTML code such as the frequently occurring parenthesis. Once this has been achieved it will be necessary to evolve the algorithm to take inputs from a user defining what item is to be searched for. The program will also be able to identify links on web pages and store them for future use.
For example in the file string, if there were two white spaces in a row, the output would be incorrect (the result would print out the wrong character position and the wrong set of words that matched that initialism). Additionally, problems occurred when the input string had spaces in it as well (the method incorrectly assumed that the space represented an initialism as well).
Several attempts were made to debug and fix these problems. The final solution was to create another method (to be used by the initialism finding algorithm) to remove the white spaces in the input string. Also, one more method was created to print the initialism sentence properly for every case that was thought of.


For testing purposes, a main method was created which utilises the scanner class again. For this, the method accepts a user entered input to determine which algorithm will be used for the current search (FindExact or FindInitialism) and then accepts another user inputted string that represents the word or initialism that will be searched for. It is currently unknown whether this will be used as the interface for the final web crawler, although it is likely that the final interface will have a similar function.


* Become familiar with a specific web crawling device and produce a successful search.  This search will be as simple as searching through a singular web page and indicating that the entire page was traversed with no errors.  Once this simple search has been completed more complex searching can be adapted such as finding links and traversing a linked web page.
Some of the simulations and results from testing can be found below.




*Merge the searching algorithm with the simple web crawling device.  The Crawler will firstly be used on individual websites to search and report on its findings.
[[Image:javaresults2.jpg|560px|centre]]




An initial design for the software can be seen in the block diagram below.
The image above demonstrates the FindExact method. From observing the word being searched for, and the testing text file used, it is possible to see that the algorithm works in the assumed manner. As stated earlier, the code considers upper and lower case characters the same – this is illustrated above with the searches for “TEST” and “test” returning the same thing. The method correctly determines the line and character number of the beginning of the word. Furthermore, the method determines the word “  hello” as “hello” (without the spaces at the start) and can correctly find its location in the text file.
[[Image:Software_block_diagram.jpg|600px|centre]]


Although the code ignores spaces in the search string, it can correctly determine the words “  t e s t” as “t e s t” with spaces in between, but not before the first non-whitespace character. Since the FindExact method can search for multiple words, it was felt that this specification was a requirement in the code. To do this, an algorithm was used to delete all of the white spaces in the user input string until a non-whitespace character was met.


As can be seen in the block diagram the general operation of the software is reasonably straight forward and simple. The basic operation will go as follows:
Additionally, the main method returns an input error when a letter other than ‘i’ or ‘e’ is used.


* Get input from a user
* Go to the starting Web page
* Run through a series of checks such as:
:::*If at end of page
:::*If a link has been found
:::*If a pattern has been found
* When a pattern is found it will be saved somewhere along with information or relevant details of its location
* When a link is found it will be added to a queue so that it can be used later
* When the bottom of a page is reached the whole process will begin from the top of the page that is associated with the link at the front of the link queue
* Once the link queue is empty (ie there are no more pages to traverse) the process is finished
* Ideally all patterns found along with their relevant information will be saved in a file


==Project Management==
[[Image:javaresults1.jpg|560px|centre]]
===Milestones & Work Breakdown===
 
The entire project has been divided into smaller elements.  This allows for more efficient organisation and management of both resources and timeBy breaking each task into smaller modules it becomes much easier to allocate different tasks to each team member whilst also keeping track of the project progress.  The breakdown of the project can be seen in the Gantt Chart below.
 
This image shows that the FindInitialism method returns the same result for every different search of the initialism test (with spaces in different places in the string). Incredible amounts of debugging and modification was done in order for this to work correctly. Extensive use of the isWhitespace() method helped to achieve a working algorithm. This method was very important because it was used to determine how many words of the initialism were currently printed as well as determining if a character was the beginning of a word or not.
 
The FindInitialism method is the method that will be developed for the final web crawler. It has a function specific to the final outcome for the project: to generate a list of potential sentences for an inputted initialism by reading through the source code of a web site.
 
===Web Crawling===
An open source Java based Web crawling device called Arachnid<ref>http://arachnid.sourceforge.net/</ref> has currently been the main focus in terms of producing the web crawler.  This crawler is so far proving to be quite useful for the project due to the following traits:
 
 
*Java based
*Basic example provided - allowing for reasonably steep learning curve instead of having to become familiar with a crawler from scratch
*Specific Case handling methods provided - methods for some URL cases are provided (with no handling code however it has provided a starting point to the solution)
*Highly Modifiable
 
 
So far after modifying the supplied crawler and exploiting methods used in the supplied example the crawler is capable of the following:
 
 
*Compiling successfully (This is a huge bonus! it allows for actual tests to take place)
*Intaking a base URL from the console
*Testing for correct protocol (We only want to look at HTTP websites for now)
*Test for a null URL
*Test for non HTML URLs
*Test for URLs that link to external websites
*Run a handleLink()
*Iterate through the website and produce a list of URLs and their associated name found along with an incrementing counter
 
The screen print below shows the command lines printed out for the beginning of a search on www.adelaide.edu.au, the university website.  As can be seen the results are displaying each found URL with a number and a title.  The title = null lines are explained later in this report.
 
[[Image:arachnidscreenprint.jpg|600px|centre]]
 
 
While using Arachnid is proving to be quite useful, there have also been some issues encountered (see “Approach for Remainder of the Project” section).  Because of this, other alternative crawling methods have been researched in parallel to Arachnid.
 
The main alternative for retrieving data from websites that is being investigated is Wget <ref>http://www.gnu.org/software/wget/</ref>.  This is a program that would be highly useful for use in the project for the following reasons:
 
*Freely available
*Supports downloading via HTTP
*Able to act as a web crawler to download recursively until reaching a limit or end of web site
 
 
The third point is extremely valuable to the goal of this project.  Wget is able to mirror websites for offline viewing which will also allow for easier extraction and searching of data.
 
==Approach for Remainder of the Project==
===Web Crawler Plans===
With the web crawler design with the Arachnid crawler there is currently two major issues to be resolved.
 
*Character encoding differences with HTML URLs
*Traversal method - breadth first vs depth first
 
 
The image above shows a cut out from the command prompt displaying the current issue regarding character encoding.  The text under the numeral “1” is from a successful URL discovery where the blue underline shows the URL ‘type’ and the red underline shows the found ‘title’ of the URL.  In the second section under the numeral “2” the type is shown as UTF-8 and as can be seen the title being displayed is null, resulting in no valid information being retrieved from the URL that has been found at that point.
 
[[Image:charencoding.jpg|400px|centre]]
 
 
It is important to overcome this problem beacause UTF-8 is the international “standard” for encoding of URLs and appears to be the most common type found.  This will be achieved by trying to firstly convert URLs to UTF-16 before handling them with Arachnid.  The reason for doing this is that Arachnid is currently handling UTF-16 quite well and it may simpler to just convert each URL that is found to the encoding that is working, rather than altering the crawler to deal with a whole new type of character encoding.  To do this the java method URLEncoder() method will be experimented with.
 
The other issue involving traversal order is less critical however will be important in terms of data collection and storage. The types of websites that are of interest will be information sites containing large amounts of text that will most likely be on the website in a structured hierarchical manner. This means that it is quite important to be able to move down one avenue of information of a site and storing similar items of interest so that they can be easily collected and analysed later. Breadth first traversal is the simplest approach to filtering through a website however it will result in data found being stored in an unorganised manner with respect to the aim of the projectDepth first is slightly harder to implement but, if it can be achieved, will result in all results of a similar nature being stored in an ordered fashion.  The following figures show the difference in breadth and depth first traversal and the resulting order of stored data.  Each node can be thought of as a different web page on the same website where the numbers relate to the order that each page would be searched.
 
 
[[Image:breadthfirst.jpg|800px|centre]]
 
 
It is clear from the diagrams that depth first will allow for data from the same category in a hierarchy will be searched in order, relating to patterns and details found being stored in a usable order as well.  Depth first traversal is the method that is most useful to the project and will be implemented as best as possible.
 
 
In order to try and minimise any large delays in the project due to issues with Arachnid, more time will be spent with Wget and using it as an effective web crawler.  As Wget is written in a C based language, there may be some issues with translating the pattern searching code in order to use data extracted by WgetThis means that researching Wget methods and experimenting with it as soon as possible is important in order to provide as much time as possible for combining the web crawler and the pattern matching algorithm.
 
===Pattern Algorithm Plans===
 
The major remaining areas of the pattern matching algorithm yet to be completed is the implementation of ignoring HTML; and the algorithm being able to search for exact and ‘similar’ initialisms as described in the Stage 1 Design. Ignoring HTML will likely be done by ignoring the characters in the text file when a “<” is encountered, and then re-start comparing characters again after a “>” is found.
 
The method to find ‘similar’ initialisms has currently not been identified as of yet, although some ideas have been formed. The approach for doing this will likely be similar to the FindInitialism method, utilising smart and logical coding rather than a simple isWhitespace class. Moreover, we would also like to implement wild card searches when ' * ' is used. This will likely be done by incrementing the swCount variable for when the ' * ' character is compared to any character in the text file string.
 
Additionally, further implementation could include saving the found initialisms into another text file for inspection later. Beyond that, more testing and debugging should be done and any unforeseen errors that occur should be fixed as soon as possible. Once that is completed, the code will be ready to be integrated into the web crawler.  This area of the project is almost finished and the future lies with the combined web crawler-pattern matching algorithm.


===Obtaining and Using Results===


[[Image:cipher_gantt.jpg|1200px|centre]]
The main sites of interest will include web sites with large amounts of raw data that can easily be parsed using the pattern matching algorithm. If possible, a variety of different sites should be searched to widen the results as much as possible. Using both the web crawler and the pattern matching algorithm, we can then produce a list of potential initialisms for the Somerton man’s code. With this list, we can then create frequency graphs to narrow down the list and hopefully solve the mystery code.


==Project Management==


There are several tasks that can be completed in parallel.  Being in a group of 2 has meant that wherever possible there is only a maximum of two tasks running at once.  This allows for project tasks to run in parallel without overloading on one group member resulting in delayed or rushed products.  Having tasks in parallel makes for much more efficient use of time and significantly reduces the risk of confusion through 2 people working on one task.  The only times when there are more than the ideal amount of tasks taking place are when the written tasks such as the deliverables are under construction.  The block diagram below depicts the rough division of work between each group member.
===Project Alterations===


As the project has progressed there have been a few things that have changed with respect to the original project plan.


[[Image:WorkloadBreakdown_blockdiagram.jpg|300px|centre]]
It was originally planned to search the entire internet for patterns of interest.  It was quickly realised that this is going to be unachievable within the time frame and resources of this project.  The internet is simply too large and changing too quickly for our techniques and equipment.  For this reason the project goal has been down sized to searching through one website at a time.  That is, there are now constrictions and limits to the size of the search that the software will be subjected to.  The specifications now state that the software will be required to intake a starting URL and search through that website and that website only.  This has given some extra challenges such as dealing with URLs that link to external websites however it will also make the project goal much more manageable and realistic.


===Work Breakdown===


As can be seen in the Gantt Chart there are 4 major deliverables throughout the course of the project. These are:
The work breakdown for the project is the same as from the Stage 1 Design<ref>[[Stage 1 Design Document 2010]]</ref>. At this point in time, the two tasks currently being completed are testing/modifying the web crawler, and testing/modifying the pattern matching algorithm. This is more or less matching the expected time line from the [[Media:Cipher_gantt.jpg|Gantt Chart]] that was previously created. Due to end of semester assignments and exams, it is likely that very little work will be done for the next few weeks and the project may fall behind. However, during the mid-semester break, there will be sufficient time to catch up and potentially get ahead of schedule.
*Proposal Seminar
*Stage 1 Design Document
*Final Seminar
*Final Report


Each of the deliverables is allocated sufficient time to be completed.  The time allocated to each deliverable has been designated with the intention of causing as little stress on other components of the project as possible.  This involves starting the construction of deliverables as soon as there has been sufficient project progress to report.
[[File:WorkloadBreakdown blockdiagram.jpg|250px|centre]]


===Budget & Resources===
===Budget & Resources===
Each student is provided with a $250 budget<ref> Project Handbook 2010 [[File:Project_handbook_2010_V-20100310.pdf]]</ref>.  With 2 students involved in this project the total project budget is $500. Fortunately the progress of this project relies heavily on software and past results which are both readily accessible through pre existing (free) university resources and the internet. It is estimated that this project will fall well under budget. With no hardware necessary, all uses for the budget will be made early on in the project life for research purposes.
As stated when this project was started, the majority of the work related to this work is software based. The software being utilised is java based and can be readily used and accessed for free through the university's resources and on the internet. It was estimated that very little of the $500 budget would be used. Although barely any of the budget has been used, it is still available should we come to a situation where we need it (such as funding alcohol, etc to further highlight our results).


Cost so far:
Costs so far:
*The Secrets of Codes: Understanding the World of Hidden Messages by Paul Lunde - $50


The Secrets of Codes: Understanding the World of Hidden Messages by Paul Lunde - $50
===Risk Management===


===Risk Management===
As stated in the Stage 1 Design, the project has very little risk involved. As such, the risk assessment is very similar to the previous assessment. Since we are now more familiar with the project, we are more aware of the more of the risks and potential dangers that could cause the project to fail. This includes not being able to sufficiently code the pattern matching algorithm or use the web crawler. While both of these are unlikely, there is a possibility that these problems could cause a delay. Additionally, due to the current winter weather, there is a greater chance of a project member becoming ill.
One great benefit of this project is that it is heavily software based. This means that the risks and constraints associated with the project are kept to a very minimal level. With no hardware, or need for relying on external sources (as far as we are currently aware), there should be no issues with time constraints due to any third party delays. However like all projects there are risks associated with this one.  The following table depicts the possible risks as well as their relevant likelihood and importance.  Ratings are given from 1 to 10 with 10 being the highest probability/impact on the project.


The new risk assessment is below.




Line 196: Line 272:
|-
|-
|  Project member falls sick
|  Project member falls sick
|  4
|  6
|  Could leave 1 member with a huge amount of work to complete
|-
|  Unable to code pattern matching algorithm
|  2
|  2
9
8
Could leave 1 member with a huge amount of work to complete
Additional time might have to be spent, delaying the project
|-
|-
|  Unable to contact project supervisors
|  Unable to contact project supervisors
Line 205: Line 286:
|  If it is at a critical time problems may arise
|  If it is at a critical time problems may arise
|-
|-
|  Crawlers not available/useful in Java
|  Crawlers not effective at parsing data
4
3
6
5
New programming language may need to be learnt and implemented in very little time
A different web crawler may have to be used, delaying the project
|-
|-
|  Run out of finance
|  Run out of finance
Line 223: Line 304:




 
Most of these risks can be minimised with proper management of time. If a project member ever hits a road block in the project progression, a supervisor can be spoken to for help. If they are unavailable, e-mails can be sent and and reply will typically be given as soon as possible. Sickness shouldn't really delay the project very much at all because the project is very software based and can be worked on at home.
===Occupational Health & Safety===
The two main Occupational Health and Safety hazards identified are psychological and ergonomic hazards<ref>Hazards and Risk Assessment Powerpoint Presentation, Honours Project 2010</ref>. Both of these are quite likely to occur during the length of the project and can be potentially severe if not handled properly.
The ergonomic hazards that apply to this project include:
 
*Work placement layout
*Work place equipment
*Job design
*Individual work practices
*Training/Experience
 
The University of Adelaide has official practices for managing these ergonomic hazards<ref>Work Station Ergonomics http://74.125.95.132/search?q=cache:SofKcuyfEPYJ:www.adelaide.edu.au</ref>.  The majority of the work in this project will likely take place on computers at home or at the University, so these strategies can be implemented throughout the course of the project to minimise the risks and cause the project to run more smoothly.  
 
Throughout the project, frequent short rest breaks should be taken. This is to help prevent fatigue and permanent muscular, joint and ligament damage. This decreases both the severity of the hazard and the likelihood that it will occur.
 
Additionally, the University has a Health Safety and Wellbeing Policy<ref> Health Safety and Wellbeing Policy http://www.adelaide.edu.au/hr/ohs/handbook/</ref>. Following these guidelines and principles will help avoid the psychological hazards such as:
 
*Stress
*Repetitive tasks
*Little training or instruction
 
The listed hazards are highly probably, but with proper consideration, we can minimise the consequence or hopefully prevent the hazard altogether. The high workload of the project will often lead to stress; this can be prevented by breaking the project down into smaller manageable tasks, dividing the work between members or taking short breaks during long periods of work. Due to the nature of the project, we will have to perform repetitive tasks such as testing and modifying the code and analysing similar results. We can avoid this hazard by having group members alternate between what assignment we currently have.
 
Finally, the project is largely independent and we have little instruction on how to complete it. This hazard is not so significant because we have two supervisors that we can keep in contact with through e-mail and regularly scheduled meetings.
 
With proper mitigation strategies in place, the probability of OH&S hazards occurring can be minimised. Without these management strategies in place, the potential for risk could be considered too high. However, because the risks can be effectively managed, it is reasonable to undertake the project.


==Conclusion==
==Conclusion==


Ideally by the end of this project the Somerton Man Code will be broken and his long lost identity will finally be revealed to the worldThis is not however the main objective of the project.  The structure of the project has been designed to produce a useful web searching tool that may be used to progress the investigation into the Taman Shud case, but will also provide uses outside of this projectThere are still some elements of the project that hold some uncertainties, however the basic outline of the project is clear and will be held to its time frame to the best of our ability.
Overall there has been significant progress towards achieving the aim of the project.  The previous results have been accepted and there has been advancement in the areas of both web crawling and producing pattern matching algorithmsAfter some minor issues are overcome in both areas, it is planned to integrate the two modules of the project together.  This will allow for extensive searching and result compilation for further analysis.


==References==
==References==
Line 258: Line 314:


==See also==
==See also==
* [[Stage 1 Design Document 2010]]
* [[Cipher Cracking 2010]]
* [[Cipher Cracking 2010]]
* [[Final report 2009: Who killed the Somerton man?]]
* [[Final report 2009: Who killed the Somerton man?]]

Latest revision as of 20:41, 20 October 2014

Due Date[edit]

This report is due on the 3rd of June, 2010 (Thursday, Week 12).

Executive Summary[edit]

For decades the written code that is linked to the “Somerton Man Mystery” (or Taman Shud Case) has been undecipherable and delivered no solid information about his history or identity. This project aims to bring light to the unsolved mystery by tapping into the wealth of information available on the internet.

Recent collection of random letter samples has been used to reinforce previous claims that the code is unlikely to just be a randomly generated sequence of letters. Along with thorough revision and checking of previous code used to produce results, it has been agreed that the Somerton Man code is most likely initialism.

The code found with the dead man will be used in conjunction with a suitable “Web Crawler” and specific pattern matching algorithms. The aim is to target websites of interest and extract data from them that may hold answers to the Sumerton Man Code. Such data refers to the huge amount of available rew text that can be found on websites. This text will be filtered in search of character patterns that are also present in the Somerton code. Once some solid results are obtained, frequency analysis will hopefully enable a solid hypothesis on some possibilities in regards to the meaning behind the code.

So far there has been substantial progress towards achieving the goal of the project. Verification of past results has been completed along with extensive research and experimenting with web crawling devices and pattern matching algorithms. The project has undergone some specification changes and still has a significant amount of work remaining along with some small problems to overcome. Despite this, the project is well underway and should be completed successfully and on time.

Aims and Objectives[edit]

The aim of the project is to make progression with respect to the Somerton Man Case through use of engineering techniques such as information theory, problem solving, statistics, encryption, decryption and software coding. Through use of such techniques, the results from the previous year’s project will be revised and used to solidify the pathway for the rest of the project. Once past results are properly revised the next goal of the project is to attempt to utilise a web crawling device in order to access information stored in raw text on websites of interest. Specifically designed algorithms will then be used on the extracted text in order to search for patterns and initialism sequences that may be of interest or relation to the contents of the Somerton Man Code.

Project Background[edit]

The Case[edit]

The project is based upon attempts to revealthe identity of a middle aged man discovered dead on a metropolitan beach in South Australia in 1948. With little solid evidence of his death apart from the body itself, the man’s passing has remained a mystery covered in controversy for decades. He was found resting against a rock wall on Somerton Beach opposite a home for crippled children shortly before 6:45 am on the 1st of December in 1948. A post-mortem of the body revealed that there was no possibility that the death was natural due to severe congestion in his organs. The most likely cause of death was decided to be from poison, which in 1994 was uncovered by forensic science to be digitalis[1].

Among the man’s possessions were cigarettes, matches, a metal comb, chewing gum, a railway ticket to Henley Beach, a bus ticket and a tram ticket. A brown suitcase belonging to the man was later discovered at the Adelaide Railway Station. In it contained various articles of clothing, a screwdriver, a stencilling brush, a table knife, scissors and other such utensils. Authorities found the name “T. Keane” on some of the items, but they were unable to find a missing person with that name.. The Somerton man has since been incorrectly identified as a missing stable hand, a worker on a steamship, and a Swedish man[1].

A spy theory also began circulating due to the circumstances of the Somerton man’s death. In 1947, The US Army’s Signal Intelligence Service, as part of Operation Venona, discovered that there had been top secret material leaked from Australia’s Department of External Affairs to the Soviet embassy in Canberra. Three months prior to the death of the Somerton man, on the 16th of August 1948, an overdose of digitalis was reported as the cause of death for US Assistant Treasury Secretary Harry Dexter White, who had been accused of Soviet espionage under Operation Venona. Due to the similar causes of their death, the rumours of the unknown man being a Soviet spy began to spread.

By far the most intriguing piece of evidence related to the case was a small piece of paper bearing the words “Tamam Shud”, which translated to “ended” or “finished”. This paper was soon identified as being from the last page of a book of poems called The Rubiayat by Omar Khayyam, a famous Persian poet[2]. After a police announcement was made in search of the missing copy of The Rubiayat a local Glenelg resident came forth baring a copy that he had found in the back seat of his car. It was in the back of this copy of The Rubiayat that the Mystery Code was discovered.

The Code[edit]

The Somerton Man Code.

The code found in the back of The Rubiayat revealed a sequence of 40 – 50 letters. It has been dismissed as unsolvable for some time due to the quality of hand writing and the available quantity of letters. The first character on the first and third line looks like an “M” or “W”, and the fifth line’s first character looks like an “I” or a “V”. The second line is crossed out (and is omitted entirely in previous cracking attempts), and there is an “X” above the “O” on the fourth line. Due to the ambiguity of some of these letters and lines, some possibly wrong assumptions could be made as to what is and isn’t a part of the code. Professional attempts at unlocking this code were largely limited due to the lack of modern techniques and strategies, because they were carried out decades earlier. When the code was analysed by experts in the Australian Department of Defence in 1978, they made the following statements regarding the code:


  • There are insufficient symbols to provide a pattern.
  • The symbols could be a complex substitute code or the meaningless response to a disturbed mind.
  • It is not possible to provide a satisfactory answer.


Last year’s code cracking project group ran a number of tests to selectively conclude which encryption method was mostly likely used. In their final report, they determined that the Somerton man’s code was not a transposition or substitution cipher. They tested the Playfair Cipher, the Vigenere Cipher and the one-time pad[3]. The group concluded that the code was most likely to be an initialism of a sentence of text and using frequency analysis, determined that the most likely intended sequence of letters in the code was:


WRGOABABD
WTBIMPANETP
MLIABOAIAQC
ITTMTSAMSTCAB

Project Requirements and Specifications[edit]

The entire project can be broken into 3 main modules. These are:

  • Verification of past results
  • Design/utilisation of a web crawler
  • Pattern matching algorithm

Each module has a set of requirements specific to its design and application with respect to the project.

Software Requirements[edit]

The following block diagram shows the designed software layout for traversing a website and filtering through raw text and storing all important findings. This method is using a breadth first traversal which is currently being revised and possibly changed to depth first. for more information see the "Web Crawler Plans" section.

Web Crawling Requirements[edit]

The web crawling device will be used to visit a website of interest and retrieve data that may contain information relevant to the project. The basic requirements of the web crawler are as follows:

  • Take in a base URL from the user
  • Traverse the base website
  • Pass forward any useful data to the pattern matching algorithm
  • Finish after traversing entire website or reaching a specified limit


The crawler will be required to detect and handle the following circumstances:

  • Bad I/O
  • Found a valid HTML URL
  • Found a Non HTML URL
  • Found a URL linking to an external site
  • Found an Invalid URL

Pattern Matching Algorithm Requirements[edit]

The pattern matching code must be able to determine the difference between HTML code and raw text and data. The HTML code will be largely useless and probably repetitive in multiple sites which wouldn’t really be useful for the results we are after (in fact, if the frequency was quite high they could potentially skew our results). How the algorithm determines what is and isn’t HTML code will be very important because it will decrease the amount of processes as well as the amount of time needed to complete the methods in the algorithm.

The code must be able to switch between parsing every character or just initial letters of a word. This is mostly for added functionality, because it has been concluded that the Somerton man’s code is probably an initialism. This is done because we want our web crawler to have uses outside of this project.

Moreover, we want the code to be able to search for exact patterns as well as “similar” patterns. That is, data that has the same pattern as our given input, but with different characters (for example, if our input is searching for “ABAB” and comes across text with “XYXY”, it would be considered as a “similar” pattern and noted down). Also, the code should handle non-letter characters such as punctuation.

For every “found” pattern, the code should put the text (or sentence or phrase that the pattern was found in) and possibly URL in some sort of list or array. This is the most significant part of the algorithm because it will provide the basis of our results. From this, we can get frequency logs of certain words to better determine what the initialisms in the Somerton man’s code could possibly be.

Progress So Far[edit]

Verification of Past Results[edit]

Error creating thumbnail: Unable to save thumbnail to destination
Completed Random Letter Sample form.

In order to decide on a direction for the use of the web crawler, it was first important to look at the previous students work and make a decision on whether their conclusions are accepted.

The verification of their results was undertaken in two sections:

  • Collection of random letter samples for comparison to the Somerton man code
  • Analysis and Review of their code

A template for collecting random letter samples was created and used to collect 45 random letter samples from a variety of subjects. 45 was chosen as it is roughly the number of letters in the Somerton man code. Details taken from each subject include the following, and can be seen in the completed form to the right.

  • Intox. Yes/no - the Physical state of each subject was recorded to keep samples taken from intoxicated subjects separate from those that were, to the best of our knowledge, not under the influence of any substance.
  • Age – each subject was required to supply their age for future use
  • M/F – the sex of each subject was recorded for future use

The graphs below show the recently collected samples next to the samples collected by the previous students and the letter frequency plot of the Somerton man code.


Although the results are not identical to the previous student’s results, they are very different to the graph depicting the code from the Somerton man. It is clear that in both results there is not 1 letter that has a frequency less than 1. This leads us to agree with the previous assumption that the code has not been produced randomly. So far there have not been enough samples from intoxicated subjects in order to produce a useful result. This is an ongoing collection and will be finalised when there are enough samples to compile a graph that can be considered useful. It was initially planned to use the age and sex data that has been collected to subdivide the results into different categories in order to conclude whether the letters in the Somerton man code are better correlated to a specific age group or sex. It was then realised that this will involve the collection of many more samples than is needed to verify the past students results and has therefore not been included in the results. Collecting more samples and splitting the data into different categories may provide useful results for future studies into the case.

The previous year's code was used to analyse and come to a conclusion of the Somerton man's code using several well known cipher techniques. Upon reviewing the created software modules, it was decided that the algorithms used would perform the proper functions in attempting to prove or disprove their hypotheses. Their code was then compiled and run to generate a number of output files. These output files were then examined and compared to the output files that the previous project group generated. From all of these outputs, none of the files generated any legible words and could then be assumed that the Somerton man's code was neither a Vigenere, Playfair or Caesar Cipher. In fact, the results claim that the mystery code resembles a set of initialisms which may or may not be using substituted letters.

Text Pattern Algorithms[edit]

The most recent revision of the pattern finding algorithm and testing text file can be found here.

The text pattern algorithm was written from scratch using java. The main purpose of the algorithm is to search through a text file and return an output saying if a word or initialism is found. This was done because we weren’t aware of what the web crawler’s were capable of using the initial functions (that is, using the web crawler without modifying any of the code).

To start, a flow chart of the basic function of the algorithm was made to determine how to begin coding in the most efficient way possible. This is shown below:



To explain the function as simply as possible, the basic algorithm was designed to:

  • Accept a user entered input from the command prompt and place it into a string
  • Open a text file and place a line from that file into a string
  • Compare the file string with the user’s string character by character, printing an output if the entire user entered string was found
  • Repeat comparison process for every line until the end of the file is reached.

The algorithm utilises the scanner class from the util package as well as the File and FileReader classes from the io package. They are important because these classes allow the algorithm to open the needed text file and scan through it. They were easily found on the java API[4] and are fairly simple to use. For this algorithm, the output would report the line and character number of the first letter of the matched word found in the text file.

To simplify the searching process, both the text file string and the user string were converted to lower case letters. It was felt that this helped improve the functionality of the algorithm because it involved fewer cases where characters would be the same letter but not the same. Since it doesn’t really matter if the characters are upper case or lower case for the results, there will really be no negative effect on the final outcome of the project.

To implement searching for initialisms, the area circled in red was modified to only change the swCount variable if the previous character in the file string was a white space. The results of the output would report the line and character number like the previous method, as well as the sentence that the initialism represents. This was done because the aim of the project is to find what the initialisms of the Somerton code could possibly represent. This seemed to work, although several errors in the output were observed when some special cases were done.

For example in the file string, if there were two white spaces in a row, the output would be incorrect (the result would print out the wrong character position and the wrong set of words that matched that initialism). Additionally, problems occurred when the input string had spaces in it as well (the method incorrectly assumed that the space represented an initialism as well). Several attempts were made to debug and fix these problems. The final solution was to create another method (to be used by the initialism finding algorithm) to remove the white spaces in the input string. Also, one more method was created to print the initialism sentence properly for every case that was thought of.

For testing purposes, a main method was created which utilises the scanner class again. For this, the method accepts a user entered input to determine which algorithm will be used for the current search (FindExact or FindInitialism) and then accepts another user inputted string that represents the word or initialism that will be searched for. It is currently unknown whether this will be used as the interface for the final web crawler, although it is likely that the final interface will have a similar function.

Some of the simulations and results from testing can be found below.



The image above demonstrates the FindExact method. From observing the word being searched for, and the testing text file used, it is possible to see that the algorithm works in the assumed manner. As stated earlier, the code considers upper and lower case characters the same – this is illustrated above with the searches for “TEST” and “test” returning the same thing. The method correctly determines the line and character number of the beginning of the word. Furthermore, the method determines the word “ hello” as “hello” (without the spaces at the start) and can correctly find its location in the text file.

Although the code ignores spaces in the search string, it can correctly determine the words “ t e s t” as “t e s t” with spaces in between, but not before the first non-whitespace character. Since the FindExact method can search for multiple words, it was felt that this specification was a requirement in the code. To do this, an algorithm was used to delete all of the white spaces in the user input string until a non-whitespace character was met.

Additionally, the main method returns an input error when a letter other than ‘i’ or ‘e’ is used.



This image shows that the FindInitialism method returns the same result for every different search of the initialism test (with spaces in different places in the string). Incredible amounts of debugging and modification was done in order for this to work correctly. Extensive use of the isWhitespace() method helped to achieve a working algorithm. This method was very important because it was used to determine how many words of the initialism were currently printed as well as determining if a character was the beginning of a word or not.

The FindInitialism method is the method that will be developed for the final web crawler. It has a function specific to the final outcome for the project: to generate a list of potential sentences for an inputted initialism by reading through the source code of a web site.

Web Crawling[edit]

An open source Java based Web crawling device called Arachnid[5] has currently been the main focus in terms of producing the web crawler. This crawler is so far proving to be quite useful for the project due to the following traits:


  • Java based
  • Basic example provided - allowing for reasonably steep learning curve instead of having to become familiar with a crawler from scratch
  • Specific Case handling methods provided - methods for some URL cases are provided (with no handling code however it has provided a starting point to the solution)
  • Highly Modifiable


So far after modifying the supplied crawler and exploiting methods used in the supplied example the crawler is capable of the following:


  • Compiling successfully (This is a huge bonus! it allows for actual tests to take place)
  • Intaking a base URL from the console
  • Testing for correct protocol (We only want to look at HTTP websites for now)
  • Test for a null URL
  • Test for non HTML URLs
  • Test for URLs that link to external websites
  • Run a handleLink()
  • Iterate through the website and produce a list of URLs and their associated name found along with an incrementing counter

The screen print below shows the command lines printed out for the beginning of a search on www.adelaide.edu.au, the university website. As can be seen the results are displaying each found URL with a number and a title. The title = null lines are explained later in this report.


While using Arachnid is proving to be quite useful, there have also been some issues encountered (see “Approach for Remainder of the Project” section). Because of this, other alternative crawling methods have been researched in parallel to Arachnid.

The main alternative for retrieving data from websites that is being investigated is Wget [6]. This is a program that would be highly useful for use in the project for the following reasons:

  • Freely available
  • Supports downloading via HTTP
  • Able to act as a web crawler to download recursively until reaching a limit or end of web site


The third point is extremely valuable to the goal of this project. Wget is able to mirror websites for offline viewing which will also allow for easier extraction and searching of data.

Approach for Remainder of the Project[edit]

Web Crawler Plans[edit]

With the web crawler design with the Arachnid crawler there is currently two major issues to be resolved.

  • Character encoding differences with HTML URLs
  • Traversal method - breadth first vs depth first


The image above shows a cut out from the command prompt displaying the current issue regarding character encoding. The text under the numeral “1” is from a successful URL discovery where the blue underline shows the URL ‘type’ and the red underline shows the found ‘title’ of the URL. In the second section under the numeral “2” the type is shown as UTF-8 and as can be seen the title being displayed is null, resulting in no valid information being retrieved from the URL that has been found at that point.


It is important to overcome this problem beacause UTF-8 is the international “standard” for encoding of URLs and appears to be the most common type found. This will be achieved by trying to firstly convert URLs to UTF-16 before handling them with Arachnid. The reason for doing this is that Arachnid is currently handling UTF-16 quite well and it may simpler to just convert each URL that is found to the encoding that is working, rather than altering the crawler to deal with a whole new type of character encoding. To do this the java method URLEncoder() method will be experimented with.

The other issue involving traversal order is less critical however will be important in terms of data collection and storage. The types of websites that are of interest will be information sites containing large amounts of text that will most likely be on the website in a structured hierarchical manner. This means that it is quite important to be able to move down one avenue of information of a site and storing similar items of interest so that they can be easily collected and analysed later. Breadth first traversal is the simplest approach to filtering through a website however it will result in data found being stored in an unorganised manner with respect to the aim of the project. Depth first is slightly harder to implement but, if it can be achieved, will result in all results of a similar nature being stored in an ordered fashion. The following figures show the difference in breadth and depth first traversal and the resulting order of stored data. Each node can be thought of as a different web page on the same website where the numbers relate to the order that each page would be searched.



It is clear from the diagrams that depth first will allow for data from the same category in a hierarchy will be searched in order, relating to patterns and details found being stored in a usable order as well. Depth first traversal is the method that is most useful to the project and will be implemented as best as possible.


In order to try and minimise any large delays in the project due to issues with Arachnid, more time will be spent with Wget and using it as an effective web crawler. As Wget is written in a C based language, there may be some issues with translating the pattern searching code in order to use data extracted by Wget. This means that researching Wget methods and experimenting with it as soon as possible is important in order to provide as much time as possible for combining the web crawler and the pattern matching algorithm.

Pattern Algorithm Plans[edit]

The major remaining areas of the pattern matching algorithm yet to be completed is the implementation of ignoring HTML; and the algorithm being able to search for exact and ‘similar’ initialisms as described in the Stage 1 Design. Ignoring HTML will likely be done by ignoring the characters in the text file when a “<” is encountered, and then re-start comparing characters again after a “>” is found.

The method to find ‘similar’ initialisms has currently not been identified as of yet, although some ideas have been formed. The approach for doing this will likely be similar to the FindInitialism method, utilising smart and logical coding rather than a simple isWhitespace class. Moreover, we would also like to implement wild card searches when ' * ' is used. This will likely be done by incrementing the swCount variable for when the ' * ' character is compared to any character in the text file string.

Additionally, further implementation could include saving the found initialisms into another text file for inspection later. Beyond that, more testing and debugging should be done and any unforeseen errors that occur should be fixed as soon as possible. Once that is completed, the code will be ready to be integrated into the web crawler. This area of the project is almost finished and the future lies with the combined web crawler-pattern matching algorithm.

Obtaining and Using Results[edit]

The main sites of interest will include web sites with large amounts of raw data that can easily be parsed using the pattern matching algorithm. If possible, a variety of different sites should be searched to widen the results as much as possible. Using both the web crawler and the pattern matching algorithm, we can then produce a list of potential initialisms for the Somerton man’s code. With this list, we can then create frequency graphs to narrow down the list and hopefully solve the mystery code.

Project Management[edit]

Project Alterations[edit]

As the project has progressed there have been a few things that have changed with respect to the original project plan.

It was originally planned to search the entire internet for patterns of interest. It was quickly realised that this is going to be unachievable within the time frame and resources of this project. The internet is simply too large and changing too quickly for our techniques and equipment. For this reason the project goal has been down sized to searching through one website at a time. That is, there are now constrictions and limits to the size of the search that the software will be subjected to. The specifications now state that the software will be required to intake a starting URL and search through that website and that website only. This has given some extra challenges such as dealing with URLs that link to external websites however it will also make the project goal much more manageable and realistic.

Work Breakdown[edit]

The work breakdown for the project is the same as from the Stage 1 Design[7]. At this point in time, the two tasks currently being completed are testing/modifying the web crawler, and testing/modifying the pattern matching algorithm. This is more or less matching the expected time line from the Gantt Chart that was previously created. Due to end of semester assignments and exams, it is likely that very little work will be done for the next few weeks and the project may fall behind. However, during the mid-semester break, there will be sufficient time to catch up and potentially get ahead of schedule.

Budget & Resources[edit]

As stated when this project was started, the majority of the work related to this work is software based. The software being utilised is java based and can be readily used and accessed for free through the university's resources and on the internet. It was estimated that very little of the $500 budget would be used. Although barely any of the budget has been used, it is still available should we come to a situation where we need it (such as funding alcohol, etc to further highlight our results).

Costs so far:

  • The Secrets of Codes: Understanding the World of Hidden Messages by Paul Lunde - $50

Risk Management[edit]

As stated in the Stage 1 Design, the project has very little risk involved. As such, the risk assessment is very similar to the previous assessment. Since we are now more familiar with the project, we are more aware of the more of the risks and potential dangers that could cause the project to fail. This includes not being able to sufficiently code the pattern matching algorithm or use the web crawler. While both of these are unlikely, there is a possibility that these problems could cause a delay. Additionally, due to the current winter weather, there is a greater chance of a project member becoming ill.

The new risk assessment is below.


Risk Probability Impact Comments
Project member falls sick 4 6 Could leave 1 member with a huge amount of work to complete
Unable to code pattern matching algorithm 2 8 Additional time might have to be spent, delaying the project
Unable to contact project supervisors 2 6 If it is at a critical time problems may arise
Crawlers not effective at parsing data 3 5 A different web crawler may have to be used, delaying the project
Run out of finance 1 7 Has the potential to be a problem if software needs to be purchased
Somebody else cracks the code .009 5 The project may lose its "x" factor but the web crawler will still be useful


Most of these risks can be minimised with proper management of time. If a project member ever hits a road block in the project progression, a supervisor can be spoken to for help. If they are unavailable, e-mails can be sent and and reply will typically be given as soon as possible. Sickness shouldn't really delay the project very much at all because the project is very software based and can be worked on at home.

Conclusion[edit]

Overall there has been significant progress towards achieving the aim of the project. The previous results have been accepted and there has been advancement in the areas of both web crawling and producing pattern matching algorithms. After some minor issues are overcome in both areas, it is planned to integrate the two modules of the project together. This will allow for extensive searching and result compilation for further analysis.

References[edit]

See also[edit]