Editing Final Report 2011 (section)

==Background==

===Previous Studies===
Over the years there have been many attempts to shed light the code’s meaning. Of much interest is the analysis performed by Government cryptanalysts in Canberra and two previous projects done by students at the University of Adelaide.

In 1978, the code was examined by Government cryptanalysts in the Department of Defence in 1978. They made three conclusions:
# There are insufficient symbols to provide a pattern
# The symbols could be a complex substitute code or the meaningless response of a disturbed mind
# It is not possible to provide a satisfactory answer<ref name=InsideStory>''Inside Story'', presented by Stuart Littlemore, ABC TV, 1978.</ref>.
These results appear discouraging, however, this study was done in 1978, prior to the technological advances this project aims to utilise. In light of this the University of Adelaide student projects done in the past two years are more relevant in providing a starting point for our own investigations.

In [[Final report 2009: Who killed the Somerton man?|2009]], the students concentrated on a cipher investigation and structural investigation. Results included concluding that the Somerton code isn’t random; it contains a message. They also made some conclusions regarding ciphers, such as a transposition cipher was not used (the letters are not simply shifted in position) and that the code is consistent with an English initialism – a sequence of English initial letters<ref name=FinalReport2009>''Final Report 2009'', Bihari, Denley and Turnbull, Andrew, 2009, https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_report_2009:_Who_killed_the_Somerton_man%3F</ref>.

The students examining the code in [[Final Report 2010|2010]] concentrated on a statistical pattern investigation analysing the everyday occurrence of patterns evident in the code and creating web analysis software able to archive web pages and analyse the contents<ref name=FinalReport2010>''Final Report 2010'', Ramirez, Kevin and Lewis-Vassallo, Michael, 2010, https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Final_Report_2010</ref>.

===Software Tools===
This section provides a list of the software tools that were used in the development of the project.

All code development was completed using the Java programming language version 1.6.0_26 and the Java Standard Edition Runtime Environment version 1.6.0_26<ref name=OracleJavaSE>''Java SE Overview'', Oracle Technology Network, http://www.oracle.com/technetwork/java/javase/overview/index.html</ref>. Testing was conducted across three operating systems, [http://www.apple.com/macosx/ Mac OSX Lion], [http://windows.microsoft.com/en-AU/windows7/products/home Windows 7] and [http://www.ubuntu.com/ Ubuntu Linux] version 9.10.
[http://netbeans.org/ Netbeans] integrated development environment version 7.0.1 was used for graphical user interface development in combination with [http://www.eclipse.org/ Eclipse] integrated development environment, versions Galileo, Helios and Indigo. 
Multiple Wikipedia webpages were used for documentation and result co-ordination.

===Cipher Methodology and Analysis===
A cipher is any general system for hiding the meaning of a message by replacing each letter in the original message with another letter<ref name=TheCodeBook>''The Code Book'', author Simon Singh, The Fourth Estate, 2000.</ref>. The two general types of ciphers are substitution ciphers and transposition ciphers. Substitution ciphers are those in which each letter of the plaintext is replaced by another letter to form the ciphertext, while transposition ciphers are those in which letters within the message retain their values but change position<ref name=TheCodeBook>''The Code Book'', author Simon Singh, The Fourth Estate, 2000.</ref>.

A simple example of a substitution cipher is the Caesar Cipher. The cipher is formed by shifting each plaintext letter three places along the alphabet to form the ciphertext letter as shown in the figure below.
<center>[[File:Caesar Cipher.png|Caesar Cipher]]</center>
<center>'''Figure 2 - Caesar cipher encryption process'''</center>


A simple example of a Caesar Cipher encryption would to be to encrypt the word “face”. By moving each letter along three places (refer to the figure above), the plaintext first letter ‘f’ goes to ciphertext ‘I’, ‘a’ goes to ‘D’, ‘c’ to ‘F’ and ‘e’ to ‘H’  . Thus the plaintext “face” gets transformed to the ciphertext “IDFH”. Decryption can be performed by trivially reversing the process.

Ciphers typically involve a general method that specifies what sort of algorithm is used in encrypting the plaintext, and a key that specifies the exact values used in the algorithm<ref name=TheCodeBook>''The Code Book'', author Simon Singh, The Fourth Estate, 2000.</ref>. For example, in the above Caesar Cipher, the algorithm can be considered a shift of the alphabet with the key being 3, resulting in the specific instruction of a shift in the alphabet of 3.

Cipher analysis on substitution ciphers is traditionally performed through a process called [http://en.wikipedia.org/wiki/Frequency_analysis Frequency Analysis]. This process uses linguistics and statistics, recognising that each letter in a language has its own unique characteristics that can be used to identify it. For example, in the English language, the letter ‘e’ has the characteristic of occurring most commonly; on average 13% of the time<ref name=letterfreq>Lewand, Robert, English Letter Frequencies, http://pages.central.edu/emp/LintonT/classes/spring01/cryptography/letterfreq.html</ref>. Thus it would make sense to replace the most commonly occurring ciphertext letter with the plaintext letter ‘e’.

===Web Crawler===
A web crawler is an automated and exhaustive software tool used to traverse the internet. The internet domain is not specified prior to execution but is grown dynamically as operation progresses.  A seed webpage is used to specify the origin of the web crawler’s traversal. All instances of the term seed used in the remainder of this report are in this context. The web crawler will analyse the content of the seed webpage and extract and store the hyperlinks to all web pages referenced from this page. The process is repeated with these newly acquired pages in a recursive manner<ref name=Rodham>Rodham, Ken, Web Crawler, Brigham Young University Computer Science, http://faculty.cs.byu.edu/~rodham/cs240/crawler/index.html</ref>. The crawl, or traversal, can be continued indefinitely; limited only by the available computer resources.

The process of traversal can be visualised with the figure below. The largest node represents the seed web page and the four connected nodes represent four links the web crawler has extracted from the seed. As the recursive process is continued, the search continues to expand over the internet. 

<center>[[File:Tree.png|Web Crawler visualisation]]</center>
<center>'''Figure 3 - Web crawler traversal visualisation'''</center>

There are two possible traversal methods that a web crawler can use; breadth first and depth first. In breadth first, all newly acquired web pages are examined prior to following the hyperlinks to the next level of depth. Thus for the example above, it would be consistent with a circle radiating outwards from the seed node. In depth first traversal, after the hyperlinks have been acquired each is followed to the maximum depth before reverting back through the hierarchy. Again looking at the example above, this would be equivalent to moving from the seed node down to the smallest node, traversing each of these before moving back toward the seed. The 2011 web crawling application uses a breadth first traversal mechanism.

While conceptually the web crawler can continue to traverse indefinitely, in reality there are limitations imposed. The Robots Exclusion Protocol specifies how web crawlers must interact with web service providers, the hosts of web pages. Crawlers that do not adhere to this protocol are liable for serious repercussions, one of which includes litigation. Secure web pages also do not permit access for web crawlers. This needs to be acknowledged by the traversal software.

The ability of web crawlers to autonomously traverse the web is very powerful and in 2011 will be exploited for use with pattern recognition software. The top-level overview of the step process for generic web crawler is shown in the flowchart below. Step 2 shows the capability of analysing the data contained on the web page. This is where the 2011 project plans to integrate pattern matching software. The complete system forms a personalised internet search engine.
 
<center>[[File:Crawler explain.png|Web Crawler process]]</center>
<center>'''Figure 4 - Web crawler operational process overview'''</center>