Structuring the Taman Shud code cracking process

Here are some preliminary guidelines on how to structure the investigation into the code. The idea of the project is not to crack the code as such. The idea is to eliminate what the code is not in a structured and well-documented way. Of course, if you crack the whole thing open that will be the cherry on the cake, but we are marking you on the engineering methodology you put into the cake and not the cherry.

In any engineering problem, what we always do is make simplified assumptions to make the problem tractable and then add extra complexity later (if needed). The same trick holds for analyzing this code. Break it up into a number of tasks that are based on reasonable simplified assumptions. Investigate these simplified cases thoroughly, and then the results will suggest where to go next. So long as you do each step methodically that is what matters. We don't care if you don't crack the code. If you run out of time to add on further complexities, you can suggest what needs to be done in the future in your conclusions. So the conclusion of your final report should present a structured plan on what a future team needs to do.

Management

Carefully study how the tasks have been logically split up below. Fill in the deliberately missing blanks. Figure out the best sequence to do them in. Figure out how to allocate the tasks between you two so that there can be two streams of complementary work going on in parallel. This will then form the guts of your Critical Design Review (CDR).

Hypothesis 1: the code is gibberish

• Assume the code is in fact a meaningless string of letters. This assumes the Somerton man was normally an English speaker, but was drunk or so badly poisoned with hallucinogens that he was writing a delusional string of letters.
• Think of ways to test this hypothesis.
• Get 10 native English speakers to write a string of 50 random letters before and after a fixed number of beers. They must try to be random only using their mind and not use computers or external devices. Better to chose friends who study courses where they don't teach you what randomness is. Otherwise you may get the odd friend who tries to be too smart and not go with the game. Arts students will be perfect victims.
• Then think of ways you can statistically compare the Somerton code to these gibberish sequences. Plot letter frequencies of gibberish with error bars. Make counts of letter pair frequencies. Are there letters of the alphabet people consistently missed out and how does this compare to the code? How do the most frequent letters compare?
• Calculate the average information in bits per symbol $H(x)$. You do this by summing up all $H(x_i)$ over the code, where $x_i$ is the $i$-th symbol. So let's say $x_1$ is the symbol 'R', you count up how many times R appears in the code and divide it by the total letters to give the probability $P(x_1)$ of there being an 'R'. Then by definition, $H(x_i) = P(x_i){\rm log_2}(P(x_i))$. Do this for all the symbols and up all the $H$'s. This is what is called the Shannon average information and has the units of 'bits'. Do this for both the gibberish and the code.

Hypothesis 2: the code is in English, but the letters have been substituted

• Make a big long list of coding techniques. Try to eliminate some based on the date they were invented. Or ones that would require more computing power than was available in 1948.
• It is not reasonable to assume he did the code in his head. A computer of the time could have been used. He could have been hurriedly copying down the code that someone else prepared. Maybe he ripped the code off someone else when they weren't looking.
• Then take your reduced list of coding techniques, and code up an e-book written in English. You should sub-sample chunks of 50 letters 10 times, and generate error bars. Then look at the letter frequencies before and after coding. Compare this with the Somerton code and you should be able to eliminate some.
• Then you can repeat the above using different statistical features such a letter pairs. Also try calculating the probability of a symbol $x_i$ appearing after a symbol $x_j$. This is called the transition probability. Comparing transition probabilities can give more clues.

Vigenère cipher

• Try the Kasiski test and the Friedman test to deduce possible key lengths.
• Once the key length is known, you can then write out the code in rows of that length, and then do a frequency analysis on each of the columns—since each column corresponds to a single letter, then a column is the result of a separate substitution cipher. You can then reuse your software for the substitution cipher to try and decrypt.
• More details here. You can try this out on your own coded up known-text as a sanity check, before you apply it to the code.

• A one-time pad is like the Vigènere cipher except that the key is the same length as the plaintext, and ideally uniformly random (i.e. all letters occur with equal probability in the key). This then generates a ciphertext with letters of equal probability (i.e. a flat probability distribution).
• In practice, he may have used some piece of text he had on him as the key, so the key would not have had letters of equal probability.
• You have already found the probability distribution is non-uniform, so it seems likely that if it was a one-time pad, he used a piece of text as the key.
• Possible keys:
• Quatrains from the Rubaiyat. You should generate all possible variants of the Fitzgerald translation using the list of differences, and try all possible quatrains (i.e. all quatrains from all versions). You can rule out those that are shorter than the ciphertext, unless you consider concatenation of successive pairs of quatrains. There are certain quatrains associated with the case, it would be interesting if it was one of those, so try them first.
• KJV version of the bible. This was the version that Gideons distributed to hotels in 1948, and we can be pretty sure the dead guy was a traveller. You would need to know where to start taking the key from, however. One possible suggestion is that some of the first letters specify book, chapter and verse. The BAB on the top line could be a disguised 3,4,3 or 5,4,3 (the first B does look a little strange). So that could be where to start from in the bible, and the following letters would be the ciphertext. Alternatively you could decode the first three letters (A=1, B=2, etc.). Starting from one (probably) as the chapters and verses are numbered starting with one. Normally you use A=0, B=1 for decoding, however, to match up with the modulo arithmetic used.
• Even if none of these work, you may still be able to crack part of the code. Since e is the most common letter in English texts, then you would get 'e'+'e' appearing most frequently in the ciphertext (if the plaintext and key were both in English), and etc. for TOAIN... However the code is too short to be able to use this (statistically) reliably to decrypt, so it's doubtful this approach will work.

External and internal information is useful

• You can use lateral thinking and think hard about the internal information in the code and the external information of the case. How do these things help you?
• Internal information: What does the presence of a 'Q' tell you? It tells you there wasn't a straight substitution with a rotary phone. The absence of a 'U' to go with the 'Q' means it is highly unlikely the code is English letters without some kind of substitution. It could be that 'Q' is an initial.
• External information: Could the name 'Jestin' be in the code itself? Or could the word 'enormoz' be in there? This was the word used by Cold War spies to refer to a nuke. Think of half a dozen significant words, from what you know about the case. Write software to apply a sliding window across the code and assume a substitution with 'Jestin' at each point. 'Jestin' has six unique characters so there will only be a few decrypted outputs. Only six letters will be decrypted, and for the rest just print a "*". Then visually we can easily eliminate the outputs that look like rubbish. Then do this for 'enormoz' and for half a dozen fun words that you come up with.
• Could 'Jestin' or 'Tamam shud' be the cryptographic keyword? Assume some basic coding schemes that use keywords and write software to check if your guessed keywords work or not.

Hypothesis 3: the code is in English and the letters are as they are supposed to be

• If the letters are as they are supposed to be, you can easily eliminate the idea that it is one big English anagram as there is only one 'e' in the whole code. So that is highly unlikely. You can demonstrate this by slicing up an English e-book into 50 letter blocks. Then count up how many 50-letter blocks in the whole book have only one 'e'. If it is zero or a small number then qed. If it is a number bigger then expected, then see Derek later on how to explore the anagram idea more deeply.
• Do letter frequency counts of a whole e-book and compare initial letter versus all the letters. Then test to see if it is likely the Somerton code is statistically similar to initial letters.
• Then we can repeat all the above tests again, but instead of an e-book try: (1) a big e-list of girl's names, (2) a big list of place names, (3) a big list of Australian train station names, (4) a list of seaports, (5) a list of chemical names, (6) a list of scientific terms to do with making nuclear bombs, and (7) any other fun thing you can think of.

Hypothesis 4: the code is in a foreign language and the letters are as they are supposed to be

• Select some foreign language e-books. Do letter frequency counts, pair counts, transition probabilities etc. Use the frequency counts to get a simple measure of Shannon entropy. Compare. Do initial letters and letters from full words. Compare.
• Another trick is to put the code phrase by phrase into an online foreign language anagram server. Then count up how many anagrams you get for each language. The one with the most wins. This may be an indicator.
• Another trick is to find a standard paragraph translated into many languages (such as the UN Declaration of Human Rights). Then append the Somerton code to each of these paragraphs. Then zip each file up. Then plot a histogram where the y-axis is the compression ratio and the x-axis is the ranked language (highest compression ratio first, lowest last). The highest compression ratios represent the most likely languages.

Hypothesis 5: the code is in a foreign language and the letters have been substituted

• This is the hardest one and should be left to the end to see if something easier turns up. I don't expect you to worry about this case for the project.
• At the end of the project the least you can do is suggest some ideas on how future students might approach this. But save your judgment until you've gained some experience with the other techniques first.

Hypothesis 6: the structure of the code can tell us something

• This is Derek's favourite one that he wants you to get stuck in as early as possible.
• It is very powerful because it assumes nothing about the letters, only their sequence structure. It also is language independent. It is also tolerant to the ambiguous letters that can be treated as wildcards.
• Write a piece of sofware that looks for regular expressions. For example ABAB is a repeat pair. Look for repeat pairs. MLIABO is a sexet with all unique characters. TMT is a palindrome triplet. Look carefully and find other characteristic structural features. Make a list of them.
• Get your software to search for these features within your own list of girls' names, names of chemicals, names of train stations, nuclear terms etc. Look for these regular expressions in e-books that specifically talk about the history of the Cold War and details of Operation Venona.
• If possible, write the software so it can search the whole www for these regular expressions. Try and find a search engine that supports regular expressions first. Otherwise you may have to write software that turns a regular expression into a set of search engine queries.