Editing Final Report/Thesis 2016 (section)

==Technical Background==
=== Hamming Distance and Levenshtein Distance ===
The two concepts were introduced from Information Theory. Both of them describe the amount of differences between two strings. The Levenshtein Distance is applied to calculate the difference between two strings that consist of letters, while the Hamming Distance is used to compare two binary strings with same length.
The Hamming distance measures the minimum time of calculations (substitutions, precisely) required to transform string A into string B. For example, the Hamming distance between “1010” and “0010” is 1 as it requires substituting the first bit '1' in the first string with '0'. In this case the calculations of Hamming distance are based on pure binary strings so the Hamming distance can be easily expressed as H(a, b) = a XOR b. 
The Levenshtein Distance, also known as edit distance, is an enhanced version of the Hamming Distance. It not only counts substitution, also it considers insertions and deletions.  The Levenshtein distance between two words is the smallest calculation times of substitutions, insertions, and deletions of symbols that are used to transform one string into another.
Here is an example demonstrating the calculation of the Levenshtein distance, substitution is marked as s and d stands for deletion, i for insertion.  String1: INTENSION, string2: EXECUSION. 

            ＩＮＴＥ＃ＮＳＩＯＮ
            ｜｜｜｜｜｜｜｜｜｜
            ＃ＥＸＥＣＵＳＩＯＮ
            ｜｜｜｜｜｜｜｜｜｜
            ｄｓｓ－ｉｓ－－－－
        Table1. Levenshtein distance
According to Table1, the minimum cost to turn string1 into string2 is 5: 3 substitutions, 1 deletion and 1 insertion. As the Levenshtein Distance considers three kinds of calculations, the complexity is inherently higher than the Hamming Distance. 

=== Simhash Algorithms ===

==== Brief Introduction of Simhash ====
The Simhash algorithm was originally invented by Moses Charikar. It was invented to estimate the similarities of a large volume of data [13]. Later the Simhash was applied by Google as their duplicate removal algorithm to deal with Google's massive data. Charikar's algorithm has been proved to be practically useful for identifying near-duplicates in web documents belonging to a multi-billion page repository [14] in Google's thesis. The idea of the Simhash algorithm are extremely condensed, it is even easier than the algorithm of finding all fingerprints with Hamming Distance less than k   in Google's thesis mentioned above. As the algorithm performs well in similarity check, it is adapted here to calculate the similarities of 2-grams strings in tests.

==== Why Simhash? ====
Traditional similarity check algorithms use Vector Space Model to separate documents into individual terms, allocate these terms into their corresponding vectors in multidimensional space. Each dimension indicates an individual term. Values of these vectors are calculated by a specific algorithm based on the terms, mostly based on the occurrence frequencies of terms, high frequency terms will be assigned a relatively larger value. Algorithms may differ but the Vector Space Model is static.  After the modeling text will be transformed into a set of vectors.
[[File:SFigure4.jpg|thumb|500px|center|Figure4. Space Vector Model]]
Figure 4 illustrates an example of the modeling of Vector Space Model, d1 and d2 represent two texts that have already been modeled. The two documents’ similarity can be found by calculating the Cosine Similarity of their corresponding vectors (dot product of the two vectors divided by the product of the two vectors’ Euclidean lengths):
Similarity <d1, d2 > = cos(theta(d1,d2) ).
 
Notice that the usage of a 2-D dimension was just for illustration purpose. In practice the number of dimension can be a much higher value, but the principle will be all the same.  
It has been proved that the Vector Space Model brings accurate estimation of the similarities, but this comes at the expensive of the exceptional high complexity in both time domain and space domain. Modeling long texts into vectors will certainly take up a large volume of storage; on top of that, each similarity value is generated by calculating the cosine value of two vectors. Obviously it is not a wise choice when dealing with data that contains a large amount of terms. Using SimHash algorithm could reduce the complexity significantly while preserve the accurate estimation of similarities.

Before explaining the Simhash, it is necessary to introduce the Hash function. A Hash function mappings different data of arbitrary size into totally different hash values of a fixed size. After hash mapping each term will be allocated a unique hash value as its fingerprint. A well-defined Hash function should be collision-resistant, which means that it is impossible to find two data sets that will generate identically the same Hash values. Also, the Hash function should be sensitive to trivial changes. Even the string changes only one bit, the Hash value will be totally different.

Here explains the Simhash algorithm in details.  The figure below demonstrates the general processes of calculating the Simhash fingerprint of a given document.  
1. Firstly the original document (big blue box) will be separated into n individual terms (small blue bars). The rule of separating can be customized in different cases. Here, in the next tests the document will be separated into 2-grams letter groups.

2. Now the separation has been done. For each term, calculate its corresponding weight (w1, w2, … wn).  Normally the weight is determined by the frequency of occurrence of each term.  The weight can also be customized to meet specific requirements.

3. For each individual term and its corresponding weight: Apply hash mapping to each term (the length of hash mapping (n) could be adjusted by adjusting the segmentation method in step1, Google used 64-bits hash mapping in its webpage duplication remove program), then multiple the n-bits hash mapping result with the corresponding weight, '0' in hash mapping result will be treated as '-1' and '1' stays unchanged.   

4. Finally, add all the results generated in the previous step together. Here the result is a string consists of n numbers. The final fingerprint is also an n-bit long binary string. For each number in the string, if it is positive then set its corresponding bit in the final result to '1', otherwise set the bit to '0'.

5. The similarities of two strings are then generated by calculating the Hamming Distance of two Simhash strings.
[[File:SFigure5.jpg|thumb|500px|center|Figure5 Simhash Procedure]]

This is how the Simhash algorithm mappings a given long text into an n-bit long fingerprint. After the previous illustration, the geometric meaning of this algorithm is quite explicit: Firstly it mappings the long text into an n-dimension space, each individual term is transformed to a vector in the space. By using hash mapping, all the terms could be guaranteed to be transformed into a set of vectors which can be seen as nearly uniformly distributed.  By multiplying weights and summing up together, the result can be called as a “sum-vector”. The “sum-vector” is then compressed by mapping positive values to 1 and others to 0, this operation is actually preserving the quadrant information of the “sum-vector”. Assume that n = 64, a 64-bit long fingerprint can express as many as 264 quadrants, which seems to be enough to represent a specific document. 
Theoretically the algorithm is reasonable, but the reason why the n-bit fingerprint has the ability to manifest similarities between documents is still unknown. Nor did Charikar (The inventor of the Simhash algorithm) give out any justification. Nevertheless, tests that have been done illustrates that the Simhash algorithm actually does a good job in similarity estimations. Here presents a simple example of similarity estimation based on the Simhash algorithm and the Hamming Distance.     

String A (initialism of the lyrics of “Bohemian Rhapsody” by Queen), String B (copied from String A with 3 letters modified) and String C (an arbitrary string) are strings being tested in the Simhash algorithm.

String A:   		ITTRLITJFCIALNEFROYELUTTSASIJAPBINNSBIECEGALHLLATWBDRMTMTM
String B:   		ITTRQITJFCIALNEFRQYELUTTSASIJAPBINNSBIECEZALHLLATWBDRMTMTM
String C:	        KAYCDRKBPQOGVTAACDUQKXJNZNZMXCBNUKPHVODWUUSQGJZFFYUKHBDMFY 

After the 2-grams separation, the Simhash of each strings were presented below (red characters indicate the differnces):
SimHash of A:		100111001000100001111000011010110111100011011110001001  
SimHash of B:		100111001100100001111000011110110111100011011110001001

SimHash of A:		100111001000100001111000011010110111100011011110001001
SimHash of C:		100110101011010101101100011001001110000111111111001010

By observing, the Hamming Distance between two Simhash strings of A and B should be 3, which is low enough to indicate that the two strings are extremely similar. Actually they are extremely similar (only 3 letters out of 59 were different).  While the Hamming Distance between A and C is 21, indicating that the two strings A and C are unlikely to have any similarities.

===Data analysis===
Hair data is recorded by Inductively Coupled Plasma Mass Spectrometer (ICP-MS) and presented in the form of Excel tables. Matlab is used to plot figures which are used to show the elements comparison results clearly. It has some graphing capabilities and can be applied for making engineering plots.[3] In this project, the massive hair element data are plotted by Matlab command ‘scatter’. Then, use command ‘hold on’ to put the Somerton man’s and control samples hair data on the same figure and make the comparison clear.