Editing Final Report/Thesis 2016 (section)

==== Preparation Test ====
Just like before, the preparation test was designed to check the performance of algorithm.  Unlike the Levenshtein Algorithm, the grouping of test based on unit length is meaningless. This is because of the nature of the Simhash algorithm. For details please refer to section 3.2.2.
The preparation test was based on the UDHR. In the first test: it compared 50 units extracted from the English version of UDHR with the whole text of UDHR (2-grams formatted as well) in both English and other languages, by firstly turning each unit to its corresponding Simhash string, then calculating the Hamming Distance between two Simhash strings. In the second test, the same method and layout are used again to compare 50 units with the whole text of UDHR in the same kind of language. To make it simple, test 1 is a cross-language test while test2 is a same-language test. Results were presented in the two box plots below (apologize for the being out of order of the x-axis; please make comparisons according to the column names under each box):
[[File:Group3.jpg|thumb|600px|center|Simhash Preparation Test Result]]

Comparing inside the cross-language group (Figure 20): the data set of “English versus English” (the more transparent box located at the bottom-right of the first figure) is considerably lower than other cross-language data sets (colored boxes in the first figure).  

Comparing across the two figures (Figure 20 and 21): it is easy to find out that the cross-language group has generally higher results than the same-language group (Color schemes in two figures are not the same , please refer to the column names when doing comparison). Median values of boxes in same-language group are all significantly lower than those in the cross-language group. All third quartiles (Q3) in the same-language group are lower than first quartiles (Q3) of the corresponding language group in the cross-language group. 

In addition, data sets in figures above are less biased compared to those in Levenshtein Test. Distributions of data in each boxplots are quite compact. The Simhash Algorithm is not sensitive to different languages. These facts give extra credibility to the Simhash algorithm. 
Based on the aforementioned observations, it is reasonable to draw out the conclusion that the Simhash Algorithm has an excellent ability of distinguishing different kinds of languages.