Editing Final Report/Thesis 2018 (section)

===DNA Degradation===
Degradation in this case refers to the randomly removal of DNA sequences form the raw data. With these results, SNP’s will be removed from the DNA at different percentages, which then can be utilised to discover how much DNA is required until it becomes unidentifiable. Matlab was used to remove different percentages of SNPs in the sample. Figure 27 demonstrates the code that was used to remove the SNPs from the raw data.
[[File:soFigure27.jpg|thumb|500px|center|Figure 27. Matlab code to remove SNPs]]
A brief explanation of how the code works will be discussed. The code gets the raw data (which is a text file).  A different portion of the raw data is shown in Figure 28, which will be used in a demonstration.
[[File:soFigure28.jpg|thumb|500px|center|Figure 28. A portion of the raw data]]
The raw data has 638468 lines, the code then removes a certain percentage of those line (The code is currently going to remove 90% of the lines). The next part of the code deletes the blank space of the removed lines. The blank spaces needed to be removed to import the data to GED match Genesis. A demonstration of how the code works is shown in Figure 29.
[[File:soFigure29.jpg|thumb|500px|center|Figure 29. A portion of the raw data]]
10 lines are present in the raw data (Left image), then 2 lines are removed (Middle image), lastly the blank spaces of the removed lines are also removed, leaving only 8 lines in total (Right Image).
Lastly the code converts the results back into a text file, which can be imported to GED match Genesis. The removal of SNPs was done from 10%-90% and the whole set of the experiment will be carried out five times to obtain an average result to reduce any outliers. 
Two different task were completed using the DNA, more specifically the degraded DNA. One tasked involved using heritage analysis and the other involved comparison of DNA using the database. 

====Heritage====
The first task completed was investigating the heritage of the DNA sample sent. There were various genetic ancestry projects that could have been chosen, therefore research was completed on all the different genetic ancestry projects to make sure that the results would be the most accurate. Eurogenes seemed the most appropriate considering the sample is decent of European background. The next part was to select which model was most appropriate to calculate the heritage in the DNA, again significant research was completed and the chosen model was Eurogenes K13, as this model is best for samples with mixed heritage.
The heritage results for the original case (no SNPs removed) can be seen in Figure 30, it can be seen that the sample has strong heritage in North Atlantic, Baltic and West Mediterranean regions.
[[File:soFigure30.jpg|thumb|500px|center|Figure 30. Heritage results]]
The results that were completed only took consideration of North Atlantic and Baltic regions as these had the highest percentages, 29.13% and 42.11% respectively.
The other cases were than completed, the 10%-90% of SNP removal. A graph was produced to visualise what was happening when removing the SNPs, which can be seen in Figure 31.
[[File:soFigure31.jpg|thumb|500px|center|Figure 31. Heritage test, with SNPs removed]]
The x-axis represents how much SNP is removed in the sample and the y-axis is how much percentage of that heritage the sample is. Observing the graph it can see the results are relative steady up to about 60%, the results then fluctuate at 70% and onwards. This indicates that removing more than 70% of the SNPS will result in inaccurate results. This result is not enough to conclude this, so five more tests were completed to see if the trend was similar for each case. The figure below demonstrates the average of the heritage tests completed.
[[File:soFigure32.jpg|thumb|500px|center|Figure 32. Average of the Heritage Test]]
It can be seen that average results are all relatively linear, expect for some points at North Atlantic 40%, 80% and 90% and Baltic 80% and 90%. The error bars however, get larger as the percentage of SNPs are being removed. This indicates that the DNA samples start to lose its structure as SNPs are being removed. 
From 10%-40% of SNP removed, the standard deviation starts increasing to approximately 1, which shows the DNA is still robust. From 50%-90% of SNP removed, the results starts to vary significantly, which shows as the error bars are large and the standard deviation is much larger.
It can be concluded with these results that DNA sample is robust until about 50% and after that DNA becomes unidentifiable.

====Database====
The second task that was completed was comparing the DNA sample to a database of other people DNA samples. Using the database, we were able to efficiently identify how closely related, the DNA samples were, to other people. Refer back to Figure 26 to see how the database displays the results.
The original sample will be compared with the 10%-90% cases of SNP are removal. The objective is to find false positives and false negatives. A false positive in this case would mean that the kit appeared in the removal section and a false negative means the kits disappeared in the original sample. An example below will be shown for a clear understanding. 
Original: A B C D E
10% of SNP removed: C D G H I
false positive: G H I and false negative: A B E
As there is thousands and thousands of kits to compare, a sample size of 30 was taken to compare. Figure 33 shows the false positives and false negatives for different percentages of SNPs removed.
[[File:soFigure33.jpg|thumb|500px|center|Figure 33. False positives and false negatives]]
The x-axis represents how much SNP is removed in the sample and the y-axis is how many false positives or false negatives are found. Unfortunately, in this case the false positives and false negatives equalled each other.
The results are all fairly high, at 10% there were only 6 matches with the original case. This indicates that even altering the DNA just by a small amount can have significant change on the DNA. It can be seen as it starts to approach 50% of SNP removal there are more false positives and false negatives, with fewer matches to the original case. At 50% and greater of SNP removed there is 30 false positives and false negatives, this means that no other DNA sample matched the original case.