Complete computational sequence characterization of mobile element variations in the human genome using meta-personal genome data
While a large number of methods have been developed to detect such types of genome sequence variations as single nucleotide polymorphisms (SNPs) and small indels, comparatively fewer methods have been developed for finding structural variants (SVs) and in particular mobile elements insertions (MEIs). Moreover, almost all these methods can detect only the breakpoints of an occurred SV, sometimes with approximation, and do not provide complete sequences representing the SVs. The main objective of our research is to develop a set of computer algorithms to provide complete genome sequence characterization for insertional structural variants in the human genomes via local de novo sequence assembly or progressive assembly using discordant and concordant read pairs and split-reads. An essential component of our approach involves utilizing all personal genome data available in the public domain vs. the standard way of using one set of personal genome sequences. The developed tool is the first system that provides full sequence characterization of SVs. Overall, the characterization success rate for Alu is 75.03% with the mean of discordant and split-reads higher than 94 reads. For SVA, it is 71.43% with the threshold of 363 reads. And for L1 the values are 77.78% and 355 respectively. The results showed that the SV characterization depends on the allele frequency and is influenced by the repetitiveness of flanking regions. Therefore, addressing these problems is a key to further improvements.