BpMatch: an efficient algorithm for segmenting sequences, calculating genomic distance and counting repeats.

Felicioli, C. and Marangoni, Roberto (2008) BpMatch: an efficient algorithm for segmenting sequences, calculating genomic distance and counting repeats. Technical Report del Dipartimento di Informatica . Università di Pisa, Pisa, IT.

Other (GZip)
Available under License Creative Commons Attribution No Derivatives.
Download (205Kb)

Official URL: http://compass2.di.unipi.it/TR/Files/TR-08-21.pdf....

Abstract

There are several important reasons (biological, evolutionary, clinical, etc.) to give a segment-based description of genomic sequences, and, in particular, to detect repeated segments, written both direct and complemented inverted. In some applications, in particular in medical genomics, it is also necessary to count the number of occurrences of a segment. Moreover, by detecting common segments shared by two different sequences it is possible to define a sort of genomic distance between them. Here we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, allows us to achieve all these three goals (identify repeated segments, including the complemented inverted copies of them, count repeats number and calculate genomic distance) in a fast and efficient way.BpMatch is able to identify exact copies (and complemented inverted copies) of a segment. The operator should define a priori the minimum length of a string, in order to be considered a segment, and the minimum number of occurrences, so that only segments having a number of occurrences greater than are considered to be significant. BpMatch is very efficient; we determined the complexity in time to calculate the self-covering of a string, the alphabet dimension. On the worst case, assuming the alphabet dimension is a constant, the time required to calculate the coverage is O. On the average, using the time required to calculate the coverage is only O. It is important to note that this estimation includes the time required to complete all of the three different tasks: to identify copied segments, to localize them, to count the number of occurrences and to evaluate the sequence coverage.

Item Type:	Book
Uncontrolled Keywords:	algorithmic distance, genomic distance, repeats counting, repeats locating
Subjects:	Area01 - Scienze matematiche e informatiche > INF/01 - Informatica
Divisions:	Dipartimenti (until 2012) > DIPARTIMENTO DI INFORMATICA
Depositing User:	dott.ssa Sandra Faita
Date Deposited:	04 Dec 2014 14:19
Last Modified:	04 Dec 2014 14:19
URI:	http://eprints.adm.unipi.it/id/eprint/2214

Repository staff only actions

View Item