Luis Coelho,

D – Departamento de Engenharia Informática, Instituto Superior Técnico, Universidade de Lisboa

Abstract:

The problem we address is text indexing for approximate matching. We consider that we are given a text T which undergoes some preprocessing to generate an index. We can later query this index to identify the places where a string occurs up to a certain number of allowed errors k, where by error we mean the substitution, deletion or replacement of one character (edition or Levenstein distance).

We present a structure for indexing which occupies space O(n log^k n) in the average case, independent of alphabet size, n being the text size. This structure can be used to report the existence of a match with k errors in O (3k mk+1 ) and to report the occurrences in O(3^k m^{k+1} + ed) time, where m is the length of the pattern and where ed the number of matching edit scripts. These bounds are independent of alphabet size. The construction of the structure has time bound by O (k N |S|), where N is the number of nodes in the index and |S| the alphabet size.

 

Date: 2006-Jul-06     Time: 16:30:00     Room: 336


For more information: