A Real-Time Algorithm for Searching for Substrings of a Pattern in Online Text Stream: An Alternative to Suffix Trees and Finite Automaton
DOI:
https://doi.org/10.14738/tecs.114.15065Keywords:
Approximate String Matching, Similarity Search, Substring SearchAbstract
In the field of similarity search, very little effort has been made to process the text without using suffix trees or automata. Most of the algorithms developed in the past use data structures and methods outlined above to accelerate the search process. However, their costly maintenance has always been a cause of concern. Let, T be a text of size N, and P be a pattern of size M. We present highly practical algorithms to process a real-time stream of text characters. The algorithm reports all substrings of pattern P in the text stream T as they appear in the stream. This is highly useful in detecting patterns in the data stream in real-time, for example, viruses and malware can be detected as they appear in the data stream. Additionally, the algorithm can be used to find the longest common substring(s) between two given strings. We achieved the stated goals without using any sophisticated data structure such as a suffix tree or finite automaton. For an alphabet size σ the algorithm runs in O(M) space, and O (MN/σ) expected time, which simplifies to O(N) for M ≤ σ. We also provide experimental results to support our claim.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Vinod Prasad
This work is licensed under a Creative Commons Attribution 4.0 International License.