|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
See Also | Recommended Books | Recommended Links | Lecture Notes and E-books | Donald Knuth | TAOCP Volume3 | Longest Common Substring Problem | Animations | |
Knuth-Morris-Pratt algorithm | Boyer-Moore string search algorithm | Sunday string search algorithm | Hume and Sunday -- Tuned Boyer-Moore and Least Cost | Harrison | Karp-Rabin | Regular expressions | Humor | Etc |
|
String searching algorithms are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. Here is bibliography that can help to start in this area (See also STRING SEARCHING ALGORITHMS by Graham A Stephen)
|
Exact Text Searching
Aho A.V., Algorithms for finding patterns in strings, Chapter 5 (pp. 255-300) of Leeuwen J. van (ed.) Handbook of Theoretical Computer Science, Elsevier Science Publishers, Amsterdam.
Abrahamson K., Generalized string matching, SIAM Journal on Computing, 16(6), 1039-1051, 1987.
Amir A., Landau G.M., and Vishkin U., Efficient pattern matching with scaling, Journal of Algorithms, 13(1), 2-32, 1992.
Apostolico A., and Giancarlo R., The Boyer-Moore-Galil string searching strategies revisited, SIAM Journal on Computing, 15(1), 98-105, 1986.
Baeza-Yates R., and Gonnet G.H., A new approach to text searching, Communications of the ACM, 35(10), 74-82, 1992.
Baeza-Yates R., and Regnier M., Average running time of the Boyer-Moore-Horspool algorithm, Theoretical Computer Science, 92(1), 19-31, 1992.
Baeza-Yates R., Choffrut C., and Gonnet G., On Boyer Moore Automata, Algorithmica, 12(4-5), 268-292, 1994.
Baeza-Yates R., and Fuentes L., A framework to animate string algorithms, Information Processing Letters, 59(5), 241-244, 1996.
Baeza-Yates R., Improved string matching, Software-Practice and Experience, 19(3), 257-271, 1989.
Baeza-Yates R.A., Algorithms for String Searching: A Survey, ACM SIGIR Forum, 23(3-4), 34-58, 1989.
Baeza-Yates R., Text Retrieval: Theory and Practice, in Proc. of the 12th IFIP World Computer Congress, 465-476, (Madrid, Spain), North-Holland, 1992.
Baeza-Yates R., A unified view to string matching algorithms, Thoery and Pratice of Informatics, 23rd Seminar, 1-15, 1996.
Blumer A., Blumer J., Haussler D., Ehrenfeucht A., Chen M.T., and Seiferas J., The smallest automaton recognizing the subwords of a text, Theoretical Computer Science, 40(1), 31-55, 1985.
Boyer R.S., and Moore J.S., A fast string searching algorithm, Communications of the ACM, 20(10), 762-772, 1977.
Breslauer D., Colussi L., and Toniolo L., Tight comparison bounds for the string prefix matching problem, Information Processing Letters, 47(1), 51-57, 1993.
Breslauer D., Saving comparisons in the Crochemore-Perrin string matching, Theoretical Computer Science, 158(1-2), 177-192, 1996.
Bruyere V., Baeza-Yates R., Dergrange O., and Scheihing R., About the size of Boyer-Moore automata, in Proc. of the 3rd South American Workshop on String Processing, 31-46, Carleton, University Press, 1996.
Charras C., and Lecroq T., Exact string matching algorithms, Technical Report, 1997.
Cole R., Hariharan R., Zwick U., and Paterson M.S., Tighter lower bounds on the exact complexity of string matching, SIAM Journal on Computing, 24(1), 30-45, 1995.
Cole R., Tight bounds on the complexity of the Boyer-Moore pattern matching, SIAM Journal on Computing, 23(5), 1075-1091, 1994.
Colussi L., Correctness and Efficiency of pattern matching algorithm, Information and Computation, 95(2), 225-251, 1991.
Colussi L., Fatest pattern matching in strings, Journal of Algorithms, 16( 2), 163-189, 1994.
Crochemore M., and Rytter W., Text Algorithms, Oxford University Press, 1994.
Crochemore M., Czymaj A., Gasieniec L., Jarominek S., Lecroq T., Plandowski W., and Rytter W., Speeding Up Two String Matching Algorithms, Algorithmica, 12(4-5), 247-267, 1994.
Crochemore M., and Lecroq T., Tight bounds on the complexity of the Apostolico-Giancarlo algorithm, Information Processing Letters, 63(4), 195-203, 1997.
Crochemore M., and Lecroq T., Pattern matching and text compression algorithms, in The Computer Science and Engineering Handbook, Tucker A.B., ed., CRC Press, Boca-Raton, Chapter 8, 162-202, 1997.
Crochemore M., and Hancart C., Automata for Matching Patterns, Handbook of Formal Languages, Rosenberg C., Salamaa A. eds, 2, 399-462, Springer-Verlag, 1997.
Crochemore M., and Hancart C., Pattern matching in strings, in Algorithms and theory of computation Handbook, Mikhail J., Atallah, ed. CRC Press, Boca Raton, 1998.
Crochemore M., String matching on ordered alphabets, Theoretical Computer Science, 92(1), 33-47, 1992.
Crochemore M., Off-line serial exact string searching, in Pattern Matching Algorithms, Apostolico A., Z. Galil, ed., Oxford University Press, New York, Chapter 1, 1-53, 1997.
Davies G., and Bowsher S., Algorithms for pattern matching, Software-Practice and Experience, 16(6), 575-601, 1996.
Ferragina P., and Grossi R., Optimal On-line search and sublinear time update in strings, SIAM Journal on Computing, 27(3), 713-736, 1998.
Galil Z., and Giancarlo R., On the exact complexity of string matching: Lower Bounds, SIAM Journal on Computing, 20(6), 1008-1020, 1991.
Galil Z., and Giancarlo R., On the exact complexity of string matching: Upper Bound, SIAM Journal on Computing, 21(3), 407-437, 1992.
Galil Z., On improving the worst case running time of the Boyer-Moore algorithm, Communications of the ACM, 22(9), 505-508, 1979.
Gasieniec L., Plandowski W., and Rytter E., The zooming method: a resursive approach to time-space algorithm, Theoretical Computer Science, 147(1-20), 19-30, 1995.
Gasieniec L., Plandowski W., and Rytter W., Sequential sampling: a new approach to constant space pattern matching, in Proc. of the 6th Annual Symposium on Combinatorial Pattern Matching, LNCS 937, 78-89, Springer-Verlag, Berlin, 1995.
Gonnet G.H., and Baeza-Yates R.A., An analysis of the Karp-Rabin string matching algorithm, Information Processing Letters, 34(5), 271-274, 1990.
Gonnet G.H., and Baeza-Yates R., Handbook of Algorithms and Data Structures in Pascal and C, 2nd edition, Addison-Wesley, Workingham, 251-288, 1991.
Gusfield D., Algorithms on strings, trees and sequences, Gambridge University Press, 1997.
Hancert C., On Simon?s string searching algorithm, Information Processing Letters, 47(2), 95-99, 1993.
Horspool R.N., Practical fast searching in strings, Software-Practice and Experience, 10(6), 501-506, 1980.
Hume A., and Sunday D., Fast string searching, Software-Practice and Experience, 21(11), 1221-1248, 1991.
Knuth D.E., Morris J.H., and Pratt V.R., Fast pattern matching in strings, SIAM Journal on Computing, 6(2), 323-350, 1977.
Lecroq T., A variation on the Boyer-Moore algorithm, Theoretical Computer Science, 92(1), 119-144, 1992.
Lecroq T., Experimental results on string matching algorithms, Software-Practice and Experience, 25(7), 727-765, 1995.
Lecroq T., Experiments on string matching in memory structures, Software-Practice and Experience, 28(5), 561-568, 1998.
Liu Z., Du X., and Ishii N., An improved adaptive string searching algorithm, Software-Practice and Experience, 28(2), 191-198, 1998.
Manolopoulos Y., and Faloutsos C., Experimenting with pattern matching algorithms, Information Sciences, 90(1-4), 75-89, 1996.
Perleberg C., Single character searching methods and the shift-or pattern matching algorithm, Information Processing Letters, 50(5), 269-275, 1994.
Raita T., Tunning the Boyer-Moore-Horspool string searching algorithm, Software-Practice and Experience, 22(10), 879-884, 1992.
Reingold E., Urban K.J., and Gries D., K-M-P string matching revisited, Information Processing Letters, 64(5), 217-223, 1997.
Sedgewick R., Algorithms, 2nd edition, Addison Wesley, Reading Mass, 277-303, 1988.
Smit G. De V., A Comparison of Three String Matching Algorithms, Software-Practice and Experience, 12(1), 57-66, 1982.
Smith P., Experiments with a very fast substring search algorithm, Software-Practice and Experience, 21(10), 1065-1074, 1991.
Smith P., On tuning the Boyer-Moore-Horspool string searching algorithm, Software-Practice and Experience, 24(4), 425-436, 1994.
Stephen A.G., String Search, Techinal Report, School of Electronic Engineering Science, 1992.
Sunday D., A very fast substring search algorithm, Communications of the ACM, 33(8), 132-142, 1990.
Tarhio J., and Peltora H., String matching in the DNA alphabet, Software-Practice and Experience, 27(7), 851-861, 1997.
Watson B., The performance of single and multiple keyword pattern matching algorithms, in Proc. of the 3rd South American Workshop on String Processing, 280-294, Carleton, University Press, 1996.
Watson B., SPARE Parts: A C++ toolkit for string Pattern Recognition, in Proc. of the Second Prague stringologic Club Workshop, 47-60, 1997.
Wu S., and Manber U., Fast text searching allowing errors, Communications of the ACM, 35(10), 83-91, 1992.
Approximate Text Searching
Aho A., Corasick M., Efficient string matching: an aid to bibliographic search, Communications of the ACM, 18(6), 333-340, 1975.
Arlazarov V.L., Dinic E.A., Kronrod M.A., Faradzev I.A., On economic construction of the transitive closure of a directed graph, Dokl. Akad. Nauk SSSR, 194, 487-488, 1970 (in Russian). English translation in Soviet Math. Dokl., 11,1209-1210, 1975.
Aho A.V., Algorithms for finding patterns in strings, Chapter 5 (pp. 2 55-300) of Leeuwen J. van (ed.) Handbook of Theoretical Computer Science, Elsevier Science Publishers, Amsterdam.
Aoe J., Computer Algorithms - String Pattern Matching Strategies, IEEE Computer Society Press, Los Alamitos, California, 1994.
Blumer A., Blumer J., Haussler D., Ehrenfeucht A., Chen M.T., Seiferas J., The smallest automaton recognizing the subwords of a text, Theoretical Computer Science, 40, 31-55, 1985.
Baeza-Yates R.A., Text Retrieval: Theory and practice, in Proc. 12th IFIP World Computer Congress, (Madrid, Spain), 1992.
Baeza-Yates R.A., A unified view of string matching algorithms, In SOFSEM 96: Theory and Practice of Informatics, 1-15, Springer-Verlag, 1996.
Baeza-Yates R.A., Gonnet G.H., A new approach to text searching, Communications of the ACM, 35(10), 74-82, 1992.
Baeza Yates R.A., Gonnet G., Fast string matching with mismatches, Information and Computation, 108(2), 187-199, 1994.
Baeza-Yates R.A., Navarro G., A fast heuristic for approximate string matching, in Proc WSP 96, 47-63, Carleton University Press, 1996.
Baeza-Yates R.A., Navarro G., A faster algorithm for approximate string matching, In Proc Combinatorial Pattern Matching 96, 1-23, Springer-Verlag, 1996.
Baeza-Yates R.A., Perleberg C.H., Fast and practical approximate pattern matching, in Proc Combinatorial Pattern Matching, CPM 92, 185-192, 1992.
Baeza-Yates R.A., Perleberg C.H., Fast and practical approximate string matching, Information Processing Letters, 59, 21-27, 1996.
Chang W.I, Lampe J., Theoretical and Emprirical Comparisons of Approximate String Matching Algorithms, In Proc. 3rd Combinatorial Pattern Matching, 175-184, University of Arizona, USA, 1992.
Chang W., Lawler E., Sublinear approximate string matching and biological applications, Algorithmica, 12(4/5), 327-344, 1994.
Chang W, Marr T., Approximate string matching and local similarity, In Proc Combinatorial Pattern Matching 94, 259-273, Springer-Verlag, 1994.
Crochemore M., String matching with constraints, in Proc. MFCS’ 88, 324, 44-58, 1988.
Commentz-Walter B., A string matching algorithm fast on the average, in Proc. ICALP 79, 118-132, Springer-Verlag, 1979.
Dermouche A., A fast algorithm for string matching with mismatches, Information Processing Letters, 55, 105-110, 1995.
Galil Z., Giancarlo R., Improved string matching with k mismatches, Sigact News, 17, 52-54, 1986.
Galil Z., Giancarlo R., Data Structures and Algorithms for Approximate String Matching, Journal of Complexit, 4, 33-72, 1988.
Giegerich R., Kurtz S., Hischke F., Ohlebusch E., A general technique to improve filter algorithms for approximate string matching, In Proc WSP 97, 38-52, Carleton University Press, 1997.
Grossi R., Luccio F., Simple and efficient string matching with k mismatches, Information Processing Letters, 33, 113-120, 1989.
Galil Z., Park K., An improved Algorithm for Approximate String Matching, SIAM Journal of Computing, 19(6), 989-999, 1990.
Hall P., Dowling G., Approximate string matching, ACM Computing Surveys, 12(4), 381-402, 1980.
Harel D., Tarhjan R.E., Fast algorithms for finding nearest common ancestors, SIAM Journal on Computing, 13(2), 338-355, 1984.
Jokinen P., Tarhio J., Ukkonen E., A Comparison of Approximate String Matching Algorithms, Software-Practice and Experience, 26(12), 1439-1458, 1996.
Kurtz S., Approximate string searching under weighted edit distance, in Proc. WSP 96, 156-170, Carleton University Press, 1996.
Landau G.M., Vishkin U., Efficient String Matching with k Mismatches, Theoretical Computer Science, 43, 239-249, 1986.
Landau G.M., Vishkin U., Fast String Matching with k Differences, Journal of Computer and System Sciences, 37, 63-78, 1988.
Landau G.M., Vishkin U., Fast Parallel and Serial Approximate String Matching, Journal of Algorithms, 10, 157-169, 1989.
McCreight E., A space-economical suffix tree construction algorithm, Journal of the ACM, 23(2), 262-272, 1976.
Michailidis P., Margaritis K., String Matching Algorithms, Technical Report, Dept. of Applied Informatics, University of Macedonia, 1999 (in Greek).
Masek W.J., Paterson M.S., A Faster Algorithm Computing String Edit Distances, Journal of Computer and System Sciences, 20, 18-31, 1980.
Myers G., A fast bit-vector algorithm for approximate pattern matching based on dynamic programming. In Proc Combinatorial Pattern Matching 98, 1-13, Springer-Verlag, 1998.
Navarro G., Mutiple approximate string matching by counting, in Proc WSP 97, 125-139, Carleton University Press, 1997.
Navarro G., A partial deterministic automaton for approximate string matching, in Proc WSP 97, 112-124, Carleton University Press, 1997.
Navarro G., Approximate Text Searching, Ph.D. Thesis, University of Chile, Dept. of Computer Science, 1998.
Navarro G., Raffinot M., A bit-parallel approach to suffix automata: Fast extended string matching, in Proc Combinatorial Patern Matching 98, 14-33, Springer-Verlag, 1998.
Sellers P.H., The Theory and Computation of Evolutionary Distance: Pattern Recognition, Journal of Algorithms, 1, 359-373, 1980.
Sutinen E., Tarhio J., On using q-gram locations in approximate string matching, In Proc ESA 95, 327-340, Springer-Verlag, 1995.
Stephen G. A., String Searching Algorithms, World Scientific Press, 1994.
Takaoka T., Approximate pattern matching with samples, In Proc. ISAAC 94, 234-242, Springer-Verlag, 1994.
Tarhio J., Ukkonen E., Approximate Boyer-Moore String Matching, SIAM Journal on Computing, 22(2), 243-260, 1993.
Ukkonen E., Algorithms for Approximate String Matching, Information and Control, 64, 100-118, 1985.
Ukkonen E., Finding Approximate Patterns in Strings, Journal of Algorithms, 6, 132-137, 1985.
Ukkonen E., Approximate string matching with q-grams and maximal matches, Theoretical Computer Science, 1, 191-211, 1992.
Ukkonen E., Wood D., Approximate String Matching with Suffix Automata, Algorithmica, 10, 353-364, 1993.
Weiner P., Linear pattern matching algorithm, in Proc. 14th IEEE Symposium on Swithching and Automata Theory, 1-11, 1973.
Wagner R.A., Fischer M.J., The String to String Correction Problem, Journal of the Association for Computing Machinery, 21(1), 168-173, 1974.
Wu S., Manber U., Fast text searching allowing errors, Communications of the ACM, 35(10), 83-91, 1992.
Wu S., Manber U., Myers G., A Subquadratic algorithm for approximate limited expression matching, Algorithmica, 15, 50-67, 1996.
Wright A., Approximate string matching using within-word parallelism, Software-Practice and Experience, 24( 4), 337-362, 1994.
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
Mar 08, 2011 | Slashdot
alphadogg writes "Just because you send an email anonymously doesn't mean people can't figure out who you are anymore. A new technique developed by researchers at Concordia University in Quebec could be used to unmask would-be anonymous emailers by sniffing out patterns in their writing style from use of all lowercase letters to common typos. Their research, published in the journal Digital Investigation, describes techniques that could be used to serve up evidence in court, giving law enforcement more detailed information than a simple IP address can produce."
Slashdot
Zothecula writes "We've seen it in numerous TV shows and movies – the witness to a crime looks through a book of mug shots, then works with a police sketch artist to come up with a likeness of the nasty person they saw. After looking through hundreds of mug shots, however, it's possible that the tired-brained witness could look right at a photo of the guilty party and not recognize them. It's also possible that there is a mug shot of the criminal on a database somewhere out there, but that this particular witness will never see it. A computer system being pioneered at Michigan State University, however, could be the solution to such problems – it automatically matches faces in police sketches to mug shots."
Dr. Dobb's Journal August, 1996
I think that I shall never see
A poem lovely as a tree.
Poems are made by fools like me,
But only God can make a tree.
Joyce KilmerA tree's a tree. How many more do you need to look at?
Ronald Reagan
Matching string sequences is a problem that computer programmers face on a regular basis. Some programming tasks, such as data compression or DNA sequencing, can benefit enormously from improvements in string matching algorithms. This article discusses a relatively unknown data structure, the suffix tree, and shows how its characteristics can be used to attack difficult string matching problems.
The problem
Imagine that you've just been hired as a programmer working on a DNA sequencing project. Researchers are busy slicing and dicing viral genetic material, producing fragmented sequences of nucleotides. They send these sequences to your server, which is then expected to locate the sequences in a database of genomes. The genome for a given virus can have hundreds of thousands of nucleotide bases, and you have hundreds of viruses in your database. You are expected to implement this as a client/server project that gives real-time feedback to the impatient Ph.D.s. What's the best way to go about it?
It is obvious at this point that a brute force string search is going to be terribly inefficient. This type of search would require you to perform a string comparison at every single nucleotide in every genome in your database. Testing a long fragment that has a high hit rate of partial matches would make your client/server system look like an antique batch processing machine. Your challenge is to come up with an efficient string matching solution.
The intuitive solution
Since the database that you are testing against is invariant, preprocessing it to simplify the search seems like a good idea. One preprocessing approach is to build a search trie. For searching through input text, a straightforward approach to a search trie yields a thing called a suffix trie. (The suffix trie is just one step away from my final destination, the suffix tree.) A trie is a type of tree that has N possible branches from each node, where N is the number of characters in the alphabet. The word 'suffix' is used in this case to refer to the fact that the trie contains all of the suffixes of a given block of text (perhaps a viral genome.)
#define B 131
char *search( pat, text )
char *pat, *text;{ int hpat, htext, Bm, j, m;
if( pat[0]==EOS ) return( text );
Bm = 1;
hpat = htext = 0;for( m=0; text[m] != EOS && pat[m] != EOS; m++ ) {
Bm *= B;
hpat = hpat*B + pat[m];
htext = htext*B + text[m];
}if( text[m]==EOS && pat[m]!=EOS ) return( NULL );
for( j=m; TRUE; j++ ) {
if( hpat==htext && strncmp(text+j-m,pat,m)==0 )
return( text+j-m );
if( text[j]==EOS ) return( NULL );
htext = htext*B - text[j-m]*Bm + text[j];
}
The simplest algorithm for searching for one string (the "needle") in another larger string (the "haystack") looks like this:
for i = 0 to len(haystack)-len(needle): matched = YES for j = 0 to len(needle)-1: if haystack[i+j] != needle[j]: matched = NO break end if end for if matched = YES: return i end for return "not found"
This has time complexity
O(H*N)
(whereH
is the length of the haystack, andN
is the length of the needle). Of course, you don't usually need to test all the characters of the needle for each value ofi
, but it can happen in the worst case (consider searching for the needle "aaaaaaaaaaab
" in a haystack made of one million repetitions of "a
").A well-known optimisation is to start by converting the needle into a finite state machine. This reduces the cost of the search to
O(H)
, independent ofN
, at the expense of a setup phase that does depend onN
. For a single fixed needle and a huge amount of haystack text, this is a worthwhile tradeoff.On the other hand, what happens if you have a single fixed haystack and a large number of different needles you want to find?
Construct the list of all terminal substrings of the haystack. (A terminal substring is one which appears at the end of the haystack. In other words, a substring you can get by taking characters off the beginning.) For example, with the haystack string "
foobar
", the terminal substrings would befoobar oobar obar bar ar r
Now sort this list into order, so we get:
ar bar foobar obar oobar r
This concludes the setup phase. Now, when given a needle, we need only perform a binary search through this list!
The storage cost of this algorithm is high. The whole haystack must be kept in memory (or somewhere reasonably accessible) during the setup phase, and in addition to the haystack text itself we also need an array of offsets into the text, which is likely to take up 4 or 8 times as much space as the haystack itself. (We don't need to store all the actual terminal substrings; we need only store the position within the haystack where each one starts.)
The setup phase consists of a sort. Sorting is well known to be possible in
O(n log n)
comparisons, but when the comparisons are between strings, the cost of each comparison is non-trivial as well. In the absolute worst case (every character in the haystack is identical), a comparison would takeO(H)
time, so the setup phase could takeO(H^2 log H)
in the worst case. On the other hand, any reasonably normal haystack is likely to take not much more thanO(H log H)
to sort.The searching procedure is similar. A binary search takes
O(log n)
comparisons, so in the worst case the search can takeO(N log H)
. Assuming the needle is reasonably small, this seems entirely acceptable for the massive gain of not having anO(H)
term!No clear applications immediately spring to mind. Search engines and indexing is a related area, and it seems not impossible there might be an application there.
File Format: PDF/Adobe Acrobat - View as HTML
... ment a few relaxed versions of Daniel Sunday's QuickSearch [9], a variant of the
Boyer–Moore algorithm. ... applied them to the text-search problem ...
File Format: PDF/Adobe Acrobat - View as HTML
... Hill, NJ 07974, USA AND DANIEL SUNDAY Johns Hopkins ... SUMMARY Since the Boyer-Moore
algorithm was described ... benchmark for the practical string search literature. ...
File Format: PDF/Adobe Acrobat
... Daniel Vaz and Hugues Treiber need a very special mention for ... Baker's Boyer
Moore-type Algorithm [37] • Sunday's Substring Search Algorithm [38] 2.5 ...
Google matched content |
**** Pattern Matching Pointers -- very good
Dick Grune's Annotated Literature Lists
EXACT STRING MATCHING ALGORITHMS
[PDF] COSC 348 String matching
Introduction to PL-I, Algorithms, and Structured Programming, 3rd Edition
Teaching Binary Tree Algorithms by Programming Proof and Animation
Quick string search
Citations- Improved string searching - Baeza-Yates (ResearchIndex)
Knuth-Morris-Pratt algorithm - Encyclopedia Fablis articles about ...
Knuth-Morris-Pratt algorithm - Wikipedia, the free encyclopedia
NIST: Knuth-Morris-Pratt algorithm
The Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt algorithm - encyclopedia article about Knuth ...
Boyer-Moore algorithm is the fastest known string searching algorithm, at least in its best case. For a text of length n and a fixed pattern of length m the best performance of the algorithm is n/m and the worst case is n*m. The algorithm is very widely used in software engineering. The notable feature of the algorithm, apparent from its running time formula, is that the longer the pattern we are looking for, the faster the algorithm will be usually able to find it.
String searching algorithm - Wikipedia, the free encyclopedia
Boyer-Moore String Matching over Ziv-Lempel Compressed Text ...
The Boyer-Moore Fast String Searching Algorithm
This algorithm, which Bob Boyer and I invented in about 1975, is the basis of the fastest known ways to find one string of characters in another. How might you look for the following pattern in the text below?pattern: EXAMPLE text: HERE IS A SIMPLE EXAMPLEWe can show you the Knuth-Pratt-Morris algorithm and we can show you the Boyer-Moore algorithm.Our algorithm has the peculiar property that, roughly speaking, the longer the pattern is, the faster the algorithm goes. Furthermore, the algorithm is ``sublinear'' in the sense that it generally looks at fewer characters than it passes. The algorithm is described in
- A Fast String Searching Algorithm, with R.S. Boyer. Communications of the Association for Computing Machinery, 20(10), 1977, pp. 762-772.
Frodo's Workshop - Boyer-Moore Algorithm
Data Structures and Algorithms II Lecture 3- String Search ...
Boyer-Moore-Horspool String Searching
SUNDAY QUICKSEARCH ALGORITHM
By Leonard Morgenstern [email protected]Forth Scientific Library Algorithm #41
Adapted from Daniel Sunday. Communications of the ACM 33:132 1990.
QUICKSEARCH ( c-addrp up c-addrt ut -- c-addr)
Locate the first instance of a search-pattern with coordinates c-addrp up within a
text-string c-addrt ut. Return the address of the first match in the text if found,
or NIL if not, where NIL is a user-defined impossible address. An ambiguous
condition exists if the search-pattern (up) is longer than the text-string (ut).The search algorithm
Before beginning the search, the algorithm constructs an array called a
shift-table, having the same number of cells as the number of possible characters,
normally 256. The contained quantity is up +1 (where up is the length of the
search-pattern) for characters not present in the search-pattern, and the number
of characters from the end of the string for characters that are present, the
final character in the string being counted as 1. If duplicates are present, the
lowest number is used. For example, if the pattern string is "THAT", then T=1,
A=2, H=3, and all others are 5. If two characters are equivalent, they are
assigned the same shift-values; in the example, T and t would both have the value
1 if case is to be ignored. The shift-table is the only data structure needed by
the algorithm, except, of course, the text and pattern strings themselves.At the start, the first character of the pattern is positioned on the first
character of the text and up characters are compared. (The choice of a comparison
algorithm will be discussed below.) If there is a mismatch, the text character
just beyond the pattern is examined, and the corresponding shift obtained from the
shift-table. The pattern is then advanced that number of characters. The process
is repeated until the pattern is matched or the text-string is exhausted.An important characteristic of Sunday's algorithm is that the distance advanced
depends only on the text character just beyond the search-pattern; it does not
depend on the character that failed to match, its position, or the order in which
the pattern and text are compared. The algorithm may search front-to-back,
back-to-front, or a scrambled order based on letter-frequencies. Naturally, the
choice affects performance, as will be discussed in the next section.The comparison algorithm
The algorIthm for comparing the search-pattern and text (the "comparator") is a
deferred word with the stack diagram:COMPARATOR ( c-addrp up c-addrt ut -- n)
where n=0 for a match and non-zero for a mismatch. The stack diagram is the same
as that of the word COMPARE in the ANSI Standard String Word Set.COMPARATOR should be carefully selected to optimize performance. The standard word
COMPARE is a good choice in most cases. Situations where another comparator should
be used include: 1) when certain sets of characters are regarded as equivalent,
for example, when case is to be ignored; 2) where wild-cards are allowed; and 3)
where character frequencies dictate a more efficient search order. In general, the
search is fastest when rare characters are matched first.Efficiency
The efficiency of the search can be measured by the number of comparisons. The most
favorable case is seen when the first character in the pattern is never present in
the text and the text character just beyond the pattern never happens to be
present in the pattern; the number of comparisons is then ut/(up+1). At the other
extreme, there are degenerate cases in which the number of comparisons approaches
ut*up. This would occur when using a front-to-back comparator if the text and
search patterns contain long runs of the final character in the search pattern,
for example, searching for AAAABA in AAA...AAAAABA. This is not merely a
theoretical problem, as it might occur in searching for a string embedded in
spaces, or in searching graphics files. In the example, performance would be
improved by matching the odd character first, but the number of comparisons could
not be reduced much below nt. If there is a significant chance that patterns like
these might occur, Sunday's algorithm probably should not be used.Requirements
This is an ANS Forth program requiring:
1. A standard system with Core and Core extension sets.
2. String extension set if COMPARE is used to compare strings.
3. Uses the words V: and DEFINES from the FSL utilities to
manage vectored words
4. The compilation of the test code is controlled by the
VALUE TEST-CODE? and the conditional compilation words in
the Programming-Tools wordset
5. The test code uses the String wordset word /STRING
6. The test code uses the (non ANS) words:
TAB to emit a tab character and,
DUP>R which is equivalent to:
: DUP>R DUP >R ;
This code is released to the public domain. Leonard Morgenstern, March 1995[THEN]
\ F-PC IMPLEMENTATION OF SUNDAY QUICKSEARCH
0 [IF]
\ PRELUDE to convert F-PC to ANSI
: CELLS 2* ;
: CELL+ 2+ ;
: VALUE 0 VALUE ;
' =: ALIAS TO IMMEDIATE
' ASCII ALIAS [CHAR] IMMEDIATE
' " ALIAS S" IMMEDIATE
: COMPARE ( c-addr1 u1 c-addr2 u2 -- n)
rot min COMP ;
\ END OF PRELUDEINCLUDE QUIKSRCH
[THEN]
1 CELLS CONSTANT CELL
\ SUNDAY QUICKSEARCH ALGORITHM
\ ANSI STANDARD
\ REQUIRES CORE, CORE EXTENSION, AND STRING EXTENSION\
\ SETTING PARAMETERS
\-1 CONSTANT NIL
\ Impossible address; returned if no match found256 CONSTANT #SYMBOLS
\ Number of possible symbolsCREATE SHIFT-TABLE #SYMBOLS CELLS ALLOT \ Make shift-table
V: COMPARATOR
' COMPARE DEFINES COMPARATOR \ Default COMPARATOR.\
\ Initializing the shift table.
\
: INIT-SHIFT-TABLE ( c-addr1 u -- )DUP 1+
SHIFT-TABLE #SYMBOLS CELLS OVER + SWAP
DO DUP I ! CELL +LOOP
DROP
DUP 0 2SWAP OVER + SWAP
DO
I C@ CELLS SHIFT-TABLE +
>R 2DUP - R> !
1+
LOOP
2DROP
;
\
\ QUICKSEARCH
\\ In this version, certain quantities are stored in VALUEs
\ For many applications, local variables would be better
0 VALUE QS-TA 0 VALUE QS-TL
0 VALUE QS-PA 0 VALUE QS-PL: QUICKSEARCH ( c-addr[p] u[p] c-addr[t] u[t] -- c-addr[m])
\ Enter with addr & len of pattern & text
\ Return address of match or NIL
TO QS-TL TO QS-TA \ Store text coordinates
2DUP INIT-SHIFT-TABLE \ Initialize shift-table
TO QS-PL TO QS-PA \ Store pattern coordinates
NIL \ NIL on stack
QS-TA QS-TL QS-PL - 1+ \ Maximum number of comparisons
OVER + SWAP \ Substitute BOUNDS if defined
DO
I QS-PL QS-PA QS-PL
COMPARATOR 0= \ Make a comparison
IF
DROP I LEAVE
\ Leave if found
THEN
I QS-PL + C@ \ Extract char just past pattern
CELLS SHIFT-TABLE + @
\ Extract shift
+LOOP
;\ END OF ALGORITHM
\ BELOW HERE ARE SEVERAL TEST UTILITIES, ETC.
\ =======================================================================TEST-CODE? [IF] \ test code =============================================
: DUMP-ST ( DUMP SHIFT TABLE)
#SYMBOLS 0
DO
I 16 MOD 0= IF CR I EMIT SPACE THEN
I CELLS SHIFT-TABLE + @ . SPACE
I 128 MOD 0= IF KEY DROP THEN
LOOP
;CREATE pattern$ 40 CHARS ALLOT
CREATE text$ 40 CHARS ALLOT: Show-result ( a -- ) \ After performing search show result
CR 1 tab
DUP nil =
IF
DROP ." NO MATCH!"
ELSE
>R \ \ stash match addr on r.s.
QS-TA QS-TL \ ta tl \ get coordinates of text
OVER R> OVER - \ ta tl ta len1 \ get coordx of 1st part
dup>r
TYPE [CHAR] ^ EMIT \ ta tl
R> /STRING \ a' l' \ index to match
OVER QS-PL
TYPE [CHAR] ^ EMIT \ a' l' \ type matching part
QS-PL /STRING TYPE \ \ type tail
THEN
CR
;
: QTEST ( -- a ) \ Ask for strings, compare, & return result
CR ." Look for: "
pattern$ 40 EXPECT pattern$ SPAN @
CR ." in: "
text$ 40 EXPECT text$ SPAN @
QUICKSEARCH
Show-result
;[THEN]
Dictionary of
Algorithms and Data Structures ... [Sund98] Daniel M.
Sunday, A Very Fast Substring Search Algorithm, Communications
of the ACM, 33(8):132-142, August 1998. [Vitt01 ...
www.nist.gov/dads/ - 82k - Nov 13, 2004
stbm.c
... 10, October 1977 pp. 762-772 */ /* * A Very Fast Substring Search
Algorithm" */ /*
Daniel M. Sunday */ /* Communications of the ACM */ /* Vol. 33 No.
... www.grouse.com.au/ggrep/stbm.c.html - 33k
[PDF]
COSC 348 String matching File Format: PDF/Adobe Acrobat -
View as HTML ... MR 89g:68021 [Sun90] Daniel Sunday, A
very fast substring search algorithm, Communications of the Association
for Computing Machinery 33 (1990), no. ...
Volume 33 , Issue 8 (August 1990)
table of contents
Pages: 132 - 142
Year of Publication: 1990
ISSN:0001-0782
Daniel M. Sunday East Stroudsburg Univ. of Pennsylvania, East Stroudsburg,
PA Publisher: ACM Press New York, NY, USA
This article describes a substring search algorithm that is faster than the Boyer-Moore algorithm. This algorithm does not depend on scanning the pattern string in any particular order. Three variations of the algorithm are given that use three different pattern scan orders. These include: (1) a "Quick Search" algorithm; (2) a "Maximal Shift" and (3) an "Optimal Mismatch" algorithm.
Abstract
A fast substring search algorithm is presented here. This algorithm is faster than a simple left to right scanning algorithm. Startup overhead is taken into consideration to avoid the use of a large lookup table. This algorithm is easily implementable in Java.
Introduction
The Java class 'String' provides substring searching capabilities. The method String.indexOf(pattern_string) will search for the location of the first occurrence of pattern string in the source string. For example, "Who is Tomas".indexOf("Tomas") would return 7.
Typical implementation algorithm would search for the character 'T' from left to right. Once 'T' is found, then 'o', 'm', 'a', and 's' would be matched. If a mismatch happens, then we have to resume searching for 'T' until we hit the end of the search string.
This is pseudo code implementing the simple algorithm.
outer_loop: for indx = 0 to length(source) - length(pattern) { // match first pattern character if pattern[0] == source[indx] then { for indx2 = 1 to length(pattern) - 1 { // match subsequent pattern characters if pattern[indx2] != source[indx + indx2] then continue outer_loop; } return indx; // everything matched } } return did_not_match;We can do much better than the above algorithm.
Looking For Impossible Characters
The key to a faster scanning algorithm is to look at the last character of the pattern string first. If the last pattern character matches, then we continue to search for pattern characters from left to right, until the second last pattern character. (remember the last pattern character already matches)
"Who is Tomas".indexOf("Tomas") v Who i........ <- source Tomas <- pattern ^In the above example, we found a character 'i' where a character 's' is required to make a match. Because 'i' does not occur anywhere in the string 'Tomas', we can advance the search position 5 characters! While we are matching the last character of the pattern string, if we found an "impossible character" in the search string, we are safe to advance the search pointer by the length of the pattern string.
How do we program this test? We can use a 64 bit long integer for a quick test.
// initialization long cache = 0L; for i = 0 to length(pattern) - 1 { cache |= 1L << (pattern[i] & 63); } ... // test for last pattern character; pattern string is "Tomas" if (source[i] != 's') { // last character did not match if ((cache & (1L << (source[i] & 63))) == 0L) { // got an impossible char i += 5; // skip 5 characters } }Being able to skip 5 characters instead of 1 character at a time gives us a speed boost.
If the implementation language does not have good support for 64 bit integer arithmetic, a 32 bit integer cache can be used in a pinch.
Search Order Discussion
After the last character of the pattern string matches with the source string character under the scan pointer, we scan the rest of the pattern string from left to right. The question is 'why'? Some other choices include right to left, and random order. The answer has to do with worst case performance estimate.
If we find an "impossible character" in the subsequent pattern string matches, we can still skip more than one characters. For example, if we find the fifth character is "impossible", then we can increase search pointer by 5.
stre stress <- source stress <- pattern 234561 <- match order v stre stress <- source stress <- pattern ^By scanning from left to right, we ensure the amount we skip is roughly proportional to the number of characters we scanned. If we scan from right to left, then it is possible for a long scan to yield a small skip amount.
MD2 Optimization
Optimization of md2 offset is used when the last character of the pattern string matches, but some other pattern characters do not match.
Take the example of "James Tomas".indexOf("Tomas")
In this example, the 5th character of the search string is 's', so it matches the last character of the pattern string. We then match 'J' against 'T', which does not match. At this point, we can again skip 5 characters! Why?
This is because in the string "Tomas", 's' only occurs as the last character and no where else. We know the 5th character is 's' but the front part did not match. If we advance the search position one character, then the 's' character would be eventually matched against the 4th character of the pattern string ('a'). But we already said 's' does not occur in the pattern string from the first position up to the 4th position.
v James.... <- source Tomas <- pattern ^So before we do anything we can already predict any shift less than 5 would produce an ultimate mismatch.
The safe shift distance for subsequent mismatches is called 'md2'. One can calculate md2 by looking for 's' starting from the 2nd last pattern character to the first pattern character
For the pattern string "Tomss", md2 would be 1.
For the pattern string "Tosas", md2 would be 2.
For the pattern string "Tsmas", md2 would be 3.
For the pattern string "somas", md2 would be 4.Pseudo Code for Full Implementation
// initialize the cache long cache = 0L; for i = 0 to length(pattern) - 1 { cache |= 1L << (pattern[i] & 63); } // calculate md2 last_pattern = pattern[ length(pattern) - 1 ]; md2 = length(pattern) for i = 0 to length(pattern) - 1 { if last_pattern == pattern[i] then { md2 = length(pattern) - i; } } scan_loop: for i = length(pattern) - 1 to length(source) - 1 { if last_pattern == source[i] then { // last character matched, match the rest for i2 = 0 to length(pattern) - 2 { if pattern[i2] != source[i - length(pattern) + 1] then { // see if character under cursor is "impossible". if ((cache & (1L << (source[i] & 63))) == 0L) altskip = i2 + 1; else altskip = 1; // skip the maximum of md2 and impossible calculation i += max(md2, altskip); continue scan_loop; } } return i - length(pattern) + 1; // everything matched } if ((cache & (1L << (source[i] & 63))) == 0L) i += length(pattern); // got an impossible character else i += 1; } return not_found;Performance Estimate
The new string searching mechanism can potentially speed up string searching by 'p' times, where p is the length of the pattern string. The down side would be that there is pre-processing cost of 2p instructions, calculating the impossible bitmap, and md2. When we discovered two instances of impossible characters in the search string, then we should be even in processing time.
This performance profile compares favorably with traditional fast substring scanning algorithms, such as Boyer Moore. Since Java's character is 16 bit wide, a naive implementation of Boyer Moore would require 65536 machine instructions to initialize its lookup table. This is obviously not acceptable when we are scanning any string shorter than 70000 characters long.
References
Timo Raita, "On Guards and Symbol Dependencies in Substring Search", Software-Practice and Experience 29(11), pg 931-941 (1999)
Daniel M. Sunday, "A Very Fast Substring Search Algorithm", CACM 33(8), pg 133-142 (1990)
R. Boyer and S. Moore, "A Fast String Searching Algorithm", CACM
Rabin-Karp string search algorithm Rabin-Karp algorithm is a string searching algorithm that seeks a pattern, i.e. a substring, within a text by using hashing. It is not widely used for single pattern matching, but is of considerable theoretical importance and is very effective for multiple pattern matching. Historically, it predates the very popular Knuth-Morris-Pratt algorithm. For text of length n and pattern of length m, its average and best case running time is O(n) (meaning proportional to n), but the worst case scenario is O(n*m), which is one of the reasons why it is not widely used.Let Σ be an alphabet (finite set). Formally, both the pattern and searched text are concatenation of elements of Σ. The Σ may be usual human alphabet (A-Z). Other applications may use binary alphabet (Σ = ) or DNA alphabet (Σ = ) in bioinformatics.
..... Click the link for more information. that works in O(n) time on average. It is based on the use of hashing to compare strings. See Rabin-Karp string search algorithm Rabin-Karp algorithm is a string searching algorithm that seeks a pattern, i.e. a substring, within a text by using hashing. It is not widely used for single pattern matching, but is of considerable theoretical importance and is very effective for multiple pattern matching. Historically, it predates the very popular Knuth-Morris-Pratt algorithm. For text of length n and pattern of length m, its average and best case running time is O(n) (meaning proportional to n), but the worst case scenario is O(n*m), which is one of the reasons why it is not widely used.
.....
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: March 12, 2019