Aho Algorithm - Corasic

The Aho – Korasik algorithm — a substring search algorithm developed by Alfred Aho and Margaret Korasik in 1975 ^[1] , implements a search for multiple substrings from a dictionary in a given string .

It is widely used in system software, for example, it is used in the grep search utility ^[2] .

Content

1 principle of operation
2 Computational complexity
3 notes
4 References

Principle of Operation

Non-deterministic automaton for the dictionary {a, ab, bc, bca, c, caa} . Gray peaks are intermediate, white peaks are finite. Blue arrows are suffix links, green arrows are end links.

The algorithm builds a finite state machine , to which it then passes a search string. The machine receives all the characters of the string in turn and moves along the corresponding edges. If the machine has reached its final state, the corresponding dictionary line is present in the search line.

Several search strings can be combined into a search tree, the so-called boron (prefix tree). Bor is a state machine that recognizes one line from $m$ ${\ displaystyle m}$ - but provided that the beginning of the line is known.

The first task in the algorithm is to teach the automaton to "recover" if the substring does not match. At the same time, transferring the automaton to its initial state with any unsuitable letter is not suitable, since this can lead to the omission of the substring (for example, when searching for the aabab string, aab aabab comes aab aabab , after reading the fifth character, resetting the automaton to the initial state will lead to the omission of the substring - it was true go to state a , and then process the fifth character again). In order for the automaton to regenerate itself, suffix links loaded with the empty symbol ⌀ are added to it (so that the deterministic automaton becomes non-deterministic). For example, if the string aaba is aaba , then the boron suffixes aba , ba , a suggested. A suffix link is a link to a node corresponding to the longest suffix that does not lead the boron to a dead end (in this case, a ).

For the root node, the suffix link is a loop. For the rest, the rule is this: if the last recognized character is $c$ ${\ displaystyle c}$ , then the parent suffix is crawled if there is an arc loaded with the symbol from there $c$ ${\ displaystyle c}$ , the suffix link goes to the node where this arc leads. Otherwise, the algorithm goes through the suffix link again and again until it either goes through the root link loop or finds an arc loaded with a symbol $c$ ${\ displaystyle c}$ .

  * ··· ⌀ ···> * ··· ⌀ ···> * ··· ⌀ ···> *
  |  |
  c c
  ↓ ↓
 [*] ·············· ⌀ ···············> *
      new suffix link

This automaton is non-deterministic. The transformation of a nondeterministic finite state machine into a deterministic one in the general case leads to a significant increase in the number of vertices. But this automaton can be turned into deterministic without creating new vertices: if for the vertex $v$ ${\ displaystyle v}$ nowhere to go by symbol $c$ ${\ displaystyle c}$ , go through the suffix link again and again - until either we get to the root, or where will we go by the symbol $c$ ${\ displaystyle c}$ .

It is convenient to do all determinations recursively. For example, for a suffix link:

  Alg Suff Link (v)
    if v. cache Suff Link ≠ Ø // for the root, the root is initially root. cache Suff Link = root
      return v. cache Suff link
    u: = v. parent
    c: = v. character
    to repeat
      u: = Suff Link (u)
    to (u = root) or (there exists a path u —c → w)
    if there exists a transition u —c → w
      then v. cache Suff Link: = w
      otherwise v. cache Suff Link: = root
    return v. cache Suff link

Determination increases the number of end vertices: if suffix links from the top $v$ ${\ displaystyle v}$ lead to the final $u$ ${\ displaystyle u}$ , itself $v$ ${\ displaystyle v}$ also declared ultimate. For this, the so-called end links are created: the end link leads to the end vertex closest to the suffix links; traversing the end links gives all matching lines.

  alg Output Result (v, i)
    type "Found" + v. needle + "in position" + (i - v. depth + 1)

  alg Main Part Search
    state: = root
    cycle i = 1 .. | haystack |
      state: = Transition (state, haystack [i]);
      if condition. needle ≠ Ø
        Output Result (state, i)
      timeState: = state
      So far Final Link (time Comp) ≠ Ø
        timeSost: = EndReference (timeSost);
        Output Result (timeSost, i)

Suffix and end links in the machine can be calculated as needed already in the search phase. Side transitions - you can calculate on the spot without caching in any way , you can cache for all nodes, you can cache for the most important ones (all this does not affect the asymptotic estimation of the algorithm).

Computational complexity

The computational complexity of the algorithm depends on the organization of the data. For example:

If the machine’s navigation table is stored as an index array , memory consumption $O(n\sigma )$ ${\ displaystyle O (n \ sigma)}$ computational complexity $O(n\sigma +H+k)$ ${\ displaystyle O (n \ sigma + H + k)}$ where $H$ ${\ displaystyle H}$ - the length of the text in which the search is performed, $n$ ${\ displaystyle n}$ - the total length of all words in the dictionary, $\sigma$ ${\ displaystyle \ sigma}$ Is the size of the alphabet, $k$ ${\ displaystyle k}$ Is the total length of all matches.
If the machine’s transition table is stored as a red-black tree , the memory consumption is reduced to $O(n)$ ${\ displaystyle O (n)}$ however computational complexity rises to $O((H+n)\log \sigma +k)$ ${\ displaystyle O ((H + n) \ log \ sigma + k)}$ .

Notes

↑ Alfred V. Aho, Margaret J. Corasick. Efficient string matching: An aid to bibliographic search // Communications of the ACM . - 1975. - T. 18 , No. 6 . - S. 333-340 . - DOI : 10.1145 / 360825.360855 .
↑ grep-2.26 released [stable ] (unspecified) . www.mail-archive.com. Date of treatment October 4, 2016.

Links

https://xlinux.nist.gov/dads/HTML/ahoCorasick.html

[1] Alfred V. Aho, Margaret J. Corasick. Efficient string matching: An aid to bibliographic search // Communications of the ACM . - 1975. - T. 18 , No. 6 . - S. 333-340 . - DOI : 10.1145 / 360825.360855 .

[2] rep-2.26 released [stable ] (unspecified) . www.mail-archive.com. Date of treatment October 4, 2016.