Thursday, March 22, 2012

Text Indexing with Aho-Corasick

The Aho-Corasick string matching algorithm is a kind of dictionary matching search algorithm. It was originally proposed as an alternative to indexing as a means of speeding up bibliographic search. That was back in 1975 before the World Wide Web and ensuing information explosion demanding indexing in some form or other to make real-time information retrieval practical. The Aho-Corasick algorithm, however, has some interesting properties which make it attractive for use as an indexing scanner.

The algorithm constructs a state machine from a collection of dictionary words. The state machine is in-effect, a reduced-grammar regular expression parser and can be used to scan text for the dictionary words in a single pass. The machine state transitions (edges) trigger on encountering a specific letter in the input stream. Machine states (nodes) can emit one or more dictionary words if the path leading to the state encodes all of the letters of the dictionary word in order. Failure edges transition from a state for which no outgoing edge matches the next next letter in the input stream, to a state from which it it still may be possible to match a dictionary word given the letters already encountered in the stream.

The time taken to construct the state machine is proportional to the sum of the lengths of all dictionary words. This cost however can be amortized over the life of the state machine and a single state machine can be used to parse multiple texts concurrently if the implementation uses independent iterators to track state transitions through the machine. The number of state transitions required for an Aho-Corasick state machine to scan a document is independent of the size of the dictionary. This means that Aho-Corasick method scales very well to large dictionaries, the limiting factor being the space required to hold the state machine in memory.

As a proof-of-concept, we implemented the Aho-Corasick algorithm in Java and ran some benchmark tests. For debugging puposes we implemented a method to dump a state machine to Graphviz DOT format. The visualization of a state machine constructed with dictionary [he, she, his, hers] is shown in Figure 1. The background image for this blog title is the visualization of a state machine constructed with a 100 word dictionary - not very practical to follow but makes an interesting graphic.

Figure 1: Aho-Corasick state machine for dictionary [he, she, his, hers]

Figure 2 shows how time taken to construct the state machine varies with the number of dictionary words. Only 3 data points were taken but the relationship is clearly linear.

Figure 2: Aho-Corasick state machine construction time

Figure 3 shows how the performance of the Aho-Corasick implementation varies as the size of the corpus increases. The relationship appears linear and, for the most part, insensitive to the dictionary size. Deviations are likely attributable to poor sampling and high variance between test runs.

Source code for this implementation is available here.

No comments:

Post a Comment