Tuesday, March 27, 2012

Data-Intensive Text Processing with MapReduce

This book is compact and intense but is an insightful and powerful demonstration as to how a problem may be decomposed to fit the MapReduce paradigm. Equally important, it describes the types of problem that are not suited to decomposition as MapReduce jobs. It covers in detail the use of MapReduce in text indexing, graph algorithms, and expectation maximization, but the techniques described could easily be applied to a wide range of applications. I was able to turn the pseudo code snippets, together with Hadoop: The Definitive Guide, into working examples in a relatively short time.

For me, this book filled in the blanks with respect to how to apply MapReduce to my own algorithms and data.

Thursday, March 22, 2012

Text Indexing with Aho-Corasick

The Aho-Corasick string matching algorithm is a kind of dictionary matching search algorithm. It was originally proposed as an alternative to indexing as a means of speeding up bibliographic search. That was back in 1975 before the World Wide Web and ensuing information explosion demanding indexing in some form or other to make real-time information retrieval practical. The Aho-Corasick algorithm, however, has some interesting properties which make it attractive for use as an indexing scanner.

The algorithm constructs a state machine from a collection of dictionary words. The state machine is in-effect, a reduced-grammar regular expression parser and can be used to scan text for the dictionary words in a single pass. The machine state transitions (edges) trigger on encountering a specific letter in the input stream. Machine states (nodes) can emit one or more dictionary words if the path leading to the state encodes all of the letters of the dictionary word in order. Failure edges transition from a state for which no outgoing edge matches the next next letter in the input stream, to a state from which it it still may be possible to match a dictionary word given the letters already encountered in the stream.

The time taken to construct the state machine is proportional to the sum of the lengths of all dictionary words. This cost however can be amortized over the life of the state machine and a single state machine can be used to parse multiple texts concurrently if the implementation uses independent iterators to track state transitions through the machine. The number of state transitions required for an Aho-Corasick state machine to scan a document is independent of the size of the dictionary. This means that Aho-Corasick method scales very well to large dictionaries, the limiting factor being the space required to hold the state machine in memory.

As a proof-of-concept, we implemented the Aho-Corasick algorithm in Java and ran some benchmark tests. For debugging puposes we implemented a method to dump a state machine to Graphviz DOT format. The visualization of a state machine constructed with dictionary [he, she, his, hers] is shown in Figure 1. The background image for this blog title is the visualization of a state machine constructed with a 100 word dictionary - not very practical to follow but makes an interesting graphic.

Figure 1: Aho-Corasick state machine for dictionary [he, she, his, hers]

Figure 2 shows how time taken to construct the state machine varies with the number of dictionary words. Only 3 data points were taken but the relationship is clearly linear.

Figure 2: Aho-Corasick state machine construction time

Figure 3 shows how the performance of the Aho-Corasick implementation varies as the size of the corpus increases. The relationship appears linear and, for the most part, insensitive to the dictionary size. Deviations are likely attributable to poor sampling and high variance between test runs.

Source code for this implementation is available here.

Wednesday, March 21, 2012

Hadoop: The Definitive Guide

This is a great introduction to MapReduce, Hadoop, and the HDFS. A programmer with basic Java knowledge could have most of the the code examples up and running in a few hours. That said, it is a broad topic and impossible to cover in the scope of a single book. I would have preferred more coverage of the MapReduce paradigm and briefer coverage of the Hadoop add-on projects like Pig, Hive, and ZooKeeper. Also, the book left a few gaps for me with respect to preparing input data to leverage the distributed filesystem.

All in all, a well written and very informative book. I found Data-Intensive Text Processing with MapReduce an excellent companion to this book for more detail on MapReduce.