Overview is a continuing set of experiments exploring alternative ways to interact with a digitized corpus. Instead of concentrating on full text search Visual MOA focuses on ways to reveal historical connections.

The body of work used for this demonstration comes from the Cornell University Library's ambitious project "The Making of America." Cornell digitized over 900,000 pages of US publications from the 19th century. This demonstration used a subset of that corpus, 955 journal volumes, around 700,000 pages. Each journal was deconstructed to the sentence level and analyzed. The resulting dataset contains over 21 million indexed sentences.

A brief introduction to the tools developed so far with screen shots:

Sentence Search

Find connections between terms at the sentence level. Searches over 21 million sentences of the corpus and displays their use over time (n-gram). When a year is selected the tool then finds connected nouns and verbs occurring in that year allowing you to view the sentences that connected terms occurred in. You can then expand the context of the sentence to get more text of the article or view the sentence on the MOA archive website.

Article Tree

Using term frequency of nouns the article tree finds connections at the article level. Using an initial search word the tool locates articles that are about the inital term and other connecting subjects. You can expand the secondary term to view the articles that are about both terms. Over 65,000 articles have been indexed and using a scoring algorithm the tool displays what it thinks is the core, or most succinct sentence of that article. The tool also allows you to link to the MOA Archive to read the article.

Author Network

This tool provides a way to explore the corpus through authors of the articles. Adding an author to the network will also add the subjects that author wrote about as children nodes. You are able to expand the subject nodes to reveal other authors who wrote on the same topics. The tool allows you to view the author filtered articles on a subject and links to them in the MOA archive.

100K Word Frequency

This tool provides a method to cycle through the top 100,000 words most frequently used in the corpus. You are able to filter and expand the context of a word to see an example of its use in a sentence.

OCR Correction Interface

This tool is the interface to a system that attempts to correct OCR errors throughout the corpus with the aid of user input. Over 7 million unique words are found in the corpus with the vast majority of them the result of an OCR error. The interface prompts the user to identity if a suspected OCR error is a false positive and if not prompts for the corrected word. Further details on the system can be read in the Notes section above.

About is a project by Matthew Miller, an Information Science student at Pratt Institute.