Challenges of Early Modern Texts

The humanist scholar faces unique problems in order to automate the study of early modern texts. This section will detail the historical and technological complications VEP has faced while generating curated data sets of early modern texts and building visual analytic tools to explore topic models of corpora.

Early Modern Typographical Features

One cannot dispute the technological and cultural significance of the printing press. What is generally called the "early modern" period in English history coincides with the era of moveable type, from 1450 to 1800. During this time, print conventions emerged from those of the manuscript, after which print, manuscript, and codex features dynamically contributed to the printed books we are familiar with today.

Nonstandard Spelling

English language speakers and readers often begrudge the complexity of English's orthography, or its standard spelling system. Frustration with English orthography, in part, results from sound changes that began mid-14th century. English underwent the Great Vowel Shift, a systematic change in the pronunciation of vowels occurring until the eighteenth century. The introduction of the printing press allowed written forms of the English language to reach a wider audience. As a result, spelling, which varied regionally, began to standardize in the 15th and 16th centuries. The orthographic system English writers and readers have inherited is rooted in Middle English spellings that grappled with pronunciation changes of the Great Vowel Shift.

Spelling variety, such as found in early modern texts, complicates the automated reading process. Pre-standardized spelling conventions circumvent standard text analysis. While a human reader can recognize ivory, ivorie, and juorie as the same word, standard text analysis treats each spelling as separate entities. Therefore, an individual who wants to study ivory's connotations in Early Modern England would need to be savvy to typographical variants, like the interchangeability of i's and j's, in order to search a corpus with non-standard spellings.

Recent efforts like the CIC CLI Virtual Modernisation Project (CIC) have begun to address the difficulties of nonstandard spelling. CIC made texts in the Early English Books Online (EEBO) database more accessible to searching, programming the database to handle spelling variants at search time for the user. For the scholar with text files of early modern writing with nonstandard spellings, however, textual analysis becomes a problem. To address the issue for our data set, we used VARD, a java program that modernizes Early Modern English spellings into Standard British English spellings. This modernization tool assists researchers to improve the accuracy of text analysis of historical corpora.

Typography

The printed book emerged from ornate manuscript and manuscript codex traditions. Designers modeled type fonts after popular handwriting styles (gothic, chancery, italic, and secretary). The evolution of fonts and codex conventions provide challenges for automated reading. Below are several difficulties of dealing with early modern typography.

Font: The variety of fonts in early modern texts pose problems for optical character recognition and hand-keying. Different fonts signal varying registers of meaning and prestige. For example, blackletter is often found in printed legal documents. Texts also rely heavily only roman and italic type.

Special Characters: Early modern texts display a range of special characters and ornate symbols, which require decisions on how to include them in the text. Common examples are the pilcrow ( ¶ ) and the long s ( ſ ).

Letter Interchangeability: The English alphabet did not always contain 26 letters. Letters I, J, U, V, and W have a complex history.

The sound represented by the contemporary W came into the English language when monks began to write Anglo-Saxon phonetically with the roman alphabet. Since the roman alphabet did not contain a character for the sound, monks borrowed the Anglo-Saxon rune the Wynn ( ƿ ). This rune was replaced by the French who invaded England in 1066. French invaders used UU or VV instead.

It wasn't until recently that I and J were distinguished as separate letters/sounds, along with U and V. J and I were used interchangeably in writing and print, as were U and V. These letters began to be used separately during the 16th century. U and V fully diverged as separate in meaning during the 18th century.