Pipeline

VEP Scripts
div-divider
- processes TEI-formatted XML files and extracts all DIV objects from a file, dividing them into independent files named after their DIV types
- allows extraction of specific DIV types (e.g., "play")
- naming rule for output files: [name]_[global_DIV_No.]_[type]_[type_DIV_No.]_[level]
div-merger:
- sequentially merges XML elements in an input folder and outputs an XML file that contains all of the XML elements within a <COLLECTION> tag
- user must specify name for output file
pre-VARDer:
- prepares TCP XML files for VARD
- removes XML reserved characters (<, >, &, %)
- removes XML comments and TEI XML tags that can interrupt words: <SEG>, <SUB>, <SUP>
- transforms non-ASCII characters into ASCII alternatives (e.g., "naïve" to "naive")
- removes TCP end-of-line hyphen characters (vertical bar: |, broken vertical bar: ¦)
- replaces TCP illegible characters (bullet: •) with carets (^)
- replaces TCP unrecognizable punctuation(small black square: ▪) with asterisks (*)
- replaces non-ASCII characters not given ASCII equivalents (e.g., pilcrow: ¶) with at signs (@)
- replaces TCP missing word symbol (lozenge in brackts: ◊) with ellipses in parentheses ((...))
tei-decoder:
- flexibly eliminates XML tags, their attributes, and their content to produce plain text
- uses a config file that specifies behavior for XML tags, i.e., determines what text to print to a new file
- generates a CSV of TEI metadata for the corpus
- EEBO-TCP cheat sheet for TEI XML tags
- software that standardizes Early Modern English spelling across corpora
- version 2.5.4