Pipeline

VEP Scripts

div-divider

processes TEI-formatted XML files and extracts all DIV objects from a file, dividing them into independent files named after their DIV types
allows extraction of specific DIV types (e.g., "play")
naming rule for output files: [name]_[global_DIV_No.]_[type]_[type_DIV_No.]_[level]

div-merger:

sequentially merges XML elements in an input folder and outputs an XML file that contains all of the XML elements within a <COLLECTION> tag
user must specify name for output file

pre-VARDer:

prepares TCP XML files for VARD
removes XML reserved characters (<, >, &, %)
removes XML comments and TEI XML tags that can interrupt words: <SEG>, <SUB>, <SUP>
transforms non-ASCII characters into ASCII alternatives (e.g., "naïve" to "naive")
removes TCP end-of-line hyphen characters (vertical bar: |, broken vertical bar: ¦)
replaces TCP illegible characters (bullet: •) with carets (^)
replaces TCP unrecognizable punctuation(small black square: ▪) with asterisks (*)
replaces non-ASCII characters not given ASCII equivalents (e.g., pilcrow: ¶) with at signs (@)
replaces TCP missing word symbol (lozenge in brackts: ◊) with ellipses in parentheses ((...))

tei-decoder:

flexibly eliminates XML tags, their attributes, and their content to produce plain text
uses a config file that specifies behavior for XML tags, i.e., determines what text to print to a new file
generates a CSV of TEI metadata for the corpus
EEBO-TCP cheat sheet for TEI XML tags