Pipeline

 

VEP Scripts

div-divider

  • processes TEI-formatted XML files and extracts all DIV objects from a file, dividing them into independent files named after their DIV types
  • allows extraction of specific DIV types (e.g., "play")
  • naming rule for output files: [name]_[global_DIV_No.]_[type]_[type_DIV_No.]_[level]

div-merger:

  • sequentially merges XML elements in an input folder and outputs an XML file that contains all of the XML elements within a <COLLECTION> tag
  • user must specify name for output file

pre-VARDer:

  • prepares TCP XML files for VARD
  • removes XML reserved characters (<, >, &, %)
  • removes XML comments and TEI XML tags that can interrupt words: <SEG>, <SUB>, <SUP>
  • transforms non-ASCII characters into ASCII alternatives (e.g., "naïve" to "naive")
  • removes TCP end-of-line hyphen characters (vertical bar: |, broken vertical bar: ¦)
  • replaces TCP illegible characters (bullet: •) with carets (^)
  • replaces TCP unrecognizable punctuation(small black square: ▪) with asterisks (*)
  • replaces non-ASCII characters not given ASCII equivalents (e.g., pilcrow: ¶) with at signs (@)
  • replaces TCP missing word symbol (lozenge in brackts: ◊) with ellipses in parentheses ((...))

tei-decoder:

  • flexibly eliminates XML tags, their attributes, and their content to produce plain text
  • uses a config file that specifies behavior for XML tags, i.e., determines what text to print to a new file
  • generates a CSV of TEI metadata for the corpus
  • EEBO-TCP cheat sheet for TEI XML tags

VARD

  • software that standardizes Early Modern English spelling across corpora
  • version 2.5.4