Ubiqu+Ity 1.2 changelogA Visualizing English Print application from the University of Wisconsin–Madison
Email Feedback About Ubiqu+Ity
DocuScope is a text analysis environment with a suite of interactive visualization tools for corpus-based rhetorical analysis. […] David created what we call the generic (default) dictionary, consisting of over 40 million linguistic patterns of English classified into over 100 categories of rhetorical effects. Suguru designed and implemented the analysis and visualization software, which can annotate a corpus of text against any dictionary of regular strings that are classified into a hierarchy of rhetorical effects.
While prior versions of Docuscope are proprietary, a revision of 3.21 has recently been made open-source and available to the public. You can download it here from GitHub.
You can learn more about DocuScope, David, and Suguru here:
The investigators and researchers of the Visualizing English Print project at the University of Wisconsin–Madison have partnered with David and Suguru to utilize DocuScope's dictionary of rules in our research and our tools, such as this one. We will keep our available dictionaries updated as further versions are released.
If you don't want to use the DocuScope dictionaries, Ubiqu+Ity allows you to create your own simple rules. Rules are declared in a CSV file that looks like this:
Ubiqu+Ity will look for exact instances of the rules that are specified. Column headings (i.e.
Words, Rule on the first line of the CSV file) are optional, but recommended.
Here's an example:
Words, Rule I have a question, GenericQuestionQuery Have you had, PresentPerfectQuery Stand ho, ExclamatoryStatement
In the future, we will allow users to create more complex, hierarchical rule structures.
The chunk feature generates allows for the comparison of tag frequencies within a text. It divides a text into chunks of a user-specified number of tokens and gives each chunk its own row in the final spreadsheet.
The "chunk distance" setting allows you to control how much the chunks overlap. Picking a distance equal to chunk length will give you completely disjoint pieces, and picking a distance half the chunk length with give you chunks with 50% overlap.
This is designed for instances where Docuscope's rules may not give the desired results for specific words in a corpus. For example, words like "bear" can have many distinct meanings within a single text that can't necessarily be captured through Docuscope. Blacklisting stops Ubiqu.Ity from tagging semantically ambiguous input.
If a word within a rule is blacklisted, then that word is tagged as "!BLACKLISTED" and any longer rules that contain that word are broken down into smaller rules. Note that blacklisting only works on individual words, not phrases.
|text_name||file names of uploaded text files|
|chunk_index||an integer that specifies the text chunk number for chunked files; starts at 0 and increases sequentially|
|!UNRECOGNIZED||percentage of text input that is not recognized by the Docuscope dictionary, often proper nouns or obscure words|
|!UNTAGGED||percentage of text input that is contained within at least one rule within the Docuscope dictionary but doesn't get tagged because the text around it matches no specific rules, often common words like 'so' that have little or ambiguous meaning without neighboring words|
|!BLACKLISTED||percentage of text input that matches content set to blacklisted|
|< Word Tokens>||total number of word tokens in a text or text chunk|
|<Punctuation Tokens>||number of punctuation tokens in a text or text chunk|
|<Tokens>||total number of tokens (Word and Punctuation) in a text or text chunk|
Here is a live sample of what normal output looks like this: Ubiqu+Ity_Shakespeare-Plays_Example_Output.csv
While chunked output looks like this: Ubiqu+Ity_Shakespeare-Plays_Example_Output_Chunked.csv
You can see an example text viewer instance of Shakespeare's Romeo and Juliet that has been tagged with the DocuScope dictionary here.
Ubiqu+Ity makes use of a number of different services, including the Celery distributed task queue and the RabbitMQ message broker and results store. If you are interested in setting up your own installation, feel free to contact us.
Please send those right to us:
We are also logging any system errors that might come up.
Defect statistics is only compatible with files processed through our TEI-P4 processing line. The XML files from the TCP use special characters
and XML tags to indicate illegible characters, punctuation, and missing words or pages.
Our pipeline translates these xml tags or unicode symbols into ^,*, and (...) respectively in resulting plain text files.
The defect statistics option then uses these characters to generate additional statistics in the csv about the textual defects of each text in the corpus. Because it recognizes only these patterns of characters, this option only works with plain texts files generated from our pipeline.
SlimTV tokens files have several improvements over the HTML viewer files. The HTML files were clunky and large. SlimTV tokens files are not only smaller but fix bugs known to exist with the HTML files. The tokens loading file directs you to the most up-to-date version of the Serendip SlimTV.