Ubiqu+Ity 1.1A Visualizing English Print application from the University of Wisconsin–Madison
Email Feedback About Ubiqu+Ity
DocuScope is a text analysis environment with a suite of interactive visualization tools for corpus-based rhetorical analysis. […] David created what we call the generic (default) dictionary, consisting of over 40 million linguistic patterns of English classified into over 100 categories of rhetorical effects. Suguru designed and implemented the analysis and visualization software, which can annotate a corpus of text against any dictionary of regular strings that are classified into a hierarchy of rhetorical effects.
You can learn more about DocuScope, David, and Suguru here:
The investigators and researchers of the Visualizing English Print project at the University of Wisconsin–Madison have partnered with David and Suguru to utilize DocuScope's dictionary of rules in our research and our tools, such as this one. At the moment, we have two versions of the dictionary available (3.83 and 3.21). We will keep our available dictionaries updated as further versions are released.
If you don't want to use the DocuScope dictionaries, Ubiqu+Ity allows you to create your own simple rules. Rules are declared in a CSV file that looks like this:
Ubiqu+Ity will look for exact instances of the rules that are specified. Column headings (i.e.
Words, Rule on the first line of the CSV file) are optional, but recommended.
Here's an example:
Words, Rule I have a question, GenericQuestionQuery Have you had, PresentPerfectQuery Stand ho, ExclamatoryStatement
In the future, we will allow users to create more complex, hierarchical rule structures.
The chunk feature generates allows for the comparison of tag frequencies within a text. It divides a text into chunks of a user-specified number of tokens and gives each chunk its own row in the final spreadsheet.
The "chunk distance" setting allows you to control how much the chunks overlap. Picking a distance equal to chunk length will give you completely disjoint pieces, and picking a distance half the chunk length with give you chunks with 50% overlap.
This is designed for instances where Docuscope's rules may not give the desired results for specific words in a corpus. For example, words like "bear" can have many distinct meanings within a single text that can't necessarily be captured through Docuscope. Blacklisting stops Ubiqu.Ity from tagging semantically ambiguous input.
If a word within a rule is blacklisted, then that word is tagged as "!BLACKLISTED" and any longer rules that contain that word are broken down into smaller rules. Note that blacklisting only works on individual words, not phrases.
|text_name||file names of uploaded text files|
|text_key||file name, transformed into lowercase characters without file extension|
|html_name||file name of HTML viewer file for respective text file|
|chunk_index||an integer that specifies the text chunk number for chunked files; starts at 0 and increases sequentially|
|!UNRECOGNIZED||percentage of text input that is not recognized by the docuscope dictionary, often proper nouns or obscure words|
|!UNTAGGED||percentage of text input that is contained within at least one rule within the docuscope dictionary but doesn't get tagged because the text around it matches no specific rules, often common words like 'so' that have little or ambigious meaning without neighboring words|
|!BLACKLISTED||percentage of text input that matches content set to blacklisted|
|< Word Tokens>||total number of word tokens in a text or text chunk|
|<Punctuation Tokens>||number of punctuation tokens in a text or text chunk|
|<Tokens>||total number of tokens (Word and Punctuation) in a text or text chunk|
Here is a live sample of what normal output looks like this: Ubiqu+Ity_Shakespeare-Plays_Example_Output.csv
While chunked output looks like this: Ubiqu+Ity_Shakespeare-Plays_Example_Output_Chunked.csv
Here is an example HTML text viewer of Shakespeare's King Henry IV Part I, tagged with the DocuScope dictionary: Ubiqu+Ity_1_KING_HENRY_IV_Docuscope_Example_Output.html
Yes! The text viewer has a set of default colors that it uses when you toggle categories on and off, but specific colors can be specified through the URL.
To specify your own colors, add a question mark (?) to the end of your URL, followed by pairs of the following form:
category-name=color&For instance, if we were to take our example tagged version of King Henry IV and make the Emotion_Negativity tags red and the Emotion_Positivity tags blue, we would add
?Emotion_Negativity=red&Emotion_Positivity=blue&to the end of our URL, like so:
You can also specify hex colors:
Ubiqu+Ity makes use of a number of different services, including the Celery distributed task queue and the RabbitMQ message broker and results store. If you are interested in setting up your own installation, feel free to contact us.
Please send those right to us:
We are also logging any system errors that might come up.
Defect statistics is only compatible with files processed through our TEI-P4 processing line. The xml files from TCP use special characters
and xml tags to indicate illegible characters, punctuation, and missing words or pages.
Our pipeline translates these xml tags or unicode symbols into ^,*, and (...) respectively in resulting plain text files.
The defect statistics option then uses these characters to generate additional statistics in the csv about the textual defects of each text in the corpus. Because it recognizes only these patterns of characters, this option only works with plain texts files generated from our pipeline.