Ubiqu+Ity 1.2 changelog

A Visualizing English Print application from the University of Wisconsin–Madison

Ubiqu+Ity generates statistics6 and web-based tagged text views7
for your text/s1, using the DocuScope dictionary2 or your own rules3.


Upload Files



Break texts into fixed-size chunks4
Chunk length (in tokens)
Distance between consecutive chunks (in tokens)
Text defect statistics in csv (TCP pipeline files only)11
Enable blacklist5
Generate n-gram csv
Select between up to 1, 2, and 3-grams inclusive.
Include punctuation in n-grams.
Generate rule metadata csv
Enable per document rule metadata
Generate token csv representation (for use with SerendipSlim)12


Use of this website is subject to the terms of use and privacy policy.

Email Feedback About Ubiqu+Ity

Ubiqu+Ity FAQ

  • A single plain-text file (though it seems like that would be boring)
  • A ZIP archive containing multiple plain-text files (now we're talking)
  • Multiple plain-text files and/or multiple ZIP archives (with HTML5 multi-file uploading)
You can upload up to 50 individual files and folder and up to 50mb of total content to Ubiq+Ity at a time. Batch and zip larger jobs.

DocuScope was created by David Kaufer and Suguru Ishizaki in the Department of English at Carnegie Mellon University (emphasis ours):

DocuScope is a text analysis environment with a suite of interactive visualization tools for corpus-based rhetorical analysis. […] David created what we call the generic (default) dictionary, consisting of over 40 million linguistic patterns of English classified into over 100 categories of rhetorical effects. Suguru designed and implemented the analysis and visualization software, which can annotate a corpus of text against any dictionary of regular strings that are classified into a hierarchy of rhetorical effects.

While prior versions of Docuscope are proprietary, a revision of 3.21 has recently been made open-source and available to the public. You can download it here from GitHub.

You can learn more about DocuScope, David, and Suguru here:

The investigators and researchers of the Visualizing English Print project at the University of Wisconsin–Madison have partnered with David and Suguru to utilize DocuScope's dictionary of rules in our research and our tools, such as this one. We will keep our available dictionaries updated as further versions are released.

If you don't want to use the DocuScope dictionaries, Ubiqu+Ity allows you to create your own simple rules. Rules are declared in a CSV file that looks like this:

  • Column 1: Whitespace-Separated Words and Punctuation
  • Column 2: The Rule's Name

Ubiqu+Ity will look for exact instances of the rules that are specified. Column headings (i.e. Words, Rule on the first line of the CSV file) are optional, but recommended.

Here's an example:

Words, Rule
I have a question, GenericQuestionQuery
Have you had, PresentPerfectQuery
Stand ho, ExclamatoryStatement

In the future, we will allow users to create more complex, hierarchical rule structures.

The chunk feature generates allows for the comparison of tag frequencies within a text. It divides a text into chunks of a user-specified number of tokens and gives each chunk its own row in the final spreadsheet.

The "chunk distance" setting allows you to control how much the chunks overlap. Picking a distance equal to chunk length will give you completely disjoint pieces, and picking a distance half the chunk length with give you chunks with 50% overlap.

This is designed for instances where Docuscope's rules may not give the desired results for specific words in a corpus. For example, words like "bear" can have many distinct meanings within a single text that can't necessarily be captured through Docuscope. Blacklisting stops Ubiqu.Ity from tagging semantically ambiguous input.

If a word within a rule is blacklisted, then that word is tagged as "!BLACKLISTED" and any longer rules that contain that word are broken down into smaller rules. Note that blacklisting only works on individual words, not phrases.

COLUMNSDATA EXPLANATION
text_name file names of uploaded text files
chunk_index an integer that specifies the text chunk number for chunked files; starts at 0 and increases sequentially
!UNRECOGNIZED percentage of text input that is not recognized by the Docuscope dictionary, often proper nouns or obscure words
!UNTAGGED percentage of text input that is contained within at least one rule within the Docuscope dictionary but doesn't get tagged because the text around it matches no specific rules, often common words like 'so' that have little or ambiguous meaning without neighboring words
!BLACKLISTED percentage of text input that matches content set to blacklisted
< Word Tokens> total number of word tokens in a text or text chunk
<Punctuation Tokens> number of punctuation tokens in a text or text chunk
<Tokens> total number of tokens (Word and Punctuation) in a text or text chunk

Here is a live sample of what normal output looks like this: Ubiqu+Ity_Shakespeare-Plays_Example_Output.csv

While chunked output looks like this: Ubiqu+Ity_Shakespeare-Plays_Example_Output_Chunked.csv

You can see an example text viewer instance of Shakespeare's Romeo and Juliet that has been tagged with the DocuScope dictionary here.

Yes. Ubiqu+Ity is available under the BSD license, and can be forked from GitHub.

Ubiqu+Ity makes use of a number of different services, including the Celery distributed task queue and the RabbitMQ message broker and results store. If you are interested in setting up your own installation, feel free to contact us.

Please send those right to us:

We are also logging any system errors that might come up.

  • Default: 115
  • v4.00: 2426
  • v3.91: 1878
  • v3.85: 170
  • v3.83: 137
  • v3.82: 137
  • v3.80: 137
  • v3.7: 131
  • v3.6: 123
  • v3.5: 121
  • v3.3: 116
  • v3.21: 115

Defect statistics is only compatible with files processed through our TEI-P4 processing line. The XML files from the TCP use special characters and XML tags to indicate illegible characters, punctuation, and missing words or pages. Our pipeline translates these xml tags or unicode symbols into ^,*, and (...) respectively in resulting plain text files.

The defect statistics option then uses these characters to generate additional statistics in the csv about the textual defects of each text in the corpus. Because it recognizes only these patterns of characters, this option only works with plain texts files generated from our pipeline.

SlimTV tokens files have several improvements over the HTML viewer files. The HTML files were clunky and large. SlimTV tokens files are not only smaller but fix bugs known to exist with the HTML files. The tokens loading file directs you to the most up-to-date version of the Serendip SlimTV.