Ubiqu+Ity

1. What text format/s can I upload and how many?

A single plain-text file (though it seems like that would be boring)
A ZIP archive containing multiple plain-text files (now we're talking)
Multiple plain-text files and/or multiple ZIP archives (with HTML5 multi-file uploading)

You can upload up to 50 individual files and folder and up to 50mb of total content to Ubiq+Ity at a time. Batch and zip larger jobs.

2. What is the DocuScope dictionary and where can I download it?

DocuScope was created by David Kaufer and Suguru Ishizaki in the Department of English at Carnegie Mellon University (emphasis ours):

DocuScope is a text analysis environment with a suite of interactive visualization tools for corpus-based rhetorical analysis. […] David created what we call the generic (default) dictionary, consisting of over 40 million linguistic patterns of English classified into over 100 categories of rhetorical effects. Suguru designed and implemented the analysis and visualization software, which can annotate a corpus of text against any dictionary of regular strings that are classified into a hierarchy of rhetorical effects.

While prior versions of Docuscope are proprietary, a revision of 3.21 has recently been made open-source and available to the public. You can download it here from GitHub.

You can learn more about DocuScope, David, and Suguru here:

DocuScope: Computer-aided Rhetorical Analysis - Carnegie Mellon University

The investigators and researchers of the Visualizing English Print project at the University of Wisconsin–Madison have partnered with David and Suguru to utilize DocuScope's dictionary of rules in our research and our tools, such as this one. We will keep our available dictionaries updated as further versions are released.

3. How do I specify my own rules?

If you don't want to use the DocuScope dictionaries, Ubiqu+Ity allows you to create your own simple rules. Rules are declared in a CSV file that looks like this:

Column 1: Whitespace-Separated Words and Punctuation
Column 2: The Rule's Name

Ubiqu+Ity will look for exact instances of the rules that are specified. Column headings (i.e. Words, Rule on the first line of the CSV file) are optional, but recommended.

Here's an example:

Words, Rule
I have a question, GenericQuestionQuery
Have you had, PresentPerfectQuery
Stand ho, ExclamatoryStatement

In the future, we will allow users to create more complex, hierarchical rule structures.

4. How does the chunk feature work?

The chunk feature generates allows for the comparison of tag frequencies within a text. It divides a text into chunks of a user-specified number of tokens and gives each chunk its own row in the final spreadsheet.

The "chunk distance" setting allows you to control how much the chunks overlap. Picking a distance equal to chunk length will give you completely disjoint pieces, and picking a distance half the chunk length with give you chunks with 50% overlap.

5. How does the blacklist feature work?

This is designed for instances where Docuscope's rules may not give the desired results for specific words in a corpus. For example, words like "bear" can have many distinct meanings within a single text that can't necessarily be captured through Docuscope. Blacklisting stops Ubiqu.Ity from tagging semantically ambiguous input.

If a word within a rule is blacklisted, then that word is tagged as "!BLACKLISTED" and any longer rules that contain that word are broken down into smaller rules. Note that blacklisting only works on individual words, not phrases.

6. What does the CSV output look like, and how can I read it?

COLUMNS	DATA EXPLANATION
text_name	file names of uploaded text files
chunk_index	an integer that specifies the text chunk number for chunked files; starts at 0 and increases sequentially
!UNRECOGNIZED	percentage of text input that is not recognized by the Docuscope dictionary, often proper nouns or obscure words
!UNTAGGED	percentage of text input that is contained within at least one rule within the Docuscope dictionary but doesn't get tagged because the text around it matches no specific rules, often common words like 'so' that have little or ambiguous meaning without neighboring words
!BLACKLISTED	percentage of text input that matches content set to blacklisted
< Word Tokens>	total number of word tokens in a text or text chunk
<Punctuation Tokens>	number of punctuation tokens in a text or text chunk
<Tokens>	total number of tokens (Word and Punctuation) in a text or text chunk

Here is a live sample of what normal output looks like this: Ubiqu+Ity_Shakespeare-Plays_Example_Output.csv

While chunked output looks like this: Ubiqu+Ity_Shakespeare-Plays_Example_Output_Chunked.csv

7. What does the SlimTV text viewer look like?

You can see an example text viewer instance of Shakespeare's Romeo and Juliet that has been tagged with the DocuScope dictionary here.

8. Is the source code for Ubiqu+Ity available?

Yes. Ubiqu+Ity is available under the BSD license, and can be forked from GitHub.

Ubiqu+Ity makes use of a number of different services, including the Celery distributed task queue and the RabbitMQ message broker and results store. If you are interested in setting up your own installation, feel free to contact us.

9. To whom should I email comments, ideas, and/or bug reports?

Please send those right to us:

Email Feedback About Ubiqu+Ity

We are also logging any system errors that might come up.

10. How many LATs (linguistic categories) does a specific version of Docuscope have?

Default: 115
v4.00: 2426
v3.91: 1878
v3.85: 170
v3.83: 137
v3.82: 137
v3.80: 137
v3.7: 131
v3.6: 123
v3.5: 121
v3.3: 116
v3.21: 115

11. What files are the defect statistics option intended for?

Defect statistics is only compatible with files processed through our TEI-P4 processing line. The XML files from the TCP use special characters and XML tags to indicate illegible characters, punctuation, and missing words or pages. Our pipeline translates these xml tags or unicode symbols into ^,*, and (...) respectively in resulting plain text files.

The defect statistics option then uses these characters to generate additional statistics in the csv about the textual defects of each text in the corpus. Because it recognizes only these patterns of characters, this option only works with plain texts files generated from our pipeline.

12. Why replace the HTML viewer files with SlimTV tokens files?

SlimTV tokens files have several improvements over the HTML viewer files. The HTML files were clunky and large. SlimTV tokens files are not only smaller but fix bugs known to exist with the HTML files. The tokens loading file directs you to the most up-to-date version of the Serendip SlimTV.

Ubiqu+Ity 1.2 ^changelog

Upload Files

Ubiqu+Ity FAQ

Ubiqu+Ity 1.2 changelog

Upload Files

Ubiqu+Ity FAQ

Ubiqu+Ity 1.2 ^changelog