Longer documents can muddy the concept of document co-occurrence. Should words still be considered to be co-occurring if they occur in different paragraphs, or pages, or chapters? Intuition would say that longer chunks will create more general models, but it can be hard to predict the effects a priori.

We built two models of 25 topics on the works of Shakespeare: one in which the documents were divided into 1000-word chunks and stitched back together after being tagged by the model, and one in which the documents were treated as whole. The parallel buddy plots show us a trend: though the order of document distances does not change much, documents seem to grow much further apart in the non-chunked model.

This makes sense as the process of stitching small chunks back together would tend to create document vectors that cover a wider range of topics. The fact that relative orders do not change much may suggest that we can trust relationships we find in the un-chunked (and correspondingly much faster to build) model. For further discussion, refer to the paper.