This is a "baked" version of the notebook - it is not editable, but it doesn't require the Jupyter software to be installed. The scatterplots (and some other) visualizations have links to interactive versions.

Maps of Core Drama

This notebook contains some initial experiments to create "maps" of the core drama corpus.

The pictures inline in the notebook are simple plots made using python plotting - to really explore the maps, you will probably want to look at the interactive versions

Setup

This makes sure we're in the right directory, and then loads in a bunch of python libraries. All stuff to do before we get started.

In [1]:
# this setup is meant to be common across all the demos...

# since the notebooks may not go where the module code is...
# change directory to the root - since that's where everything goes
import sys
import os

if os.path.basename(os.getcwd()) == "Notebooks":
    print("Changing directory from:",os.getcwd())
    os.chdir("..")
    print("                     to:",os.getcwd())
    
### Magic to make notebooks work nicely
%matplotlib inline

### standard useful python stuff
from collections import Counter, defaultdict
from importlib import reload

### plotting stuff
import pylab
import seaborn

### numerical stuff
import numpy
from scipy import stats

### my CorpusModeler package stuff
import Corpus.DataMatrix as DM
import Corpus.reduce 
import Corpus.cull 
import Corpus.linear
import Corpus.nonlinear
import Tests.plays as Plays
Changing directory from: C:\Users\gleic\Projects\corpusmodeler\Notebooks
                     to: C:\Users\gleic\Projects\corpusmodeler

This next block sets up some of the basic data structues - loading the corpus, and creating some useful versions of it.

Warning - this uses the wrong form of normalization. I should re-run the experiments with the right normalziation, but that would require updating the narrative.

In [2]:
p = Plays.readCore()
Plays.addNames(p)
r = Corpus.reduce.ngrams(p,200)
nr = r.normalize()
genres = p.getMetaDataArray("genre")
dates = p.getMetaDataArray("date")
authors = p.getMetaDataArray("author")
corpus of (554, 179064) loaded in 0.310314416885376
Density after reduction is 0.9848104693140795 - making dense
Reduce from 179064 to 200 words in 0.3668496608734131

A first map - the most common words!

Here is a first simple map...

Using the 2 most common words!

The first thing that is striking is the outlier - there's a tragedy that has 12% of its words being "the". If you are looking at the interactive version of the plot, you can mouse over it to see what it is. Instead, I will list all documents that have more than 8% of their words being "the" (or scoring over .08 on the first feature).

Warning - this is based on normalizing things based on the number of words in the top 200. (so 12% means "12% of the words in the top 200" - not 12% of the overall document). This probably isn't the right thing to do, as it brings in effects of the number of unusual words.

In [26]:
reload(Plays)
Plays.uplot(nr,fname="core_the_i",oc=list(range(2,10)),tagWords=nr.terms[:10])
nr.printTopDocs(0,pairs=True,metaData="name",thresh=.08)
../VEP_Scatter/index.html#data/core_the_i.csv?x=the&y=i&grp=genre&grayCol=Other
                A59170.headed.txt :  0.121 : TR:1648:Seneca:Medea
                A31023.headed.txt :  0.090 : TR:1655:Baron:Mirza
                A02262.headed.txt :  0.088 : TR:1640:Grotius:Christ's Passion
A11909.headed_048_tragedy_006.txt :  0.087 : TR:1558:Seneca:Troas
A11909.headed_069_tragedy_008.txt :  0.086 : TR:1566:Seneca:Agamemnon
A11909.headed_025_tragedy_004.txt :  0.083 : TR:1567:Seneca:Hippolytus
                A59169.headed.txt :  0.081 : TR:1651:Seneca:Hippolytus

The next thing we see if that it seems like the tragedies tend to be to the right (and bottom) while the comedies are in the upper left. A different view of the data might help us get a better sense of that.

Using a python statistical graphics library (seaborn), I can make some plots that might help me get a sense of this. Here are beeswarm, violin, and boxplots.

In [5]:
pylab.clf()
seaborn.swarmplot(x=genres,y=nr.matrix[:,0],palette=Plays.genreColors_)
pylab.show()
seaborn.violinplot(x=genres,y=nr.matrix[:,0],palette=Plays.genreColors_)
pylab.show()
seaborn.boxplot(x=genres,y=nr.matrix[:,0],palette=Plays.genreColors_)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1be0a950d68>

This gives me some intuition that comedies use less "the" than histories and tragedies. I might want to ask if this is statistically significant (i.e., what is the likelihood of this happening by chance). Technically, I should do an ANOVA or some complicated statistical test. But at first, I can try a basic T-Test to see if comedies are less than tragedies.

In [6]:
com = [v for i,v in enumerate(nr.matrix[:,0]) if genres[i]=="CO"]
tra = [v for i,v in enumerate(nr.matrix[:,0]) if genres[i]=="TR"]
print(len(com),len(tra))
print(numpy.mean(com),numpy.mean(tra))
stats.ttest_ind(com,tra,equal_var=False)
266 154
0.0419095157017 0.0523951818446
Out[6]:
Ttest_indResult(statistic=-8.3566899509183994, pvalue=6.479455675497188e-15)

OK, so the use of the word "the" is unlikely to have come from the same distribution.

For completeness, here are the graphs for the word "I"

In [7]:
pylab.clf()
seaborn.swarmplot(x=genres,y=nr.matrix[:,1],palette=Plays.genreColors_)
pylab.show()
seaborn.violinplot(x=genres,y=nr.matrix[:,1],palette=Plays.genreColors_)
pylab.show()
seaborn.boxplot(x=genres,y=nr.matrix[:,1],palette=Plays.genreColors_)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1be09137390>

Classifiers

Now, we see that (on average) different genres use the words "the" and "i" (the two most common words) differently. So, comedies will tend to be in the upper left of our map, and tragedies will tend to be lower and more to the right.

We can think about using a measurement (e.g., the percentage of words that are "the") to tell use what genre something is. Simply, if this amount is high, we expect it's a tragedy or history.

To assess this, consider a simple game: I pick two plays at random - if one of them is a tragedy, what is the probability of it's "score" being higher? Or put differently, if I consider all pairs of plays, what percentage of those that have one tragedy does the tragedy have the higher score? If all tragedies scored higher than all other plays, this would be 100%. If all tragedies scored less than the others, it would be 0%. If the measure didn't tell us anything, we'd expect it to be 50% (half the time yes, half the time no).

In machine learning, this metric of success is call area under the curve, or more specifically, area under the receiver operating characteristic curve. The reason for this funny name is complicated - but its best to think of the simple explanation: what is the chance that given a pair of elements, the one "in-class" has the higher value.

Here are the AUC (area under curve) scores for the first few words. Notice that a random history has a 75% chance of having a higher percentage of "the" than a random non-history. (Or, for 74.8% of the pairs of plays, the history has the higher percentage of "the"). Similarly, comedies are low in "the" (they only have more than some other play 28.5% of the time). But comedies are high in "I" and "A".

In [8]:
nr.aucTable(genres,[0,1,2,3,4])
   :      CO       TR       TC       HI 
the:   0.285    0.692    0.493    0.748 
  i:   0.765    0.245    0.480    0.313 
and:   0.374    0.584    0.526    0.667 
 to:   0.292    0.663    0.620    0.535 
  a:   0.758    0.310    0.400    0.315 

There are a few lessons to follow up on here (methodological lessons, what it means for these simple words is for someone else to figure out).

First, notice that tragedies and histories tend to do similar things. These words don't distinguish them well. For a fairer test, we might check to see how well things work to tell these two groups apart from comedies. With these words, the tragicomedies aren't going to be too distinctive.

The next thing is to put features together... If "the" and "to" are both good for telling tragedies and histories, maybe counting both will be even better. And... while we're at it, we could subtract the counts of words that are good for comedies.

I've just built a linear classifier (or scoring function): 1 times the number of the, -1 times the numer of i, etc. In this case, the simple thing works pretty well. But you can imagine, I could do something much fancier to pick which things to measure (that's called feature selection) and figuring out what combination of those features to use (the values for the multiplers).

In [9]:
nr.aucTable(["TR+HI" if g=="TR" or g=="HI" else g for g in genres],[0,1,3,4])
manualTrag = Corpus.linear.fromWords(nr,[ ["the","to"], ["i","a"], ["the","to",("i",-1),("a","-1")]])
manualTrag.terms = ["the+to","i+a","the+to-i-a"]
manualTrag.aucTable(genres,[0,1,2])
manualMap = Corpus.linear.fromWords(nr,[["the","to",("i",-1),("a","-1")],["king","lord"] ])
Plays.uplot(manualMap,fname="core_manual_map",tagWords=["the","to","i","a","king","lord"])
   :      CO    TR+HI       TC 
the:   0.285    0.741    0.493 
  i:   0.765    0.221    0.480 
 to:   0.292    0.654    0.620 
  a:   0.758    0.279    0.400 
          :      CO       TR       TC       HI 
    the+to:   0.241    0.715    0.555    0.708 
       i+a:   0.825    0.224    0.428    0.262 
the+to-i-a:   0.180    0.771    0.570    0.736 

Feature Selection

So far, we just took the first 5 words. What if we looked at all 200 and asked - which is best?

In [10]:
for genre in ["CO","TR","HI","TC"]:
    com = [1 if g==genre else 0 for g in genres]
    auc = nr.auc(com)
    aucMag = [abs(a-.5) for a in auc]
    aucOrd = numpy.argsort(aucMag)[::-1]
    print("Best Words for "+genre)
    nr.aucTable(genres,aucOrd[:5])
Best Words for CO
      :      CO       TR       TC       HI 
 death:   0.147    0.843    0.551    0.682 
 blood:   0.194    0.802    0.547    0.638 
  from:   0.200    0.758    0.597    0.642 
     i:   0.765    0.245    0.480    0.313 
master:   0.761    0.272    0.365    0.497 
Best Words for TR
     :      CO       TR       TC       HI 
death:   0.147    0.843    0.551    0.682 
blood:   0.194    0.802    0.547    0.638 
 from:   0.200    0.758    0.597    0.642 
    i:   0.765    0.245    0.480    0.313 
  you:   0.749    0.255    0.500    0.303 
Best Words for HI
     :      CO       TR       TC       HI 
 king:   0.252    0.640    0.554    0.899 
 lord:   0.376    0.603    0.420    0.833 
  can:   0.537    0.445    0.652    0.196 
could:   0.453    0.581    0.603    0.205 
  our:   0.247    0.689    0.548    0.782 
Best Words for TC
     :      CO       TR       TC       HI 
 find:   0.525    0.399    0.701    0.279 
 wife:   0.661    0.452    0.318    0.431 
 well:   0.733    0.311    0.328    0.564 
since:   0.391    0.519    0.671    0.488 
 good:   0.698    0.338    0.329    0.613 

Building classifiers by hand is tricky, since we need to consider how common the words are to get the right scaling factors. Just being +/- 1 is pretty limiting. I'm trying to pick words that not only promote each genre, but distinguish similar ones (tragedies and histories are hard).

In [11]:
myGenres = Corpus.linear.fromWords(nr,[
                                   ["find",("wife",-1),("well",-1),"since"],
                                   ["king","lord",("could",-1)],
                                   ["i","master",("death",-1),("blood",-1)],
                                   ["death","blood",("i",-1),("you",-1)]
                                  ])
myGenres.terms = ["TC","HI","CO","TR"]
myGenres.aucTable(genres,[0,1,2,3])
Plays.uplot(myGenres,fname="core_manual_genres",tagWords=["the","to","i","a","king","lord"],oc=[2,3])
  :      CO       TR       TC       HI 
TC:   0.266    0.630    0.762    0.424 
HI:   0.339    0.591    0.464    0.913 
CO:   0.813    0.203    0.457    0.310 
TR:   0.193    0.793    0.521    0.726 

Here we try to use a standard Machine Learning approach for classifier construction (Support Vector Machines - SVMs) to build classifiers. The parameter (C) balances between concise answers (fewer words used) and better correctness. If you set C really big, you get classifiers that always pick out the target class. But they use all the words, and weird combinations of them.

In [29]:
svm = Corpus.linear.SVM(nr,genres,l1=True,svmparams={"C":500})
print(svm)
svm.aucTable(genres,[0,1,2,3])
print(svm.topWords(0))
Plays.uplot(svm,fname="core_svm",oc=list(range(2,svm.shape[1])),tagWords=svm.topWords(0))
Build svm model in  1.43
<Corpus.DataMatrix.DataMatrix 554x4>
  :      CO       TR       TC       HI 
CO:   0.999    0.115    0.294    0.225 
HI:   0.410    0.476    0.461    1.000 
TC:   0.326    0.422    1.000    0.317 
TR:   0.167    1.000    0.420    0.412 
[('eyes', 651.65616218888022), ('day', 414.78632329486123), ('long', 412.7276171760189), ('very', 353.32054000930935), ('an', 341.69960270369921), ('though', 338.62091818070354), ('soul', -317.38916558648862), ('death', -336.13708841153516), ('blood', -351.71059621507192), ('better', -368.18832234624335), ('when', -373.05535423671347), ('way', -377.28059660684636), ('both', -391.80174284575617), ('without', -407.14330187802602), ('take', -412.19831934575001), ('who', -428.39114905052554), ('hand', -450.61914867651717), ('hope', -511.050968540599), ('keep', -668.54928102106896), ('might', -675.8669414773733)]

PCA

Our success at crafting a "classifier" so easily suggests that statisics might just cause this to play out.

Here, we apply "Principle Components Analysis" to reduce from 200 dimensions to 2 dimensions.

For comparison, I show the AUC table along with the "manual classifier" (using 4 words) we derived above. The PCA does a reasonable job of distinguishing genres.

In [30]:
pca = Corpus.linear.PCA(nr,10)
Plays.uplot(pca,fname="core_pca")
pca.aucTable(genres,[0,1,2])
manualTrag.aucTable(genres,[0,1,2])
Build PCA model in  0.01
../VEP_Scatter/index.html#data/core_pca.csv?x=PCA(0)&y=PCA(1)&grp=genre&grayCol=Other
      :      CO       TR       TC       HI 
PCA(0):   0.207    0.764    0.523    0.756 
PCA(1):   0.593    0.440    0.458    0.424 
PCA(2):   0.621    0.390    0.468    0.449 
          :      CO       TR       TC       HI 
    the+to:   0.241    0.715    0.555    0.708 
       i+a:   0.825    0.224    0.428    0.262 
the+to-i-a:   0.180    0.771    0.570    0.736 

RCA (Supervised Dimensionality Reduction)

In [14]:
rca = Corpus.linear.RCA(nr,genres,dims=3)
rca.aucTable(genres,[0,1,2])
Plays.uplot(rca,fname="core_rca",oc=[2])
rcaTSNE = Corpus.nonlinear.tsne(rca)
Plays.uplot(rcaTSNE,fname="core_rca_tsne")
Build RCA model in  0.13
      :      CO       TR       TC       HI 
RCA(0):   0.966    0.061    0.455    0.168 
RCA(1):   0.646    0.261    0.352    0.997 
RCA(2):   0.355    0.304    0.969    0.634 

Neighborhoods

In addition to these compass directions, we might wonder about the neighborhoods... If I take a point on the map, are it's neighbors going to be of the same genre?

What I am going to do is to "predict" the genre of each play based on it's neighbors on the map. For each play, I will pick the 5 closest neighbors (this actually means itself, plus 4 real neighbors) - and have this group of 5 "vote". So, if 2 of the 4 closest neighbors are of the same genre, the play will be labeled with its genre. If more than 2 of the closest neighbors are different, the play will be labeled with a different genre. We can see how often this predicts correctly.

In the plot, I highlight ones that are "wrong" (plays whose neighbors are generally not the same as their neighbors).

The table gives a sense of how well this worked.

In [15]:
import Tests.knn as KNN
KNN.knnExp(nr,genres,nd=2,uplotArgs={"fname":"core_i_the_neighbors","tagWords":["the","i"]})
             precision    recall  f1-score   support

         CO       0.65      0.91      0.76       266
         HI       0.48      0.31      0.38        39
         TC       0.43      0.14      0.21        95
         TR       0.68      0.56      0.61       154

avg / total       0.61      0.64      0.60       554

Dates

The fact that the first thing we tried (the most common words) are correlated with genre might make us wonder if we just got lucky. Let's see if something else is correlated with these words. The obvious thing to try is date, since we have that handy... Here's a plot of "the" vs. date

In [16]:
seaborn.regplot(numpy.array(nr.getMetaDataArray("date"),dtype=float),nr.matrix[:,0])
pylab.show()
seaborn.regplot(numpy.array(nr.getMetaDataArray("date"),dtype=float),nr.matrix[:,1])
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1be090059b0>

Short answer - the use of "the" and "i" doesn't look correlated to date. Although, it is probably worth asking, whether or not genre is correlated to date. Again, rather than something statistical, I'll make a picture.

In [17]:
dates = numpy.array(nr.getMetaDataArray("date"))
pylab.clf()
seaborn.swarmplot(x=genres,y=dates,palette=Plays.genreColors_)
pylab.show()
seaborn.violinplot(x=genres,y=dates,palette=Plays.genreColors_)
pylab.show()
seaborn.boxplot(x=genres,y=dates,palette=Plays.genreColors_)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1be0990f710>

For tragedies and comedies, it doesn't seem to vary so much. Histories and Tragi-Comedies certainly do change. This might explain what we see in "the" and "I".

And just to show that correlations do happen, let's see that "the" and "i" are correlated (or anti-correlated, to be precise).

In [18]:
seaborn.regplot(nr.matrix[:,0],nr.matrix[:,1])
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1be098e0a20>

Before and After Shakespeare

As an experiments let's see how/if the usage of the top 200 words changed after shakespeare. I divide the plays into 3 groups: before Shakespeare, during Shakespeare, and after Shakespeare. I will then do "supervised dimensionality reduction" to pick 2 dimensions that best separate before from after. The specific method I am choosing is called "Relevant components analysis" (or RCA).

In [19]:
reload(Plays)
dates = p.getMetaDataArray("date")
dix = [1 if d<1590 else 2 if d>1613 else 0 for d in dates]
print(Counter(dix))
dixRCA = Corpus.linear.RCA(nr,dix,skipZeros=True)
Plays.uplot(dixRCA,grpCol=dix,fname="core_before_after_shak")
Plays.uplot(dixRCA,fname="core_before_after_shake_genre")

Of course, the first question to ask is "what words" are making that nice division along the x axis. The answer is all of them, but we can look at the top 20. Positive means "pulls to the right" (or is seen more often in documents before), negative means pulls to left.

In [20]:
dixRCA.topWords(0)
Out[20]:
[('nothing', 548.83857158780677),
 ('without', 440.94773001611782),
 ('might', 403.55806043656344),
 ('made', 342.36398187731703),
 ('down', 274.36614200198341),
 ('tell', 270.7539349420619),
 ('life', 256.348595709118),
 ('whose', 247.42406857362087),
 ('long', 243.10819007745889),
 ('faith', 234.37471525861764),
 ('true', -226.532496703937),
 ('one', -229.96028866025685),
 ('hope', -230.20822981745184),
 ('those', -262.96526379719995),
 ('blood', -271.80890853072157),
 ('eyes', -285.27223137246995),
 ('sure', -294.59430906903475),
 ('way', -308.60032642370356),
 ('nor', -311.80160481141468),
 ('find', -398.4702999824334)]

Do the weightings matter? here I force them to be -2,-1,0,1,2. And it works almost as well.

In [21]:
dixRCAq = dixRCA.quantizeWords(2)
Plays.uplot(dixRCA,grpCol=dix,fname="core_before_after_shak_quant")
dixRCAq.topWords(0)
Out[21]:
[('nothing', 2.0),
 ('without', 2.0),
 ('such', 1.0),
 ('long', 1.0),
 ('done', 1.0),
 ('does', 1.0),
 ('whose', 1.0),
 ('shall', 1.0),
 ('are', 1.0),
 ('up', -1.0),
 ('cannot', -1.0),
 ('hope', -1.0),
 ('true', -1.0),
 ('way', -1.0),
 ('find', -1.0),
 ('how', -1.0),
 ('any', -1.0),
 ('nor', -1.0),
 ('blood', -1.0),
 ('only', -1.0)]

Do we need all 200 words? How about just the 10 most important?

In [22]:
dixRCAn = dixRCA.topNCol(10)
print(dixRCAn.topWords(0))
Plays.uplot(dixRCAn,grpCol=dix,fname="core_before_after_shak_n10",tagWords=dixRCAn.topWords(0))
[('nothing', 548.83857158780677), ('without', 440.94773001611782), ('might', 403.55806043656344), ('made', 342.36398187731703), ('down', 274.36614200198341), ('eyes', -285.27223137246995), ('sure', -294.59430906903475), ('way', -308.60032642370356), ('nor', -311.80160481141468), ('find', -398.4702999824334)]

Looks like we need more than 10 words. Picking 50 does much better.

In [23]:
dixRCAn = dixRCA.topNCol(50)
print(dixRCAn.topWords(0))
Plays.uplot(dixRCAn,grpCol=dix,fname="core_before_after_shak_n50",tagWords=dixRCAn.topWords(0,50))
[('nothing', 548.83857158780677), ('without', 440.94773001611782), ('might', 403.55806043656344), ('made', 342.36398187731703), ('down', 274.36614200198341), ('tell', 270.7539349420619), ('life', 256.348595709118), ('whose', 247.42406857362087), ('long', 243.10819007745889), ('faith', 234.37471525861764), ('true', -226.532496703937), ('one', -229.96028866025685), ('hope', -230.20822981745184), ('those', -262.96526379719995), ('blood', -271.80890853072157), ('eyes', -285.27223137246995), ('sure', -294.59430906903475), ('way', -308.60032642370356), ('nor', -311.80160481141468), ('find', -398.4702999824334)]

Semi-Supervised

How about.... training on Shakespeare, and seeing what happens to the other plays?

In [24]:
reload(Plays.PL)
sg = Plays.shakGenres(nr, nonGroup=-1)
shakRCA = Corpus.linear.RCA(nr,sg,shiftCov=.001)
Plays.uplot(shakRCA,grpCol="genre",highlight=Plays.shakespeareVec(r),fname="shak-super-example",tagWords=shakRCA.topWords(0,10))
In [ ]:
 
In [ ]: