This is a "baked" version of the notebook - it is not editable, but it doesn't require the Jupyter software to be installed. The scatterplots (and some other) visualizations have links to interactive versions.

# Maps of Core Drama¶

This notebook contains some initial experiments to create "maps" of the core drama corpus.

The pictures inline in the notebook are simple plots made using python plotting - to really explore the maps, you will probably want to look at the interactive versions

## Setup¶

This makes sure we're in the right directory, and then loads in a bunch of python libraries. All stuff to do before we get started.

In [1]:
# this setup is meant to be common across all the demos...

# since the notebooks may not go where the module code is...
# change directory to the root - since that's where everything goes
import sys
import os

if os.path.basename(os.getcwd()) == "Notebooks":
print("Changing directory from:",os.getcwd())
os.chdir("..")
print("                     to:",os.getcwd())

### Magic to make notebooks work nicely
%matplotlib inline

### standard useful python stuff
from collections import Counter, defaultdict
from importlib import reload

### plotting stuff
import pylab
import seaborn

### numerical stuff
import numpy
from scipy import stats

### my CorpusModeler package stuff
import Corpus.DataMatrix as DM
import Corpus.reduce
import Corpus.cull
import Corpus.linear
import Corpus.nonlinear
import Tests.plays as Plays

Changing directory from: C:\Users\gleic\Projects\corpusmodeler\Notebooks
to: C:\Users\gleic\Projects\corpusmodeler


This next block sets up some of the basic data structues - loading the corpus, and creating some useful versions of it.

Warning - this uses the wrong form of normalization. I should re-run the experiments with the right normalziation, but that would require updating the narrative.

In [2]:
p = Plays.readCore()
r = Corpus.reduce.ngrams(p,200)
nr = r.normalize()

corpus of (554, 179064) loaded in 0.310314416885376
Density after reduction is 0.9848104693140795 - making dense
Reduce from 179064 to 200 words in 0.3668496608734131


## A first map - the most common words!¶

Here is a first simple map...

Using the 2 most common words!

The first thing that is striking is the outlier - there's a tragedy that has 12% of its words being "the". If you are looking at the interactive version of the plot, you can mouse over it to see what it is. Instead, I will list all documents that have more than 8% of their words being "the" (or scoring over .08 on the first feature).

Warning - this is based on normalizing things based on the number of words in the top 200. (so 12% means "12% of the words in the top 200" - not 12% of the overall document). This probably isn't the right thing to do, as it brings in effects of the number of unusual words.

In [26]:
reload(Plays)
Plays.uplot(nr,fname="core_the_i",oc=list(range(2,10)),tagWords=nr.terms[:10])

../VEP_Scatter/index.html#data/core_the_i.csv?x=the&y=i&grp=genre&grayCol=Other
                A59170.headed.txt :  0.121 : TR:1648:Seneca:Medea
A31023.headed.txt :  0.090 : TR:1655:Baron:Mirza
A02262.headed.txt :  0.088 : TR:1640:Grotius:Christ's Passion
A11909.headed_048_tragedy_006.txt :  0.087 : TR:1558:Seneca:Troas
A11909.headed_069_tragedy_008.txt :  0.086 : TR:1566:Seneca:Agamemnon
A11909.headed_025_tragedy_004.txt :  0.083 : TR:1567:Seneca:Hippolytus
A59169.headed.txt :  0.081 : TR:1651:Seneca:Hippolytus


The next thing we see if that it seems like the tragedies tend to be to the right (and bottom) while the comedies are in the upper left. A different view of the data might help us get a better sense of that.

Using a python statistical graphics library (seaborn), I can make some plots that might help me get a sense of this. Here are beeswarm, violin, and boxplots.

In [5]:
pylab.clf()
seaborn.swarmplot(x=genres,y=nr.matrix[:,0],palette=Plays.genreColors_)
pylab.show()
seaborn.violinplot(x=genres,y=nr.matrix[:,0],palette=Plays.genreColors_)
pylab.show()
seaborn.boxplot(x=genres,y=nr.matrix[:,0],palette=Plays.genreColors_)

Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1be0a950d68>

This gives me some intuition that comedies use less "the" than histories and tragedies. I might want to ask if this is statistically significant (i.e., what is the likelihood of this happening by chance). Technically, I should do an ANOVA or some complicated statistical test. But at first, I can try a basic T-Test to see if comedies are less than tragedies.

In [6]:
com = [v for i,v in enumerate(nr.matrix[:,0]) if genres[i]=="CO"]
tra = [v for i,v in enumerate(nr.matrix[:,0]) if genres[i]=="TR"]
print(len(com),len(tra))
print(numpy.mean(com),numpy.mean(tra))
stats.ttest_ind(com,tra,equal_var=False)

266 154
0.0419095157017 0.0523951818446

Out[6]:
Ttest_indResult(statistic=-8.3566899509183994, pvalue=6.479455675497188e-15)

OK, so the use of the word "the" is unlikely to have come from the same distribution.

For completeness, here are the graphs for the word "I"

In [7]:
pylab.clf()
seaborn.swarmplot(x=genres,y=nr.matrix[:,1],palette=Plays.genreColors_)
pylab.show()
seaborn.violinplot(x=genres,y=nr.matrix[:,1],palette=Plays.genreColors_)
pylab.show()
seaborn.boxplot(x=genres,y=nr.matrix[:,1],palette=Plays.genreColors_)

Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1be09137390>

## Classifiers¶

Now, we see that (on average) different genres use the words "the" and "i" (the two most common words) differently. So, comedies will tend to be in the upper left of our map, and tragedies will tend to be lower and more to the right.

We can think about using a measurement (e.g., the percentage of words that are "the") to tell use what genre something is. Simply, if this amount is high, we expect it's a tragedy or history.

To assess this, consider a simple game: I pick two plays at random - if one of them is a tragedy, what is the probability of it's "score" being higher? Or put differently, if I consider all pairs of plays, what percentage of those that have one tragedy does the tragedy have the higher score? If all tragedies scored higher than all other plays, this would be 100%. If all tragedies scored less than the others, it would be 0%. If the measure didn't tell us anything, we'd expect it to be 50% (half the time yes, half the time no).

In machine learning, this metric of success is call area under the curve, or more specifically, area under the receiver operating characteristic curve. The reason for this funny name is complicated - but its best to think of the simple explanation: what is the chance that given a pair of elements, the one "in-class" has the higher value.

Here are the AUC (area under curve) scores for the first few words. Notice that a random history has a 75% chance of having a higher percentage of "the" than a random non-history. (Or, for 74.8% of the pairs of plays, the history has the higher percentage of "the"). Similarly, comedies are low in "the" (they only have more than some other play 28.5% of the time). But comedies are high in "I" and "A".

In [8]:
nr.aucTable(genres,[0,1,2,3,4])

   :      CO       TR       TC       HI
the:   0.285    0.692    0.493    0.748
i:   0.765    0.245    0.480    0.313
and:   0.374    0.584    0.526    0.667
to:   0.292    0.663    0.620    0.535
a:   0.758    0.310    0.400    0.315


There are a few lessons to follow up on here (methodological lessons, what it means for these simple words is for someone else to figure out).

First, notice that tragedies and histories tend to do similar things. These words don't distinguish them well. For a fairer test, we might check to see how well things work to tell these two groups apart from comedies. With these words, the tragicomedies aren't going to be too distinctive.

The next thing is to put features together... If "the" and "to" are both good for telling tragedies and histories, maybe counting both will be even better. And... while we're at it, we could subtract the counts of words that are good for comedies.

I've just built a linear classifier (or scoring function): 1 times the number of the, -1 times the numer of i, etc. In this case, the simple thing works pretty well. But you can imagine, I could do something much fancier to pick which things to measure (that's called feature selection) and figuring out what combination of those features to use (the values for the multiplers).

In [9]:
nr.aucTable(["TR+HI" if g=="TR" or g=="HI" else g for g in genres],[0,1,3,4])
manualTrag = Corpus.linear.fromWords(nr,[ ["the","to"], ["i","a"], ["the","to",("i",-1),("a","-1")]])
manualTrag.terms = ["the+to","i+a","the+to-i-a"]
manualTrag.aucTable(genres,[0,1,2])
manualMap = Corpus.linear.fromWords(nr,[["the","to",("i",-1),("a","-1")],["king","lord"] ])
Plays.uplot(manualMap,fname="core_manual_map",tagWords=["the","to","i","a","king","lord"])

   :      CO    TR+HI       TC
the:   0.285    0.741    0.493
i:   0.765    0.221    0.480
to:   0.292    0.654    0.620
a:   0.758    0.279    0.400
:      CO       TR       TC       HI
the+to:   0.241    0.715    0.555    0.708
i+a:   0.825    0.224    0.428    0.262
the+to-i-a:   0.180    0.771    0.570    0.736


# Feature Selection¶

So far, we just took the first 5 words. What if we looked at all 200 and asked - which is best?

In [10]:
for genre in ["CO","TR","HI","TC"]:
com = [1 if g==genre else 0 for g in genres]
auc = nr.auc(com)
aucMag = [abs(a-.5) for a in auc]
aucOrd = numpy.argsort(aucMag)[::-1]
print("Best Words for "+genre)
nr.aucTable(genres,aucOrd[:5])

Best Words for CO
:      CO       TR       TC       HI
death:   0.147    0.843    0.551    0.682
blood:   0.194    0.802    0.547    0.638
from:   0.200    0.758    0.597    0.642
i:   0.765    0.245    0.480    0.313
master:   0.761    0.272    0.365    0.497
Best Words for TR
:      CO       TR       TC       HI
death:   0.147    0.843    0.551    0.682
blood:   0.194    0.802    0.547    0.638
from:   0.200    0.758    0.597    0.642
i:   0.765    0.245    0.480    0.313
you:   0.749    0.255    0.500    0.303
Best Words for HI
:      CO       TR       TC       HI
king:   0.252    0.640    0.554    0.899
lord:   0.376    0.603    0.420    0.833
can:   0.537    0.445    0.652    0.196
could:   0.453    0.581    0.603    0.205
our:   0.247    0.689    0.548    0.782
Best Words for TC
:      CO       TR       TC       HI
find:   0.525    0.399    0.701    0.279
wife:   0.661    0.452    0.318    0.431
well:   0.733    0.311    0.328    0.564
since:   0.391    0.519    0.671    0.488
good:   0.698    0.338    0.329    0.613


Building classifiers by hand is tricky, since we need to consider how common the words are to get the right scaling factors. Just being +/- 1 is pretty limiting. I'm trying to pick words that not only promote each genre, but distinguish similar ones (tragedies and histories are hard).

In [11]:
myGenres = Corpus.linear.fromWords(nr,[
["find",("wife",-1),("well",-1),"since"],
["king","lord",("could",-1)],
["i","master",("death",-1),("blood",-1)],
["death","blood",("i",-1),("you",-1)]
])
myGenres.terms = ["TC","HI","CO","TR"]
myGenres.aucTable(genres,[0,1,2,3])
Plays.uplot(myGenres,fname="core_manual_genres",tagWords=["the","to","i","a","king","lord"],oc=[2,3])

  :      CO       TR       TC       HI
TC:   0.266    0.630    0.762    0.424
HI:   0.339    0.591    0.464    0.913
CO:   0.813    0.203    0.457    0.310
TR:   0.193    0.793    0.521    0.726


Here we try to use a standard Machine Learning approach for classifier construction (Support Vector Machines - SVMs) to build classifiers. The parameter (C) balances between concise answers (fewer words used) and better correctness. If you set C really big, you get classifiers that always pick out the target class. But they use all the words, and weird combinations of them.

In [29]:
svm = Corpus.linear.SVM(nr,genres,l1=True,svmparams={"C":500})
print(svm)
svm.aucTable(genres,[0,1,2,3])
print(svm.topWords(0))
Plays.uplot(svm,fname="core_svm",oc=list(range(2,svm.shape[1])),tagWords=svm.topWords(0))

Build svm model in  1.43
<Corpus.DataMatrix.DataMatrix 554x4>
:      CO       TR       TC       HI
CO:   0.999    0.115    0.294    0.225
HI:   0.410    0.476    0.461    1.000
TC:   0.326    0.422    1.000    0.317
TR:   0.167    1.000    0.420    0.412
[('eyes', 651.65616218888022), ('day', 414.78632329486123), ('long', 412.7276171760189), ('very', 353.32054000930935), ('an', 341.69960270369921), ('though', 338.62091818070354), ('soul', -317.38916558648862), ('death', -336.13708841153516), ('blood', -351.71059621507192), ('better', -368.18832234624335), ('when', -373.05535423671347), ('way', -377.28059660684636), ('both', -391.80174284575617), ('without', -407.14330187802602), ('take', -412.19831934575001), ('who', -428.39114905052554), ('hand', -450.61914867651717), ('hope', -511.050968540599), ('keep', -668.54928102106896), ('might', -675.8669414773733)]


# PCA¶

Our success at crafting a "classifier" so easily suggests that statisics might just cause this to play out.

Here, we apply "Principle Components Analysis" to reduce from 200 dimensions to 2 dimensions.

For comparison, I show the AUC table along with the "manual classifier" (using 4 words) we derived above. The PCA does a reasonable job of distinguishing genres.

In [30]:
pca = Corpus.linear.PCA(nr,10)
Plays.uplot(pca,fname="core_pca")
pca.aucTable(genres,[0,1,2])
manualTrag.aucTable(genres,[0,1,2])

Build PCA model in  0.01