This notebook contains some initial experiments to create "maps" of the core drama corpus.
The pictures inline in the notebook are simple plots made using python plotting - to really explore the maps, you will probably want to look at the interactive versions
This makes sure we're in the right directory, and then loads in a bunch of python libraries. All stuff to do before we get started.
# this setup is meant to be common across all the demos...
# since the notebooks may not go where the module code is...
# change directory to the root - since that's where everything goes
import sys
import os
if os.path.basename(os.getcwd()) == "Notebooks":
print("Changing directory from:",os.getcwd())
os.chdir("..")
print(" to:",os.getcwd())
### Magic to make notebooks work nicely
%matplotlib inline
### standard useful python stuff
from collections import Counter, defaultdict
from importlib import reload
### plotting stuff
import pylab
import seaborn
### numerical stuff
import numpy
from scipy import stats
### my CorpusModeler package stuff
import Corpus.DataMatrix as DM
import Corpus.reduce
import Corpus.cull
import Corpus.linear
import Corpus.nonlinear
import Tests.plays as Plays
This next block sets up some of the basic data structues - loading the corpus, and creating some useful versions of it.
Warning - this uses the wrong form of normalization. I should re-run the experiments with the right normalziation, but that would require updating the narrative.
p = Plays.readCore()
Plays.addNames(p)
r = Corpus.reduce.ngrams(p,200)
nr = r.normalize()
genres = p.getMetaDataArray("genre")
dates = p.getMetaDataArray("date")
authors = p.getMetaDataArray("author")
Here is a first simple map...
Using the 2 most common words!
The first thing that is striking is the outlier - there's a tragedy that has 12% of its words being "the". If you are looking at the interactive version of the plot, you can mouse over it to see what it is. Instead, I will list all documents that have more than 8% of their words being "the" (or scoring over .08 on the first feature).
Warning - this is based on normalizing things based on the number of words in the top 200. (so 12% means "12% of the words in the top 200" - not 12% of the overall document). This probably isn't the right thing to do, as it brings in effects of the number of unusual words.
reload(Plays)
Plays.uplot(nr,fname="core_the_i",oc=list(range(2,10)),tagWords=nr.terms[:10])
nr.printTopDocs(0,pairs=True,metaData="name",thresh=.08)
The next thing we see if that it seems like the tragedies tend to be to the right (and bottom) while the comedies are in the upper left. A different view of the data might help us get a better sense of that.
Using a python statistical graphics library (seaborn), I can make some plots that might help me get a sense of this. Here are beeswarm, violin, and boxplots.
pylab.clf()
seaborn.swarmplot(x=genres,y=nr.matrix[:,0],palette=Plays.genreColors_)
pylab.show()
seaborn.violinplot(x=genres,y=nr.matrix[:,0],palette=Plays.genreColors_)
pylab.show()
seaborn.boxplot(x=genres,y=nr.matrix[:,0],palette=Plays.genreColors_)
This gives me some intuition that comedies use less "the" than histories and tragedies. I might want to ask if this is statistically significant (i.e., what is the likelihood of this happening by chance). Technically, I should do an ANOVA or some complicated statistical test. But at first, I can try a basic T-Test to see if comedies are less than tragedies.
com = [v for i,v in enumerate(nr.matrix[:,0]) if genres[i]=="CO"]
tra = [v for i,v in enumerate(nr.matrix[:,0]) if genres[i]=="TR"]
print(len(com),len(tra))
print(numpy.mean(com),numpy.mean(tra))
stats.ttest_ind(com,tra,equal_var=False)
OK, so the use of the word "the" is unlikely to have come from the same distribution.
For completeness, here are the graphs for the word "I"
pylab.clf()
seaborn.swarmplot(x=genres,y=nr.matrix[:,1],palette=Plays.genreColors_)
pylab.show()
seaborn.violinplot(x=genres,y=nr.matrix[:,1],palette=Plays.genreColors_)
pylab.show()
seaborn.boxplot(x=genres,y=nr.matrix[:,1],palette=Plays.genreColors_)
Now, we see that (on average) different genres use the words "the" and "i" (the two most common words) differently. So, comedies will tend to be in the upper left of our map, and tragedies will tend to be lower and more to the right.
We can think about using a measurement (e.g., the percentage of words that are "the") to tell use what genre something is. Simply, if this amount is high, we expect it's a tragedy or history.
To assess this, consider a simple game: I pick two plays at random - if one of them is a tragedy, what is the probability of it's "score" being higher? Or put differently, if I consider all pairs of plays, what percentage of those that have one tragedy does the tragedy have the higher score? If all tragedies scored higher than all other plays, this would be 100%. If all tragedies scored less than the others, it would be 0%. If the measure didn't tell us anything, we'd expect it to be 50% (half the time yes, half the time no).
In machine learning, this metric of success is call area under the curve, or more specifically, area under the receiver operating characteristic curve. The reason for this funny name is complicated - but its best to think of the simple explanation: what is the chance that given a pair of elements, the one "in-class" has the higher value.
Here are the AUC (area under curve) scores for the first few words. Notice that a random history has a 75% chance of having a higher percentage of "the" than a random non-history. (Or, for 74.8% of the pairs of plays, the history has the higher percentage of "the"). Similarly, comedies are low in "the" (they only have more than some other play 28.5% of the time). But comedies are high in "I" and "A".
nr.aucTable(genres,[0,1,2,3,4])
There are a few lessons to follow up on here (methodological lessons, what it means for these simple words is for someone else to figure out).
First, notice that tragedies and histories tend to do similar things. These words don't distinguish them well. For a fairer test, we might check to see how well things work to tell these two groups apart from comedies. With these words, the tragicomedies aren't going to be too distinctive.
The next thing is to put features together... If "the" and "to" are both good for telling tragedies and histories, maybe counting both will be even better. And... while we're at it, we could subtract the counts of words that are good for comedies.
I've just built a linear classifier (or scoring function): 1 times the number of the, -1 times the numer of i, etc. In this case, the simple thing works pretty well. But you can imagine, I could do something much fancier to pick which things to measure (that's called feature selection) and figuring out what combination of those features to use (the values for the multiplers).
nr.aucTable(["TR+HI" if g=="TR" or g=="HI" else g for g in genres],[0,1,3,4])
manualTrag = Corpus.linear.fromWords(nr,[ ["the","to"], ["i","a"], ["the","to",("i",-1),("a","-1")]])
manualTrag.terms = ["the+to","i+a","the+to-i-a"]
manualTrag.aucTable(genres,[0,1,2])
manualMap = Corpus.linear.fromWords(nr,[["the","to",("i",-1),("a","-1")],["king","lord"] ])
Plays.uplot(manualMap,fname="core_manual_map",tagWords=["the","to","i","a","king","lord"])
So far, we just took the first 5 words. What if we looked at all 200 and asked - which is best?
for genre in ["CO","TR","HI","TC"]:
com = [1 if g==genre else 0 for g in genres]
auc = nr.auc(com)
aucMag = [abs(a-.5) for a in auc]
aucOrd = numpy.argsort(aucMag)[::-1]
print("Best Words for "+genre)
nr.aucTable(genres,aucOrd[:5])
Building classifiers by hand is tricky, since we need to consider how common the words are to get the right scaling factors. Just being +/- 1 is pretty limiting. I'm trying to pick words that not only promote each genre, but distinguish similar ones (tragedies and histories are hard).
myGenres = Corpus.linear.fromWords(nr,[
["find",("wife",-1),("well",-1),"since"],
["king","lord",("could",-1)],
["i","master",("death",-1),("blood",-1)],
["death","blood",("i",-1),("you",-1)]
])
myGenres.terms = ["TC","HI","CO","TR"]
myGenres.aucTable(genres,[0,1,2,3])
Plays.uplot(myGenres,fname="core_manual_genres",tagWords=["the","to","i","a","king","lord"],oc=[2,3])
Here we try to use a standard Machine Learning approach for classifier construction (Support Vector Machines - SVMs) to build classifiers. The parameter (C) balances between concise answers (fewer words used) and better correctness. If you set C really big, you get classifiers that always pick out the target class. But they use all the words, and weird combinations of them.
svm = Corpus.linear.SVM(nr,genres,l1=True,svmparams={"C":500})
print(svm)
svm.aucTable(genres,[0,1,2,3])
print(svm.topWords(0))
Plays.uplot(svm,fname="core_svm",oc=list(range(2,svm.shape[1])),tagWords=svm.topWords(0))
Our success at crafting a "classifier" so easily suggests that statisics might just cause this to play out.
Here, we apply "Principle Components Analysis" to reduce from 200 dimensions to 2 dimensions.
For comparison, I show the AUC table along with the "manual classifier" (using 4 words) we derived above. The PCA does a reasonable job of distinguishing genres.
pca = Corpus.linear.PCA(nr,10)
Plays.uplot(pca,fname="core_pca")
pca.aucTable(genres,[0,1,2])
manualTrag.aucTable(genres,[0,1,2])
rca = Corpus.linear.RCA(nr,genres,dims=3)
rca.aucTable(genres,[0,1,2])
Plays.uplot(rca,fname="core_rca",oc=[2])
rcaTSNE = Corpus.nonlinear.tsne(rca)
Plays.uplot(rcaTSNE,fname="core_rca_tsne")
In addition to these compass directions, we might wonder about the neighborhoods... If I take a point on the map, are it's neighbors going to be of the same genre?
What I am going to do is to "predict" the genre of each play based on it's neighbors on the map. For each play, I will pick the 5 closest neighbors (this actually means itself, plus 4 real neighbors) - and have this group of 5 "vote". So, if 2 of the 4 closest neighbors are of the same genre, the play will be labeled with its genre. If more than 2 of the closest neighbors are different, the play will be labeled with a different genre. We can see how often this predicts correctly.
In the plot, I highlight ones that are "wrong" (plays whose neighbors are generally not the same as their neighbors).
The table gives a sense of how well this worked.
import Tests.knn as KNN
KNN.knnExp(nr,genres,nd=2,uplotArgs={"fname":"core_i_the_neighbors","tagWords":["the","i"]})
The fact that the first thing we tried (the most common words) are correlated with genre might make us wonder if we just got lucky. Let's see if something else is correlated with these words. The obvious thing to try is date, since we have that handy... Here's a plot of "the" vs. date
seaborn.regplot(numpy.array(nr.getMetaDataArray("date"),dtype=float),nr.matrix[:,0])
pylab.show()
seaborn.regplot(numpy.array(nr.getMetaDataArray("date"),dtype=float),nr.matrix[:,1])
Short answer - the use of "the" and "i" doesn't look correlated to date. Although, it is probably worth asking, whether or not genre is correlated to date. Again, rather than something statistical, I'll make a picture.
dates = numpy.array(nr.getMetaDataArray("date"))
pylab.clf()
seaborn.swarmplot(x=genres,y=dates,palette=Plays.genreColors_)
pylab.show()
seaborn.violinplot(x=genres,y=dates,palette=Plays.genreColors_)
pylab.show()
seaborn.boxplot(x=genres,y=dates,palette=Plays.genreColors_)
For tragedies and comedies, it doesn't seem to vary so much. Histories and Tragi-Comedies certainly do change. This might explain what we see in "the" and "I".
And just to show that correlations do happen, let's see that "the" and "i" are correlated (or anti-correlated, to be precise).
seaborn.regplot(nr.matrix[:,0],nr.matrix[:,1])
As an experiments let's see how/if the usage of the top 200 words changed after shakespeare. I divide the plays into 3 groups: before Shakespeare, during Shakespeare, and after Shakespeare. I will then do "supervised dimensionality reduction" to pick 2 dimensions that best separate before from after. The specific method I am choosing is called "Relevant components analysis" (or RCA).
reload(Plays)
dates = p.getMetaDataArray("date")
dix = [1 if d<1590 else 2 if d>1613 else 0 for d in dates]
print(Counter(dix))
dixRCA = Corpus.linear.RCA(nr,dix,skipZeros=True)
Plays.uplot(dixRCA,grpCol=dix,fname="core_before_after_shak")
Plays.uplot(dixRCA,fname="core_before_after_shake_genre")
Of course, the first question to ask is "what words" are making that nice division along the x axis. The answer is all of them, but we can look at the top 20. Positive means "pulls to the right" (or is seen more often in documents before), negative means pulls to left.
dixRCA.topWords(0)
Do the weightings matter? here I force them to be -2,-1,0,1,2. And it works almost as well.
dixRCAq = dixRCA.quantizeWords(2)
Plays.uplot(dixRCA,grpCol=dix,fname="core_before_after_shak_quant")
dixRCAq.topWords(0)
Do we need all 200 words? How about just the 10 most important?
dixRCAn = dixRCA.topNCol(10)
print(dixRCAn.topWords(0))
Plays.uplot(dixRCAn,grpCol=dix,fname="core_before_after_shak_n10",tagWords=dixRCAn.topWords(0))
Looks like we need more than 10 words. Picking 50 does much better.
dixRCAn = dixRCA.topNCol(50)
print(dixRCAn.topWords(0))
Plays.uplot(dixRCAn,grpCol=dix,fname="core_before_after_shak_n50",tagWords=dixRCAn.topWords(0,50))
How about.... training on Shakespeare, and seeing what happens to the other plays?
reload(Plays.PL)
sg = Plays.shakGenres(nr, nonGroup=-1)
shakRCA = Corpus.linear.RCA(nr,sg,shiftCov=.001)
Plays.uplot(shakRCA,grpCol="genre",highlight=Plays.shakespeareVec(r),fname="shak-super-example",tagWords=shakRCA.topWords(0,10))