Some map making experiments using the core drama corpus, focused on authorship.
It's probably better to start with one of the simpler notebooks
# this setup is meant to be common across all the demos...
# since the notebooks may not go where the module code is...
# change directory to the root - since that's where everything goes
import sys
import os
if os.path.basename(os.getcwd()) == "Notebooks":
print("Changing directory from:",os.getcwd())
os.chdir("..")
print(" to:",os.getcwd())
### Magic to make notebooks work nicely
%matplotlib inline
### standard useful python stuff
from collections import Counter, defaultdict
from importlib import reload
### plotting stuff
import pylab
import seaborn
### numerical stuff
import numpy as N
from scipy import stats
### my CorpusModeler package stuff
import Corpus.DataMatrix as DM
import Corpus.reduce
import Corpus.cull
import Corpus.linear
import Corpus.nonlinear
import Tests.plays as Plays
p = Plays.readCore()
Plays.addNames(p)
r = Corpus.reduce.ngrams(p,200)
nr = r.normalize()
pca = Corpus.linear.PCA(nr,10)
genres = p.getMetaDataArray("genre")
dates = p.getMetaDataArray("date")
authors = p.getMetaDataArray("author")
We typically do these games for genre, but let's try doing things for authors. These are not uncorrellated, since most authors focused on a small set of genres. Unlike genres, there are a lot of authors, so we'll focus on the top few.
reload(Plays)
reload(Plays.PL)
authors = p.getMetaDataArray("author")
## get the top N authors
authorCount = Counter(authors)
## let's see how they look on some of our favorite plots
top6 = [a[0] for a in authorCount.most_common(6)]
Plays.uplot(nr,grpCol="author",onlyGroups=top6, fname="coredrama_author_0")
Plays.uplot(pca,grpCol="author",onlyGroups=top6, fname="coredrama_author_pca")
This sets up to do some machine learning.
We create a vector of authors that uses numbers. We only use the 6 most common authors (but we remove anonymous, because that doesn't make sense to think of an "an author")
authorToSeparate = [ a[0] for a in authorCount.most_common(7) if a[0] != "Anon."]
authorNumDict = { a:i+1 for i,a in enumerate(authorToSeparate)}
authorNums = [authorNumDict[a]+1 if a in authorNumDict else 0 for a in authors]
authorLDA = Corpus.linear.LinearDiscrim(nr,targets=authorNums,skipZeros=True)
Plays.uplot(authorLDA,grpCol="author",onlyGroups=authorToSeparate,fname="coredrama_author_lda.csv")
What this has done is "learned" a new projection from the 200 dimensions of the initial (we took the top 200 words), and reduced it to 2 dimensions in a way that tries to seperate the groups as much as possible. Other dimensions (these are just the first 2) further separate classes.
The method I used in "Linear Discriminant Analysis" which is an "old school" machine linear / statistics method. Simple, but it works well here.
Remember, this was trained to distinguish the top 6 authors. We don't expect it to do well on other authors, but we can look...
Plays.uplot(authorLDA,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_lda_nt.csv")
We can look at what words it uses to make the X axis - here, this seems to tell us that Shirley uses been, every, poor, ... a lot - while everyone else uses being, look, only, ... a lot (those are negative so having those words pull documents to the left).
Be careful with this game though: this is actually using all 200 words. And if a word is rare, it might still be useful.
To help sort this out, I show the weightings (just the 20 biggest ones) but also sorting the list by which words make the top "contributions" (multiplying frequency by weight).
print("Weightings")
print(authorLDA.topWords(0))
print("Contributions")
print(authorLDA.topWords(0,contrib=True))
You might wonder what happens if rather than using those crazy specific weightings if I just used simpler numbers. Here, I've quantized things to integers +/- 10. It isn't as clear a picture, but it's generally right.
authorLDAq = authorLDA.quantizeCols(10)
Plays.uplot(authorLDAq,grpCol="author",onlyGroups=authorToSeparate,fname="coredrama_author_ldaq.csv")
print("Weightings")
print(authorLDAq.topWords(0))
This is using a different machine learning technique called "Relevant Components Analysis" - which achieve similar effects in a different manner (it is trying to minimize the variance within the groups, rather than maximize the variance between groups). It gets similar results.
reload(Corpus.linear)
authorRCA = Corpus.linear.RCA(nr,authorNums,skipZeros=True)
Plays.uplot(authorRCA,grpCol="author",onlyGroups=authorToSeparate,fname="coredrama_author_rca.csv")
Plays.uplot(authorRCA,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_rca_nt.csv")
Part of what is going on here is that the groups are seperated more in higher dimensions (RCA is producing a 10 dimensional space).
We can use a technique called TSNE to show what is close to each other in the higher dimensional space. In TSNE, the positions don't mean anything: points are positioned so they are close to things they are close to in high-dimensional space.
authorRCATSNE = Corpus.nonlinear.tsne(authorRCA)
Plays.uplot(authorRCATSNE,grpCol="author",onlyGroups=authorToSeparate,fname="coredrama_author_rca_tsne.csv")
Plays.uplot(authorRCATSNE,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_rca_tsne_nt.csv")
Look at the interactive version of the plot to see what plays look like shakespeare (or heywood, or ...). Mike and Jonathan tell me that's interesting.
svec = ["Shakespeare" in author for author in authors]
auc = nr.auc(svec)
aucAbs = [abs(a-.5) for a in auc]
aucOrder = N.argsort(aucAbs)[::-1]
for idx in aucOrder[:10]:
print(nr.terms[idx],auc[idx])