Some map making experiments using the core drama corpus, focused on authorship.
It's probably better to start with one of the simpler notebooks
# this setup is meant to be common across all the demos...
# since the notebooks may not go where the module code is...
# change directory to the root - since that's where everything goes
import sys
import os
if os.path.basename(os.getcwd()) == "Notebooks":
print("Changing directory from:",os.getcwd())
print(" to:",os.getcwd())
### Magic to make notebooks work nicely
%matplotlib inline
### standard useful python stuff
from collections import Counter, defaultdict
from importlib import reload
### plotting stuff
import pylab
import seaborn
### numerical stuff
import numpy as N
from scipy import stats
### my CorpusModeler package stuff
import Corpus.DataMatrix as DM
import Corpus.reduce
import Corpus.cull
import Corpus.linear
import Corpus.nonlinear
import Tests.plays as Plays
p = Plays.readCore()
r = Corpus.reduce.ngrams(p,200)
nr = r.normalize()
pca = Corpus.linear.PCA(nr,10)
genres = p.getMetaDataArray("genre")
dates = p.getMetaDataArray("date")
authors = p.getMetaDataArray("author")
We typically do these games for genre, but let's try doing things for authors. These are not uncorrellated, since most authors focused on a small set of genres. Unlike genres, there are a lot of authors, so we'll focus on the top few.
authors = p.getMetaDataArray("author")
## get the top N authors
authorCount = Counter(authors)
## let's see how they look on some of our favorite plots
top6 = [a[0] for a in authorCount.most_common(6)]
Plays.uplot(nr,grpCol="author",onlyGroups=top6, fname="coredrama_author_0")
Plays.uplot(pca,grpCol="author",onlyGroups=top6, fname="coredrama_author_pca")
This sets up to do some machine learning.
We create a vector of authors that uses numbers. We only use the 6 most common authors (but we remove anonymous, because that doesn't make sense to think of an "an author")
authorToSeparate = [ a[0] for a in authorCount.most_common(7) if a[0] != "Anon."]
authorNumDict = { a:i+1 for i,a in enumerate(authorToSeparate)}
authorNums = [authorNumDict[a]+1 if a in authorNumDict else 0 for a in authors]
authorLDA = Corpus.linear.LinearDiscrim(nr,targets=authorNums,skipZeros=True)
What this has done is "learned" a new projection from the 200 dimensions of the initial (we took the top 200 words), and reduced it to 2 dimensions in a way that tries to seperate the groups as much as possible. Other dimensions (these are just the first 2) further separate classes.
The method I used in "Linear Discriminant Analysis" which is an "old school" machine linear / statistics method. Simple, but it works well here.
Remember, this was trained to distinguish the top 6 authors. We don't expect it to do well on other authors, but we can look...
Plays.uplot(authorLDA,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_lda_nt.csv")
We can look at what words it uses to make the X axis - here, this seems to tell us that Shirley uses been, every, poor, ... a lot - while everyone else uses being, look, only, ... a lot (those are negative so having those words pull documents to the left).
Be careful with this game though: this is actually using all 200 words. And if a word is rare, it might still be useful.
To help sort this out, I show the weightings (just the 20 biggest ones) but also sorting the list by which words make the top "contributions" (multiplying frequency by weight).
You might wonder what happens if rather than using those crazy specific weightings if I just used simpler numbers. Here, I've quantized things to integers +/- 10. It isn't as clear a picture, but it's generally right.
authorLDAq = authorLDA.quantizeCols(10)
This is using a different machine learning technique called "Relevant Components Analysis" - which achieve similar effects in a different manner (it is trying to minimize the variance within the groups, rather than maximize the variance between groups). It gets similar results.
authorRCA = Corpus.linear.RCA(nr,authorNums,skipZeros=True)
Plays.uplot(authorRCA,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_rca_nt.csv")
Part of what is going on here is that the groups are seperated more in higher dimensions (RCA is producing a 10 dimensional space).
We can use a technique called TSNE to show what is close to each other in the higher dimensional space. In TSNE, the positions don't mean anything: points are positioned so they are close to things they are close to in high-dimensional space.
authorRCATSNE = Corpus.nonlinear.tsne(authorRCA)
Plays.uplot(authorRCATSNE,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_rca_tsne_nt.csv")
Look at the interactive version of the plot to see what plays look like shakespeare (or heywood, or ...). Mike and Jonathan tell me that's interesting.
svec = ["Shakespeare" in author for author in authors]
auc = nr.auc(svec)
aucAbs = [abs(a-.5) for a in auc]
aucOrder = N.argsort(aucAbs)[::-1]
for idx in aucOrder[:10]: