This is a "baked" version of the notebook - it is not editable, but it doesn't require the Jupyter software to be installed. The scatterplots (and some other) visualizations have links to interactive versions.

Authorship in Core Drama

Some map making experiments using the core drama corpus, focused on authorship.

It's probably better to start with one of the simpler notebooks

In [1]:
# this setup is meant to be common across all the demos...

# since the notebooks may not go where the module code is...
# change directory to the root - since that's where everything goes
import sys
import os

if os.path.basename(os.getcwd()) == "Notebooks":
    print("Changing directory from:",os.getcwd())
    print("                     to:",os.getcwd())
### Magic to make notebooks work nicely
%matplotlib inline

### standard useful python stuff
from collections import Counter, defaultdict
from importlib import reload

### plotting stuff
import pylab
import seaborn

### numerical stuff
import numpy as N
from scipy import stats

### my CorpusModeler package stuff
import Corpus.DataMatrix as DM
import Corpus.reduce 
import Corpus.cull 
import Corpus.linear
import Corpus.nonlinear
import Tests.plays as Plays
Changing directory from: C:\Users\gleicher\Projects\CorpusModeler\Notebooks
                     to: C:\Users\gleicher\Projects\CorpusModeler
In [2]:
p = Plays.readCore()
r = Corpus.reduce.ngrams(p,200)
nr = r.normalize()
pca = Corpus.linear.PCA(nr,10)
genres = p.getMetaDataArray("genre")
dates = p.getMetaDataArray("date")
authors = p.getMetaDataArray("author")
corpus of (554, 179064) loaded in 0.21792387962341309
Density after reduction is 0.9848104693140795 - making dense
Reduce from 179064 to 200 words in 0.22760534286499023
Build PCA model in  0.02


We typically do these games for genre, but let's try doing things for authors. These are not uncorrellated, since most authors focused on a small set of genres. Unlike genres, there are a lot of authors, so we'll focus on the top few.

In [3]:
authors = p.getMetaDataArray("author")

## get the top N authors 
authorCount = Counter(authors)

## let's see how they look on some of our favorite plots
top6 = [a[0] for a in authorCount.most_common(6)]
Plays.uplot(nr,grpCol="author",onlyGroups=top6, fname="coredrama_author_0")
Plays.uplot(pca,grpCol="author",onlyGroups=top6, fname="coredrama_author_pca")

This sets up to do some machine learning.

We create a vector of authors that uses numbers. We only use the 6 most common authors (but we remove anonymous, because that doesn't make sense to think of an "an author")

In [4]:
authorToSeparate = [ a[0] for a in authorCount.most_common(7) if a[0] != "Anon."]
authorNumDict = { a:i+1 for i,a in enumerate(authorToSeparate)}
authorNums = [authorNumDict[a]+1 if a in authorNumDict else 0 for a in authors]
authorLDA = Corpus.linear.LinearDiscrim(nr,targets=authorNums,skipZeros=True)
Build LinearDiscrim model in  0.01
C:\Anaconda\envs\python3\lib\site-packages\sklearn\ UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")

What this has done is "learned" a new projection from the 200 dimensions of the initial (we took the top 200 words), and reduced it to 2 dimensions in a way that tries to seperate the groups as much as possible. Other dimensions (these are just the first 2) further separate classes.

The method I used in "Linear Discriminant Analysis" which is an "old school" machine linear / statistics method. Simple, but it works well here.

Remember, this was trained to distinguish the top 6 authors. We don't expect it to do well on other authors, but we can look...

In [5]:
Plays.uplot(authorLDA,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_lda_nt.csv")

We can look at what words it uses to make the X axis - here, this seems to tell us that Shirley uses been, every, poor, ... a lot - while everyone else uses being, look, only, ... a lot (those are negative so having those words pull documents to the left).

Be careful with this game though: this is actually using all 200 words. And if a word is rare, it might still be useful.

To help sort this out, I show the weightings (just the 20 biggest ones) but also sorting the list by which words make the top "contributions" (multiplying frequency by weight).

In [6]:
[('been', 1587.6677288612718), ('every', 1532.1769511523369), ('poor', 1503.22470223706), ('does', 1168.2573425295545), ('into', 1121.7014758572118), ('death', 1112.5631326222338), ('pray', 1095.3152826744918), ('done', 1083.0040557412115), ('down', 994.63628986760682), ('call', 928.16393490217706), ('first', -793.45079770248708), ('speak', -916.64652688110152), ('night', -918.61504929054922), ('little', -1005.5591358002055), ('could', -1056.5680753927857), ('were', -1059.260099661635), ('both', -1092.0108800776907), ('only', -1306.7976474086063), ('look', -1316.3144215671555), ('being', -1481.1695366656775)]
[('be', 2424.5310282006335), ('not', 2421.1229007960351), ('with', 2089.1119979155678), ('but', 1944.7034630327919), ('to', 1932.2329809657072), ('all', 1881.8402351436641), ('a', 1750.7734210226045), ('does', 1502.492630310114), ('at', 1241.3736985755577), ('come', -1187.5503071762719), ('in', -1306.3299544757083), ('see', -1313.3578575151664), ('no', -1636.1265500912014), ('for', -1651.7445934504528), ('then', -1781.0386927269403), ('were', -1813.8598398823106), ('would', -1818.8934847507064), ('it', -1913.0360268168395), ('and', -3523.8869391148487), ('that', -4396.8141687677899)]

You might wonder what happens if rather than using those crazy specific weightings if I just used simpler numbers. Here, I've quantized things to integers +/- 10. It isn't as clear a picture, but it's generally right.

In [7]:
authorLDAq = authorLDA.quantizeCols(10)
In [8]:
[('been', 10.0), ('every', 10.0), ('poor', 9.0), ('pray', 7.0), ('into', 7.0), ('death', 7.0), ('done', 7.0), ('does', 7.0), ('down', 6.0), ('call', 6.0), ('first', -5.0), ('little', -6.0), ('speak', -6.0), ('night', -6.0), ('could', -7.0), ('both', -7.0), ('were', -7.0), ('look', -8.0), ('only', -8.0), ('being', -9.0)]

This is using a different machine learning technique called "Relevant Components Analysis" - which achieve similar effects in a different manner (it is trying to minimize the variance within the groups, rather than maximize the variance between groups). It gets similar results.

In [9]:
authorRCA = Corpus.linear.RCA(nr,authorNums,skipZeros=True)
Plays.uplot(authorRCA,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_rca_nt.csv")

Part of what is going on here is that the groups are seperated more in higher dimensions (RCA is producing a 10 dimensional space).

We can use a technique called TSNE to show what is close to each other in the higher dimensional space. In TSNE, the positions don't mean anything: points are positioned so they are close to things they are close to in high-dimensional space.

In [10]:
authorRCATSNE = Corpus.nonlinear.tsne(authorRCA)
Plays.uplot(authorRCATSNE,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_rca_tsne_nt.csv")

Look at the interactive version of the plot to see what plays look like shakespeare (or heywood, or ...). Mike and Jonathan tell me that's interesting.

Shakespeare's Words

In [11]:
svec = ["Shakespeare" in author for author in authors]
auc = nr.auc(svec)
aucAbs = [abs(a-.5) for a in auc]
aucOrder = N.argsort(aucAbs)[::-1]
for idx in aucOrder[:10]:
can 0.166411668707
say 0.830426356589
only 0.180844553244
speak 0.804518563851
hope 0.230212158303
go 0.761270909833
may 0.246277029784
sure 0.254742962056
lord 0.737352101183
first 0.268257853937
In [ ]: