This is a "baked" version of the notebook - it is not editable, but it doesn't require the Jupyter software to be installed. The scatterplots (and some other) visualizations have links to interactive versions.

# Authorship in Core Drama¶

Some map making experiments using the core drama corpus, focused on authorship.

In [1]:
# this setup is meant to be common across all the demos...

# since the notebooks may not go where the module code is...
# change directory to the root - since that's where everything goes
import sys
import os

if os.path.basename(os.getcwd()) == "Notebooks":
print("Changing directory from:",os.getcwd())
os.chdir("..")
print("                     to:",os.getcwd())

### Magic to make notebooks work nicely
%matplotlib inline

### standard useful python stuff
from collections import Counter, defaultdict

### plotting stuff
import pylab
import seaborn

### numerical stuff
import numpy as N
from scipy import stats

### my CorpusModeler package stuff
import Corpus.DataMatrix as DM
import Corpus.reduce
import Corpus.cull
import Corpus.linear
import Corpus.nonlinear
import Tests.plays as Plays

Changing directory from: C:\Users\gleicher\Projects\CorpusModeler\Notebooks
to: C:\Users\gleicher\Projects\CorpusModeler

In [2]:
p = Plays.readCore()
r = Corpus.reduce.ngrams(p,200)
nr = r.normalize()
pca = Corpus.linear.PCA(nr,10)

corpus of (554, 179064) loaded in 0.21792387962341309
Density after reduction is 0.9848104693140795 - making dense
Reduce from 179064 to 200 words in 0.22760534286499023
Build PCA model in  0.02


# Authors¶

We typically do these games for genre, but let's try doing things for authors. These are not uncorrellated, since most authors focused on a small set of genres. Unlike genres, there are a lot of authors, so we'll focus on the top few.

In [3]:
reload(Plays)

## get the top N authors
authorCount = Counter(authors)

## let's see how they look on some of our favorite plots
top6 = [a[0] for a in authorCount.most_common(6)]
Plays.uplot(nr,grpCol="author",onlyGroups=top6, fname="coredrama_author_0")
Plays.uplot(pca,grpCol="author",onlyGroups=top6, fname="coredrama_author_pca")


This sets up to do some machine learning.

We create a vector of authors that uses numbers. We only use the 6 most common authors (but we remove anonymous, because that doesn't make sense to think of an "an author")

In [4]:
authorToSeparate = [ a[0] for a in authorCount.most_common(7) if a[0] != "Anon."]
authorNumDict = { a:i+1 for i,a in enumerate(authorToSeparate)}
authorNums = [authorNumDict[a]+1 if a in authorNumDict else 0 for a in authors]
authorLDA = Corpus.linear.LinearDiscrim(nr,targets=authorNums,skipZeros=True)
Plays.uplot(authorLDA,grpCol="author",onlyGroups=authorToSeparate,fname="coredrama_author_lda.csv")

Build LinearDiscrim model in  0.01

C:\Anaconda\envs\python3\lib\site-packages\sklearn\discriminant_analysis.py:389: UserWarning: Variables are collinear.
warnings.warn("Variables are collinear.")


What this has done is "learned" a new projection from the 200 dimensions of the initial (we took the top 200 words), and reduced it to 2 dimensions in a way that tries to seperate the groups as much as possible. Other dimensions (these are just the first 2) further separate classes.

The method I used in "Linear Discriminant Analysis" which is an "old school" machine linear / statistics method. Simple, but it works well here.

Remember, this was trained to distinguish the top 6 authors. We don't expect it to do well on other authors, but we can look...

In [5]:
Plays.uplot(authorLDA,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_lda_nt.csv")


We can look at what words it uses to make the X axis - here, this seems to tell us that Shirley uses been, every, poor, ... a lot - while everyone else uses being, look, only, ... a lot (those are negative so having those words pull documents to the left).

Be careful with this game though: this is actually using all 200 words. And if a word is rare, it might still be useful.

To help sort this out, I show the weightings (just the 20 biggest ones) but also sorting the list by which words make the top "contributions" (multiplying frequency by weight).

In [6]:
print("Weightings")
print(authorLDA.topWords(0))
print("Contributions")
print(authorLDA.topWords(0,contrib=True))

Weightings
[('been', 1587.6677288612718), ('every', 1532.1769511523369), ('poor', 1503.22470223706), ('does', 1168.2573425295545), ('into', 1121.7014758572118), ('death', 1112.5631326222338), ('pray', 1095.3152826744918), ('done', 1083.0040557412115), ('down', 994.63628986760682), ('call', 928.16393490217706), ('first', -793.45079770248708), ('speak', -916.64652688110152), ('night', -918.61504929054922), ('little', -1005.5591358002055), ('could', -1056.5680753927857), ('were', -1059.260099661635), ('both', -1092.0108800776907), ('only', -1306.7976474086063), ('look', -1316.3144215671555), ('being', -1481.1695366656775)]
Contributions
[('be', 2424.5310282006335), ('not', 2421.1229007960351), ('with', 2089.1119979155678), ('but', 1944.7034630327919), ('to', 1932.2329809657072), ('all', 1881.8402351436641), ('a', 1750.7734210226045), ('does', 1502.492630310114), ('at', 1241.3736985755577), ('come', -1187.5503071762719), ('in', -1306.3299544757083), ('see', -1313.3578575151664), ('no', -1636.1265500912014), ('for', -1651.7445934504528), ('then', -1781.0386927269403), ('were', -1813.8598398823106), ('would', -1818.8934847507064), ('it', -1913.0360268168395), ('and', -3523.8869391148487), ('that', -4396.8141687677899)]


You might wonder what happens if rather than using those crazy specific weightings if I just used simpler numbers. Here, I've quantized things to integers +/- 10. It isn't as clear a picture, but it's generally right.

In [7]:
authorLDAq = authorLDA.quantizeCols(10)
Plays.uplot(authorLDAq,grpCol="author",onlyGroups=authorToSeparate,fname="coredrama_author_ldaq.csv")

In [8]:
print("Weightings")
print(authorLDAq.topWords(0))

Weightings
[('been', 10.0), ('every', 10.0), ('poor', 9.0), ('pray', 7.0), ('into', 7.0), ('death', 7.0), ('done', 7.0), ('does', 7.0), ('down', 6.0), ('call', 6.0), ('first', -5.0), ('little', -6.0), ('speak', -6.0), ('night', -6.0), ('could', -7.0), ('both', -7.0), ('were', -7.0), ('look', -8.0), ('only', -8.0), ('being', -9.0)]


This is using a different machine learning technique called "Relevant Components Analysis" - which achieve similar effects in a different manner (it is trying to minimize the variance within the groups, rather than maximize the variance between groups). It gets similar results.

In [9]:
reload(Corpus.linear)
authorRCA = Corpus.linear.RCA(nr,authorNums,skipZeros=True)
Plays.uplot(authorRCA,grpCol="author",onlyGroups=authorToSeparate,fname="coredrama_author_rca.csv")
Plays.uplot(authorRCA,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_rca_nt.csv")


Part of what is going on here is that the groups are seperated more in higher dimensions (RCA is producing a 10 dimensional space).

We can use a technique called TSNE to show what is close to each other in the higher dimensional space. In TSNE, the positions don't mean anything: points are positioned so they are close to things they are close to in high-dimensional space.

In [10]:
authorRCATSNE = Corpus.nonlinear.tsne(authorRCA)
Plays.uplot(authorRCATSNE,grpCol="author",onlyGroups=authorToSeparate,fname="coredrama_author_rca_tsne.csv")
Plays.uplot(authorRCATSNE,grpCol="author",onlyGroups=[a[0] for a in authorCount.most_common(13)[7:]],fname="coredrama_author_rca_tsne_nt.csv")


Look at the interactive version of the plot to see what plays look like shakespeare (or heywood, or ...). Mike and Jonathan tell me that's interesting.

# Shakespeare's Words¶

In [11]:
svec = ["Shakespeare" in author for author in authors]
auc = nr.auc(svec)
aucAbs = [abs(a-.5) for a in auc]
aucOrder = N.argsort(aucAbs)[::-1]
for idx in aucOrder[:10]:
print(nr.terms[idx],auc[idx])

can 0.166411668707
say 0.830426356589
only 0.180844553244
speak 0.804518563851
hope 0.230212158303
go 0.761270909833
may 0.246277029784
sure 0.254742962056
lord 0.737352101183
first 0.268257853937

In [ ]: