Saturday, November 22, 2014

Visualizing Word Embeddings in Pride and Prejudice

It is a truth universally acknowledged that a weekend web hack can be a lot of work, actually. After my last blog post, I thought I'd do a fast word2vec text experiment for #NaNoGenMo. It turned into a visualization hack, not too surprisingly. The results were mixed, though they might be instructive to someone out there.

Overall, the project as launched consists of the text of Pride and Prejudice, with the nouns replaced by the most similar word in a model trained on all of Jane Austen's books' text. The resulting text is pretty nonsensical. The blue words are the replaced words, shaded by how close a "match" they are to the original word; if you mouse over them, you see a little tooltip telling you the original word and the score.


Meanwhile, the graph shows the 2D reduction of the words, original and replacement, with a line connecting them:

The graph builds up a trace of the words you moused over, a kind of self-created word cloud report.


The final project lives here. The github repo is here, mostly Python processing in an IPython (Jupyter) notebook and then a javascript front-end. This is a blog post about how it started and how it ended.

Data Maneuvers

In a (less meandering than how it really happened) summary, the actual steps to process the data were these:

  1. I downloaded the texts for all Jane Austen novels from Project Gutenberg and reduced the files to just the main book text (no table of contents, etc.).
  2. I then pre-processed them to convert to just nouns (not proper nouns!) using pattern.py's tagger. Those nouns were used to train a word2vec model using gensim. I also later trained on all words, and that turned out to be a better model for the vis.
  3. Then I replaced all nouns inside Pride and Prejudice with their closest match according to the model's similarity function. This means closest based on use of words in the whole Austen oeuvre!
  4. I used a python t-SNE library to reduce the 200 feature dimensions for each word to 2 dimensions and plotted them in matplotlib. I saved out the x/y coordinates for each word in the book, so that I can show those words on the graph as you mouse over the replaced (blue) words.
  5. The interaction uses a "fill in the word cloud" mechanism that leaves a trace of where you've been so that eventually you see theme locations on the graph. (Maybe.) Showing all the words to start is too much, and even after a while of playing with it, I wanted them to either fade or go away--so I added a "clear" button above the graph till I can treat this better.

The UI uses the novel text preprocessed in Python (where I wrote the 'span' tag around each noun with attributes of the score, former word, and current word), a csv file for the word locations on the graph, and a PNG with dots for all word locations on a transparent background. The D3 SVG works on top of that (this is the coolest hack in the project, IMO--see below for a few more details).

Word Similarity Results

The basic goal initially was to take inspiration from the observation that "distances" in word2vec are nicely regular; the distance between "man" and "woman" is analogous to the distance between "king" and "queen." I thought I might get interesting word-swap phenomena using this property, like gender swaps, etc. When I included pronouns and proper nouns in my experiment, I got even limper word salad, so I finally stuck with just the 'NN' noun tag in the pattern.py ptag parser output. (You will notice some errors in the text output; I didn't try to fix the tagging issues.)

I was actually about to launch a different version--a model trained on just the nouns in Austen, but the results left me vaguely dissatisfied. The 2D graph looked like this, including the very crowded lower left tip that's the most popular replacement zone (which in a non-weekend-hacky project this would need some better treatment in the vis, maybe a fisheye or rescaling...):

Because the closest word to most words are the most "central" words for the model--e.g., "brother" and "family", the results are pretty dull: lots of sentences with the same words over-used, like "It is a sister universally acknowledged, that a single brother in retirement of a good man, must be in time of a man."

Right before I put up all the files, I tried training the model on all words in Austen, but still replacing only the nouns in the text. The results are much more interesting in the text as well as the 2D plot; while there is no obvious clustering effect visually, you can start seeing related words together, like the bottom:

There are also some interesting similarity results for gendered words in this model:

model.most_similar(['wife'])
[(u'son', 0.7893723249435425),
 (u'reviving', 0.7113327980041504),
 (u'daughter', 0.7054953575134277),
 (u'admittance', 0.6823280453681946),
 (u'attentions', 0.658092737197876),
 (u'warmed', 0.6542254090309143),
 (u'niece', 0.6514275074005127),
 (u'addresses', 0.6490938663482666),
 (u'proposals', 0.647223174571991),
 (u'behaviour', 0.6413060426712036)]

model.most_similar(['husband'])
[(u'nerves', 0.8918779492378235),
 (u'lifting', 0.7963227033615112),
 (u'wishes', 0.7679949998855591),
 (u'nephew', 0.7674976587295532),
 (u'senses', 0.7639766931533813),
 (u'daughter', 0.7601332664489746),
 (u'ladyship', 0.7527087330818176),
 (u'daughters', 0.7525165677070618),
 (u'thoughts', 0.7426179647445679),
 (u'mother', 0.7310776710510254)]

However, the closest matches for "man" is "woman" and vice versa. I should note that in Radim's gensim demo for the Google News text, "man: woman :: woman: girl," and "husband: wife :: wife : fiancée."

And while most of the text is garbage, with some fun gender riffs here and there, in one version I got this super sentence: "I have been used to consider furniture the estate of man." (Originally: "poetry the food of love.") Unfortunately, in this version of the model and replacements, we get "I have been used to consider sands as the activity of wise."

I saved out the json of the word replacements and scores for future different projects. I should also note that recently gensim added doc2vec (document to vector), promising even more relationship fun.

A Note on Using the Python Graph as SVG Background

To make a dot image background for the graph, I just plotted the t-SNE graph in matplotlib, like this (see the do_tsne_files function) with the axis off:

plt.figure(figsize=(15, 15))
plt.axis('off')
plt.scatter(Y[:,0], Y[:,1], s=10, color='gray', alpha=0.2)

After doing this, I right-clicked the inline image to "save image" from my IPython notebook, and that became the background for drawing the dots, lines, and words for the mouseovers. Using the axis('off') makes it entirely transparent except for the marks on top, it turns out. So the background color works fine, too:

#graph {
  position: fixed;
  top: 150px;
  right: 20px;
  overflow: visible;
  background: url('../data/pride_NN_tsne.png');
  background-color: #FAF8F5;
  background-size: 600px 600px;
  border: 1px #E1D8CF solid;
}

There was a little jiggering by hand of the edge limits in the CSS to make sure the scaling worked right in the D3, but in the end it looks approximately right. My word positioning suffers from a simplification--the dots appear at the point of the word coordinates, but the words are offset from the dots, and I don't re-correct them after the line moves. This means that you can sometimes see a purple and blue word that are the same word, in different spots on the graph. Exercise for the future!

I also borrowed some R code and adapted it for my files, to check the t-SNE output there. One of the functions will execute a graphic callback every N iterations, so you can see a plot of the status of the algorithm. To run this (code in my repo), you'll need to make sure you paste (in the unix sense) the words and coordinates files together and then load them into R. The source for that code is this nice post.

The Original Plan and Its Several Revisions

If I were really cool, I would just say this is what I intended to build all along.

My stages of revision were not pretty, but maybe educational:

  • "Let's just replace the words with closest matches in the word2vec model and see what we get! Oh, it's a bit weird. Also, the text is harder to parse and string replace than I expected, so, crud."
  • ...Lots of experimenting with what words to train the model with, one book or all of them, better results with more data but maybe just nouns...
  • "Maybe I can make a web page view with the replacements highlighted. And maybe add the previous word and score." (You know, since the actual text is itself sucky.)
  • ...A long bad rabbit hole with javascript regular expressions and replacements that were time-consuming for me and the web page to load...
  • "What if I try to visualize the distances between words in the model, since I have this similarity score. t-SNE is what the clever kids are using, let's try that."
  • "Cool, I can output a python plot and draw on top of it in javascript! I'll draw a crosshair on the coordinates for the current word in the graph."
  • "Eh, actually, the original word and the replacement might be interesting in the graph too: Let's regenerate the data files with both words, and show both on the plot."
  • "Oh. The 'close' words in the model aren't close on the 2D plot from the nouns model. I guess that figures. Bummer. This was kind of a dead-end."
  • Post-hoc rationalization via eye-candy: "Still, better to have a graph than just text. Add some D3 dots, a line between them, animate them so it looks cooler." (Plus tweaks like opacity of the line based on closeness score, if I do enough of these no one will notice the crappy text?)
  • Recap: "Maybe this is a project showing results of a bad text replacement, and the un-intuitive graph that goes along with it?"
  • "Well, it's some kind of visualization of some pretty abstract concepts, might be useful to someone. Plus, code."
  • ...Start writing up the steps I took and realize I was doing some of them twice (in Python and JS) and refactor...
  • "Now I still have to solve all the annoying 'final' details like CSS, ajax loading of text parts on scroll, fixing some text replacement stuff for non-words and spaces, making a github with commented code and notebook, add a button to clear the graph since it gets crowded, etc."
  • Then, just as I was about to launch today: "Oh, why don't I just show what the graph looks like based on a model of all the words in Austen, not just nouns. Hey, wait, this is actually more interesting and the close matches are usually actually close on the graph too!"

There were equal amounts of Python hacking and Javascript hacking in this little toy. Building a data interactive requires figuring out the data structures that are best for UI development, which often means going back to the data processing side and doing things differently there. Bugs in the vis itself turned up data issues, too. For a long time I didn't realize I had a newline in a word string that broke importing of the coordinates file after that point; this meant the word "truth" wasn't getting a highlight. That's one of the first words in the text, of course!

And obviously I replaced my word2vec model right at the last second, too. Keep the pipeline for experiments as simple as possible, and it'll all be okay.