Sunday, October 26, 2014

A Roundup of Recent Text Analytics and Vis Work

Some really exciting things in text analysis and visualization have crossed my Twitter feed recently; I thought I'd pull together some pointers in case you missed any of my tweetspam about one of my favorite subjects. Maybe posts like this will become a regular thing!

Shiffman's P5.js and Javascript Text Tutorials

Dan Shiffman, famous for his excellent books and lessons on Processing, is doing a course for ITP that includes a lot of text analytics work done in javascript and p5.js (the new javascript Processing lib). The git repo for his course content (code and tutorials) is here. He includes accessible content on TF-IDF, Markov chains, Naive Bayes, parsing, and text layout for the web.

Topic Modeling News

David Mimno updated Mallet, the Java reference package for LDA, with labeled LDA (topics within labeled documents) and stop word regular expressions. Blog post with some explanation here.

Alan Riddell released a Python implementation of LDA with an interface inspired by scikit-learn. He points to an interesting semi-supervised topic modeling package also in Python, zLabel-LDA.

I liked this paper by Maiya and Rolfe with ideas for improving labeling of topics as compared to using raw LDA results. (Every time I teach topic modeling I confront the "but what do these mean" question, and the notion of post-processing the results for more meaningful representation gets a pretty short answer, because we've usually run out of time.)

Here's a nice recent project release from Peter Organisciak for making timeseries charts of topics across digital books in the Hathitrust Digital archive. Full instructions for the python and R package. Here's a section of his example of some topic distributions across The Scarlet Letter:

Words in Space (Multidimensional Scaling)

I was rather excited when Mario Klingemann posted his evolving project on visualizing the topics of the images in the Internet Archive's Book Collection -- a giant zoomable map of related subjects crunched with t-SNE. The links open the related images collections on flickr (e.g., here's "playing cards"). If you like old book images, especially woodcuts, this is a trap you may never escape from! I got lost in occult symbols and finally had to shut the tab.

Related to topic modeling, Lee and Mimno posted a paper on drawing convex 2d (or 3d) hulls around "anchor words" to outline topics in their co-occurrence spaces, such as from t-SNE (t-Distributed Stochastic Neighbor Embedding) or PCA. From their paper:

Meanwhile, David McClure has an interesting post about creating something like these algorithms "by hand" and generating network diagrams from the results. (Thanks to Ted Underwood for passing this on.) Here's his hand-labeled map of War and Peace:

Other words-in-space multidimensional scaling projects of recent note include word2vec, which has a nice Python gensim implementation (see great blog post and demo by Radim Řehůřek);
and GloVe, which claims to improve on word2vec but looks similar to me from the usage perspective (here's a "maybe buggy" Python implementation). t-SNE also has implementations in lots of languages including Python and R, all listed on their page. Also see a nice overview explanation of word embeddings with t-SNE visual examples by Chris Olah here and his demo of dimensionality reduction and t-SNE here.

Narrative Vis

In a fascinating project on, Georgia Panagiotidou and Anne Pasanen visualize the oscillation of characters between good and evil in the Finnish Kalevala epic. Really lovely and worth a browse in full screen.

Nick Beauchamp's Plot Mapper: Paste in a text and a complex PCA visualization reduces it to something amazingly simple. He says,

The text is chopped into N chunks, and each "chapter" is plotted in a 2-dimensional space (connected by lines) along with the top X words in the text. You can see how the trajectory of the text moves through the space of words, emphasizing different themes at different stages of the work.

Here's a surprisingly sweet Peter Pan:

Note: Keep options for words to generate low, or you may get an error. Thanks to David Mimno (@dmimno) for passing that one one!

Text Generator Art

Darius Kazemi (@tinysubversions) is doing NaNoGenMo (National Novel Generation Month) again this year - repo and rules here. Let's all work on text generation in November! Instead of, you know, actually writing that novel by hand, like an animal.

It was a few weeks ago, but it still makes me giggle - the Vogon Poetry Generator that uses Google Search to build something based on the title (which you can edit in-page).

Wrapping Up

I really love text visualization projects that combine great analytics with great applications. Keep sending me pointers (@arnicas on Twitter) and maybe I'll do more of these roundups when the awesome gets to me enough. For more inspirational links, try my Pinterest board of text vis, my twitter list of text vis, art, nlp folks (who talk about a lot of other things so YMMV), and this hopefully growing index of academic work from the ISOVIS folks.