Sunday, March 08, 2015

Teaching News

Overdue for a blog post, and I guess my news needs an official announcement!

I'm happy to announce that I have accepted a visiting post at the University of Miami for 9 months, beginning August 2015 and running through the academic year. This post is financially possible thanks to the generous Knight Foundation, which supports various faculty positions in journalism throughout the country. I’ll be helping Alberto Cairo get his new Data Visualization and Journalism track in the Interactive Media MFA off to a running start; I’ll be teaching data visualization and data analysis, including D3.js. I'll probably keep some side contract work going at the same time. Here's my favorite version of the news on Twitter:

I’ve always been wary of trying to teach D3 in any short workshop format — I’ve been asked and said “no” many times. However, the first class I’ll teach is a semester long, so it seems more feasible. To help prepare for this, I’ll also be a TA in this spring’s online Data Visualization and Infographics with D3 course co-taught by Alberto and Scott Murray (@alignedleft, screen-capped above), who is the author of a very nice introductory D3 book, Interactive Data Visualization for the Web. (If you’re reading about it now for the first time, the class filled up quickly to the cap set at 500 people. Maybe they can do it again if it’s successful.)

In other more minor teaching news, I did a guest lecture at CMU in Golan Levin’s STUDIO for Creative Inquiry on NLP (natural language processing) in Python; the files are all here. The most “interesting” part from Twitter’s perspective is the Bayesian detection of sex scenes in 50 Shades of Gray (because spam is boring). I first did this cocktail-party stunt at OpenVis Conf in 2013, and now I’ve finally released the data and code for it. These introductory lectures cover concepts that would be useful in any more advanced text visualization context; I hope to get a chance to expand on that subject while in Miami, too.

I’m also putting together a class, Introduction to Data Analysis with Pandas, although I’ve been doing it veeerrrryyyy slowwwwlllyyyy.

Finally, related to teaching, I’m co-chair of OpenVis Conf this year. We are not quite sold out yet (as of this post), and I think you should come. This is a conference about how the visualization sausage is made — lots of educational talks!

I had planned to write 3 more sections on learning, teaching, and making, but there were some minefields in there about gender and sexism in tech. Not ready for prime time. No navel-gazing for now!

Tuesday, December 30, 2014

A Silly Text Visualization Toy

This little text-to-image replacement toy made me laugh, so I decided to put it up in case it makes you laugh too. In my last project, I did part-of-speech tagging in Python and used that to replace nouns with other nouns (see post and demo); in this one, I did the part-of-speech tagging all in Javascript using the terrific RiTa.js library!

With RiTa, you get the same slightly noisy results I got in the tagging I did before: not all the nouns are good "nouns." The API for tagging is super easy:

>RiTa.getPosTagsInline("Silent night, holy night")
>"Silent/jj night/nn , holy/rb night/nn"

After generating the parts of speech, I filtered for just the nouns ("/nn" and "/nns"). I replaced those with words in "span" tags, and then used an ajax call to search for each spanned text in Google's image search API. The whole operation is outlined here, with the logic for getting the local text selected first:

      $.get("texts/" + file_name, function (text) {
        lines = text.split('\n');
    .then(function () { 
      return processLines(lines);
    .then(function (text) {
    .done(function () {
      $("span.replace").each(function (i, val) {

It turns out (of course) that there's a lot of repetition in certain words, especially for holiday songs and poems; so I introduced some random picking of the image thumbnails for variety.

Here's more from "Night Before Christmas" (which is really called "A Visit from St. Nick") -- yes, that's Microsoft Word:

This is the first sentence of Pride & Prejudice; it ends with the single man getting the Good Wife:

And the Road Not Taken:

I think the Night Before Christmas is the best one, but they all have their moments. Try it. Suggestions for other well-known (short) texts to try?

Saturday, November 22, 2014

Visualizing Word Embeddings in Pride and Prejudice

It is a truth universally acknowledged that a weekend web hack can be a lot of work, actually. After my last blog post, I thought I'd do a fast word2vec text experiment for #NaNoGenMo. It turned into a visualization hack, not too surprisingly. The results were mixed, though they might be instructive to someone out there.

Overall, the project as launched consists of the text of Pride and Prejudice, with the nouns replaced by the most similar word in a model trained on all of Jane Austen's books' text. The resulting text is pretty nonsensical. The blue words are the replaced words, shaded by how close a "match" they are to the original word; if you mouse over them, you see a little tooltip telling you the original word and the score.

Meanwhile, the graph shows the 2D reduction of the words, original and replacement, with a line connecting them:

The graph builds up a trace of the words you moused over, a kind of self-created word cloud report.

The final project lives here. The github repo is here, mostly Python processing in an IPython (Jupyter) notebook and then a javascript front-end. This is a blog post about how it started and how it ended.

Data Maneuvers

In a (less meandering than how it really happened) summary, the actual steps to process the data were these:

  1. I downloaded the texts for all Jane Austen novels from Project Gutenberg and reduced the files to just the main book text (no table of contents, etc.).
  2. I then pre-processed them to convert to just nouns (not proper nouns!) using's tagger. Those nouns were used to train a word2vec model using gensim. I also later trained on all words, and that turned out to be a better model for the vis.
  3. Then I replaced all nouns inside Pride and Prejudice with their closest match according to the model's similarity function. This means closest based on use of words in the whole Austen oeuvre!
  4. I used a python t-SNE library to reduce the 200 feature dimensions for each word to 2 dimensions and plotted them in matplotlib. I saved out the x/y coordinates for each word in the book, so that I can show those words on the graph as you mouse over the replaced (blue) words.
  5. The interaction uses a "fill in the word cloud" mechanism that leaves a trace of where you've been so that eventually you see theme locations on the graph. (Maybe.) Showing all the words to start is too much, and even after a while of playing with it, I wanted them to either fade or go away--so I added a "clear" button above the graph till I can treat this better.

The UI uses the novel text preprocessed in Python (where I wrote the 'span' tag around each noun with attributes of the score, former word, and current word), a csv file for the word locations on the graph, and a PNG with dots for all word locations on a transparent background. The D3 SVG works on top of that (this is the coolest hack in the project, IMO--see below for a few more details).

Word Similarity Results

The basic goal initially was to take inspiration from the observation that "distances" in word2vec are nicely regular; the distance between "man" and "woman" is analogous to the distance between "king" and "queen." I thought I might get interesting word-swap phenomena using this property, like gender swaps, etc. When I included pronouns and proper nouns in my experiment, I got even limper word salad, so I finally stuck with just the 'NN' noun tag in the ptag parser output. (You will notice some errors in the text output; I didn't try to fix the tagging issues.)

I was actually about to launch a different version--a model trained on just the nouns in Austen, but the results left me vaguely dissatisfied. The 2D graph looked like this, including the very crowded lower left tip that's the most popular replacement zone (which in a non-weekend-hacky project this would need some better treatment in the vis, maybe a fisheye or rescaling...):

Because the closest word to most words are the most "central" words for the model--e.g., "brother" and "family", the results are pretty dull: lots of sentences with the same words over-used, like "It is a sister universally acknowledged, that a single brother in retirement of a good man, must be in time of a man."

Right before I put up all the files, I tried training the model on all words in Austen, but still replacing only the nouns in the text. The results are much more interesting in the text as well as the 2D plot; while there is no obvious clustering effect visually, you can start seeing related words together, like the bottom:

There are also some interesting similarity results for gendered words in this model:

[(u'son', 0.7893723249435425),
 (u'reviving', 0.7113327980041504),
 (u'daughter', 0.7054953575134277),
 (u'admittance', 0.6823280453681946),
 (u'attentions', 0.658092737197876),
 (u'warmed', 0.6542254090309143),
 (u'niece', 0.6514275074005127),
 (u'addresses', 0.6490938663482666),
 (u'proposals', 0.647223174571991),
 (u'behaviour', 0.6413060426712036)]

[(u'nerves', 0.8918779492378235),
 (u'lifting', 0.7963227033615112),
 (u'wishes', 0.7679949998855591),
 (u'nephew', 0.7674976587295532),
 (u'senses', 0.7639766931533813),
 (u'daughter', 0.7601332664489746),
 (u'ladyship', 0.7527087330818176),
 (u'daughters', 0.7525165677070618),
 (u'thoughts', 0.7426179647445679),
 (u'mother', 0.7310776710510254)]

However, the closest matches for "man" is "woman" and vice versa. I should note that in Radim's gensim demo for the Google News text, "man: woman :: woman: girl," and "husband: wife :: wife : fiancée."

And while most of the text is garbage, with some fun gender riffs here and there, in one version I got this super sentence: "I have been used to consider furniture the estate of man." (Originally: "poetry the food of love.") Unfortunately, in this version of the model and replacements, we get "I have been used to consider sands as the activity of wise."

I saved out the json of the word replacements and scores for future different projects. I should also note that recently gensim added doc2vec (document to vector), promising even more relationship fun.

A Note on Using the Python Graph as SVG Background

To make a dot image background for the graph, I just plotted the t-SNE graph in matplotlib, like this (see the do_tsne_files function) with the axis off:

plt.figure(figsize=(15, 15))
plt.scatter(Y[:,0], Y[:,1], s=10, color='gray', alpha=0.2)

After doing this, I right-clicked the inline image to "save image" from my IPython notebook, and that became the background for drawing the dots, lines, and words for the mouseovers. Using the axis('off') makes it entirely transparent except for the marks on top, it turns out. So the background color works fine, too:

#graph {
  position: fixed;
  top: 150px;
  right: 20px;
  overflow: visible;
  background: url('../data/pride_NN_tsne.png');
  background-color: #FAF8F5;
  background-size: 600px 600px;
  border: 1px #E1D8CF solid;

There was a little jiggering by hand of the edge limits in the CSS to make sure the scaling worked right in the D3, but in the end it looks approximately right. My word positioning suffers from a simplification--the dots appear at the point of the word coordinates, but the words are offset from the dots, and I don't re-correct them after the line moves. This means that you can sometimes see a purple and blue word that are the same word, in different spots on the graph. Exercise for the future!

I also borrowed some R code and adapted it for my files, to check the t-SNE output there. One of the functions will execute a graphic callback every N iterations, so you can see a plot of the status of the algorithm. To run this (code in my repo), you'll need to make sure you paste (in the unix sense) the words and coordinates files together and then load them into R. The source for that code is this nice post.

The Original Plan and Its Several Revisions

If I were really cool, I would just say this is what I intended to build all along.

My stages of revision were not pretty, but maybe educational:

  • "Let's just replace the words with closest matches in the word2vec model and see what we get! Oh, it's a bit weird. Also, the text is harder to parse and string replace than I expected, so, crud."
  • ...Lots of experimenting with what words to train the model with, one book or all of them, better results with more data but maybe just nouns...
  • "Maybe I can make a web page view with the replacements highlighted. And maybe add the previous word and score." (You know, since the actual text is itself sucky.)
  • ...A long bad rabbit hole with javascript regular expressions and replacements that were time-consuming for me and the web page to load...
  • "What if I try to visualize the distances between words in the model, since I have this similarity score. t-SNE is what the clever kids are using, let's try that."
  • "Cool, I can output a python plot and draw on top of it in javascript! I'll draw a crosshair on the coordinates for the current word in the graph."
  • "Eh, actually, the original word and the replacement might be interesting in the graph too: Let's regenerate the data files with both words, and show both on the plot."
  • "Oh. The 'close' words in the model aren't close on the 2D plot from the nouns model. I guess that figures. Bummer. This was kind of a dead-end."
  • Post-hoc rationalization via eye-candy: "Still, better to have a graph than just text. Add some D3 dots, a line between them, animate them so it looks cooler." (Plus tweaks like opacity of the line based on closeness score, if I do enough of these no one will notice the crappy text?)
  • Recap: "Maybe this is a project showing results of a bad text replacement, and the un-intuitive graph that goes along with it?"
  • "Well, it's some kind of visualization of some pretty abstract concepts, might be useful to someone. Plus, code."
  • ...Start writing up the steps I took and realize I was doing some of them twice (in Python and JS) and refactor...
  • "Now I still have to solve all the annoying 'final' details like CSS, ajax loading of text parts on scroll, fixing some text replacement stuff for non-words and spaces, making a github with commented code and notebook, add a button to clear the graph since it gets crowded, etc."
  • Then, just as I was about to launch today: "Oh, why don't I just show what the graph looks like based on a model of all the words in Austen, not just nouns. Hey, wait, this is actually more interesting and the close matches are usually actually close on the graph too!"

There were equal amounts of Python hacking and Javascript hacking in this little toy. Building a data interactive requires figuring out the data structures that are best for UI development, which often means going back to the data processing side and doing things differently there. Bugs in the vis itself turned up data issues, too. For a long time I didn't realize I had a newline in a word string that broke importing of the coordinates file after that point; this meant the word "truth" wasn't getting a highlight. That's one of the first words in the text, of course!

And obviously I replaced my word2vec model right at the last second, too. Keep the pipeline for experiments as simple as possible, and it'll all be okay.

Sunday, October 26, 2014

A Roundup of Recent Text Analytics and Vis Work

Some really exciting things in text analysis and visualization have crossed my Twitter feed recently; I thought I'd pull together some pointers in case you missed any of my tweetspam about one of my favorite subjects. Maybe posts like this will become a regular thing!

Shiffman's P5.js and Javascript Text Tutorials

Dan Shiffman, famous for his excellent books and lessons on Processing, is doing a course for ITP that includes a lot of text analytics work done in javascript and p5.js (the new javascript Processing lib). The git repo for his course content (code and tutorials) is here. He includes accessible content on TF-IDF, Markov chains, Naive Bayes, parsing, and text layout for the web.

Topic Modeling News

David Mimno updated Mallet, the Java reference package for LDA, with labeled LDA (topics within labeled documents) and stop word regular expressions. Blog post with some explanation here.

Alan Riddell released a Python implementation of LDA with an interface inspired by scikit-learn. He points to an interesting semi-supervised topic modeling package also in Python, zLabel-LDA.

I liked this paper by Maiya and Rolfe with ideas for improving labeling of topics as compared to using raw LDA results. (Every time I teach topic modeling I confront the "but what do these mean" question, and the notion of post-processing the results for more meaningful representation gets a pretty short answer, because we've usually run out of time.)

Here's a nice recent project release from Peter Organisciak for making timeseries charts of topics across digital books in the Hathitrust Digital archive. Full instructions for the python and R package. Here's a section of his example of some topic distributions across The Scarlet Letter:

Words in Space (Multidimensional Scaling)

I was rather excited when Mario Klingemann posted his evolving project on visualizing the topics of the images in the Internet Archive's Book Collection -- a giant zoomable map of related subjects crunched with t-SNE. The links open the related images collections on flickr (e.g., here's "playing cards"). If you like old book images, especially woodcuts, this is a trap you may never escape from! I got lost in occult symbols and finally had to shut the tab.

Related to topic modeling, Lee and Mimno posted a paper on drawing convex 2d (or 3d) hulls around "anchor words" to outline topics in their co-occurrence spaces, such as from t-SNE (t-Distributed Stochastic Neighbor Embedding) or PCA. From their paper:

Meanwhile, David McClure has an interesting post about creating something like these algorithms "by hand" and generating network diagrams from the results. (Thanks to Ted Underwood for passing this on.) Here's his hand-labeled map of War and Peace:

Other words-in-space multidimensional scaling projects of recent note include word2vec, which has a nice Python gensim implementation (see great blog post and demo by Radim Řehůřek);
and GloVe, which claims to improve on word2vec but looks similar to me from the usage perspective (here's a "maybe buggy" Python implementation). t-SNE also has implementations in lots of languages including Python and R, all listed on their page. Also see a nice overview explanation of word embeddings with t-SNE visual examples by Chris Olah here and his demo of dimensionality reduction and t-SNE here.

Narrative Vis

In a fascinating project on, Georgia Panagiotidou and Anne Pasanen visualize the oscillation of characters between good and evil in the Finnish Kalevala epic. Really lovely and worth a browse in full screen.

Nick Beauchamp's Plot Mapper: Paste in a text and a complex PCA visualization reduces it to something amazingly simple. He says,

The text is chopped into N chunks, and each "chapter" is plotted in a 2-dimensional space (connected by lines) along with the top X words in the text. You can see how the trajectory of the text moves through the space of words, emphasizing different themes at different stages of the work.

Here's a surprisingly sweet Peter Pan:

Note: Keep options for words to generate low, or you may get an error. Thanks to David Mimno (@dmimno) for passing that one one!

Text Generator Art

Darius Kazemi (@tinysubversions) is doing NaNoGenMo (National Novel Generation Month) again this year - repo and rules here. Let's all work on text generation in November! Instead of, you know, actually writing that novel by hand, like an animal.

It was a few weeks ago, but it still makes me giggle - the Vogon Poetry Generator that uses Google Search to build something based on the title (which you can edit in-page).

Wrapping Up

I really love text visualization projects that combine great analytics with great applications. Keep sending me pointers (@arnicas on Twitter) and maybe I'll do more of these roundups when the awesome gets to me enough. For more inspirational links, try my Pinterest board of text vis, my twitter list of text vis, art, nlp folks (who talk about a lot of other things so YMMV), and this hopefully growing index of academic work from the ISOVIS folks.

Sunday, May 11, 2014

Data Characters in Search of An Author

My last post on Implied Stories was about how we fill in the blanks to create story contexts in even very short works, like Hemingway's example of "the shortest story every told": "For sale: Baby shoes, never worn." In that post, I used Pixar's 22 Rules of Storytelling and Emma Coats' talk about them at Tapestry Conference, plus some sociology, to frame my points about how audiences find implied stories.

I closed that post with some concerns about how this applies to data visualization, as we "read" the stories implied in visuals and look for causation, for example. Our brains are telling stories even when they might not be there; as a designer or journalist you might want to head them off at the pass, or face a stampede of weird conclusions. You get those with correlation plots, which everyone reads as causation (after all, you must be implying something, right?). Some great new examples of spurious correlations came up this week in a popular linkmeme, the Spurious Correlation site. I dare you not to try to create a story in your head to try to rationalize this one:

Per capita consumption of cheese and number of people who died becoming tangled in their bedsheets. From tylervigen.

It's at least a well-known "myth" that you shouldn't eat cheese late at night.

Over at Eagereyes, Robert Kosara argued with me and Hemingway that the baby shoes ad isn't a story, because it lacks the formal elements of narrative structure. Part of my point was in how much we bring to the interpretation independent of what is written or shown explicitly. Our brains look for stories and remember stories, as was noted in the recent excellent Data Stories podcast on the topic. But I do think a lot (or most) of data visualization — including the most successful work — lacks story element completeness, and the metaphor is weak as a result.

This is part 2 of my post on implied stories, suggesting that good data visualization is often about characters. What we as readers or as designers do with those characters to fill in the story around them isn't the focus here. But readers are hooked by characters. 8 out of the 22 Pixar Rules focus on character. While the story metaphor for visualization might be weak in places, I think it works when we look for characters in successful visualizations, especially with respect to data outliers.

Heroes & Villians

Sometimes the data provide the heroes and villains of the story, and the rest of the work is finding out the setting and events that got them there. Often that reporting is at least partly in text, not in another data visual.

This is just one recent visual from the many news stories currently analyzing the depth of America’s health care problem, from the Atlantic. The outlier, who in this case is the hero (for readers of the Atlantic), is here cast as the villain, making a pretty compelling point:

Image from Atlantic article.

Underdog Heroes (in Search of an Author)

Another case, one of my favorites, is hidden in the movie data released for the Information is Beautiful movie data contest a few years ago. I didn’t enter, but looked at the data out of curiosity. It turns out there is an extreme outlier in profitability, Paranormal Activity, which cost almost nothing to make in 2009 and racked up 1289040% of its cost in profit (or 1311200%, depending on how you calculate). The next closest profitability is Insidious in 2011 with 6467% profitability. Paranormal Activity blows away the scale unless you revert to logarithmic. It looks like this, otherwise:

Graphing an extreme outlier without a log scale.

Since the data was released as part of a contest, I checked the entrants to see how they had handled this. A lot of them just filtered it out, didn’t deal with it at all. There’s a giant story in that data point, if you ask me what’s interesting in that data set, and it wasn’t told in most entries. In a few interactives, it was, but far fewer than you’d expect. Surely the point of interaction is the ability to dynamically change scales, add explanations, zoom and filter? Instead, the “outlier” that most folks found convenient to report is Avatar, which fits on a non-log scale. Here’s the default view of James Fisher’s entry, showing raw profit, not % profitability. That particular outlier is Avatar:

James Fisher's Hollywood vis.

McCandless (or his blogger) says that this entry “Encourages the user to draw their own conclusions with highly customizable elements and hundreds of data combinations.” A lot of the interactive visualizations do this, showing off their app building skills, creating exploratory tools rather than finding and highlighting that interesting data point or points. To Fisher’s credit, he does revert to a log scale view and doesn’t hide that amazing outlier, if you can find the controls to display by profit and then see the tiny dot that is my hero:

James Fisher's Hollywood vis in action.

As an aside, I found McCandless's brief elevator pitch text for each shortlist entry really interesting; what's the take-away pitch for your vis? "It's a tool to allow exploration" vs. "It shows that X" or "When you compare X and Y you see that..."? More of McCandless's intro text: "Sometimes bubble charts are all about color, aren’t they? Choose the right ones and let your brain and eyes do the rest." (Hmmm.)

But how about "Did you know that the average audience rating for love stories is highest around spring, summer holidays and christmas?" Now that's kind of interesting and definitely makes me want to explore. The entry in question, Confluence by Blimp Design, used a clever and funny labeling trick to handle the Paranormal Activity scale problem — although it might not be clear to folks without a footnote (look to the right end point of the second scale slider):

Gordon Chan’s entry ("sometimes bubble charts are all about color") does some nice things with the spacing between y-axis gridlines to compress and expand where room is needed, but he gives up on Paranormal Activity as a lost cause. You can see Paranormal Activity 2, though, the top red blob here!

If I were a journalist looking for a story about money in this data, I’d probably be more interested in the underdog heroes of the Paranormal franchise than the James Cameron Avatar success story. The Paranormal data point sent me to Wikipedia where I learned that indeed, it is the most profitable film ever made based on return on investment. Just because it’s inconveniently extreme doesn’t mean it’s not an important data point to showcase in a visualization, especially an interactive. (Admittedly not everyone entering the contest was focused on profit, however, as their angle of choice.)

Our Hero Joins a Gang

Related to the heroes and villains is the character “fell in with a bad crowd” and “joined a great band” posse story. The technique here is to show a group your readers associate something negative (or positive) with and how your hero/villain can be seen as a member of that group. Here’s Russia along with other scary outliers in mortality rates:

Combining characters into groups that behave similarly is a nice technique for reducing the visual noise of these types of plots. Regions like “EU” or “Northeast” make for good obvious groupings. Sometimes non-obvious groupings are the story, and we’re back to the “Band/Gang” theme, where we learn about a character by the company it keeps; look for the U.S. in the upper left quadrant here:

There may not be an obvious bad or good gang, but groups help you tell a cleaner story anyway.
Rule 5 in the Pixar rules starts “Simplify. Focus. Combine characters.” We're all familiar with the technique of re-coding our data in groups, like turning 12 months into 4 seasons, to highlight patterns that may be seasonal. Usually the data determines the reasonable groupings for you, like this nice illustration from NOAA of what months constitute "dry" vs. "rainy" seasons in Florida:

6 (or 60) Characters In Search of an Author

I’m also a fan of small multiples, but sometimes we're presented with exploratory visualization with no analysis applied. Here’s an example where I have no idea what to take away from it — if you squint you’ll see a small green blip that might’ve been a story here, but I don’t know.

Here’s a nice small multiples example, though, from a good blog post:

Image and discussion about storytelling charts found on XLCubed Blog post.

Some reasons it’s good: no hunting for the good/bad cases from the reporter’s perspective (the color helps), trendlines annotate the important cells; high points of interest are labelled; the mean of the data is in the background of the small cells so you can see the relative that’s being shown; text is helping guide the interpretation too. This is the well-known "annotation layer" at work for us. (Yes, there are cosmetic things I don’t love, but it’s overall very nice.)

Wrapping Up (So I Can Go Eat)

Without a lot of supporting explanation (or performance, in the case of Hans Rosling), I don't believe a visualization can tell you the "whole story," in terms of who, what, when, where, why, and how; a visualization usually implies only a few of those, and our brains leap to conclusions the author of the visualization may or may not expect. Spurious correlations that suggest causation are a great example. If the author shows the data, it must be important and mean something, right? But sometimes visualization creators make poor choices, as in any design activity.

One of Emma Coats’ most resonant points at Tapestry was that audiences for movies want to be enthralled, delighted, connected — they don’t necessarily want to learn, but they are curious. Presenting a visualization hook of a hero or villain outlier and the company they keep might make a reader curious — curious enough to explore the visualization or topic further. But I still think some authorship is needed; otherwise, you risk your reader leaving your piece in frustration after hitting a few buttons and, if you're lucky, admiring the work you did re-inventing Excel online. If your authorship amounts to having screened out a really interesting outlier because it was inconvenient for your code, well, I'm not giving your story 5 stars.

Notes: Heroes and Villains is a book by the late Angela Carter, and 6 Characters in Search of an Author is a play by Pirandello, some of which involves the characters arguing about what their drama is about. You really should listen to the Data Stories podcast on the storytelling in vis debate.

Sunday, March 23, 2014

Implied Stories (and Data Vis)

At the excellent Tapestry Conference in February in Annapolis, Emma Coats (@lawnrocket) spoke about storytelling, the theme of the conference. Her talk was based on her internet-famous 22 Rules of Storytelling developed while she was at Pixar.

Lacking the video of her talk ([ETA: here it is!]), I cracked open the ebook based on her principles by Stephan Bugaj, Pixar’s 22 Rules of Story (That Aren’t Really Pixar’s) (— which, incidentally, Emma says was written without her permission and none of her involvement. Caveat Lector).

Pixar Rule 4:

Once upon a time there was a ______. Every day, ________. One day ________. Because of that, _______. Because of that, _______. Until finally ________.

Bugaj points out this is a summary of a basic plotting structure, the “story spine,” suggested in many books on writing fiction: setup, change through conflict, resolution. The details make it a good story, of course (character, context, conflict…).

Emma talked about confounding the expectations of an audience: The ghost of what they expected should remain at the end, but your story arc should win (and convincingly). Related was an important point: the implied story line. You suggest a shape to what will or might happen (or has happened), and the audience fills it in. Her pithy example was Hemingway’s “shortest story every told”, a 6-worder:

“For sale: baby shoes, never worn.”

There are lots of ways the story here can be filled in, all of them sad. The reader brings the detail and does most of the work, but the author set it up very well to allow this.

Another Short Story

I’d like to offer another example, a very short story deconstructed in a series of lectures by sociologist Harvey Sacks (Lectures on Conversation) — which coincidentally also features a baby:

“The baby cried. The mommy picked it up.”

Maybe it’s not as GOOD a story as Hemingway’s, but Sacks argues it’s a story, based on having a recognisable beginning and end, the way stories do. There’s a dramatic moment, and a resolution. And while you may think we can read less into the plot than into Hemingway’s, Sacks spends 2 lectures (plus book appendices) on this story and how we understand it the way we do.

Ok, let's accept it’s a story. Secondly, we infer that the baby and mommy may be related: it’s the baby’s mommy. “Characters appear on cue” in stories, he says; the Mommy is not a surprise in the normal setting conjured in our head; it doesn’t feel deus ex machina, like cheating.

Notice the story didn’t say “his mommy” or “her mommy” or “the baby’s mommy.” Juxtaposition of category terms often used in family contexts helps us infer this, Sacks argues. It’s clearly possible the baby was abandoned outside a supermarket and someone else’s mother picked it up to comfort it, as I hope one would! It’s not the simplest reading, though. Notice that we also assume they are humans, not apes or cats. Our human context draws that story, an Occam’s Razor kind of principle to reading.

Thirdly, Sacks notes we read the story as having cause and effect. Again, this is related to the juxtaposition and assumptions of normal family roles. That’s partly the expected story spine at work, too: conflict, resolution! Cause, effect, NOT just correlation.

Fourthly: the action in this story is believable, interpretable, unlike “colorless green ideas sleep furiously.” (That’s an old linguistics chestnut.) Babies cry; babies who cry should probably be picked up. Sacks notes that a mother can say plausibly, “You may be 40 years old but you’re still my baby.” In that case we don’t expect the crying 40-year old to be picked up, even if he’s “acting like a baby.” We fill in the blanks in this story in the most consistent way possible for the details we’ve been given, which means a lot of assumptions based on what we know and expect about social and human behavior.


A thing I didn’t tell you right away is that this story is a story by a 2 year old, that Sacks got from a book called Children Tell Stories. Sacks spends a certain amount of words on why this is a story because it comes from a child: the drama is a child’s, the resolution is a child’s happy ending. Sacks suggests that children, as speakers, might start a story with a dramatic moment, as a method of getting the floor. He says the dramatic problem here is a valid child’s talk opener, like “Hey, did you notice your computer is smoking?” would be for a stranger addressing you in a coffee shop while you’re getting a napkin. The ending is a valid ending, because for a child being picked up is a resolution. For this story to have a tidy ending, we infer that being picked up results in a non-crying child, or at least a happy child. But the actual non-crying denouement is implied here because of Mommy doing something expected.

The child’s story is arguably less sophisticated than Hemingway’s story, but notice that it’s more of a classic, plotted story in that 2 events occur, the crisis and the resolution. I hope I’ve convinced you that’s it’s still quite sophisticated in terms of the amount we bring to it when we read it, and how it successfully carries us along despite being terse. Hemingway’s is a suggestion of events behind a public for-sale ad, and all the action and characters and emotion occur in your head.

Story, Discourse, Visuals

What does this have to do with data visualization? Emma Coats wasn’t quite sure how to relate her story telling principles to vis design, but left it to us as adult vis creators to make that connection. I’m going to spell out some of what I take from the Pixar and Sacks points, as well as a little more storytelling thinking.

First one useful distinction in terms from Dino Felluga's General Introduction to Narratology:

"Story" refers to the actual chronology of events in a narrative; discourse refers to the manipulation of that story in the presentation of the narrative. [...] Story refers, in most cases, only to what has to be reconstructed from a narrative; the chronological sequence of events as they actually occurred in the time-space ... universe of the narrative being read.

(This isn't necessarily the way a linguist would define discourse, but it'll do for now.) Discourse encompasses all the similes, metaphors, style devices used to convey the story, and in a film, all the cutting, blocking, music, etc. The story is what is conveyed through these devices when the discourse has succeeded. (So, for Felluga, telling "non-linear" stories is an attribute of the discourse, not the story itself.)

Hemingway's short story's discourse structure is very different from a two-year old's discourse structure. The artistry lies in the discourse choices as well as in the stories they picked to tell.

Felluga illustrates how stories can be told in a visual discourse form with a Dürer woodcut:

(Woodcut to Wie der Würffel auff ist Kumen (Nuremberg: Max Ayrer, 1489). Reprinted in and courtesy of The Complete Woodcuts of Albrecht Dürer, ed. Willi Kurth (New York: Dover, 1963)

The story goes something like this: 1) The first "frame" of the sequence is the right-hand half of the image, in which a travelling knight is stopped by the devil, who holds up a die to tempt the knight to gamble; 2) the second "frame" is the bottom-left-hand corner of the image, where a quarrel breaks out at the gambling table; 3) the third "frame" is the top-left-hand corner of the image, where the knight is punished by death on the wheel. By having the entire sequence in a single two-dimensional space, the image comments on the fact that narrative, unlike life, is never a gamble but always stacks the deck towards some fulfilling structural closure. (A similar statement is made in the Star Trek episode I analyze under Lesson Plans.) [Note from Lynn: Love this guy.]

George Kampis took out these lessons from this example, for his own introductory course:

  • Narratives can be visual
  • Time is Space here
  • Actions and events are consequences (causation), not just occurring in a sequence.
  • Narrative is therefore offering "explanation" — why did things happen?
  • But order has been imposed.

I would not argue that the woodcut is easy to read, at least for most of us. Reading this story requires background in themes and socio-cultural contexts that a lot of modern viewers don't have anymore. It's not as simple as "the baby cried" or even the Hemingway "for-sale" discourse format.

Causation in Vis

We look for cause and effect in sequences of events, which is why I suspect there’s so much confusion over correlation and causation in data reporting. Charlotte Linde, in Life Stories, talks about this as "narrative presupposition." She offers us the following two examples, which we read differently:

1. I got flustered and I backed the car into a tree. 2. I backed the car into a tree and I got flustered.
Linde toys with the idea that this is related to cognition, but falls back to suggesting it's a fact about English (and possibly related languages') story telling discourse and morphology. Regardless, it is a "bias" of interpretation we bring to bear on how we interpret sparse details juxtaposed. If a data reporter chooses details that juxtapose the rise of one thing with the rise (or fall) of another, the average reader will assume causation is implied by the reporter.

What's an example of a simple causation story in data vis? A timeseries of measures might be a good example. But without added context, it’s often just "X, then Y". Filling in some explanatory context on timelines has become standard, at least in journalism. The labels here help us contextualize the data, and arguably to infer some causation:

(Image by Ritchie King in a Quartz article.)

Here the designer has imposed order by suggesting causation or at least relevant correlations behind the measures shown over time and the labeling of events. Some of the labels may be just "informational," like the recent presidencies. For readers who know about the Clinton era economy vs. Reagan and Bush economies, the annotations carry more meaning. Regardless, by choosing to annotate in this way, the reporter suggests relationships in the minds of the reader, very deliberately. Less clearly related events also happened on those labelled time periods — births, deaths, scientific discoveries — and yet their relevance wouldn't be so "obvious" and so easy to glance over as reasonable. Economy and war go together like babies and mommies.

Because readers assume the author has juxtaposed items on purpose, suggesting odd relationships in your discourse automatically evokes weird stories in your reader's heads. These might be entertaining from an artistic perspective, of course...

(A super example from this paper on fallacy summarized on Steve's Politics Blog.

It's a little unlikely that lemon imports over time have a direct causal relation to accident rate, although we immediately want to figure out how they could!

Artistic &/or Journalistic

Is journalism better served by 2-year old storytelling with simple discourse forms ("X, then Y")? Maybe, for some purposes. Even so, there are a lot of unwritten implications behind every chart, from what's reported to how it's reported. It's easy to classify some work as simple propoganda — see Media Matters History of Dishonest Fox Charts for a lot of examples of apparent intentional misleading by implication.

Periscopic’s Stolen Lives gun deaths visualization was criticized by some for being un-journalistic, and yet, it makes its implications quite explicit and well-marked in the discourse (gray lines). The visualization walks the viewer through the interpretation with a slow intro, to show exactly where the artistic license begins to deviate from the data source.

(Visual from Periscopic's work.)

This work may be be more like Hemingway's for-sale story than a 2-year old's story, although in fact it leaves less to the imagination while it veers further from traditional journalism as it does so. Yet this is still data visualization taking an artistic narrative risk, for the sake of activism.

Wrapping Up (So I Can Watch TV)

Even very simple stories, whatever the discourse form, rely on the reader filling in a lot of invisible holes. Some of the interpretation we do is so "obvious" that only sociologists or cognitive scientists can make explicit the jumps we don't notice we're wired to make. Choice of structure, of juxtaposition, of annotation, of what's implied versus made explicit: these are discourse maneuvers that can clarify, mislead, open up possibilities, or even evoke emotion in surprising ways.

A willingness to borrow insights from other disciplines' thinking about these subjects was one of the reasons I liked Tapestry's programming. Emma Coats made me get out some old books, and writing this up helped tune my thinking a little bit. Good conference, and hopefully a thought-provoking post for a few readers.

Incidentally, some recent related articles: Periscopic's A Framework for Talking About Data Narration and Jen Christiansen's article "Don't Just Visualize Data — Visceralize It." [ETA: Also, a followup to this post by Robert Kosara at eagereyes.]

Saturday, November 16, 2013

Data Vis Consulting: Advice for Newbies

Every time I give a talk or introduce myself at a conference, someone gets really interested in what I do. When I think I’ve scored a potential client, it turns out they just want to know if they could do what I do. Some folks are direct: One woman at a Python Data Science conference said, “So, you know that job ad list you run for data-vis jobs? How do I get one of those jobs?”

Backing up a tick, in case you didn’t know, I curate a low-traffic list for job ads in data-vis (short for “data visualization”). I think she knew about it from me on Twitter. If you don’t follow me on Twitter, you might think about it, I share a lot of links related to data science and visualization: @arnicas.

This post is about getting yourself to a place to get those jobs, plus money and client issues. I’m not the poster child for successful consulting, but I have been doing it a few years now and I am living off it. I work as an individual contributor, writing code for data analysis and creating interactive visualizations. To balance out my own skew on things, I asked for input from some fairly famous names, mostly folks I know via Twitter (see People Resources).

Note: I’m not talking in this post about skills needed or training resources--there are books, MOOCs, and plenty of ways to find that stuff by now; see especially Andy Kirk’s site Visualising Data.

The Kinds of Data Vis Work Out There

In my experience, there are a few key varieties of work deliverable types for freelancers in data visualization:

  • "Cool data set" visualization: Client wants someone to explore a data set and produce a static or interactive graphic they can feature as a PR move, as part of a news story or article, or for an internal business report. This work is probably most well-known, since it’s the core of the most famous/artistic vis work.
  • Dashboards: A lot of organizations want analytics (or consumer) dashboards reporting multiple key metrics in an attractive and useful display. They seek help determining those metrics and the best way to present them. (I’m putting these between the “cool data set” and “tools” because as a problem, they combine aspects of both. Make sure you read the Red Flag section for this work!)
  • Tool building: Client is building tools for users (internal or external) to view data of some kind, which means they and you are not starting from a specific data set, but an idealized version of one; and you are helping design/create infrastructure for data exploration.
  • Teaching: Teach principles of design, visualization, how to produce graphics and interactives, basic or advanced stats; usually in workshops.

Work roles range from part-time as-they-need-it work, sub-contracting for big names or projects, contract-to-hire gigs, single project work, retainer work for regular needs; and these jobs often entail some mix of design and development. I myself maintain a mix of these roles and work types so that I keep busy. (However, I charge slightly differently depending on whether I’m doing primarily design or development. I consider UX design to be “harder” and more painful due to the people-politics involved.)

Usually print work doesn’t pay well, but can be excellent for PR and the portfolio. Most of the “famous” artistic vis folks do some percentage of print work and win awards for it, even the ones who also do interactive work.

In general, a lot of us (in my People Resources) started doing freelancing in other areas, before getting more focused on the data visualization field. I was a UX designer for many years at (too) many companies before I went independent. Through some lucky breaks, I was able to do more and more data-related projects in my UX work, until I switched entirely to data work a few years ago. Others from whom I solicited input said they started in general web design or development contract work before moving into 100% data work.

Client (Mis)Understandings About What We Do

I’m hearing more often from start-ups and funded research groups (in universities and companies) that they have plenty of back-end data analytics people, but are in need of front-end folks for data in particular. Front-end people have always been in short supply, and ones who can do good work in the latest visualization tools are even scarcer. But beware, “front-end” for a data shop or an un-savvy product manager may mean any or all of: UI designer, website builder (all of it, from scratch, including login modules, preferences, menus, etc), interactive or static data visualization builder (e.g., d3, or just ordinary charts/graphs). Also, sometimes database janitor and stats person, depending on who else is on staff.

I try to explain that I’m no longer a broad UX designer for site architectures/workflows, that I’m working in the data area only now. Here’s a chart from Alberto Cairo’s book The Functional Art that broadly captures the data design specialization, although it doesn’t try to capture skills, overlaps, and tools at all:

Given the excitement over big data's promises and analytics-driven business goals, the popularity of infographics, and the excellence of today's interactive data journalism (at places like The New York Times and The Guardian), data visualization is hot for consulting right now. Unfortunately, there can be a lot of noise amidst the signal from legitimate clients. Sometimes I can't tell if I'm reading a spam broadcast or a real email, to whit, today: "Mr. [Redacted] is requesting contact information for anyone/company with experience in store cluster analysis, at a reasonable price."

Kim Rees says to beware of the client saying, “We’re really excited to do some datavis! But we’re not sure how to get started.” “These people have no idea what datavis is. Conversations will be confusing, nebulous, and full of far more questions than answers. Tell them to get back to you when they have a project in mind that involves data. Or give them a budget just to explore ideas with them. The only deliverable of that phase will be: Project Idea write-up no longer than one page.” Likewise, coming up with a visualization appropriate for the data and users IS the job, and doing pre-contract work to determine what you will build is not workable for a sustainable consulting business.

"These people have no idea what datavis is."
(Kim Rees)

Tiffany Farrant-Gonzalez notes that “lots of clients are attracted to heavily visual infographics that have become popular, and it’s sometimes hard work to educate them about good visualization practices.” She says, they sometimes want you to “simply make their data ‘look cool’ or ‘more interesting’ without really understanding what this means or the process involved.” Certainly even in development jobs, I have to explain that there is a data exploration, analysis, and design phase BEFORE the building starts -- just as in other design spaces.

Moritz, who most frequently works on what I'd class as "cool data set vis" projects, tells me he usually requests a data sample and some answers to a few questions clarifying the context and basic motivation of the project before starting:

  • Why are we doing this?
  • What are you hoping to achieve?
  • Who are we targeting?
  • How is the end product going to be used?
  • How are we publishing?
  • What data do we have available?
  • Which other existing materials should we take into account?
  • Which constraints do we have?
  • Who is responsible for what?
  • Who else is doing something similar?

For Moritz, answers to these questions help him understand why the client thinks a data visualization is important, and also help define success criteria for the project. He says, “Often, both the client and I realize that half of these questions cannot be answered yet, but that's fine, as long as we make sure to answer them along the way.”

Moritz shares a workflow diagram he uses with clients to illustrate the process and iterative stages of the work:

Moritz Stefaner noted in his excellent interview on FILWD that you need to educate a client to move along with you, so they see the value and thought process, the pros and cons of various design approaches. All design involves tradeoffs, and you need to illuminate these to help the client evolve their own thinking about what’s important to show. Designers of work other than data vis need to do this as well, of course. Remember this impacts your billable hours: producing presentations or documentation materials around your work is time-consuming.

For work that is closer to tool-building, I would also suggest these kinds of questions, at least for solo
consultants like me:

"Do you actually get any say over the presentation and design?"
(Me, wondering)
  • Do they want you to build it all? Try to get some notion of what "all" means for them.
  • Do they have others who can do the generic site code around the visualization piece? Building a whole site to host a vis project is usually non-trivial work!
  • Do they have a designer on staff now who does CSS/visual design (useful if you aren’t superb at this; or problematic if you are and they’re not, or they aren't clueful about the visuals in visualization)
  • Do you get to touch any of the data yourself (because you need to understand it to build something smart)? Who has the data, how can you get to it? (SQL, API call, samples available as CSV...)
  • Do you actually get any say over the presentation and design, or are you a code monkey in their eyes?
  • Are they looking for work "like" anything in particular? (They usually have an inspiration, whether something in the paper or a tool they use or a competitor.)
  • Do they need anything as fancy as d3.js or advanced charts, or would Highcharts and/or a general javascript person be good enough for them?

Early on in your consulting days, you may need to take more jobs that involve some unpleasantness or non-specialist work, but as you get more successful, you can be more choosey.

Getting Started: Do the Work

Suppose you aren't even at the point of talking to clients about doing data vis work, and you're wondering how to transition into it.

As Bill Shander says, you have to “Do the work.” Scott Murray suggests, “Find data stories that are interesting to you, and create them.” (If you have trouble finding data stories you’re interested in, rethink this as a career path, perhaps?) Get a fun data set, analyze it, do some visualization, post about it on a blog. Create the kinds of things you’d like to get paid for. There’s remarkable correlation of opinion across the folks I asked for input: Do projects, even (or especially) for free, that set an example of what you can do and would like to do. Then publicize them on Twitter, on your blog, in a presentation or talk (such as at a local tech Meetup or conference).

“Make, make, make, make.” (Jer Thorp)

Entering visualization contests is another good way to get some experience and attention, although the bar can be quite high for winning. Visualising.Org has some nice challenges with significant prize money. Bill Shander’s entry in a recent contest was picked up in reporting coverage of the contest and got him client leads. Jan Willem Tulp also praises contests, and offers: “A nice side-effect is that you're actually practicing creating data visualizations for a fictional client. Additionally, you’re already building on your portfolio this way!”

Jer Thorp says, “Make, make, make, make. Reduce the preciousness of your work so that you can make more of it and get further faster.” Amen.

The Required Portfolio

Having a portfolio is critical. You always need something you can show, that you yourself made, because every single client will look for evidence that you can do the work. Note that, unfortunately, a lot of jobs don’t produce work that can be shown in public (whether for NDA reasons, or because it was for an internal tool or demo). I myself have a hidden portfolio that I produce on request, because I’m still tuning my self-presentation and collecting items that can be made public.

Anna Powell-Smith says, “The most effective thing people can do to get hired is to create good projects by themselves. Clients love to see that you can both come up with a good idea, and execute it. And if it's all your own code, they know exactly how good you are.” Put aside long weekends or the Christmas break!

"Be selective in constructing your portfolio." (Everyone)

Jeff Clark says his website projects have gotten him valuable input and forced him to think about a personal work brand. “I think almost every project I've done for pay has started with someone seeing some work I've done on my website and they have contacted me through email.”

Dominik Baur suggests doing visually interesting projects that appeal at first sight, to get you a second, closer glance. The power of the visual can help, and might get you PR from other folks (on Pinterest, for example).

Both Moritz and Jan Willem advise being very selective in your portfolio choices. Most vis folks don’t put all their work in their portfolio, and tune it regularly. Jan Willem says, “Make sure that you show the work that you want to do more of. Don't show everything, don't show that you are also good at many other things, unless you want to get work in that direction as well. It might therefore be better to show only 3 really good projects in your portfolio that represent the kind of projects you would like to do rather than showing everything you've done so far.”

Tiffany echoed this, “Take out work that isn’t work you want to be producing, or suggesting you can produce (especially if it was joint work with others). … It's tempting to list out all of your skills (no matter how strong you are at them) and display all of your previous work on your site or resume, but it really helps if you narrow down to the core services that you want to provide, and hopefully this will help you get the work best suited to you.”

General Self-Promotion: Twitter, Tutorials, Teaching

As Jan Willem Tulp says, “People have to know you exist, that you do data visualization, and that you’re good.” It’s critical to have a web site with your portfolio, a presence on LinkedIn, and possibly a blog too. Being active on Twitter can help too. Jeff Clark suggests curating good work “in public” such as on Twitter, Pinterest, or a blog, and mixing in your own work occasionally.

I do get work via Twitter connections, but I also put a lot of work into Twitter. I get a lot of professional value out of it. Twitter is where I find out what other people think is good work (I save links to delicious and Pinterest), listen to arguments/discussions among experts, hear about good blog posts, find out about conferences where I can meet people who are doing good work and learn new things.

Kim Rees also values Twitter for network connections. She says one way to get her notice is to follow her on Twitter (she reads all follower bios which admittedly not everyone does, ahem), say interesting things about data, visualization or design, and post an insightful comment on one of the Periscopic blog posts. Also, she loves to get paper mail presents. (Hint!)

"Prepping talks takes time, usually unpaid." (Me, after a lot)

Give talks in which you show your work and tell people you do consulting. However, take care: prepping talks takes time, which is usually unpaid work. Make sure you talk in places that can benefit you, and try to keep track of “leads” after each one, to better assess which audiences are good for your business. Don’t forget that giving talks at conferences is also about the networking, though; the benefit of a drink at the bar with someone is often as high as the value of the talk you give (and costs less). Post your slides later, with full contact details (and your website link) in them!

Writing tutorials and teaching can be a good way to get business; Jim Vallandingham got several contract jobs from online tutorials he did, including on the popular site. They can also help you make your own knowledge more concrete: Teaching is a great way to learn. A couple of popular D3 sites and self-published books have been started by people learning as they produce materials that they take payment or donation for (see, e.g., this interesting post by D3 Noob about sales and PR effect of his book). Teaching workshops can also lead to consulting follow-ups, as Andy Kirk notes.

When you do gigs in person, always carry a lot of cards. My business cards say what I do, not just my business name and email. I like to think having a fuller business card will help people remember later why they’ve got it.

Red Flags, or Gigs to Think Twice About

Gigs with no data should be avoided. Yet, they are surprisingly common. Sometimes the client has none yet, or they can't get it to you for various reasons that are themselves red flags. Why is this bad? Because you can’t produce a good design without data investigation first, and it’s a mistake to start without. One of my clients had me drawing fake dashboards in Illustrator for a couple months before we mutually parted ways. It stopped feeling "creative" to be "making it up" pretty damn fast.

Another client had problems getting me both the real data and any design input. When the design input finally came it consisted of a mockup that had been created without any data investigatory work at all. When I looked at the real data, I discovered some large percent of one data field was full of garbage, and of course the design had to change when we realized it was unusable. This data investigation is a crucial step in the design process that can’t be short-cutted. Ideally you are involved in both the data investigation and design stages.

Tool development is often hard because you may be responsible for finding or developing your own test data sets, which takes solid time. I ultimately had to let one project go for a start-up that was taking more time than estimated; it was the perfect storm of debugging and improving someone's very difficult algorithm, plugging into a complicated dev environment (a weekend lost to Git merge despair), and data set collection/creation/testing. I still have nightmares of #Fail from this one.

"I still have nightmares of #Fail from this one." (Me, with regrets)

Other gigs to be wary of: Debugging other people’s code. Just don’t take them. You’ll spend a huge amount of time that isn’t visible as “producing” something, and it will be frustrating to you and the client.

Anna Powell-Smith says talk carefully with clients who want “something amazing,” but aren't more specific. “It's so dependent on how interesting your data is.” She points people to ">this awesome Quora thread about data analysis on the OK Cupid blog:

“OkCupid's blog worked because we had sexy data. [And] we had Christian Rudder writing the blog. … His posts were great because he's such an amazing writer, not because he's awesome at math. (He's certainly the best writer I know.) The posts each took 4-8 weeks of full-time work for him to write. Plus another 2-4 weeks of dedicated programming time from someone else on the team. It's easy to look at an OkTrends post, with all its simple graphs and casual writing style and think someone just threw it together, but it probably had 50 serious revisions. And we threw out a lot of research that didn't turn into good posts. Your start-up probably can't afford to do this. It shouldn't waste like 10 man weeks of effort/focus/money on writing a blog post.”

--Chris Coyne on Quora: “How Important Was Blogging to OKCupid’s Success”

Which brings me to another red flag client type: The start-up that wants to hire someone to do some “viral” datavis posts for their blog, but haven’t read Chris Coyne’s post or know how much work goes into a good data dive and report. I did a “sample post” for one once with no brief on content, spent about 3 times as long as they were really paying for; and they still didn’t think it was punchy enough. (In my defense, I was given a data set of developer questions, not dating preferences.)

Some clients want eye-candy in the “cool” category, either data art or lots of bubbles with animations. I'm not saying don't do these jobs, just be sure you and they know what they want! One client wanted both a salesy "eye-candy" cool piece and a serious dashboard tool for concrete internal business problems; in practice this turned into two projects, unsurprisingly. (No, I didn't work on both.)

Kim Rees suggests that start-ups attracted to data visualization should be avoided, unless they want make you a co-founding partner. Visualization isn’t usually an add-on, it’s a core fundamental. These can be similar to the “make something cool” clients, and their pivots and lack of money usually make for dangerous work relations.

"Dashboard design can be a pit of political misery." (Me)

A special red flag callout for dashboard design jobs: Like a lot of fundamental UX work, dashboard design can be a pit of political misery. Your role often ends up as an analytics counselor for a company that usually hasn’t settled on simple key metrics, which need to be determined before you can produce an attractive and useful design. You iterate quickly on ugly mockup after ugly mockup, trying to help them get internal agreement on their business goals and how to measure them via design artifacts. Highly stressful for you and the client stakeholders! (I tend to charge more for these political wrangling jobs, based on sad experience.) I should clarify that I do like dashboard work, but I now structure the project timing and money to take the politics and analytics discussions into account.

Wes Grubbs says bluntly, “Don't get stuck doing shit you don't enjoy.” Moritz advises avoiding boring or painful jobs for low pay. It might sound obvious, but we’ve all had them. Figure in a “pain coefficient” in calculating your rate for a job (range, upper limit, etc). Calculate your rate by estimating time and value to you -- learning something new, liking the client or the data subject, possibility of portfolio material at the end of it.

“Don't get stuck doing shit you don't enjoy.” (Wes Grubbs)

Wes Grubbs suggests your contract terms allows you to drop a job if you get uncomfortable with the data or client requests.

Real Billable Time

Moritz Stefaner tracks time carefully and reports that one year he found himself having done only 18 hours billable work a week, averaged out. As he says, consulting requires a lot of administration (business calls/email, book-keeping), PR work, learning, and keeping up with tools and the industry.

Please be aware: Freelance sites like Odesk show a world of work being done by kids without mortgages and health insurance, or folks in countries with lower costs of living. The rates offered and accepted on those sites aren’t realistic views of what this work normally pays and what a consultant requires to stay in business long-term.

I prefer to work hourly because project pricing almost always under-represents the true time to successful completion, especially when development is involved. When I track time on project pieces, I find that often up to 50% of dev projects involves just plugging into someone else’s dev process and environment, and it’s not unheard of for 30-40% of the time to be taken up with design alternatives and analysis stages. This doesn’t leave a lot of time for the core development work!

Red flag for project coordination: As Jim Vallandingham notes from one painful project, if there’s a separate person doing the data and a separate person doing the design and you’re doing the coding, it’s hard to sync up. This is time-consuming and potentially impacts quality and pain-coefficient for the work. Also, when you work remotely and do a lot of handing-off, you lose the opportunity to hear client discussions and critique in person. You’re at greater risk of being given a possibly foolish directive rather than being part of the process of coming up with a new design solution.

As Jérôme Cukier notes: “There is always a problem with the data and multiple feedback loops that can increase the time up to ten fold.” This should be built into your timing and estimates, realistically.

Cash Flow and Some Nitty Gritty

Get a financial adviser, or advice from someone who knows about money and small businesses. Both for taxes and for planning savings, you need professional input. I had to take an evening accounting class early on, because I had no idea what to do, even in QuickBooks “SimpleStart.” My tax accountant helps me understand when I’m likely to be writing off too much (conference travel, software, machines, books…) and raise red flags at the IRS.

Bill your clients with Net-15 terms (don’t let them talk you out of it; I did, and ended up with a bureaucratic client owing me $17K for a couple months before they processed the paperwork). “Net-15” means payment is due 15 days after you submit an invoice. (Ideally, you attach a clause saying they will be penalized for delay after the due date, but I have no idea how you realistically enforce this.)

From a billing perspective, I’ve had better luck being paid on time by mid-sized or small companies; giant corporations with paperwork complexity can delay your pay cycle by months, and this is serious to a cash-flow-driven business like consulting. Always bill sooner rather than later. Some people tell me they require part payment in advance, to mitigate these possible delays. I don't (yet) have the balls for this myself.

"Taking a part-time long-term gig will help tremendously with cash flow." (Me)

Taking a part-time long-term gig will help tremendously with cash flow and reduce your anxiety over finding work in dry periods. To do this, try remaining with a client on retainer or hourly basis for a couple days a week after doing significant project work for them; or start with one client for a long-term period and scale back later. I have a part-time retainer job with a local university, and although the pay is lower than most client work, the flexibility in hours, as well as the chance to learn new things on the job, make it a reliable gem for me. Teaching workshops is another way to pick up income, especially if you re-use materials. (If you don’t re-use, it’s a lousy way to pick up income. Trust me.)

Other mythical options for cash flow are passive income sources from writing a best-selling app, or a best-selling book, or maybe winning the lottery. If you write an app, remember you have to do support and field customer requests (cf. Moritz’s interview on FILWD). I’m not really serious about the book: I have no great evidence that one can count on royalties or gigs coming from them, and the work required is enormous.

A grim reminder from Jérôme Cukier: Don’t count on the “sure things” in your inbox. Any one of them call fall through and leave you with nothing incoming.

"Don't count on 'sure things' in your inbox." (Jérôme Cukier)

Keep an eye on your bank balance at all times: It’s possible to stay busy working on talks, contests, your web site, favors for friends, learning new tools -- and forget about the need for billable hours entirely. Especially when you’re self-motivated, and not just employer-motivated, and you’ve got zillions of “personal projects” you want to be doing as well!

A Few Pros and Cons of Consulting

Several friends of mine have quit consulting and taken full-time jobs. The uncertainty of the income and the need to do business development, book-keeping, and administration wear some people down. Also, the loneliness can be bad; or just the lack of team and occasional inability to see a project all the way through, if you're a cog in a bigger effort.

Once I delivered some super code to a start-up, they paid me, and I never heard from them again. That’s normal, for a consultant. But later I went looking for an example in that code I’d sent them, and realized I had forgotten to attach the file! Now that’s depressing. No one works for just the money. If you’re in a company, you get to make sure your stuff is passed on to the right people and can advocate for it seeing the light of day when no one else cares as much.

But the pros of consulting for me outweigh the downsides pretty dramatically:

  • Not being in an office/commute 5 days a week (I get a lot more done when I write code from home; UX work, however, is much harder without face-time all the time).
  • Paid hourly most of the time, so long weekends aren’t “free” work like they were at my full-time salaried jobs
  • I’m in charge of my own conference/training decisions: where I send myself and what to learn are HUGE issues for my career, and never were for employers
  • Vacation time is mine to determine (after living in Europe, I can’t do without serious time off)

If you’re a consultant, you’re in charge of your career. No one else is!

Wrapping up: I hope this whole post didn't sound too negative. I admit I have never worked so hard for so little money as I did in 2013, due to many of the red flags and unwise gigs. But I also had loads of fun, met a lot of great folks in the same field, and had some amazing clients and data projects. I'd probably do most of it again.

Resources: Places to Find Data Vis Jobs

People Resources

With email or bar chat input from (in no order):

Also, some recommended posts/threads: