Sunday, May 11, 2014

Data Characters in Search of An Author

My last post on Implied Stories was about how we fill in the blanks to create story contexts in even very short works, like Hemingway's example of "the shortest story every told": "For sale: Baby shoes, never worn." In that post, I used Pixar's 22 Rules of Storytelling and Emma Coats' talk about them at Tapestry Conference, plus some sociology, to frame my points about how audiences find implied stories.

I closed that post with some concerns about how this applies to data visualization, as we "read" the stories implied in visuals and look for causation, for example. Our brains are telling stories even when they might not be there; as a designer or journalist you might want to head them off at the pass, or face a stampede of weird conclusions. You get those with correlation plots, which everyone reads as causation (after all, you must be implying something, right?). Some great new examples of spurious correlations came up this week in a popular linkmeme, the Spurious Correlation site. I dare you not to try to create a story in your head to try to rationalize this one:

Per capita consumption of cheese and number of people who died becoming tangled in their bedsheets. From tylervigen.

It's at least a well-known "myth" that you shouldn't eat cheese late at night.

Over at Eagereyes, Robert Kosara argued with me and Hemingway that the baby shoes ad isn't a story, because it lacks the formal elements of narrative structure. Part of my point was in how much we bring to the interpretation independent of what is written or shown explicitly. Our brains look for stories and remember stories, as was noted in the recent excellent Data Stories podcast on the topic. But I do think a lot (or most) of data visualization — including the most successful work — lacks story element completeness, and the metaphor is weak as a result.

This is part 2 of my post on implied stories, suggesting that good data visualization is often about characters. What we as readers or as designers do with those characters to fill in the story around them isn't the focus here. But readers are hooked by characters. 8 out of the 22 Pixar Rules focus on character. While the story metaphor for visualization might be weak in places, I think it works when we look for characters in successful visualizations, especially with respect to data outliers.

Heroes & Villians

Sometimes the data provide the heroes and villains of the story, and the rest of the work is finding out the setting and events that got them there. Often that reporting is at least partly in text, not in another data visual.

This is just one recent visual from the many news stories currently analyzing the depth of America’s health care problem, from the Atlantic. The outlier, who in this case is the hero (for readers of the Atlantic), is here cast as the villain, making a pretty compelling point:

Image from Atlantic article.

Underdog Heroes (in Search of an Author)

Another case, one of my favorites, is hidden in the movie data released for the Information is Beautiful movie data contest a few years ago. I didn’t enter, but looked at the data out of curiosity. It turns out there is an extreme outlier in profitability, Paranormal Activity, which cost almost nothing to make in 2009 and racked up 1289040% of its cost in profit (or 1311200%, depending on how you calculate). The next closest profitability is Insidious in 2011 with 6467% profitability. Paranormal Activity blows away the scale unless you revert to logarithmic. It looks like this, otherwise:

Graphing an extreme outlier without a log scale.

Since the data was released as part of a contest, I checked the entrants to see how they had handled this. A lot of them just filtered it out, didn’t deal with it at all. There’s a giant story in that data point, if you ask me what’s interesting in that data set, and it wasn’t told in most entries. In a few interactives, it was, but far fewer than you’d expect. Surely the point of interaction is the ability to dynamically change scales, add explanations, zoom and filter? Instead, the “outlier” that most folks found convenient to report is Avatar, which fits on a non-log scale. Here’s the default view of James Fisher’s entry, showing raw profit, not % profitability. That particular outlier is Avatar:

James Fisher's Hollywood vis.

McCandless (or his blogger) says that this entry “Encourages the user to draw their own conclusions with highly customizable elements and hundreds of data combinations.” A lot of the interactive visualizations do this, showing off their app building skills, creating exploratory tools rather than finding and highlighting that interesting data point or points. To Fisher’s credit, he does revert to a log scale view and doesn’t hide that amazing outlier, if you can find the controls to display by profit and then see the tiny dot that is my hero:

James Fisher's Hollywood vis in action.

As an aside, I found McCandless's brief elevator pitch text for each shortlist entry really interesting; what's the take-away pitch for your vis? "It's a tool to allow exploration" vs. "It shows that X" or "When you compare X and Y you see that..."? More of McCandless's intro text: "Sometimes bubble charts are all about color, aren’t they? Choose the right ones and let your brain and eyes do the rest." (Hmmm.)

But how about "Did you know that the average audience rating for love stories is highest around spring, summer holidays and christmas?" Now that's kind of interesting and definitely makes me want to explore. The entry in question, Confluence by Blimp Design, used a clever and funny labeling trick to handle the Paranormal Activity scale problem — although it might not be clear to folks without a footnote (look to the right end point of the second scale slider):

Gordon Chan’s entry ("sometimes bubble charts are all about color") does some nice things with the spacing between y-axis gridlines to compress and expand where room is needed, but he gives up on Paranormal Activity as a lost cause. You can see Paranormal Activity 2, though, the top red blob here!

If I were a journalist looking for a story about money in this data, I’d probably be more interested in the underdog heroes of the Paranormal franchise than the James Cameron Avatar success story. The Paranormal data point sent me to Wikipedia where I learned that indeed, it is the most profitable film ever made based on return on investment. Just because it’s inconveniently extreme doesn’t mean it’s not an important data point to showcase in a visualization, especially an interactive. (Admittedly not everyone entering the contest was focused on profit, however, as their angle of choice.)

Our Hero Joins a Gang

Related to the heroes and villains is the character “fell in with a bad crowd” and “joined a great band” posse story. The technique here is to show a group your readers associate something negative (or positive) with and how your hero/villain can be seen as a member of that group. Here’s Russia along with other scary outliers in mortality rates:

Combining characters into groups that behave similarly is a nice technique for reducing the visual noise of these types of plots. Regions like “EU” or “Northeast” make for good obvious groupings. Sometimes non-obvious groupings are the story, and we’re back to the “Band/Gang” theme, where we learn about a character by the company it keeps; look for the U.S. in the upper left quadrant here:

There may not be an obvious bad or good gang, but groups help you tell a cleaner story anyway.
Rule 5 in the Pixar rules starts “Simplify. Focus. Combine characters.” We're all familiar with the technique of re-coding our data in groups, like turning 12 months into 4 seasons, to highlight patterns that may be seasonal. Usually the data determines the reasonable groupings for you, like this nice illustration from NOAA of what months constitute "dry" vs. "rainy" seasons in Florida:

6 (or 60) Characters In Search of an Author

I’m also a fan of small multiples, but sometimes we're presented with exploratory visualization with no analysis applied. Here’s an example where I have no idea what to take away from it — if you squint you’ll see a small green blip that might’ve been a story here, but I don’t know.

Here’s a nice small multiples example, though, from a good blog post:

Image and discussion about storytelling charts found on XLCubed Blog post.

Some reasons it’s good: no hunting for the good/bad cases from the reporter’s perspective (the color helps), trendlines annotate the important cells; high points of interest are labelled; the mean of the data is in the background of the small cells so you can see the relative that’s being shown; text is helping guide the interpretation too. This is the well-known "annotation layer" at work for us. (Yes, there are cosmetic things I don’t love, but it’s overall very nice.)

Wrapping Up (So I Can Go Eat)

Without a lot of supporting explanation (or performance, in the case of Hans Rosling), I don't believe a visualization can tell you the "whole story," in terms of who, what, when, where, why, and how; a visualization usually implies only a few of those, and our brains leap to conclusions the author of the visualization may or may not expect. Spurious correlations that suggest causation are a great example. If the author shows the data, it must be important and mean something, right? But sometimes visualization creators make poor choices, as in any design activity.

One of Emma Coats’ most resonant points at Tapestry was that audiences for movies want to be enthralled, delighted, connected — they don’t necessarily want to learn, but they are curious. Presenting a visualization hook of a hero or villain outlier and the company they keep might make a reader curious — curious enough to explore the visualization or topic further. But I still think some authorship is needed; otherwise, you risk your reader leaving your piece in frustration after hitting a few buttons and, if you're lucky, admiring the work you did re-inventing Excel online. If your authorship amounts to having screened out a really interesting outlier because it was inconvenient for your code, well, I'm not giving your story 5 stars.

Notes: Heroes and Villains is a book by the late Angela Carter, and 6 Characters in Search of an Author is a play by Pirandello, some of which involves the characters arguing about what their drama is about. You really should listen to the Data Stories podcast on the storytelling in vis debate.

No comments :