Sunday, November 19, 2006

Data and Infovis and "Art"

I've been thinking about data a lot, since the Infovis 2006 symposium. At this conference was a strange mix of scientists, mathematicians, and a few artists, or those with an artistic bent.

My friends Martin Wattenberg and Fernanda Viegas from IBM Reseach Cambridge secured funding, invited submissions, reviewed, set up the equipment for, and then sat guard over (missing talks in which they were cited) an art show of infovis applications. They were specifically featuring artistic displays of real data (I'm paraphrasing what I think they said were their selection criteria. One was Golan Levin's The Dumpster, which I blogged about a while ago.)

To introduce this art show, they gave an excellent talk that I'd summarize as "What's Going On Out There in the Real World That You Might Not Know About." A bunch of us saw a lot of people in the audience noting down the existence of Ben Fry's Processing Toolkit that makes programming datavis apps accessible to artists and ordinary people who aren't postdocs in mathematics. Sadly, it reminded me of 5 or even 10 years ago when the CHI and CSCW research communities realized web startups had already made community apps that worked and they weren't made by researchers in labs. Where's the actual innovation happening? More often than not, it's students or other clever people with time on their hands and a willingness to play around.

But back to data: When I was doing my dissertation, data was a sticky subject. Collecting data on "human subjects" was overseen by strict board reviews and ethical examination, and I had to go through this as an early internet researcher with a Human Subjects Board who didn't know what to do with this kind of data.

The community I "studied" reacted strongly to some of the data that I collected, post-processed, analysed, and reported, regardless of the reviews I went through. My data said some things that they didn't want made visible, or suggested things they didn't like simply reducible to graphs and charts. (The book is available here, the last chapter discusses this problem in some detail.) Anyone who looks at or exposes recorded human behavior is going to hit this: for example, people who don't think they talk much and discover they talk all the time often don't like knowing this, however measurable it is and however potential this exposure might be for them. Which brings up the questi0n of why and when should you turn something into data? And analyse it?

So, thinking now about how the research and infovis worlds have evolved since then, and the new inevitability of data mining on behavior from the traces we leave behind us, I see these data source dimensions:

  1. Data sets that exist and are known to exist-- census data, weather data, stock market data.
  2. Data that "happens" but isn't necessarily assumed captured or turned into a set that's easily analysable: email, chat, mobile phone records, my retrievals from ATMs, where I walk and what I eat.
  3. Data that we set out to measure, because we're looking for something: experimental data, NSA tapping us, etc.
  4. Data we have (from any of the above means) and we converted to another form of data: e.g., turning activity logs into summaries of time on tasks, turning gene sequences into musical notes, turning video of your cats into a single overlayed image, turning text into images, etc.

The really creative apps for infovis often seem to lie in item 4), because transformation of data into other modalities is a trick of visualisation that might give us insights we didn't have before. Some of them are just elegant visualisations of data we wouldn't have thought of visualising (like Ben Fry's zipcode applet that Martin called an infovis "haiku"). The "insight" part is still tricky to handle; human perception differs, and reasoning skills differ, and that makes drawing conclusions from visualisations tricky too. (Untutored people generally make more of statistical tests than they should, too.)

Martin and Fernanda stayed safely away from defining "art" but I still thought about the artistic component of data mining. The value of data mining and the ability to form and then test hypotheses from different views of data is a skill, perhaps even an art in itself. An event occurs: I capture it, I capture multiple instances of it, and I look for patterns in different views of it, and then I learn from it or measure it some more or in another way to progress towards some truth.

Or, for the more artistic data visualiser: she captures it and events like it, she presents it in a novel and beautiful way, hopefully with some elegant interactivity, and other people learn something. The might learn something ineffable or impossible to reduce to words. But that doesn't make it less important. Scientific creativity still springs from the indescribable ideas you have about the world before proof and publishing.

No comments :