Sunday, May 13, 2012

My Most Influential One Pixel Line


I thought I'd contribute one story to the "telling stories with data" genre, even if it's a silly one. It's silly 'cuz it features such a silly graph, which I shoved into an appendix of a presentation for a client a few years ago. Here's an anonymized version:


I put that animation with the arrow in there on purpose, because when I presented it, I had to point out the skinny line on the top. More graphs than you'd expect come with a "performance" part and in some contexts, I think this is just fine. Afterwards, one exec at the company referred to it often as "that chart with the one pixel line." (Okay, technically it had about 2 or 3 pixels. Not as punchy if you refer to it as "that chart with the 3 pixel line" or "that chart with the thin red line.")

I'm sure there are other, better, ways to present this red-and-orange tower. The point is: It was remembered. It had an impact. This graph led to more graphs being created! Roughly, we saw these steps:

  1. Acknowledged and admitted: The one pixel red line was considered to be a problem (or rather, the un-analyzed orange bar was).
  2. More descriptive graphs were made: This is key &emdash; an influential graph/chart always leads to more data investigation, with more graphs. Describe the size of the problem, delve further. The giant orange segment was tackled: How could it be made manageable? What patterns existed inside it?
  3. Sensemaking/iterpretation: What could we do, what couldn't we do? What should we prioritize or safely ignore? What tools were needed? Who owned what parts of this orange bar?
  4. Data tools sprouted: A series of ad hoc and then longer term tools were built: Excel reports with perl/python/VBA, then a Flex tool for intermediate data dives, then a dashboard in Flex for tracking larger picture trends.

Do It Well, and Do It "In-House"

It's an old analytics saw that you can't improve what you don't measure. Well, I think you won't improve what you don't measure meaningfully and then pay attention to. The client had collected the data, but then did nothing with it, because no one had made understanding it a priority. Data for data's sake is pointless and will be ignored. At the time of my one-pixel bar, an analytics cheerleader in the company described our primary data system as "buggy, opaque, brittle, esoteric, confusing." I'd add, "understaffed," and as a result of all that, usually ignored, which is how the one pixel red line came to be.

We took a brief detour in which we considered "outsourcing" the data problem to another company to do the top-level reporting for us, but our (mostly my) investigations suggested we couldn't do the fine-grained, raw-to-dashboard (ETL) reporting and analysis we needed without owning the entire pipeline ourselves. Because in all these organizational, data-driven settings, the reasoning goes like this:

  • What's going on? Now, and as a result of previous behaviors/changes. Do we have the right data? Trends, alerts, important KPIs.
  • Why is that going on? Drill in. Question if we have the right data and instruments to diagnose. A deep dive occurs, often all the way back to RAW data. This is normal! And this is necessary.
  • How might we change the bad things? This is a complicated question, never simple and often not just quantitative. This is where the profound thinking happens, when the cross-disciplinary methods and teams pull together to interpret and chop data. Sense-making and interpretation require lots of checks on data, reasoning, and context.

Cross-Disciplinary Success

Our ultimate data team was a cross-company, somewhat ad hoc group of people who cared about the same thing, but didn't report together anywhere: Customer Support, UI development management, directors of development and the API team, a couple of database gurus. Oh yeah, let's not forget the database gurus: I couldn't have even made that bar chart without badgering the database guys for info on their tables, so I could do some SQL on it.

In a year, we had achieved measurable significant improvements, via that cross-disciplinary team, and without out-sourcing our important data in any way. The short-term tools paid off almost immediately, and I hope the long-term ones are still evolving. One of the team members won an award for the tool he developed for exploring important raw data (and I did contribute to the design). None of this was done under official reporting structures. But the organization was flexible enough to support the networking, collaboration, and skills needed.

I Did Other Stuff, Too...

Since that graph is so silly, here's a little montage of other exploratory data and design work I did while I was with that client. Lots of tools were involved, from R to Tableau to Flex to Python to Excel to Illustrator. Vive la toolset!

Sunday, March 18, 2012

Digging Into NetworkX and D3


For Boston's Predictive Analytics Meetup in February, I gave a short talk on using the python library NetworkX to analyze social network link data, illustrated with some simple D3.js visuals of the results. I've since spruced up the slides to stand on their own a bit better, extended a few of the examples, and moved it all online.

Here's a link to the zip file of the ppt, heavily commented code samples, and the network edgelist I used (from Moritz Stefaner's and my previous look at Twitter Infovis folks in mid-2011). Or you can browse the slides below (the links should work fine).


A few comments, if you made it through the deck... The network stats are doubtless out of data, since I know there has been some movement in who-follows-whom among the Infovis crowd on Twitter. The overall workflow proposal is this:

  1. Read in your edgelist into NetworkX (or a json file if you already have one)
  2. Convert to a NetworkX graph object
  3. Calculate stats and save them as attributes on the graph nodes (my code shows you how/does it for you)
  4. Optionally here: Filter the network by some important attribute.
  5. Write out JSON of the network to use elsewhere (e.g., D3)
  6. Visualize (in D3) and explore what you got
  7. Optionally here: Filter the network further in the interactive visualization.
  8. Go back to (1) and add more stats. Or filter some more.

In my previous post using Gephi to analyse the infovis network, I labelled one subcommunity "The Processing" crowd, another one "The Researchers" and another one "The Authorities." In my current analysis, where I find 6 subcommunities (or "partitions"), you can see them as roughly the green partition (Processing folks and infovis artists), the orange partition (the research/analytics group), and the blue partition (with high-degree authorities like infosthetics and flowingdata).

The different demos make different things clear about this data, as you might expect!

  • The adjacency matrix of the top 88 chosen by eigenvector centrality reveals that the orange partition, or the Researchers, have more members with high eigenvector centrality than the other subgroups. This is quite clear when you sort by Partition. Other partitions are barely represented here. (NB: It was only 88 because it was all that fit easily; for the other demos, I use a subset of 100 out of the full 1644 nodes.)


  • The chord diagram, which allows you to toggle between the 100 person top eigenvector scoring subset vs. all 1644, shows a striking difference between the subset of 100 and the full set. It's even more obvious here how the orange partition (the "Researchers") overwhelms the top eigenvector subset, and how little of the large green group are represented in this subset. We can certainly speculate why this is...


  • The force network of nodes allows you to see some individual following patterns, at least among the "most important" top nodes. In this example, I filter the graph inside the javascript code, instead of in the NetworkX code. The graph shows a union of the top nodes selected for eigenvector centrality, betweenness, and degree. Some of them near the edges don't follow anyone in the "top" network sample, but they made this cut by being high in one or more measures of degree, eigenvector centrality, or betweenness. You can resize by each variable, or click on one to see the individual's values.

Once you get started making these visuals, you want to tinker forever... I hope the code samples and comments help you get started, if you want to try to do something in this line! Once again, talk slides plus source are in this zip file. Be sure to note my warnings and gotchas if you tinker yourself.

For a recent and different analysis of talk among the Twitter Infovis crowd, visit @JeffClark's posts here and here and here. (He's an orange, top N member in my graphs.) He identifies "red" and "blue" groups based on their interactions and words used. His two primary groups seem to correspond to the processing/artists (green) and researchers/analytics (orange) distinctions I found in this older data.

Sunday, November 20, 2011

A Kindle Fire Review (from a Media Fan)



I'm a Kindle fan, and an Amazon fan. I really like their media content: I buy Amazon music, Amazon Kindle books, TV shows, Android apps. So when my Kindle Fire came, it was pretty much pre-loaded, and that was really nice. All my stuff is sitting there with a little "download to device" arrow, which rocks.

I got this thing because of upcoming travel over the holidays (I don't own an iPad, I think they're too big). I was never intending to take the Fire instead of my reading Kindle, and after 5 days, I still wouldn't. Partly that's battery-life-related; I adore my reading Kindle for the everlasting, never-needing-to-charge-it, one-handed reading wonder that it is. The Kindle Fire battery supposedly lasts about 8 hours, and that may not be true with video watching and wifi on (I haven't tested that part yet).

So, this is not a Kindle-killer, anymore than it's an iPad killer, 'nuff said there.

More specifically, I got the Fire for video watching, web browsing/email/twitter, PDF reading, and light app use (Solitaire, Angry Birds, etc), in about that order of priority. So let's hit those, with some UI observations along the way, because that's where the chance for the Fire's improvements really lies. Then I'll finish up with a few comments on major navigational issues, e.g., scrolling, selecting, typing, which permeate the product.

Video Watching and Disk Space


The Fire seems to want you to mostly stream, which doesn't surprise me. The 8GB drive, and the free Amazon Prime (streaming only) support this. Netflix and Hulu Plus work on it (install their free apps from the store). If you have ever paid for a TV show ep (I sure have!) from Amazon, THOSE can be downloaded to your device.  (Browse to a show you have bought episodes for, and they tell you they're still yours, and you can download to your device now!)

Why does this matter? If you're wanting to use it on an airplane, or in iffy hotels off the grid, which I do, you need to download to your device.  And if you want to load video you already have, I did the research: It only recognizes MP4, so you need to convert stuff. (I'm using AVS Video Converter; my version does only one file at a time, which is proving to be a giant slow babysitting process.)

You can load videos (or PDFs, or mobi files, or anything else) when you attach your device by USB cable. Drag them into the Videos folder.

But don't expect them to show up in the Videos section of the UI, reachable by the top tabs! They will be found in the rather hidden pre-installed "Gallery" app, which is where your photos and videos live. And then you may be surprised by how poor the UI is for the videos (I am praying they fix this, it's un-manageable!) They appear as a tiny thumbnail with no text; you must select, and then choose "Properties," in order to figure out which one is which. This will get old fast, not just because selection is so funky on this device (more on that later). Here's the videos display with 2 videos:

Video in Gallery

A short season of one show could run just over 3GB. The actual disk space available to you is not 8GB, because of the OS etc; it's really 6GBish. To find out what you're using, you need to hunt a bit. There is no disk meter in the top accessory bar where Wifi, battery, and other settings live. Tap that bar, and you'll see options like volume. (Yes, it says "Lynn's 5th Kindle," I don't want to talk about it.)




You need to hit "More" and then click into "Device" to see the disk usage. That's really annoying for a device with such a small drive. I wouldn't be hoarding content on it, but for non-wifi situations, having downloaded content seems pretty important to me. I'm really befuddled by this one.

Disk usage

This said, my MP4 videos do look and sound nice. I'll be spending the evening getting ready for that trip.

Web Browsing, Email, Twitter, Etc.


I am guardedly pleased with this so far. I had some issues getting the built-in email app to recognize a Verizon Yahoo address, but the Yahoo mail app worked fine. Tweetcaster works nicely, and I even get a tiny cute beep in the notifications bar when someone @ mentions me, which is nice (same cute beep for email I receive).

The web browser does support tabs, which is great; but the favorites/bookmarks have one major issue: There seems to be no way to delete one. Huh? So it came built in with ESPN.com and a few others I never use, and I can't remove them. If this was a UI design mistake, it's shocking; if it was policy for some payment by partners -- unlike Amazon in so many ways, who are usually all about the customer.

Please fix this, Amazon.

Web pages also allow you to remember passwords, which - thank goodness. Typing is such a damn pain (see below).

Since web pages look good, and play video (including flash), this is a real plus on the device. Selection of links is funky, and I sometimes don't know if the selection problems I am having are due to the OS, touchscreen, or some web loading/processing issue.

PDF Files


PDFs on the e-ink "reading" Kindles are terrible - when they took away the text reflowing option for PDF docs, it becomes impossible to really read them, requiring too much zooming, scrolling, etc, and any images take forever to load and are, of course, B&W.

Most of this is awesome on the Kindle Fire! Definitely a reasonable PDF reader. The documents look great, and my only issue is the weird scroll-down, then to the right, for navigating a large document. It would be nice to have an option for "just scroll down" to get through a PDF document, instead of trying to use the book/paper metaphor of flipping pages. Here is how pretty PDFs look (yes, this is fanfiction, deal with it; of course I tested academic articles too).

Gorgeous Color PDF

Here's a page in portrait, with arrows suggesting how I need to scroll (down to get to bottom of the page, then flick to the right to "flip" to next page).


Thumbs up on PDF reading.
They appear in your Documents folder as you would expect, and you don't need to send them to your device for conversion, they just "work." There isn't a Kindle Fire Instapaper app, but since you can save the site in your bookmarks and read text only, or download as Mobi files, you are all set there.

App Use: Angry Birds


Angry Birds is great. So is Solitaire. I haven't tried to install any apps that aren't in the Amazon app store, although you can (instructions abound on this). Note: I installed these on my Android phone and can't use them on it, screen is too small to really do it right. This form-factor is just fine for games that need a wider field of vision, or for people who are getting older and blinder.

I also installed a drawing app, but I don't much want to draw with my finger, so.

I like, and have always liked, the Amazon Android Apps store experience. In some ways, it's better than Google's app store. I'd expect that from Amazon UI, but it's nice to see on Amazon's first dedicated Android device.

Typing, Scrolling, Selecting, Turning the Page


The use of the touch screen is my biggest peeve. It's just buggy! If it's software, I expect a fix update -- Amazon is always good about updating Kindle software. If it's hardware, it's just a damn shame, and I'm kind of shocked it shipped this way.

Typing: The on-screen keyboard behaves very badly in portrait mode. My space bar and the letter "c" seem to be hyperactive for any key I pick on the right side of the keyboard. It's so bad, I will just switch to landscape for anything I ever need to type. The typing issues make the device less fun for email/twitter than I hoped. I am very sad about this.

Scrolling and Selecting: I have had so much trouble trying to scroll vs. selecting what's under my finger that I even looked it up in the help, and watched some demo videos online to see if they were doing it differently. It's not obvious. I have similar problems on my Android phone, which either means the OS itself is to blame, or making good touch screens is really really hard and Amazon's hardware providers failed. I spend a lot of time hitting the "back" button to undo a selection I didn't mean to make while I tried to scroll, especially in Tweetcaster or email.

If I were creating an Android app, I might consider making a dedicated scroll bar, just because it would offer some (admittedly old-school) way around this crappiness in the UI.

Incidentally, scrolling is very important in the apps that Amazon built for showing your bookshelf, your music, your videos... so this problem is quite profound.

Turn the Page: For books, without the e-ink hardware buttons, you need to flick or tap to turn the page. It's a slightly delicate maneuver, since it's easy to hit too hard and bring up the menu bars etc. Also, I am not so convinced this is a read-with-one-hand device. I'm not convinced by the reviews of the Amazon Touch either; if you're holding it in your left hand, tapping on the left side goes to the previous page. This is another surprising gaff on the UI side, for me. I'm not left handed, but I read left-handed about half the time.

Summary


I am very pleased by the PDF experience, and mostly like the apps and Web experience. I didn't buy this to replace my reading Kindle, so no real comment on that side.

I am shocked by these things, and expect software updates to fix some, if not all:

  • Lack of a disk usage meter on the top info bar. Related to having very little storage on the device -- I admit, I wondered how hard it would be to crack it open and install a larger hard drive. We all did that with our first TiVos for years...
  • Touch screen badness - for typing, selecting, scrolling... If this is hardware, we're rather screwed, I bet.
  • Inability to delete web bookmarks (sheesh, seriously, Amazon?)
  • Better UI for seeing your installed videos on the device. Option to see what the darn video is, without having to select it and go into "properties" first. Which is hard because of the touch screen issues.
  • Possible option to just use a down-scroll on PDF docs, rather than flick-right to turn the page.

My quibbles aside, I do like it, especially for PDFs and apps. I'm looking forward to the Fire evolution and expect to see software updates (or at least good apps) addressing some of these problems very soon.


Sunday, October 30, 2011

A Personal Take on Infovis 2011

I haven't had time to go thru the papers I liked and didn't like yet, but I have been musing on some other aspects of Infovis that I thought I'd recap. To situate this, I usually go every other year to Infovis, and have been doing so since mid-2000's, I guess.

Who Went, Who Didn't; Design vs. "Science"


Partly due to irritable blog exchanges in the past couple years, and partly due to perceived relevance of papers and audience, many of the artistic practitioners of infovis did not come. Or, if they did, I didn't know they were there. By this I mean academic artistic sorts like Golan Levin and Casey Reas and Dan Shiffman, and the practitioners like Stamen, Moritz Stefaner, JanWillen Tulp, Jer Thorp, Wes Grubbs, Ben Fry and Fathom, David McCandless, etc. (Kim Rees from Periscopic did attend. I wish I'd gotten a chance to talk with her.)

Martin Wattenberg and Fernanda Viegas, who are successful straddlers of artistic, industrial, and academic infovis, didn't make it either. They weren't boycotting, it was due to work and personal reasons. (Google+ Ripples, a project of theirs, launched while we were sitting in paper sessions.) I mention them because a handful of years ago they tried to bridge the communities (with Golan) in starting an art track. I don't think the momentum has been entirely conserved. Certainly the papers didn't reflect great focus on emotional, artistic, or design processes. The one most focused on design as process was a very dry and obvious overview how to do "user-centered design for beginners" that caused an industrial colleague of mine to observe "the bar for acceptance seems very low here." (It's not, but that one did make me raise my eyebrows.)

Again, this said, Amanda Cox's brilliant capstone talk, which was largely about design process and decisions at the NYT, was a huge success. As was Jessica Hullman's talk on visual engagement methods (or "chart junk, the sequel," as someone noted--Jerome Cukier, possibly).

I know some members of the program committee are trying to figure out how to get more industrial attendance. CHI has been through this for years, and added various case study tracks, panels dedicated to industrial talks, alt.chi for less mainstream academic works, among other strategies. Infovis could use some of this, but attracting people who have successful careers already, and convincing them there is value in attending given the pricetag, needs some more thinking through. I see value for them in the algorithm side of many of the papers -- but that might not be worth the cost of attendance for them.

Maybe the drinking would? I know some of us talked about the artistic non-attendees over drinks, since they weren't there to participate. More on this below...

One more contingent: there were a lot of folks from the intelligence communities, DoD, the government in general. My perception is that this has increased. And I think they asked smarter questions this year; they certainly weren't shy about going to the mic.


Paper Experience Sure Differs, Depending on Your Perspective


During a bunch of papers, the demo or video had some astoundingly beautiful angle or process moment that just wasn't published "point" -- it was almost incidental. I'm thinking especially of the beautiful organic edge bundling videos from "Skeleton-Based Edge Bundling for Graph Visualization" by Ozan Ersoy et al. (see this page for some recap.). My comment to Jen Lowe was that Jer Thorp and the Processing crowd would have loved this, and with the algorithm detail in the paper, would be able to implement and tweak quite easily. I can't find their videos anywhere, though! (Note: Even the first questioner afterwards said "I could watch your videos forever," but it was kind of in an undertone, not her point either. Let's have more talks where creating beautiful effects is a part of the point, perhaps?)

Ersoy et al. Skeletonization

Mike Bostock's D3.js talk was fascinating to those of us who had read his slides from SVG beforehand, but hadn't heard his commentary on them; and if you knew the DataMarket protovis-vs-d3 history online. It was also nerve-wracking worrying about who would ask what afterwards given some of that historical controversy. Apparently not so for other attendees, I heard later! I find Mike's arguments convincing, although I have not tried to build anything sizable in D3 yet.

Jo Wood's et al.'s BallotMaps talk about name-order biases in voting districts was a wonderful "process" talk on using their HIVE system to visually test hypotheses. (For general info, see their org page.) I feel that the talk with demo of stages of visual exploration was important in making the story work, and the paper isn't as easy or fun to grok. Aidan Slingsby et al's talk on showing uncertainty in cluster results was similar (and surprisingly, the paper seems to differ quite a bit in the system design shown).

Ballotmaps

Program Committee: I'd like to see more videos in the proceedings!

Student Distractions: To Finish or Not?


As an ex-research type myself, I'm always interested in what grad students are going through now, what topics they and their advisors find valuable to study, and what my friends are facing as advisors. Stanford and Berkeley students seem to have a lot more distractions from start-ups given the "big data" and "data science" world we're in now. At the Stanford-sponsored party, I actually found myself recapping all the reasons to finish a Ph.D. to some poor guy who had no intention of quitting his. (Sorry, S, too many drink tickets.)

I don't necessarily use my own Ph.D. (except maybe socially at conference parties), but I have certainly concluded that spending years in a university surrounded by other smart people is not a bad thing. After all, the business world is usually not as smart, face it. And you will have many years to work a 9-7 job after school, so why rush out? The chance to sit in on other departments' classes, even when it's not a requirement, is a chance you don't usually get after graduation. Infovis, like HCI, is (or should be) interdisciplinary; being able to be in stats courses, graphic design courses, programming courses, psychology courses... well, if I were a student now, I'd want take advantage of those wonderful distractions. (I did when I was finishing up, but did NOT take enough stats. Luckily this is fixable with online courses, to some degree.)

Overall, More Drinking Than Usual


I definitely had more fun drinking with people who knew a lot about drinks than I have in previous years. They knew about whisky, cocktails, wine, vodka infusions. Beer too. I was humbled by their depth of alcohol knowledge. Doesn't this convince you to come next year? Stanford threw a good party too, to try to improve the conference party scene.

Maybe you'll come next year.

Wednesday, September 07, 2011

Combing Through the Infovis Twitter Network Hairball

A month or two ago, Moritz Stefaner posted this image of "infovis" folks on twitter, with nodes sized by number of followers ("in-degree"):


I dropped him a note wondering if he'd tried any social network analysis methods to simplify it, or otherwise break it down -- so he sent me the data and said "have a go!" If I had crawled twitter links myself, I might not have used his criteria or seed set, but I was curious if I could make any more sense of his data set as is. (So I've neither re-crawled, nor added any info such as frequency or content of tweets to this data set).

I compared some of the measures calculated by the python library NetworkX with measures calculated by Gephi. The two tools produce slightly different scores for some metrics, an interesting fact which I have not investigated deeply. I've made my spreadsheet of the calculated stats available for you on Google Docs. (Variables prefixed with "NX" are calculated with NetworkX and with "Gi" were calculated by Gephi.)

First, some overall stats on the network in Moritz's dataset:


  • 1644 twitter id's are represented, and there are 145,382 edges, or links, between id's.
  • Gephi reports the average path length is 2.5.
  • Gephi and NetworkX say this is a connected graph; Gephi reports 1 weakly and 5 strongly connected components.
  • The average degree is 89.9, but the median is 51. There is a long tail here, meaning that some nodes have very high degree (see below) but most do not.

Derek Green's excellent tutorial for NetworkX suggests doing community detection using another python library for the Louvain method. At superficial review, it's similar to the Gephi modularity class detection algorithm, but I got slightly different results from the two methods. [Update: The method is non-deterministic and results will vary depending on starting values used]. NetworkX generally finds 5 communities, and Gephi alternates between finding 4 or 5. Here is one confusion matrix, showing the differences between node allocations assigned by NetworkX and Gephi; in this chart, squares are sized according to number of nodes assigned to each group:

So, interpreting this: In one run, Gephi split up the folks who are in NetworkX's community 0 into Gephi's communities A, B, C, D. Gephi's community C mostly overlaps NetworkX's community 1.

For the rest of this post, I'll illustrate from NetworkX's community divisions, which I spent more time investigating. When I looked at the force-directed layouts and stats for the community members, I decided on these approximate group names, based on what I knew of the id's in each group:
  • Group 0: The Authorities
  • Group 1: The Researchers
  • Group 2: The Processing Crowd
  • Group 3: The Small NYT Group
  • Group 4: MSLima's Crowd
These are a bit arbitrary as names - based on who I myself recognized among the high degree members. (I myself live in group 1, way down the list-- feel free to check out where you live, in the spreadsheet!)

To make sensible (i.e., less hairy) plots, I filtered for the top 5% by degree calculation. "Degree" corresponds to sum of in-degree and out-degree edges; in other words, how many people a node (or twitter id, in this case) is linked from and to by other nodes. High "in-degree" count usually implies someone is a perceived authority in the network. High "out-degree" might suggests a social media corporate type. Well, not necessarily - but it means they follow a lot of folks, and could themselves be a useful information source if they also have high centrality and share their information. (Like I said, I didn't look at who said what or how often they tweeted, which would be important measures of health in this network.)


Here's a plot of Gephi's authority calculation vs. degree, strongly correlated (you may see why I named Community 0 "The Authorities"):




Sorting by degree, the top players are these (pulled from the spreadsheet):



LabelNetworkX CommunityGephi ClassDegreeCloseness CentralityBetweenness Centrality
flowingdata0D13940.4469304230.043313537
datavis0A13760.4821901680.072294856
infosthetics0A13620.4352909910.034115345
infobeautiful0D10740.3912108910.007498017
blprnt2B9320.4101151730.02337346
ben_fry2B8820.3656250.006936445
moritz_stefaner0B8700.4523612260.028942168
eagereyes1C8610.4551264240.031837862
mslima4A8280.4334480020.014404322
VizWorld4A8280.5244956770.08984938

Showing edges in hairball graphs makes things really complicated. For the following network graphs, I've limited the displayed nodes and edges to the top 5% by degree measure. Here's an animation of the difference between all edges visible vs. just community-internal edges (I know it's subtle, sorry; the id names are sized by relative degree):



Non-animated, larger versions: With All Edges, Only Intra-Community Edges

The largest names are purple, community 0, or "The Authorities" (a proxy for degree in this case).

Since I chose "degree" for relative sizing, it's worth seeing that in- and out-degree are not always correlated. Here you can see that some "true" authorities have much higher in-degree than out-degrees.  In particular, VizWorld has very high out-degree but rather smaller in-degree.  And by NetworkX's community assignment, he does not end up grouped with the purple community 0.  (Click for larger view.)



However, when we look at betweenness-centrality, VizWorld scores quite high. Betweenness-centrality (or centrality) roughly measures connectedness to components of the larger graph.


If you'd like to inspect the internal linkage structure corresponding to each community subgroup, click on the small images below to view. I've filtered out all but the top 5% by degree, to highlight the authorities in each sub-group.  (Note that this was insufficient for community 3 -- so I expanded it a bit more.)  The curved edges indicate "mutual" follow relations, while the straight edges indicate uni-directional, or one-way follow relations.

community 0, The Authorities

community 1, the Researchers

community 2, the Processing Crowd

community 3, the small NYT group

community 4, MSLima's crowd

Notice that community 0, the purple one, has a surprising number of unidirectional links, as does community 3. The others seem to be dominated by curved lines, a high degree of mutuality. (Hopefully I can explore this later!)

Depending on what you know about the players in these graphs, you will probably see things I don't see.  I myself have very little familiarity with the names in communities 3 and 4, while I admit to being surprised or entertained by the links and organization in the other 3 graphs.  For example, in community 0, the placement of Visually, and its straight line uni-directional links, is especially interesting to me.  (Remember this graph represents the top 5% by degree-- so Visually at this time scored high on degree, and was classified as a member of the "Authorities" group 0 by the community algorithms, but was not itself closely followed by the others in this elite group.)   Green community 2 is also interesting; certainly the artistic folks are there, including the founders, authors, and teachers of Processing courses (ben_fry, REAS, shiffman, blprnt, toxi, mariuswatz, ...); but this group also includes Brainpicker and well-known design firms like Stamen and PitchInteractiv.

Wrapping up, these are the tools I used for the analysis, charts, and graphs: Excel, Tableau (for scatterplots), Python, R (correlation plots which weren't shown here), Gephi, Google Docs, Illustrator and Photoshop. It took more time than I expected, in part because of Gephi's alpha status, and having to adjust a lot of the plots by hand in Illustrator! Hopefully the need for hand-tweaking will disappear as Gephi becomes more mature.

Postscript: While I was working on this, MS Lima's new book, Visual Complexity, shipped from Amazon.  It's a beautiful collection of network visualizations.


Sunday, March 20, 2011

PyCon 2011 - Data, Men, and Me

In the past couple years, I've switched from sending myself to research conferences (like CHI) to more down-and-dirty developery conferences. I'm looking for skills development and tools I can use day-to-day. This spring I went to PyCon in Atlanta, since I've been using Python more and more for data analysis problems. (Complete talk videos are here on blip.tv.)

The initial draw was the tutorials. I aimed for cloud data and machine learning. Olivier Grisel's tutorial Applied Machine Learning in Python with scikit-learn was a definite high point of the conference for me. His talk on text analysis was very good as well -slides here, and video here. His French accent was very nice, but I kept mishearing "scikit-learn" as "psychic learn." :-) I also really enjoyed the talk on Genetic Algorithms by Eric Floehr, a fellow who seems to do weather prediction consulting. His slides and a bunch of other interesting supporting material (including code) are up on his site.

There were a lot of talks on data, big data, cloud data, and scaling Python (to handle big data and data problems). Other examples: A talk on Pypes by Eric Gaumer included a good reminder that big data problems existed in the search engine space long before other kinds of big data became "hot" to work on. Pypes is a quasi-visual toolkit for doing data processing inspired by Yahoo pipes. (The gist being that since a lot of data handling involves discrete steps to clean and transform, you can put these steps into little modules that allow you to view the big picture of what's going on with your data munging.)

Hilary Mason's excellent keynote made a lot of us data geeks happy; she called for programming language evolution to get closer to the data problems, and to be less cryptic when it comes to support for multi-threading and map-reduce strategies needed these days. (I loved her "WTF?" comment on her multithreading code example.) Yelp's "mrjob" library for the cloud might answer some of her issues, but I missed that talk for some reason!

Another talk on big data that was well-tweeted was C. Titus Brown's "Handling Ridiculous Amounts of Data with Probabilistic Data Structures." Slides here - probably requires the video to fully interpret this, at least it does for me (yes, I missed this one too).

Not all talks were excellent, of course. My linguistics degrees got grouchy during one on the linguistics of twitter -- or maybe it was my geeky side asking "what can I do with this?" Some talks were nice surprises, too, kind of the point of going to conferences! Based on lunch table happenstance, I ended up going to a Blender API talk by Chris Allan Webber, a subject about which I knew zilch. Blender is apparently beefing up its API for external calls and automation; as a visualization person, I'm interested in tools that I can "drive" with data as input. I have big hopes for the evolution of processing.py and Nodebox2, two pythonic visualization options, but I am not sure they're there yet for me as a data vis person.

My sad female nerd note: I was one of 3 women in the Machine Learning tutorial. Out of perhaps 40? I later heard via Twitter a guess that there were only 8% women at the conference as a whole, based on t-shirt orders. I loved Hilary's talk, but was a bit bummed out by the Dropbox keynote that featured the social network of "friends of Arash" who started that company -- yeah, all men.

A final comment for any UX folks reading this: This would've been a great audience for a talk on UI design in open source, or UI design for Python UI's. There were a lot of companies presenting: Dropbox did their "we use Python" talk; Evite apparently has rewritten their entire java backend in Python; Threadless, a sponsor, is all Python... One of the reasons for its growth at these companies is the ease of writing things fast in Python; the "prototype and iterate" philosophy showed up over and over in various presentations as a real strength of Python. As a light coder myself, I can't agree more. I was there as a data-oriented geek, but I saw UX opportunity everywhere, for the right kinds of UX folks.

Sunday, August 08, 2010

Fan Video Editing Community and Copyright

In April, I gave a talk at UIUC's HCI department on fan video remix artists, or "vidders," as they are known within the fan media community. To build the talk, I drew on several years of LiveJournal network data, and a large 2-part survey I did in the spring of 2010 to document current attitudes, trends, and self-reported demographics of the community. Afterwards, I made my slides available in an annotated deck for the vidders themselves, as I had promised I would -- there were some interesting comments, including disagreement with certain aspects of the technical commentary (whether meta-data is really useful and available for management of clip collections) and whether the quote I pulled about "political correctness" as a dampener on some fans' "fun" was fair and balanced as a critique of the recent years' vidding discussions on issues of race and gender in vids. I haven't updated the annotations or the deck -- I'm posting it as I posted it to them; if someone is interested in hearing more about the community discussion, I'm happy to reply in comments or email.

I'm posting about this now because of two great things happening for this group of video editing fans -- this weekend is the annual meeting of Vividcon, a fan-run con all about vidding and vids as art and fun. I'm following the tweets with great jealousy -- I never made it off the entry waitlist this year.

The second great thing of recent days is the passage of the new DMCA exemptions from copyright-infringement laws for vidders (and other video artists) using copyright materials for artistic purposes. Since Internet sharing began, fans have regularly had their videos removed from many media sharing sites by copyright police. Some still post on password-protected private servers, rather than making them public and findable by "The Powers That Be."

Francesca Coppa posted on the blog of the Organization of Transformative Works that the case for the copyright-law exemption had been made in part based on the artists' need for high quality original source material for their remix works.

That said (and it's true), it's ironic to me that my own history goes back to the pre-Internet-sharing days, when we borrowed n-th generation tapes and made fuzzy vids with stone knives and bear skins. Check out my slide deck (pdf) for more on this. My talk includes some network analysis, one slide of which shows the "age" effect for when a vidder started vidding, and whose work they admire -- the VCR-era folks (including myself) are now off to the edges on the right and top. Fortuitously, right after my talk, Mimi Ito's article on anime fan editors came out in First Monday. I had already exchanged mail with her about her anime research, and it influenced my second round of survey questions to the vidders. Anime editors differ enormously from the vidder community; one major difference is that fan vidders are mostly women, while anime is more mixed, tending towards more male, and anime editors seem to be younger or to have started earlier, from what Mimi found. In my network graphs and quotes from the community, I show some points of overlap between anime and fan vidders, points and nodes which have increased in the past few years as the two groups learn about each other online and at cons like Vividcon.

Anyway, here are my slides: "Vidding Evolution: Community Change Among Fan Video Editors" (2010).