Sunday, March 18, 2012

Digging Into NetworkX and D3


For Boston's Predictive Analytics Meetup in February, I gave a short talk on using the python library NetworkX to analyze social network link data, illustrated with some simple D3.js visuals of the results. I've since spruced up the slides to stand on their own a bit better, extended a few of the examples, and moved it all online.

Here's a link to the zip file of the ppt, heavily commented code samples, and the network edgelist I used (from Moritz Stefaner's and my previous look at Twitter Infovis folks in mid-2011). Or you can browse the slides below (the links should work fine).


A few comments, if you made it through the deck... The network stats are doubtless out of data, since I know there has been some movement in who-follows-whom among the Infovis crowd on Twitter. The overall workflow proposal is this:

  1. Read in your edgelist into NetworkX (or a json file if you already have one)
  2. Convert to a NetworkX graph object
  3. Calculate stats and save them as attributes on the graph nodes (my code shows you how/does it for you)
  4. Optionally here: Filter the network by some important attribute.
  5. Write out JSON of the network to use elsewhere (e.g., D3)
  6. Visualize (in D3) and explore what you got
  7. Optionally here: Filter the network further in the interactive visualization.
  8. Go back to (1) and add more stats. Or filter some more.

In my previous post using Gephi to analyse the infovis network, I labelled one subcommunity "The Processing" crowd, another one "The Researchers" and another one "The Authorities." In my current analysis, where I find 6 subcommunities (or "partitions"), you can see them as roughly the green partition (Processing folks and infovis artists), the orange partition (the research/analytics group), and the blue partition (with high-degree authorities like infosthetics and flowingdata).

The different demos make different things clear about this data, as you might expect!

  • The adjacency matrix of the top 88 chosen by eigenvector centrality reveals that the orange partition, or the Researchers, have more members with high eigenvector centrality than the other subgroups. This is quite clear when you sort by Partition. Other partitions are barely represented here. (NB: It was only 88 because it was all that fit easily; for the other demos, I use a subset of 100 out of the full 1644 nodes.)


  • The chord diagram, which allows you to toggle between the 100 person top eigenvector scoring subset vs. all 1644, shows a striking difference between the subset of 100 and the full set. It's even more obvious here how the orange partition (the "Researchers") overwhelms the top eigenvector subset, and how little of the large green group are represented in this subset. We can certainly speculate why this is...


  • The force network of nodes allows you to see some individual following patterns, at least among the "most important" top nodes. In this example, I filter the graph inside the javascript code, instead of in the NetworkX code. The graph shows a union of the top nodes selected for eigenvector centrality, betweenness, and degree. Some of them near the edges don't follow anyone in the "top" network sample, but they made this cut by being high in one or more measures of degree, eigenvector centrality, or betweenness. You can resize by each variable, or click on one to see the individual's values.

Once you get started making these visuals, you want to tinker forever... I hope the code samples and comments help you get started, if you want to try to do something in this line! Once again, talk slides plus source are in this zip file. Be sure to note my warnings and gotchas if you tinker yourself.

For a recent and different analysis of talk among the Twitter Infovis crowd, visit @JeffClark's posts here and here and here. (He's an orange, top N member in my graphs.) He identifies "red" and "blue" groups based on their interactions and words used. His two primary groups seem to correspond to the processing/artists (green) and researchers/analytics (orange) distinctions I found in this older data.