Sunday, September 13, 2015

Knight Projects for the Year


I am installed in Miami for the academic year as a Visiting Knight Chair in the Journalism department; I've been busy (frantically, insanely busy) trying to put together class materials for the semester, grade stuff, produce talks and workshops, and keep up with Twitter.

As a nice benefit — or responsibility — I have project money to spend on activities or products that will improve the lives of the journalists of the future. Or of the now, if I do it right. Apart from some conference organization with Alberto Cairo, I'm thinking hard about how I'd like to spend that money. Here are a few things I tweeted about a week ago that I think would be of great benefit to data journalists, which don't yet exist fully:

"A few of my Wish List items for improving work, probably out of my project $ and scope:"

  1. "A data-wrangler tool like Trifacta, easy to get/use."
  2. "A customizable, comprehensive interactive vis lib with easy basics - like Vega 2 but maybe more baked? Vega in a year?"
  3. "A non-programming tool for visualization creation that outputs code you can tweak. Lyra, basically, baked."
  4. "A Shiny Server and similar paradigm for Python."
  5. "HTMLwidgets for Python -- we need one ring to bind them, or something. Soooo many attempts to make notebook vis graphics."
  6. "One more - tools/methods for making training and sharing entity recognizers easier. HUGE problem in text analysis."
A few of these tools are under active development in the University of Washington's Interactive Data Lab, particularly Vega and Lyra. (I recommend this video of Arvind Satyanaryan demoing Lyra at OpenVis Conf.) One, Trifacta, is a spin-off company and product from Jeff Heer (Director of the IDL) and student Sean Kandel, who created Data Wrangler. If you want to see some of the excellent tool future in the works at UW's IDL, Jeff Heer's keynote at OpenVis this year was outstanding.

And apparently there's more goodness in the works addressing my needs for IPython notebook interactive widgets in a sub-vega project on Github, pointed out by Rob Story), called ipython-vega right now. Also on the Python front, Rob Story suggests we might want to look at Pyxley from Stitchfix, but to me that still currently looks like a lot of programming and manual setup for a non-programmery analyst. Shiny apps are dead-simple for data analysts with a little gumption to throw up and share with folks right from their R Studio environment.

The future looks great about 5+ years out when all the grad students have finished and productized (or gotten significant coding support). But right now there is still a lot of pain, especially when you're trying to teach folks and recommend tools that are stable, documented, and tested (by people, not unit tests, although those too). Trifacta, of course, is not open-source. A competitor product, Alteryx, looks nice and has an academic license scheme but the non-academic version is $4K! Both for students and data journalists, enterprise level pricing for data wrangling tools is looking scary.

Aside on Entity Recognizers

Oh, a little note on the #6 item, entity recognition tools... Anyone who is trying to do named entity recognition (NER) in text files has a horrible slog getting good results. NER means things like looking up all the people, places, products, or companies in a text. It's hard because different strings are used to refer to the same things. To get results that are any good, especially on dynamic recent data (like news!), you need to train a recognizer with labeled text. (This is because the "out of the box" models and tools like Stanford NER etc. are almost always inadequate for what you really want.) The tools to do the labeling, and the labeling itself, pretty much suck. (Although I admit I haven't looked at the most recent one recommended to me by the Caerus folks.) I know a lot of grad students are suffering with this, when doing research on text in highly specific domains.

I'd love to see a marketplace for trained models customized for different domains, and easy-peasy tools for updating them and sharing improvements. I wish someone's NLP student would tackle this as a startup. Or, I suppose, I could do it with my project money and some help.

Instead, Text Analysis and Vis How-To's?

In the realm of things I can deliver that don't require a corporate team of developers, I'm thinking about doing an online repo ("book") of text analysis and visualization methods. This will be a combination of NLP and corpus analysis methods (in R and Python, I hope) as well as a handbook of visualization methods for text (with sample D3 code). The audience would be journalists with text to analyze, digital humanists with corpora, linguists wanting to get more visual with their work. Because my time is shockingly limited, I'll probably recruit an external helper with my project money to create code samples. If you've seen my epic collection of text vis on Pinterest and want to know "how do I make those?" I hope I'll be able to help you all.


How does this sound? Useful?

Any other ideas from folks out there? I'm chatting with my pals at Bocoup (Irene, Jim, Yannick) about other options for collaborations between us.


Local Workshops on Data Journalism Topics

One of my contributions to the local community at U of Miami is a series of workshops on topics hopefully of interest to data journalists (that I am qualified to teach). The first was a well-attended one on Excel Data Analysis (files here), and upcoming topics include:
  • Excel Charts and Graphs
  • Just What is Big Data (and Data Science) Anyway?
  • Intro to Web Analytics: A/B Testing and Tracking
  • Intro to Tableau
  • Python and R: What Are They Good For?
  • Text Mining with Very Little Programming
  • Visualizing Network Data

I'd like to do one on command line data analysis, and some more on Python and R tools, but am not sure yet where the group wants to go. Stay tuned for more links!

Sunday, March 08, 2015

Teaching News

Overdue for a blog post, and I guess my news needs an official announcement!

I'm happy to announce that I have accepted a visiting post at the University of Miami for 9 months, beginning August 2015 and running through the academic year. This post is financially possible thanks to the generous Knight Foundation, which supports various faculty positions in journalism throughout the country. I’ll be helping Alberto Cairo get his new Data Visualization and Journalism track in the Interactive Media MFA off to a running start; I’ll be teaching data visualization and data analysis, including D3.js. I'll probably keep some side contract work going at the same time. Here's my favorite version of the news on Twitter:




I’ve always been wary of trying to teach D3 in any short workshop format — I’ve been asked and said “no” many times. However, the first class I’ll teach is a semester long, so it seems more feasible. To help prepare for this, I’ll also be a TA in this spring’s online Data Visualization and Infographics with D3 course co-taught by Alberto and Scott Murray (@alignedleft, screen-capped above), who is the author of a very nice introductory D3 book, Interactive Data Visualization for the Web. (If you’re reading about it now for the first time, the class filled up quickly to the cap set at 500 people. Maybe they can do it again if it’s successful.)

In other more minor teaching news, I did a guest lecture at CMU in Golan Levin’s STUDIO for Creative Inquiry on NLP (natural language processing) in Python; the files are all here. The most “interesting” part from Twitter’s perspective is the Bayesian detection of sex scenes in 50 Shades of Gray (because spam is boring). I first did this cocktail-party stunt at OpenVis Conf in 2013, and now I’ve finally released the data and code for it. These introductory lectures cover concepts that would be useful in any more advanced text visualization context; I hope to get a chance to expand on that subject while in Miami, too.

I’m also putting together a lynda.com class, Introduction to Data Analysis with Pandas, although I’ve been doing it veeerrrryyyy slowwwwlllyyyy.

Finally, related to teaching, I’m co-chair of OpenVis Conf this year. We are not quite sold out yet (as of this post), and I think you should come. This is a conference about how the visualization sausage is made — lots of educational talks!

I had planned to write 3 more sections on learning, teaching, and making, but there were some minefields in there about gender and sexism in tech. Not ready for prime time. No navel-gazing for now!