PyData NYC Nov. 2013

PyData 2013 NYC was a pretty great time.  It is always fun to meet folks as passionate about your favorite tools as you are.  There’s probably too much to really mention, but I definitely want to throw together a few of my thoughts and ideas.  Without futher ado …

Some of the talks I went to:

  • Travis talking about conda (and blog post and blog post).  While I’m an admitted gentoo fanboy (actually, I don’t fan at all; I just use it), having a lighter weight option for the Python eco-system (across *nix (including OSX) and Windows) is really nice.  If I would have realized a few things about conda last year (I’m not sure how far along it was, at the right time point), I might have used it for some internal code deployment.
  • Yves talking about Performance Python (and an ipython notebook of the same; some other talk material is at his website).  Not much here was new to me — but — being reminded of the fundamentals and low-hanging fruit is always good.
  • Dan Blanchard talking about skll (and a link to the talk).  skll seems to take care of several procedural meta-steps in scikit-learn programs:  train/test/CV splits and model parameter grid searches.
  • Thomas Wiecki talking about pymc3 (most of the talk material shows up in the pymc3 docs; he also mentioned Quantopian’s zipline project and he has a few interesting git repos).
  • Peter Wang’s keynote was insightful, thought provoking, and not the typical painful keynote that has you checking email the whole time.  He mentioned a Jim Gray paper that seems worthwhile.  By reputation, everything Jim Gray did was worthwhile.  [Gray disappeared while sailing a few years back.]

A thought that I’ve had over the years and that I’d love to see come to (ongoing) completion is some sort of CI job (continuous integration) that grabs the main Python learning systems, builds them, and runs [some|many|most|all] of the learning algorithms on synthetic, random, and/or standard (UCI, kaggle, etc.) datasets.  Of course, we would measure resource usage (time/memory) and error rates.  While the time performance is what would really get most people interested (and also cause the most dissent:  you weren’t fair to XYZ), I’m more interested in verifying that random forest in scikit-learn and orange give marginally similar results.  Throwing in some R and matlab options would give some comparison to the outside world, as well.

Doing these comparisons in the right way has a number of difficulties, as I discussed with Jake VanderPlas.  In just a few minutes, we were worried about data format differences (less important for numpy based alternatives, Orange uses its own ExampleTable — which you can convert to/from numpy arrays), default and hard-coded parameters (possibly not being able to compare equivalent models), and social issues.