Python for Data analysis

Sat, Mar 16, 2013

Part 1

Python for Data Analysis I am reading this book and it is really good. Everyone who wants to do Data analysis should read this book and consider using these tools. It presents NumPy and SciPy for numeric and vectorized operations, matplotlib for fast and programmatic plotting, and Pandas for a robust Data structure framework. It also goes over some data formats and tools for parsing them.

Part 2

I have read and thoroughly enjoyed the book by Wes McKinney. Ever since, I have been using python for all of my coding. That is except for the CUDA and Cilk stuff required for my High Performance Computing algorithms class. The Pandas library has been a wonderful tool. It really builds atop and plays nicely with the Numpy, Scipy, pyplot stack for interactive data analysis. Making a quick plot to see a trend in some data is really easy and going all the way to publication quality plots scales well. I have been developing a workflow for a paper that I am writing with David Ediger and Rob McColl. The python code is an integral part of the paper because it shows the record of how I analysed the data, and prepared the figures. My goal for the end paper is to distribute a tarball will Bash code for gathering data; Stinger C code for graph experiments; Python with Pandas code for data analysis; latex for document preparation; and a Makefile that will do it all in a pipeline to make the paper. That way if we get a new dataset, all of the hard work that I have been doing the past months can be replicated in a single command line. This will take a real step forward for reproducible research.

I think that I also have a clearer picture of how one should organize code for using IPython and interactively exploring some data. I started off with a little script that would load the data and analyse it and make the figures. But then I realized that lots of the features I had were being reused and so I collapsed them into functions with some general parameters. I tried to right functions directly, but that makes debugging them a lot harder because you don’t end up in their namespace when they are done executing. I also don’t really know where functionality should live, and what data is going to be reused later. So writing code in a static style where you plan first and then code is not going to work for exploratory data analysis. As this script grew larger it had a lot of computation that was going to be performed every time I ran it. But all of the IO takes most of the time and I do not need to reload the data every time I make a small change to feature.

This brings us to the file layout that I am now using. The first file does all of the loading of data and cleans up after that by closing files. The second file does the computation and defines a bunch of functions that can be used at the IPython prompt. The third file does all of the visualization in a general way ie. no string constants. The paper_figures file uses constants to make the final figures that will go into the paper and deposits them into files for latex to find them. This allows you to iterate on the computation file without running the IO every time you change a computation. By separating these tasks we can make interactive data analysis more fun and responsive. The loading data and paper_figures files are really for the batch mode that will come later once the data has been explored and we are distributing the results to other researchers and decision makers. The files in the middle are evolving a little library of things you do with your data that could be reused next time.

Let me know if you have had a similar experience or wisdom to share on this subject.