In my PhD course in econometrics I used to ask students for their end-of-year paper but also for the full replication code, in whatever language they use. Being able to fully replicate a submitted paper’s tables is key as most top journals now require submission of such code upon acceptance — such data availability and data replication policies became the norm in academia after the well-documented Hoxby v. Rothstein debate.
Looking at my students’ code has been revealing. Here, rather than pointing out flaws (whose work doesn’t have flaws?) I’ll point out how I process data and write models. First,
- More than 2/3 of the work on an empirical paper will involve merging and recoding data from different sources. Many PhD students repeat the same code lines 10 or 20x. While simple software like Stata makes it easy to do a merge, they are extremely bad at abstracting code. For most heavy projects, data processing can be done in <100 lines, because data preparation instructions are very repetitive. Hence advice #1 is: use R, and abstract your code as much as possible. For instance, creating a dummy for missing values of variables ‘var1’-‘var100’ in 5 different files shouldn’t take 100 lines. It should take one line. There are many more elaborate ways of abstracting even the most complex operations; and using functional programming as in R makes abstracting fast and easy. Thus learn lapply, sapply, tapply and use functions as much as possible.
- For GIS operations, which most urban economists need, I use the R libraries rgeos, sp, rgdal as much as possible given their flexibility. But these libraries can be slow. For large operations, use ogr2ogr, which performs an order of magnitude faster but is hard to handle. Write Makefile files to replicate your GIS operations that use ogr2ogr.
- I never start writing source code without writing the literate (natural language) version of the code. First, write the plain English description as comments, then insert the required equations. Second, write the code in-between the comment lines. Some people use R Markdown to intersperse R code and comments, but I prefer using plain R code with #’ comments that can be either rendered as pdf/html or executed as R. Donald Knuth, the author of the Art of Computer Programming, has long been a proponent of literate programming, and it does help in clarifying code and writing code as a work of art.
- For theory models, Mathematica has been great help. The notebook interface really is much more suited for theory work (e.g. solving a symbolic system of equations) than for numerical/empirical work. Mathematica is well-suited for functional programming and so switching between R and Mathematica shouldn’t be conceptually hard. (There is a Mathematica package called RLink as well). When simulating a model under different sets of parameters, I have written R Shiny user interfaces to play with the model’s parameters and check for the nature of equilibria.
Disclaimer: Stata has been and still is a great pedagogical tool. Its output is clear as it’s a software that almost trains its users. It displays the right statistics for publication, and provides useful warnings. That simplicity can become a straightjacket. In addition, there is simply no GIS support in Stata (apart from the super basic spmap and shp2dta): good luck doing a regression discontinuity design at a border in Stata.