#+Title: Active Documents with Org-mode #+Author: Eric Schulte and Dan Davison #+LATEX_HEADER: \usepackage{attrib} #+Options: ^:nil toc:nil #+Startup: hideblocks \begin{abstract} Org-mode is a simple, plain text, markup language for hierarchical documents allowing intermingled data, code and prose. An entire research project, including initial note taking, planning, task management, experimentation, analysis, and publication may take place within a single Org-mode document. This article introduces Org-mode with an overview of syntax, a working \emph{reproducible} example of embedded data analysis, and a summary of the features that make Org-mode a particularly useful tool for the scientific researcher. \end{abstract} * Introduction Org-mode is implemented as a part of the Emacs text editor \cite{emacs}. It was initially developed as a simple outlining tool intended for note taking and brainstorming, and was later augmented with task management tools---enabling notes to be transformed into tasks with deadlines and priorities---and with syntax for the inclusion of tables, data blocks, and active code blocks. Users new to Org-mode often start with its simple plain-text note taking system, then move on to increasingly sophisticated features as their comfort level permits. Reproducible Research (RR) is the practice of publishing scientific results along with the software environment and data required for reproduction of all computational analyses presented in the publication \cite{cise-rr}. Reproducibility is essential to peer reviewed research, however, scientific publications often lack the information required for reviewers to reproduce the analysis described in the work. #+begin_quote An article about computational science in a scientific publication is *not* the scholarship itself, it is merely *advertising* of the scholarship. The actual scholarship is the complete software development environment and complete set of instructions which generated the figures. \attrib{Donoho \cite{donoho}} #+end_quote Org-mode supports RR with syntax for including in-line data and code, mechanisms for evaluating embedded code, and publishing functionality that may be used to automate the computational analysis and generation of figures. This article will focus on the features of Org-mode that support the practice of RR; information on other aspects of Org-mode can be found in the manual \cite{org-manual} [fn:2] and the community wiki [fn:3]. The plain text Org-mode source of this article is available for download [fn:4]; a user with the requisite open-source software can execute the source code examples, which analyze a dataset and create graphics, and export the complete paper to one of several output formats. * Syntax ** Outlines Org-mode documents are organized using a hierarchical outline. The outline can be folded and expanded, hiding or exposing as much of the document as wanted. Using this facility, even very large documents can be comfortably navigated in a manner similar to that of a file system. Headlines are indicated by leading =*='s as shown below in the folded view of this article from within Org-mode. #+source: folded-org #+headers: :exports results #+begin_src sh :var this=(buffer-file-name) :results output cat $this|grep "^\*"|sed 's/$/.../g' #+end_src #+label: fig:folded-org #+results: folded-org The =...='s at the end of each line indicate that the content of the heading is hidden from view. Notice that the heading beginning with the keyword =COMMENT= is not included in the exported document. Org-mode uses many such keywords for associating information with headlines. ** Code and Data Using a simple `block' syntax, both code and data can be embedded in Org-mode documents, as follows. #+begin_src org :exports code ,First a data block. ,#+begin_example , raw textual data ,#+end_example ,Second a code block. ,#+begin_src sh , echo "shell script code" ,#+end_src #+end_src Code and data blocks can be named, allowing their contents to be referenced form elsewhere in the Org-mode file, as illustrated in the following example where the shell script references the contents of the data block. #+begin_src org :exports code ,First a data block. ,#+results: raw-data ,#+begin_example , raw textual data ,#+end_example ,Second a code block. ,#+begin_src sh :var text=raw-data , echo $text|wc ,#+end_src ,#+results: ,: 1 3 17 #+end_src Cross references between the code and data elements of an Org-mode file turn Org-mode into a powerful multilingual programming environment, in which data and code expressed in many different programming languages may interact. * Evaluation Code and data references make possible strings of /chained evaluation/. Figure \ref{fig:chained-evaluation} shows the series of actions which result when the =analyze= code block (fig. \ref{fig:chained-evaluation}, 1) is evaluated interactively or during export. #+label: fig:chained-evaluation #+Caption: Active Org-mode Document #+attr_latex: width=\textwidth [[file:chained-evaluation.pdf]] 1. The =analyze= code block is evaluated. The =:var data=data= header argument causes Org-mode to evaluate the =data= reference. 2. To resolve this reference the =data= code block is located in the Org-mode file and is evaluated. 3. The =:var raw=raw= header argument causes Org-mode to resolve the =raw= reference. 4. The =raw= code block is evaluated, the =:var url="http://data.org"= header argument is evaluated as a literal value which is assigned to the =url= variable and passed to the shell script. The shell script then downloads data from the external url and makes these data available to Org-mode. 5. The results of the shell script are assigned to the =raw= variable, which is passed to the Python code in the body of the =data= code block. 6. This code is passed to an external Python interpreter which evaluates the Python code and returns its result to Org-mode. 7. The results of the =data= code block are then assigned to the =data= variable and passed to the R code in the body of the =analyze= code block. 8. This code is then passed to an external R interpreter, which generates a figure that is written to file specified in =:file fig.pdf=. 9. A reference to this figure is then passed from the =analyze= code block back to Org-mode, which inserts a link marked by double square brackets into the body of the Org-mode document. On export to HTML, ASCII, LaTeX, or another format supported by Org-mode, the linked figure will be embedded into the exported document. * Example Application The application of Org-mode to RR is illustrated with an analysis of baseball statistics. The ordered nature of baseball games makes them particularly amenable to statistical analysis. The performance of baseball players, and the course of baseball games, are routinely captured in a small number of statistics that are comparable across space and time. In this example we analyze the correlation of several common offensive statistics with the attendance at Major League Baseball (MLB) games in the 2010 season. We hypothesize what every baseball fan wants to believe, that large crowds spur the home team to superior levels of performance. The offensive statistic that has the largest correlation with high attendance is found and reported. ** Download External Data This example will show correlation of home team offensive statistics with attendance for the src_sh[:var season=season]{echo $season} MLB season. #+results: season : 2010 This first code block, named =url=, translates the numerical season shown above into the url for the =retrosheet.org= [fn:1] website, a website devoted to the collection and curation of major league baseball statistics. #+source: url #+begin_src sh :var season=season :exports none echo "http://www.retrosheet.org/gamelogs/gl$season.zip" #+end_src With the =raw-data= shell code block, the zip file of statistics located at the specified url is downloaded and its contents are unpacked into a local text file named =2010.csv=. The =:cache yes= header argument ensures that this code block is only run once and the data are not downloaded again every time the results of the code block are referenced. #+source: raw-data #+headers: :exports none #+begin_src sh :cache yes :var url=url :file 2010.csv wget $url && \ unzip -p gl2010.zip > 2010.csv && \ rm gl2010.zip #+end_src Next the =stat-headers= Python code block returns a list of the names of the offensive statistics that will be tested for correlation with attendance. #+source: stat-headers #+headers: :exports none #+begin_src python :results list :cache yes :return fields import urllib2 url = 'http://www.retrosheet.org/gamelogs/glfields.txt' fp = urllib2.urlopen(url) fields = [] for line in fp: if line.find('Visiting team offensive statistics') != -1: line = fp.readline() while line.find('Visiting team pitching statistics') == -1: if line[13] != ' ': fields.append(line.strip().split('.')[0].split('(')[0]) line = fp.readline() #+end_src #+results[97fdb2368b66e48faa6afb8b6eff34e00f05633b]: stat-headers - at-bats - hits - doubles - triples - homeruns - RBI - sacrifice hits - sacrifice flies - hit-by-pitch - walks - intentional walks - strikeouts - stolen bases - caught stealing - grounded into double plays - awarded first on catcher's interference - left on base ** Parsing The next two shell code blocks, =offensive-stats= and =attendance=, collect the offensive statistics and the attendance from the raw data file produced by the =raw-data= code block. #+source: offensive-stats #+headers: :exports none #+begin_src sh :var file=raw-data awk '{for (x=50; x<=66; x++) { printf "%s ", $x } printf "\n" }' FS="," \ < $file #+end_src #+source: attendance #+headers: :exports none #+begin_src sh :var file=raw-data awk '{ print $18 }' FS="," < $file #+end_src ** Analysis The =analysis= code block uses the =R= statistical programming language to calculate correlations between the outputs of the =offensive-stats= and =attendance= code blocks, whose values are saved into the =stats= and =attendance= variables respectively. #+source: analysis #+headers: :var headers=stat-headers :var stats=offensive-stats #+begin_src R :var attendance=attendance :exports none # apply the headers to the list colnames(stats) <- headers ## The following lines are required because parsing bugs are causing ## corrupt data in these two rows. badrows <- c(141, 674) stats <- stats[-badrows,] attendance <- attendance[-badrows,] attendance <- as.integer(attendance) # perform a simple correlation of each column with the attendance corrln <- cor(stats, attendance) # return the name of the most correlated column rownames(corrln)[which.max(corrln)] #+end_src The most correlated column, namely src_sh[:var stat=analysis]{echo $stat}, can be mentioned in the text using an inline code block. The Org-mode syntax for an inline block can be seen in this paragraph. These results indicate that the fans' belief in the effect of large crowds is shared by the visiting team, which chooses to walk a dangerous home team hitter rather than take the chance that the large crowd will spur him to a potentially damaging performance. ** Display Using gnuplot we can plot the number of forced walks and the attendance for the five games with the most forced walks (see Figure \ref{fig:top-5}). #+source: top-8 #+begin_src sh :var data=raw-data :exports none cat $data|awk '{print $60,$18,$7"-"$4}' FS=","|sed 's/"//g'|sort -rn |head -5 #+end_src #+source: figure #+begin_src gnuplot :var data=top-8 :file plot.png :exports results # set term tikz # set output 'plot.tex' set yrange [0:6] set y2range [0:50000] set key above set y2tics border set ylabel 'forced walks' set y2label 'attendance' set style fill pattern set style data histogram set style histogram clustered set auto x set xtic rotate by -45 scale 0 plot data using 1:xtic(3) title 'forced walks', \ data using 2 axes x1y2 title 'attendance' #+end_src #+label: fig:top-5 #+attr_latex: width=0.8\textwidth #+Caption: Top 5 games by forced walks, with forced walks and attendance shown. #+results: figure [[file:plot.png]] Commingling code and prose, as demonstrated in this example, makes it possible for the author to collect all relevant information into a single place. This practice benefits the reader, who can reproduce the calculations performed in the work, and also extend the analysis, possibly within Org-mode itself. For example, the reader of this article can re-run the analysis for another season by simply changing the value of the =season= code block above and re-exporting the file. * Conclusion There are a number of features of Org-mode that make it a good choice for reproducible research; some of these are /essential/ for any RR tool, and others alleviate common burdens of practicing RR. Of the /essential/ properties, arguably the most important is that as part of Emacs, the Org-mode copyright is owned by the Free Software Foundation \cite{fsf}. This ensures that Org-mode is now and will always be free and open source software. This is directly related to two of the goals of RR. First, Org-mode is available free of charge to install by any user on any system ensuring access to the software environment required for reproduction. Second, the source code specifying the inner workings of Org-mode is open to inspection, ensuring that the mechanisms through which Org-mode generates scientific results are open to review and verification. In addition to its open source pedigree, Org-mode benefits in other ways from its development as part of Emacs. Emacs is one of the most widely ported pieces of software in existence, with versions that run on all major operating systems. This ensures that Org-mode documents can be incorporated into almost any computer working environment. Emacs is also widely used by the scientific community for editing both prose documents and source code. By leveraging existing Emacs editing support, Org-mode is able to offer its users a comfortable and familiar editing environment for all types of content. Finally, due to Org-mode's implementation in the Emacs extension language, /Emacs Lisp/ \cite{elisp}, it is possible for users to customize the behavior of Org-mode to their particular needs and to add support for arbitrary new programming languages---Org-mode currently has support for over thirty programming languages. Org-mode addresses many common problems in the practice of RR. Given that a single Org-mode document can be used for every stage of a research project from brain-storming, through software development and experimentation, to publication, the author is largely relieved of the burden of tracking resources required for reproduction of the work. Such large amounts of information can result in extremely large files, however the hierarchical folding of Org-mode documents enables users to comfortably read and edit such files. The files themselves are encoded in plain text, which enhances their portability and makes them easy to integrate well with version control systems, allowing for revision tracking and collaboration \cite{cise-vc}. Org-mode documents can run the gambit from simple collections of plain-text notes, to complex laboratories housing data and analysis mechanisms, to publishing desks with facilities for the display and export of scientific results. There is a friendly community of Org-mode users and developers who communicate on the Org-mode mailing list [fn:5]; through answering questions and helping each other to master Org-mode's many features, this community helps to solve one of the largest hurdles posed by any RR tool, namely learning how to use it. #+begin_LaTeX \bibliographystyle{plain} \bibliography{babel} #+end_LaTeX * COMMENT How to Export this Document - Requirements :: Ensure that you have both recent versions of [[http://www.gnu.org/software/emacs/][Emacs]] (23 or greater) and [[http://orgmode.org/][Org-mode]] (7.5 +or greater+) installed on your system. To evaluate the code blocks in this paper the relevant programming languages must be installed on your system, these include; - [[http://www.python.org/][Python]] - [[http://www.r-project.org/][R]] and [[http://ess.r-project.org/][ESS]] - [[http://www.gnuplot.info/][gnuplot]] and [[http://www.emacswiki.org/emacs/GnuplotMode][gnuplot-mode]] - Configuration :: Evaluate the following emacs-lisp code block to configure Org-mode for export of this paper. #+source: configuration #+begin_src emacs-lisp :results silent ;; first it is necessary to ensure that Org-mode loads support for the ;; languages used by code blocks in this article (org-babel-do-load-languages 'org-babel-load-languages '((sh . t) (org . t) (emacs-lisp . t) (python . t) (R . t) (gnuplot . t))) ;; then we'll remove the need to confirm evaluation of each code ;; block, NOTE: if you are concerned about execution of malicious code ;; through code blocks, then comment out the following line (setq org-confirm-babel-evaluate nil) ;; finally we'll customize the default behavior of Org-mode code blocks ;; so that they can be used to display examples of Org-mode syntax (setf org-babel-default-header-args:org '((:exports . "code"))) #+end_src - Export :: After installing all required software the following steps can be used to export this paper to a number of different backends. 1. Open this document in Emacs 2. Evaluate the "Configuration" =emacs-lisp= code block immediately previous in this document. This can be done with =C-c C-v p= to jump to the previous code block, then =C-c C-c= to evaluate the code block where =C-c= means press "c" while holding the control key, =C-v= means press "v" while holding the control key, and so forth. 3. Next use =C-c C-e= to open the Org-mode export dialog, which displays a number of backend options and the key which should be used to export to that backend, for example, press "d" to export this document to a =.pdf= and open the resulting file in your document reader, or press "b" to export this document to =.html= and open the resulting file in your web browser. * Footnotes [fn:1] The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org". [fn:2] http://orgmode.org/manual/ [fn:3] http://orgmode.org/worg/ [fn:4] https://github.com/eschulte/CiSE/raw/master/org-mode-active-doc.org [fn:5] http://lists.gnu.org/mailman/listinfo/emacs-orgmode