Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

research documents for rMarkdown #111

Merged
merged 10 commits into from
Mar 1, 2021
30 changes: 30 additions & 0 deletions data-analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Literate Analysis and RMarkdown

This directory records best practices for writing literate analysis reports and using
[RMarkdown](https://rmarkdown.rstudio.com/authoring_quick_tour.html) to do it.

Literate analysis is a style of writing documents that includes the text and the code for analysis in one document. It is a major benefit in keeping your numbers and figures
aligned with your text; consolidating your work sanely; and self-documenting the code
your analysis code. See [Hannah write up for some more depth](https://source.opennews.org/articles/black-box-be-gone-tools-human-optimized-data-analy/).

## Contents

- README
- [Research](./research/)
- [Comparisons with existing tools](./research/comparisons-with-existing-tools.md)
- [Recommendation of adoption](./research/recommendation-of-adoption.md)

## When to Literate Analysis

When you have to write code to generate figure, charts, or graphics to include in
a research report, you should write a literate analysis document.

## How to use RMarkdown for Literate Analysis

Look to the [Courts Transparency cookiecutter](https://github.com/datamade/cookiecutter-court-transparency) for inspiration in getting started.

If this is your first project, we strongly recommend using [RStudio](https://rstudio.com/), which has fabulous support for RMarkdown.

## Resources for learning

* https://rmarkdown.rstudio.com/lesson-1.html
54 changes: 54 additions & 0 deletions data-analysis/research/comparisons-with-existing-tools.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Comparing rMarkdown with existing tools

How does rMarkdown compare with existing tools in DataMade's stack or possible alternatives.

## Pweave

Like rMarkdown, [Pweave](http://mpastell.com/pweave/) is an implementation of [noweb](https://en.wikipedia.org/wiki/Noweb), but one that primarily targets Python instead of R.

The main advantage of Pweave is that it is Python.

While rMarkdown does allow for Python code chunks, there is typically some setup code and that does need to be done in R. With Pweave, it's all Python.

That is really the only advantage.

Like rMarkdown, Pweave requires an additional runtime beyond standard Python. rMarkdown requires R and Pweave requires
[IPython](https://ipython.org/).

Pweave is not actively maintained, and has not been updated
in three years.

rMarkdown has better editor support than Pweave. For the following editors, rMarkdown is as good and usually better
than support for Pweave, if there any Pweave support exists.

* [sublime](https://packagecontrol.io/packages/knitr)
* [emacs](https://ess.r-project.org/)
* [atom](http://www.goring.org/resources/atom_and_r.html)
* [vscode](https://marketplace.visualstudio.com/items?itemName=Ikuyadeu.r)

rMarkdown also has its own IDE, [RStudio](https://rstudio.com/)

Beyond active devlopment and editor support, Pweave is missing many features compared to rMarkdown. Of greatest consequence are 1. chunk specific caching and support for 2. multiple languages, particularly SQL.

Chunk specific caching can dramatically reduce build times which is critical in speed of development.

Our past experience suggests that SQL will be a common language we will use in literate reports, and first class
support is very nice.

## Jupyter Notebook

Jupyter Notebooks overlap in functionality with rMarkdown. The main differences is that Notebooks are intended to be
an interactive exploration tools and rMarkdown is intended to be a documentation and document creation tool.

I have not used Notebooks extensively, but three attributes
make it less attractive.

1. While possible, it is more difficult to generate attractive documents from Notebooks.
2. The file format of Notebooks is not plain text and not natively diffable by github or gitlab, thus making PRs difficult
3. While possible, Notebooks are not primarily intended to
be scripted instead of interactive, thus making bit of mismatch with our ETL philosophy

## Manual integration

We can do and do generate statistics and graphs in one tool and then copy the data or graphics into Google Docs or a markdown file. Sometimes this is the appropriate approach, as described in
the recommendation document.
35 changes: 35 additions & 0 deletions data-analysis/research/recommendation-of-adoption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Recommendation of Adoption

We recommend RMarkdown for authoring literate research reports when the following conditions pertain:

1. The report is for a client
2. When the report contains graphs or statistics.
3. When we use code to generate the graphs or statistics. If we are doing an quick analysis in Excel, because that is what a client needs, then a literate research report would not be useful approach.

RMarkdown should be used even if it the report seems like it will be quick and lightweight. Experience tells us that it is not easy to predict when an analysis will grow in complexity or when a client may return months later to ask about a detail in a quick analysis.

## Proof of concept and pilot

RMarkdown has been the tool of choice for authoring reports in the Courts project. DataMade staff familiar with Pweave have picked it up quickly and journalists without a deep background in programming have also been able to use it successfully (within the RStudio environment).

## Prerequisite Skills

RMarkdown's interleaving of text and code adds another layer to interact with code. As such, we advise that staff not be introduced to RMarkdown until they are familiar with the programming language they will be using in the report. If the report will depend on SQL code, the developer should be familiar with how write and debug SQL code in the terminal or by writing SQL scripts.

If something is not working within a RMarkdown file, it's very useful to be able to work on the code in familiar environment in order to narrow the possible considerations while debugging.

Experience with the R programming language is not a prerequisite, unless that's the language that most of the analysis will be done in.

## Maintenance outlook

It is already DataMade's experience that literate research reports are more maintainable than alternative report authoring workflows.

As far as RMarkdown in particular, the longterm outlook for this tool is excellent.

1. RMarkdown is maintained by RStudio, the major commercial player in R.
2. The R community has settled on RMarkdown (and RStudio) as not just an report authoring tool, but as their notebooking tool. Any possible successor to RMarkdown will have significant pressure to be backwards compatible.
3. RMarkdown, as a file format, is very lightweight and convertible.

## Editors

[RStudio](https://rstudio.com/) is an excellent IDE for RMarkdown. We recommend that people new to RMarkdown start with using RStudio.