aequitas-workbook-text.Rmkd


### Getting Started with Aequitas

Aequitas is a python package that can be used in several ways: as an
imported module in your code or jupyter notebooks, as a command-line
utility, or served as a web interface. To help orient you to its
interface and explore some of the concepts covered in this chapter in
more detail, we have developed a tutorial notebook around the COMPAS
(Correctional Offender Management Profiling for Alternative Sanctions)
case study described in Section [COMPAS case](#sec:compascase).^[See 
https://dssg.github.io/aequitas/.] 
Each of the tutorial’s sections is described briefly below.

### Requirements

To work with the Aequitas tutorial, you will need a jupyter notebook
server running a python 3.6 (or higher) kernel with the `pandas` and
`seaborn` packages installed. In addition, you will need to install the
`aequitas` module:

        pip install aequitas

Alternatively, you can install from source with:

        git clone https://github.com/dssg/aequitas.git
        cd aequitas
        python setup.py install

### Data Preparation

Included with the tutorial notebook is a sample dataset,
`compas_for_aequitas.csv`, representing two years of COMPAS recidivism
predictions and subsequent arrest outcomes from Broward County, Florida.
The data provided here represent the same data used in the ProPublica
analysis described above, cleaned somewhat and structured for use with
Aequitas.

The example dataset has be preprocessed into the format required in
order to use Aequitas. To do so, your input data should be provided at
the individual entity level, and contain the following columns:

-   A `score` column, which may be either a binary (0 or 1) or
    continuous (between 0.0 and 1.0) value representing the output of a
    predictive model (for a continuous case, you will need to provide a
    classification threshold for the analysis). For the example data
    included here, the COMPAS scores have been mapped to a binary score
    based on ProPublica's interpretation of Northpointe's practioner
    guide, with 0 representing a `low` COMPAS score and 1 representing a
    `medium` or `high` score.

-   A `label_value` column, provided as a binary (0 or 1) value,
    representing the actual outcome for each entity. Again following
    ProPublica, the example dataset defines a recidivism outcome
    (`label_value` = 1) as a new arrest within two years.

-   One or more attribute columns for which you would like to evaluate
    predictive fairness and disparities. Here, you can consider any
    attributes of interest for your particular application. For
    instance, the example dataset includes the columns `race`, `sex`,
    and `age_cat` that will be used for our tutorial.

As you follow along in the tutorial notebook, we start with a little bit
of descriptive data exploration just to get a feel for the data. Figure \@ref(fig:tutorial-explore) shows the large difference in the distribution
of COMPAS scores across race. The notebook guides you through several
additional initial analyses and you should feel free to further explore
the data as well.

```{r tutorial-explore, out.width = '100%', fig.align = 'center', echo = FALSE, fig.cap = 'Data exploration screenshot from the Aequitas tutorial'}
knitr::include_graphics("ChapterBias/figures/tutorial_explore.png")
```

Applying Aequitas progammatically is a three step process represented by
three python classes that will be described in the following sections:

-   `Group()`: Define groups

-   `Bias()`: Calculate disparities

-   `Fairness()`: Assert fairness

### Working with Bias Metrics

In the second section of the notebook, you will learn how to use
Aequitas to understand the distribution of your data and outcomes, as
well as measure, visualize, and interpret bias metrics of interest. To
perform these analyses, you'll make use of the `Group()` and `Plot()`
classes, which can be imported with:

        from aequitas.group import Group
        from aequitas.plotting import Plot

Aequitas's `Group()` class enables researchers to evaluate biases across
all subgroups in their dataset by assembling a confusion matrix of each
subgroup, calculating commonly used metrics such as false positive rate
and false omission rate, as well as counts by group and group prevalance
among the sample population.

Following the notebook, after constructing a `Group` object, you can use
the `Group.get_crosstabs()` method to calculate a confusion matrix for
each subgroup in your data. This method expects as input a
`pandas.DataFrame` object formatted with the columns described above,
and will infer the names and distinct values of your attribute columns
defining the subgroups in your data. Calling `get_crosstabs` with your
input data will return a `pandas.DataFrame` at the subgroup level with
confusion matrix counts and ratios. The tutorial notebook walks through
some exploration of this dataframe and interpreting biases across
groups.

Absolute group bias metrics from the crosstabs dataframe created
by the `Group.get_crosstabs()` method can be visualized with two methods
in the Aequitas `Plot()` class. After instantiating a `Plot` object, you
can plot the results of a single metric by passing the crosstabs
dataframe and metric name to `Plot.plot_group_metric()`, as well as
optionally specifying a threshold below which to exclude small groups
which may be particularly noisy. The tutorial notebook walks through an
example of visualizing the false negative rates of groups using:

        aqp = Plot()
        fnr = aqp.plot_group_metric(xtab, 'fnr', min_group_size=0.05)

Additionally, you can visualize several metrics at the same time in
small multiples using the `Plot.plot_group_metric_all()` method. Figure \@ref(fig:tutorial-plot-crosstabs) shows an example of this method from
notebook. In it, you can see that the largest `race` group,
`African American`, is predicted positive more often than any other race
group (predicted positive rate $PPR$ of 0.66), and are more likely to be
incorrectly classified as 'high' risk (false positive rate $FPR$ of
0.45) than incorrectly classified as 'low' or 'medium' risk (false
negative rate $FNR$ of 0.28).

```{r tutorial-plot-crosstabs, out.width = '100%', fig.align = 'center', echo = FALSE, fig.cap="Visualizing metrics at the same time using small multiples can help identify differences between groups."}
knitr::include_graphics("ChapterBias/figures/tutorial_plot_crosstabs.png")
```

Data exploration screenshot from the Aequitas tutorial workbook, showing the predicted positive rate $PPR$, predicted prevalence $PPrev$, false negative rate $FNR$, and false positive rate $FPR$ across subgroups in the COMPAS data.

To graph specific metrics of interest, you can pass a list to
`plot_group_metric_all()` using the `metrics` keyword. Alternatively,
you can pass `'all'` to visualize all calculated metrics or omit the
keyword to plot the default metrics:

-   Predicted Prevalence (`'pprev'`)

-   Predicted Positive Rate (`'ppr'`)

-   False Discovery Rate (`'fdr'`)

-   False Omission Rate (`'for'`)

-   False Positive Rate (`'fpr'`)

-   False Negative Rate (`'fnr'`)

You can explore these options in more detail in the tutorial notebook.

### Measuring Disparities

We use the Aequitas `Bias()` class to calculate disparities between
groups based on the crosstabs returned by the `Group.get_crosstabs()`
method described above. Disparities are calculated as a ratio of a
metric for a group of interest compared to a base group. For example,
the False Negative Rate Disparity for black defendants vis-a-vis whites
is:

$$Disparity_{FNR} = \frac{FNR_{black}}{FNR_{white}}$$

Aequitas provides a couple of options for determining the reference
group for each attribute's disparity calculations: using
`Bias.get_disparity_predefined_groups()` allows you to specify the
reference groups directly, while `Bias.get_disparity_major_group()` will
choose the largest group as the reference and
`Bias.get_disparity_min_metric()` will use the group with the smallest
value of the metric being calculated. Note in this last case that
different reference groups may be used for different metrics.

In the tutorial notebook, you can walk through a couple examples of
disparity calculations and how your choice of reference groups affects
the results. Each of the `get_disparity_` methods will return a
`pandas.DataFrame` containing the results of disparity calculations as
well as (optionally) tests of statistical significance. For instance,
the call below will calculate disparities relative to the specified
groups and determine statistical significance at a level of
$\alpha = 0.05$ (`mask_significance` means values of `True` or `False`
will be returned rather than the p-value itself).

        b = Bias()
        bdf = b.get_disparity_predefined_groups(
                xtab, 
                original_df=df, 
                ref_groups_dict={
                    'race':'Caucasian', 'sex':'Male', 'age_cat':'25 - 45'
                }, 
                alpha=0.05, 
                check_significance=True, 
                mask_significance=True
            )

Notice that because disparities are calculated as ratios, the reference
group will always have a disparity of 1.0. They can be interpreted as
how much more prone the model is to making a certain type of mistake for
one group relative to the reference group. For instance, calculating the
`fpr_disparity` values by race with the COMPAS data indicates that black
people are falsely identified as being high or medium risks 1.9 times
the rate for white people, while the `fdr_disparity` values are much
closer to 1.

The Aequitas `Plot()` class provides methods to visualize the results of
your disparity calculations, with a similar interface to the methods
described for plotting the absolute metrics described above.
`Plot.plot_disparity()` allows for plotting a single specified disparity
metric for a single attribute while `Plot.plot_disparity_all()` allows
you to plot multiple disparities and attributes in small multiples at
once.

Figure \@ref(fig:tutorial-plot-disparity) shows an example of using the
`plot_disparity()` method from the tutorial notebook. Each disparity is
plotted as a treemap, with the size of the rectangle representing the
size of the group and color indicating the level of disparity (with
values over 1.0 in orange and those under 1.0 in blue). Notice in the
figure that the reference group is colored gray and labeled `(Ref)`.
Statistically significant disparities (at the level specified with
`significance_alpha`) will be labeled with two asterisks (\*\*), as seen
for the `African-American` group in Figure \@ref(fig:tutorial-plot-disparity).

```{r tutorial-plot-disparity, out.width = '100%', fig.align = 'center', echo = FALSE, fig.cap = 'Plotting disparity can show any significant difference, with two asterisks denoting statistical significance.'}
knitr::include_graphics("ChapterBias/figures/tutorial_plot_disparity.png")
```

Data exploration screenshot from the Aequitas tutorial workbook, showing racial disparities on the false positive rate $FPR$. Note that the reference group, Hispanic, is indicated in gray and a statistically significant disparity for African-Americans is labeled with two asterisks (**).

The tutorial notebook walks through several additional examples of using
`plot_disparity()` and `plot_disparity_all()`. When visualizing more
than one disparity using `plot_disparity_all()`, you can specify a list
of disparity metrics, `'all'` disaprity metrics, or use the Aequitas
default disparity metrics by not supplying an argument:

-   Predicted Positive Group Rate Disparity (`pprev_disparity`)

-   Predicted Positive Rate Disparity (`ppr_disparity`)

-   False Discovery Rate Disparity (`fdr_disparity`)

-   False Omission Rate Disparity (`for_disparity`)

-   False Positive Rate Disparity (`fpr_disparity`)

-   False Negative Rate Disparity (`fnr_disparity`)

### Assessing Model Fairness

Finally, the Aequitas `Fairness()` class provides three functions that
provide a high level summary of fairness. This class builds on the
dataframe returned from one of the three `Bias.get_dispariy_` methods.
By specifying a threshold within which you would consider disparities to
meet a reasonable level of fairness, this class allows you to evaluate
at the group, attribute, and overall levels. For example, evaluating
group-level FPR fairness with the default thresholds evaluates:

$$0.8 < Disparity_{FPR} = \frac{FPR_{group}}{FPR_{base\_group}} < 1.25$$

Calling `Fairness.get_group_value_fairness()` with your bias dataframe
as an argument will return a `pandas.DataFrame` with boolean results
indicating where your fairness criteria is met for each of the disparity
metrics, as well as:

-   **Type I Parity:** Fairness in both FDR Parity and FPR Parity

-   **Type II Parity:** Fairness in both FOR Parity and FNR Parity

-   **Equalized Odds:** Fairness in both FPR Parity and TPR Parity

-   **Unsupervised Fairness:** Fairness in both Statistical Parity and
    Impact Parity

-   **Supervised Fairness:** Fairness in both Type I and Type II Parity

-   **Overall Fairness:** Fairness across all parities for all
    attributes

You can also assess whether each of these metrics meets your fairness
threshold for all groups across each attribute at the same time using
`Fairness.get_group_attribute_fairness()`. That is, this method will
return a boolean at the level of each attribute (e.g., `race`, `sex`, or
`age`) if the criteria is met for every subgroup defined by that
attribute. For a further roll-up across all attributes as well, you can
use `Fairness.get_overall_fairness()` to see a high-level assessment of
unsupervised fairness, supervised fairness, and overall fairness. Below
is a quick example of the usage for each of these methods, which you can
explore further in the tutorial notebook:

        f = Fairness()

        # group-level fairness:
        fdf = f.get_group_value_fairness(bdf)   # input is the result of a Bias.get_disparity_ method
        
        # attribute-level fairness:
        gaf = f.get_group_attribute_fairness(fdf)   # input is group-level fairness result from above
        
        # overall fairness:
        gof = f.get_overall_fairness(fdf)           # input is group-level fairness result from above

In the tutorial notebook, you can calculate these metrics using the
COMPAS data. There, the African-American false omission and false
discovery are within the bounds of fairness (when assessed using
Aequitas's default thresholds), which should be expected because COMPAS
is calibrated. On the other hand, African-Americans are roughly twice as
likely to have false positives and 40 percent less likely to false
negatives. In real terms, 44.8% of African-Americans who did not
recidivate were marked high or medium risk (with potential for
associated penalties), compared with 23.4% of Caucasian non-reoffenders.
This result doesn't meet the fairness threshold and thus returns `False`
in the resulting dataframe. These findings mark an inherent trade-off
between FPR Fairness, FNR Fairness and calibration, which is present in
any decision system where base rates are not equal as discussed in
Section [COMPAS case](#sec:compascase). Aequitas helps bring this trade-off to the
forefront with clear metrics and asks system designers to make a
reasoned decision based on their use case.

The Aequitas `Plot()` class provides four methods to allow you to
visualize the results of these various fairness calculations:

-   Using `Plot.plot_fairness_group()`, you can plot a graph of a single
    fairness metric across different groups showing the absolute metric
    values. Example usage is shown in Figure \@ref(fig:tutorial-plot-fairness1).

-   `Plot.plot_fairness_group_all()` allows you to plot small multiples
    of several fairness metrics' absolute values at the group level

-   With `Plot.plot_fairness_disparity()`, you can plot a treemap of the
    fairness results (similar to the disparity plot in Figure
    \@ref(fig:tutorial-plot-disparity) showing the values of disparities
    relative to a base group and fairness results. Figure 
    \@ref(fig:tutorial-plot-fairness2) shows example usage of this
    method.

-   `Plot.plot_fairness_disparity_all()` allows you to plot small
    multiples of several disparity treemaps across different groups and
    metrics.

```{r tutorial-plot-fairness1, out.width = '100%', fig.align = 'center', echo = FALSE, fig.cap = 'You can plot a single fairness metric across different groups, such as age category, sex, and race.'}
knitr::include_graphics("ChapterBias/figures/tutorial_plot_fairness1.png")
```

Data exploration screenshot from the Aequitas tutorial workbook,
showing the fairness results for predicted positive rate $PPR$ across
subgroups in the COMPAS data. Absolute values of $PPR$ are plotted for
each group with bars colored by the results of applying fairness
criteria to these disparities with groups meeting the criteria shown in
green and those not meeting the criteria shown in red.

```{r tutorial-plot-fairness2, out.width = '100%', fig.align = 'center', echo = FALSE, fig.cap = 'You can plot a treemap of fairness results. Here, the disparity by race is shown.'}
knitr::include_graphics("ChapterBias/figures/tutorial_plot_fairness2.png")
```

Data exploration screenshot from the Aequitas tutorial workbook,
showing a treemap representing $FPR$ disparities across racial groups in
the COMPAS data. The size of each square represents the size of the
group, with labels indicating the disparity values, and color indicating
whether these values meet specified fairness thresholds (green for those
meeting the criteria, red for those not meeting the criteria, and
reference groups shown in gray.

The graphs generated by these methods will generally show similar
information to the plots of absolute metric values and disparities
described above, however the application of fairness criteria is
overlayed on these plots to indicate whether a group meets the specified
fairness criteria for a given metric (those meeting the threshold are
shown in green and those failing to meet it are shown in red). The
tutorial notebook will walk you through several examples of using each
of these methods, and you should feel free to explore their usage
further, including how fairness results change with the application of
different thresholds.