$ pip install traintestdiff
You can find the documentation in https://traintestdiff.readthedocs.io and a Jupyer notebook in example
traintestdiff provides a simple way to explore differences on your train, validation and test data:
it's main entry point is the class TrainTestDiff
whose only argument is a dict of datasets you would
like to explore.
In this case we're going to explore the tips dataset provided by Seaborn
import pandas as pd
import seaborn as sns
from traintestdiff import TrainTestDiff
tips = sns.load_dataset("tips")
# Let's split our data in train and test
train=tips.sample(frac=0.8,random_state=0)
test=tips.drop(train.index)
Once you have your train and test set you're ready to use TrainTestDiff
datasets = {'train': train, 'test': test}
ttd = TrainTestDiff(datasets)
The two main methods are plot_cat_diff
and plot_cont_diff
: the first one produces a plot of categorical features,
and the second one a plot of continuous features.
long_form, fig1 = ttd.plot_cat_diff(features=['smoker', 'day', 'time'])
With plot_cont_diff
we can explore the continuous features of the datasets
longform_cont1, fig2 = ttd.plot_cont_diff(features=["total_bill", "size", "tip"], kind="box")
As you can see from the code, both plot_cat_diff
and plot_cont_diff
return two values: a pandas.core.frame.DataFrame
and a matplotlib.figure.Figure
The idea is to give you a way to explore the data in a tidy format and the figure to tweak how it looks. For example, let's change the title:
fig1.suptitle("The same graph with other title")
fig1