For the server and backend setup of the system, check out the server
branch.
For the UI and front-end implemenetation, check out the ui
branch.
- Lives in
ui/
- Implemented in React using Create React App and TypeScript
- To install dependencies, enter
yarn install
ornpm install
in theui
directory. - To run, enter
yarn start
ornpm start
in theui
directory.
- Lives in
server/
- Implement in Python with Flask RESTful
- To run, enter the following commands in the
server
directory:pip install -r requirements.txt
python api.py
- To run BART to inject errors, ensure Postgres 11 is running on your machine and, in
server
, run the following command:./BART/Bart_Engine/run.sh <path to XML egtask configuration file>
- To run the post-analysis evaluation script, run the following command in
server
:python eval_h.py <real | sim> <scenario | user> <scenario # | user #>
- Datasets
- Violation configuration files
- CFD discovery module (from this paper and this repo)
- Not included in repo, but make sure the same subdirectories in the plots/ folder in this Drive live in the repo under plots/
- Contains data from individual scenario runs
- Contains information about the users (is empty at first)
- Contains top-level logic for the server
- To run:
python api.py
- Script that builds and runs the Docker container of the backend
- Post-analysis of empirical study results
- Plots and result files are output into plots/
- Nearly all functions that are called by api.py live in here
- Model logic, handling user feedback, and sampling tuples live here
- Convert pickle files needed for post-analysis to JSON files for easier parsing
- Prepare scenarios before having users work through them
- This should be run before having ANY users work with the system
- Defines which scenarios in scenario.json will be utilized in the study
- Base master definitions for all scenarios
- Master definitions for all scenarios after preprocessing is done
- This is what the backend reads when initializing new scenarios for the user to do
- Run simulations of user interactions
- Contains various statistical test functions, e.g. Mann Kendall test
- Calculates rewards for the model for pure match, MRR, pure match w/ subset/superset, and MRR with subset/superset
- Stores user labeling activity in each iteration
- Updates Beta distributions of FDs based on labeling activity (for Bayesian only)
- Builds a new sample to show the user in the next iteration, ensuring violations of the target and alternative hypotheses are present
- Takes an FD, dirty dataset, and clean dataset, and calculates the support (i.e. how many tuples this FD applies to) and violations of the FD in the dirty dataset
- Supports getSupportAndVios
- Turns FD into a bunch of CFDs so we can parse dirty data for violations of the patterns relevant to the FD
- Takes output from cfddiscovery module and ensures compositions and combinations of FDs are also added to the viable hypothesis space definition E.g. if A → B and A → C, then make sure A → BC is also in the hypothesis space
- Derives the initial shape parameters and for the Beta distribution of an FD using the supplied mean and variance values
- Find all violation pairs for an FD in the provided dataset
- Calculates the violations marked, violations found, and total violations with respect to an FD in the sample provided
- Converts all pickle files for a given project ID (i.e. individual scenario-user run) to JSON files
- Check if a terminating condition for the interaction has been met, i.e. if changes in user labeling trends are sufficiently small
- Calculate a wide variety of metrics and statistics that will be utilized in eval_h.py during post-analysis of empirical study results