PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems and tracks critical data. PIICatcher uses two techniques to detect PII:
- Match regular expressions with column names
- Match regular expressions and using NLP libraries to match sample data in columns.
Read more in the blog post on both these strategies.
PIICatcher is batteries-included with a growing set of plugins to scan column metadata as well as metadata. For example, piicatcher_spacy uses Spacy to detect PII in column data.
PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.
There are ingestion functions for both Datahub and Amundsen which will tag columns and tables with PII and the type of PII tags.
- AWS Glue & Lake Formation Privilege Analyzer for an example of how piicatcher is used in production.
- Two strategies to scan data warehouses
PIICatcher is available as a docker image or command-line application.
Docker:
alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'
Pypi: # Install development libraries for compiling dependencies. # On Amazon Linux sudo yum install mysql-devel gcc gcc-devel python-devel
python3 -m venv .env
source .env/bin/activate
pip install piicatcher
# Install Spacy plugin
pip install piicatcher_spacy
# add a sqlite source
piicatcher catalog add_sqlite --name sqldb --path '/db/sqldb'
# run piicatcher on a sqlite db and print report to console
piicatcher detect --source-name sqldb
╭─────────────┬─────────────┬─────────────┬─────────────╮
│ schema │ table │ column │ has_pii │
├─────────────┼─────────────┼─────────────┼─────────────┤
│ main │ full_pii │ a │ 1 │
│ main │ full_pii │ b │ 1 │
│ main │ no_pii │ a │ 0 │
│ main │ no_pii │ b │ 0 │
│ main │ partial_pii │ a │ 1 │
│ main │ partial_pii │ b │ 0 │
╰─────────────┴─────────────┴─────────────┴─────────────╯
from dbcat.api import open_catalog, add_postgresql_source
from piicatcher.api import scan_database
# PIICatcher uses a catalog to store its state.
# The easiest option is to use a sqlite memory database.
# For production usage check, https://tokern.io/docs/data-catalog
catalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret')
with catalog.managed_session:
# Add a postgresql source
source = add_postgresql_source(catalog=catalog, name="pg_db", uri="127.0.0.1", username="piiuser",
password="p11secret", database="piidb")
output = scan_database(catalog=catalog, source=source)
print(output)
# Example Output
[['public', 'sample', 'gender', 'PiiTypes.GENDER'],
['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'],
['public', 'sample', 'lname', 'PiiTypes.PERSON'],
['public', 'sample', 'fname', 'PiiTypes.PERSON'],
['public', 'sample', 'address', 'PiiTypes.ADDRESS'],
['public', 'sample', 'city', 'PiiTypes.ADDRESS'],
['public', 'sample', 'state', 'PiiTypes.ADDRESS'],
['public', 'sample', 'email', 'PiiTypes.EMAIL']]
PIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques:
- Metadata
- Data
Plugins can be created for either of these two techniques. Plugins are then registered using an API or using Python Entry Points.
To create a new detector, simply create a new class that inherits from MetadataDetector
or DatumDetector
.
In the new class, define a function detect
that will return a PIIType
If you are detecting a new PII type, then you can define a new class that inherits from PIIType.
For detailed documentation, check piicatcher plugin docs.
PIICatcher supports the following databases:
- Sqlite3 v3.24.0 or greater
- MySQL 5.6 or greater
- PostgreSQL 9.4 or greater
- AWS Redshift
- AWS Athena
- Snowflake
For advanced usage refer documentation PIICatcher Documentation.
Please take this survey if you are a user or considering using PIICatcher. The responses will help to prioritize improvements to the project.
For Contribution guidelines, PIICatcher Developer documentation.