Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring pii_redactor as its own dpk_ named module #895

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ RUN pip install --no-cache-dir pytest
RUN useradd -ms /bin/bash dpk
USER dpk
WORKDIR /home/dpk

ARG DPK_WHEEL_FILE_NAME

# Copy and install data processing libraries
Expand All @@ -18,20 +19,9 @@ RUN pip install data-processing-dist/${DPK_WHEEL_FILE_NAME}

# END OF STEPS destined for a data-prep-kit base image

COPY --chown=dpk:root src/ src/
COPY --chown=dpk:root pyproject.toml pyproject.toml
COPY --chown=dpk:root dpk_pii_redactor/ dpk_pii_redactor/
COPY --chown=dpk:root requirements.txt requirements.txt
RUN pip install --no-cache-dir -e .

# copy transform main() entry point to the image
COPY ./src/pii_redactor_transform_python.py .

# copy some of the samples in
COPY ./src/pii_redactor_local.py local/

# copy test
COPY test/ test/
COPY test-data/ test-data/
RUN pip install -r requirements.txt

# Set environment
ENV PYTHONPATH /home/dpk
Expand Down
Original file line number Diff line number Diff line change
@@ -1,37 +1,22 @@
ARG BASE_IMAGE=docker.io/rayproject/ray:2.24.0-py310

FROM ${BASE_IMAGE}

RUN pip install --upgrade --no-cache-dir pip

# install pytest
RUN pip install --no-cache-dir pytest
ARG PIP_INSTALL_EXTRA_ARGS
ARG DPK_WHEEL_FILE_NAME

# Copy and install data processing libraries
# These are expected to be placed in the docker context before this is run (see the make image).
COPY --chown=ray:users data-processing-dist data-processing-dist
RUN pip install data-processing-dist/${DPK_WHEEL_FILE_NAME}[ray]

## Copy the python version of the tansform
COPY --chown=ray:users python-transform/ python-transform/
RUN cd python-transform && pip install --no-cache-dir -e .

#COPY requirements.txt requirements.txt
#RUN pip install --no-cache-dir -r requirements.txt

COPY --chown=ray:users src/ src/
COPY --chown=ray:users pyproject.toml pyproject.toml
RUN pip install --no-cache-dir -e .

# copy the main() entry point to the image
COPY ./src/pii_redactor_transform_ray.py .

# copy some of the samples in
COPY ./src/pii_redactor_local_ray.py local/

# copy test
COPY test/ test/
COPY test-data/ test-data/
COPY --chown=ray:users dpk_pii_redactor/ dpk_pii_redactor/
COPY --chown=ray:users requirements.txt requirements.txt
RUN pip install -r requirements.txt

# Grant non-root users the necessary permissions to the ray directory
RUN chmod 755 /home/ray
Expand Down
18 changes: 18 additions & 0 deletions transforms/language/pii_redactor/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
REPOROOT=../../..
# Use make help, to see the available rules
include $(REPOROOT)/transforms/.make.cicd.targets

#
# This is intended to be included across the Makefiles provided within
# a given transform's directory tree, so must use compatible syntax.
#
################################################################################
# This defines the name of the transform and is used to match against
# expected files and is used to define the transform's image name.
TRANSFORM_NAME=$(shell basename `pwd`)

################################################################################


publish:
@echo "Skip... do nothing! pushing CI/CD over a cliff with OSError on text_encoder "
79 changes: 0 additions & 79 deletions transforms/language/pii_redactor/Makefile.disable

This file was deleted.

112 changes: 102 additions & 10 deletions transforms/language/pii_redactor/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,105 @@


# PII Redactor Transform

* [python](python/README.md) - provides the base python-based transformation
implementation.
* [ray](ray/README.md) - enables the running of the base python transformation
in a Ray runtime
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.
<!-- Consider commenting out since we do not have a spark transform for this.
* [spark](spark/README.md) - enables the running of a spark-based transformation
in a Spark runtime.
-->
This transform redacts Personally Identifiable Information (PII) from the input data.

The transform leverages the [Microsoft Presidio SDK](https://microsoft.github.io/presidio/) for PII detection and uses the Flair recognizer for entity recognition.


## Contributors

- Sowmya.L.R ([email protected])


### Supported Entities

The transform detects the following PII entities by default:
- **PERSON**: Names of individuals
- **EMAIL_ADDRESS**: Email addresses
- **ORGANIZATION**: Names of organizations
- **DATE_TIME**: Dates and times
- **PHONE_NUMBER**: Phone number
- **CREDIT_CARD**: Credit card numbers

You can configure the entities to detect by passing the required entities as argument param ( **--pii_redactor_entities** ).
To know more about different entity types supported - [Entities](https://microsoft.github.io/presidio/supported_entities/)

### Redaction Techniques

Two redaction techniques are supported:
- **replace**: Replaces detected PII with a placeholder (default)
- **redact**: Removes the detected PII from the text

You can choose the redaction technique by passing it as an argument parameter (**--pii_redactor_operator**).

## Input and Output

### Input

The input data should be a `py.Table` with a column containing the text where PII detection and redaction will be applied. By default, this column is named `contents`.

**Example Input Table Structure:** Table 1: Sample input to the pii redactor transform

| contents | doc_id |
|---------------------|--------|
| My name is John Doe | doc001 |
| I work at apple | doc002 |


### Output

The output table will include the original columns plus an additional column `new_contents` which is configurable with redacted text and `detected_pii`
column consisting the type of PII entities detected in that document for replace operator.

**Example Output Table Structure for replace operator:**

| contents | doc_id | new_contents | detected_pii |
|---------------------|--------|--------------------------|------------------|
| My name is John Doe | doc001 | My name is `<PERSON>` | `[PERSON]` |
| I work at apple | doc002 | I work at `<ORGANIZATION>` | `[ORGANIZATION]` |

When `redact` operator is chosen the output will look like below

**Example Output Table Structure for redact operator**

| contents | doc_id | new_contents | detected_pii |
|---------------------|--------|--------------------------|------------------|
| My name is John Doe | doc001 | My name is | `[PERSON]` |
| I work at apple | doc002 | I work at | `[ORGANIZATION]` |

### Launched Command Line Options
The following command line arguments are available in addition to
the options provided by
the [python launcher](../../../data-processing-lib/doc/python-launcher-options.md).

```
--pii_redactor_entities PII_ENTITIES
list of PII entities to be captured for example: ["PERSON", "EMAIL"]
--pii_redactor_operator REDACTOR_OPERATOR
Two redaction techniques are supported - replace(default), redact
--pii_redactor_transformed_contents PII_TRANSFORMED_CONTENT_COLUMN_NAME
Mention the column name in which transformed contents will be added. This is required argument.
--pii_redactor_score_threshold SCORE_THRESHOLD
The score_threshold is a parameter that sets the minimum confidence score required for an entity to be considered a match.
Provide a value above 0.6
```
## PII Redactor Ray Transform
Please see the set of
[transform project conventions](../../README.md#transform-project-conventions)
for details on general project conventions, transform configuration,
testing and IDE set up.

## Summary
This project wraps the pii redactor transform with a Ray runtime.

### Launched Command Line Options
In addition to those available to the transform as defined here,
the set of
[ray launcher options](../../../data-processing-lib/doc/ray-launcher-options.md) are available.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .transform import *
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import os

from data_processing.data_access import DataAccessLocal
from pii_redactor_transform import (
from dpk_pii_redactor.transform import (
PIIRedactorTransform,
doc_transformed_contents_key,
supported_entities_key,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@

from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from pii_redactor_transform import doc_transformed_contents_cli_param
from pii_redactor_transform_python import PIIRedactorPythonTransformConfiguration
from dpk_pii_redactor.transform import doc_transformed_contents_cli_param
from dpk_pii_redactor.transform_python import PIIRedactorPythonTransformConfiguration


# create parameters
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import logging

import spacy
from flair_recognizer import FlairRecognizer
from dpk_pii_redactor.flair_recognizer import FlairRecognizer
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

Expand Down
Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

from data_processing.utils import ParamsUtils
from data_processing_ray.runtime.ray import RayTransformLauncher
from pii_redactor_transform_ray import PIIRedactorRayTransformConfiguration
from dpk_pii_redactor.ray.transform import PIIRedactorRayTransformConfiguration


# create parameters
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

from data_processing.utils import ParamsUtils
from data_processing_ray.runtime.ray import RayTransformLauncher
from pii_redactor_transform_ray import PIIRedactorRayTransformConfiguration
from dpk_pii_redactor.ray.transform import PIIRedactorRayTransformConfiguration


print(os.environ)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
from data_processing_ray.runtime.ray.runtime_configuration import (
RayTransformRuntimeConfiguration,
)
from pii_redactor_transform import PIIRedactorTransformConfiguration
from dpk_pii_redactor.transform import PIIRedactorTransformConfiguration


logger = get_logger(__name__)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@
import pyarrow as pa
from data_processing.transform import AbstractTableTransform, TransformConfiguration
from data_processing.utils import CLIArgumentProvider, TransformUtils, get_logger
from pii_analyzer import PIIAnalyzerEngine
from pii_anonymizer import PIIAnonymizer
from dpk_pii_redactor.pii_analyzer import PIIAnalyzerEngine
from dpk_pii_redactor.pii_anonymizer import PIIAnonymizer


short_name = "pii_redactor"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
PythonTransformRuntimeConfiguration,
)
from data_processing.utils import get_logger
from pii_redactor_transform import PIIRedactorTransformConfiguration
from dpk_pii_redactor.transform import PIIRedactorTransformConfiguration


log = get_logger(__name__)
Expand Down
Loading