-
Notifications
You must be signed in to change notification settings - Fork 54
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
bring back validate_great_expectations
- Loading branch information
1 parent
af65ec0
commit 6849c44
Showing
8 changed files
with
1,590 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Great Expectations Validation | ||
![Great Expectations Logo](doc/great-expectations-logo-full-size.png) | ||
|
||
Run data validation via Great Expectations. Will validate a given dataset with a given set of expectations, run the validation, and log the output HTML data doc in MLRun. | ||
|
||
## Prerequisites | ||
|
||
See [1_set_expectations.ipynb](1_set_expectations.ipynb) for a full example. | ||
|
||
- Initialized a Great Expectations project | ||
- Configured at least one Datasource i.e. `my_datasource` | ||
- Created at least one Expectation Suite i.e. `my_suite` | ||
- Created a Checkpoint i.e. `my_checkpoint` | ||
|
||
## Usage | ||
|
||
See [2_validate_expectations.ipynb](2_validate_expectations.ipynb) for a full example. | ||
|
||
```python | ||
import mlrun | ||
|
||
fn = mlrun.import_function("hub://great_expectations") | ||
run = fn.run( | ||
inputs={"data": "https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv"}, | ||
params={ | ||
"expectation_suite_name": "test_suite", | ||
"data_asset_name": "iris_dataset", | ||
}, | ||
) | ||
``` | ||
|
||
## All Configuration | ||
Inputs | ||
```rst | ||
:param data: Data to validate. Can be local or remote link. | ||
``` | ||
|
||
Parameters | ||
```rst | ||
:param expectation_suite_name: Name of expectation suite to validate against. | ||
:param data_asset_name: Name of dataset in Great Expectations. | ||
:param datasource_name: Name of datasource to use for validation. | ||
:param data_connector_name: Name of data connector to use for validation. | ||
:param datasource_config: Full configuration for datasource. For use with custom | ||
data sources other than the default pandas datasource. | ||
:param batch_identifiers: Custom metadata for identifying particular batches of | ||
data. For use when not using the default batch identifiers. | ||
:param root_directory: Path to underlying Great Expectations project. Defaults to | ||
MLRun project artifact path if not specified. | ||
:param checkpoint_name: Name of checkpoint to use for validation. | ||
:param checkpoint_config: Full configuration for checkpoint. For use with custome | ||
checkpoint config other than the default. | ||
``` |
Binary file added
BIN
+62.8 KB
validate_great_expectations/doc/great-expectations-logo-full-size.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
kind: job | ||
metadata: | ||
name: validate-great-expectations | ||
tag: '' | ||
hash: 82d0b647d443eb6e643d9dbfc8c0a650d74da018 | ||
project: '' | ||
labels: | ||
author: nicks | ||
framework: great-expectations | ||
categories: | ||
- data-validation | ||
- data-analysis | ||
spec: | ||
command: '' | ||
args: [] | ||
image: '' | ||
build: | ||
functionSourceCode: import os
import shutil

import mlrun

from great_expectations.core.batch import RuntimeBatchRequest
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.base import (
    DataContextConfig,
    FilesystemStoreBackendDefaults,
)
from great_expectations.checkpoint.types.checkpoint_result import CheckpointResult


def get_default_datasource_config(
    datasource_name: str, data_connector_name: str
) -> dict:
    """
    Convenience function to get the default pandas datasource config
    for use in validating expectations.

    :param datasource_name:     Name of datasource.
    :param data_connector_name: Name of data connector.

    :returns: Configuration for default datasource.
    """
    default_datasource_config = {
        "name": f"{datasource_name}",
        "class_name": "Datasource",
        "module_name": "great_expectations.datasource",
        "execution_engine": {
            "module_name": "great_expectations.execution_engine",
            "class_name": "PandasExecutionEngine",
        },
        "data_connectors": {
            f"{data_connector_name}": {
                "class_name": "RuntimeDataConnector",
                "module_name": "great_expectations.datasource.data_connector",
                "batch_identifiers": ["default_identifier_name"],
            },
        },
    }
    return default_datasource_config


def get_default_checkpoint_config(checkpoint_name: str) -> dict:
    """
    Convenience function to get the default checkpoint config for
    use in validating expectations.

    :param checkpoint_name: Name of checkpoint.

    :returns: Configuration for default checkpoint.
    """
    return {
        "name": checkpoint_name,
        "config_version": 1.0,
        "class_name": "SimpleCheckpoint",
        "run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
    }


def get_data_doc_path(checkpoint_result: CheckpointResult) -> str:
    """
    Convenience function to get the path of the output
    data doc from a checkpoint result.

    :param checkpoint_result: Great Expectations checkpoint result.

    :returns: Absolute path to new data doc.
    """
    result_id = checkpoint_result.list_validation_result_identifiers()[0]
    data_doc_path = checkpoint_result["run_results"][result_id]["actions_results"][
        "update_data_docs"
    ]["local_site"]
    data_doc_path = data_doc_path.replace("file://", "")
    return data_doc_path


def validate_expectations(
    context: mlrun.MLClientCtx,
    data: mlrun.DataItem,
    expectation_suite_name: str,
    data_asset_name: str,
    datasource_name: str = "pandas_datasource",
    data_connector_name: str = "default_runtime_data_connector_name",
    datasource_config: dict = None,
    batch_identifiers: dict = None,
    root_directory: str = None,
    checkpoint_name: str = None,
    checkpoint_config: dict = None,
) -> None:
    """
    Main function to validate an input dataset, datasource, data connector,
    and expectation suite.

    Runs the Great Expectation validation and logs
    whether the validation was a success as well as the output page
    of the data docs.

    :param context:                MLRun context.
    :param data:                   Data to validate. Can be local or remote link.
    :param expectation_suite_name: Name of expectation suite to validate against.
    :param data_asset_name:        Name of dataset in Great Expectations.
    :param datasource_name:        Name of datasource to use for validation.
    :param data_connector_name:    Name of data connector to use for validation.
    :param datasource_config:      Full configuration for datasource. For use with custom
                                   data sources other than the default pandas datasource.
    :param batch_identifiers:      Custom metadata for identifying particular batches of
                                   data. For use when not using the default batch identifiers.
    :param root_directory:         Path to underlying Great Expectations project. Defaults to
                                   MLRun project artifact path if not specified.
    :param checkpoint_name:        Name of checkpoint to use for validation.
    :param checkpoint_config:      Full configuration for checkpoint. For use with custome
                                   checkpoint config other than the default.
    """

    # Get data
    df = data.as_df()

    # Use default root directory for project if not specified
    root_directory = (
        root_directory
        if root_directory
        else f"/v3io/projects/{context.project}/great_expectations"
    )

    # Load great expectations context
    ge_context = BaseDataContext(
        project_config=DataContextConfig(
            store_backend_defaults=FilesystemStoreBackendDefaults(
                root_directory=root_directory
            )
        )
    )

    # Get expectation suite
    ge_context.get_expectation_suite(expectation_suite_name=expectation_suite_name)

    # Add default data source if not specified
    datasource_config = (
        datasource_config
        if datasource_config
        else get_default_datasource_config(datasource_name, data_connector_name)
    )
    ge_context.add_datasource(**datasource_config)

    # Get data batch
    batch_identifiers = (
        batch_identifiers
        if batch_identifiers
        else {"default_identifier_name": "default_identifier"}
    )
    batch_request = RuntimeBatchRequest(
        datasource_name=datasource_name,
        data_connector_name=data_connector_name,
        data_asset_name=data_asset_name,
        runtime_parameters={"batch_data": df},
        batch_identifiers=batch_identifiers,
    )

    # Get validator
    validator = ge_context.get_validator(
        batch_request=batch_request,
        expectation_suite_name=expectation_suite_name,
    )

    # Use default checkpoint name and config if not specified
    checkpoint_name = (
        checkpoint_name if checkpoint_name else f"{data_asset_name}_checkpoint"
    )
    checkpoint_config = (
        checkpoint_config
        if checkpoint_config
        else get_default_checkpoint_config(checkpoint_name)
    )

    # Add checkpoint
    ge_context.add_checkpoint(**checkpoint_config)

    # Run expectation suite on checkpoint
    checkpoint_result = ge_context.run_checkpoint(
        checkpoint_name=checkpoint_name,
        validations=[
            {
                "batch_request": batch_request,
                "expectation_suite_name": expectation_suite_name,
            }
        ],
    )

    # Log success
    context.log_result("validated", checkpoint_result["success"])

    # Log data doc
    data_doc_path = get_data_doc_path(checkpoint_result)
    context.log_artifact("validation_results", target_path=data_doc_path)
 | ||
base_image: mlrun/mlrun | ||
commands: | ||
- python -m pip install great-expectations==0.15.41 | ||
code_origin: https://github.com/igz-us-sales/functions.git#c7b44af35294494a531a014f3d02a28eff3f4105:/User/functions/validate_great_expectations/validate_great_expectations.py | ||
origin_filename: /User/functions/validate_great_expectations/validate_great_expectations.py | ||
entry_points: | ||
get_default_datasource_config: | ||
name: get_default_datasource_config | ||
doc: 'Convenience function to get the default pandas datasource config | ||
for use in validating expectations.' | ||
parameters: | ||
- name: datasource_name | ||
type: str | ||
doc: Name of datasource. | ||
default: '' | ||
- name: data_connector_name | ||
type: str | ||
doc: Name of data connector. | ||
default: '' | ||
outputs: | ||
- default: '' | ||
doc: Configuration for default datasource. | ||
type: dict | ||
lineno: 15 | ||
get_default_checkpoint_config: | ||
name: get_default_checkpoint_config | ||
doc: 'Convenience function to get the default checkpoint config for | ||
use in validating expectations.' | ||
parameters: | ||
- name: checkpoint_name | ||
type: str | ||
doc: Name of checkpoint. | ||
default: '' | ||
outputs: | ||
- default: '' | ||
doc: Configuration for default checkpoint. | ||
type: dict | ||
lineno: 46 | ||
get_data_doc_path: | ||
name: get_data_doc_path | ||
doc: 'Convenience function to get the path of the output | ||
data doc from a checkpoint result.' | ||
parameters: | ||
- name: checkpoint_result | ||
type: CheckpointResult | ||
doc: Great Expectations checkpoint result. | ||
default: '' | ||
outputs: | ||
- default: '' | ||
doc: Absolute path to new data doc. | ||
type: str | ||
lineno: 63 | ||
validate_expectations: | ||
name: validate_expectations | ||
doc: 'Main function to validate an input dataset, datasource, data connector, | ||
and expectation suite. | ||
Runs the Great Expectation validation and logs | ||
whether the validation was a success as well as the output page | ||
of the data docs.' | ||
parameters: | ||
- name: context | ||
type: MLClientCtx | ||
doc: MLRun context. | ||
default: '' | ||
- name: data | ||
type: DataItem | ||
doc: Data to validate. Can be local or remote link. | ||
default: '' | ||
- name: expectation_suite_name | ||
type: str | ||
doc: Name of expectation suite to validate against. | ||
default: '' | ||
- name: data_asset_name | ||
type: str | ||
doc: Name of dataset in Great Expectations. | ||
default: '' | ||
- name: datasource_name | ||
type: str | ||
doc: Name of datasource to use for validation. | ||
default: pandas_datasource | ||
- name: data_connector_name | ||
type: str | ||
doc: Name of data connector to use for validation. | ||
default: default_runtime_data_connector_name | ||
- name: datasource_config | ||
type: dict | ||
doc: Full configuration for datasource. For use with custom data sources other | ||
than the default pandas datasource. | ||
default: null | ||
- name: batch_identifiers | ||
type: dict | ||
doc: Custom metadata for identifying particular batches of data. For use when | ||
not using the default batch identifiers. | ||
default: null | ||
- name: root_directory | ||
type: str | ||
doc: Path to underlying Great Expectations project. Defaults to MLRun project | ||
artifact path if not specified. | ||
default: null | ||
- name: checkpoint_name | ||
type: str | ||
doc: Name of checkpoint to use for validation. | ||
default: null | ||
- name: checkpoint_config | ||
type: dict | ||
doc: Full configuration for checkpoint. For use with custome checkpoint config | ||
other than the default. | ||
default: null | ||
outputs: | ||
- default: '' | ||
lineno: 80 | ||
description: Validate a dataset using Great Expectations | ||
default_handler: validate_expectations | ||
disable_auto_mount: false | ||
env: [] | ||
resources: | ||
requests: | ||
memory: 1Mi | ||
cpu: 25m | ||
limits: | ||
memory: 20Gi | ||
cpu: '2' | ||
priority_class_name: igz-workload-medium | ||
preemption_mode: prevent | ||
affinity: | ||
nodeAffinity: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: app.iguazio.com/lifecycle | ||
operator: NotIn | ||
values: | ||
- preemptible | ||
- key: eks.amazonaws.com/capacityType | ||
operator: NotIn | ||
values: | ||
- SPOT | ||
- key: node-lifecycle | ||
operator: NotIn | ||
values: | ||
- spot | ||
tolerations: null | ||
security_context: {} | ||
verbose: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
apiVersion: v1 | ||
categories: | ||
- data-validation | ||
- data-analysis | ||
description: Validate a dataset using Great Expectations | ||
doc: '' | ||
example: validate_great_expectations.ipynb | ||
generationDate: 2022-04-26:12-28 | ||
hidden: false | ||
icon: '' | ||
labels: | ||
author: nicks | ||
framework: great-expectations | ||
maintainers: [] | ||
marketplaceType: '' | ||
mlrunVersion: 1.1.0 | ||
name: validate-great-expectations | ||
platformVersion: 3.5.2 | ||
spec: | ||
filename: validate_great_expectations.py | ||
handler: validate_expectations | ||
image: mlrun/mlrun | ||
kind: job | ||
requirements: [great-expectations==0.15.41] | ||
url: '' | ||
version: 1.1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
great-expectations==0.15.41 |
Oops, something went wrong.