Skip to content

Commit

Permalink
bring back validate_great_expectations
Browse files Browse the repository at this point in the history
  • Loading branch information
Eyal-Danieli committed Sep 8, 2024
1 parent af65ec0 commit 6849c44
Show file tree
Hide file tree
Showing 8 changed files with 1,590 additions and 0 deletions.
53 changes: 53 additions & 0 deletions validate_great_expectations/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Great Expectations Validation
![Great Expectations Logo](doc/great-expectations-logo-full-size.png)

Run data validation via Great Expectations. Will validate a given dataset with a given set of expectations, run the validation, and log the output HTML data doc in MLRun.

## Prerequisites

See [1_set_expectations.ipynb](1_set_expectations.ipynb) for a full example.

- Initialized a Great Expectations project
- Configured at least one Datasource i.e. `my_datasource`
- Created at least one Expectation Suite i.e. `my_suite`
- Created a Checkpoint i.e. `my_checkpoint`

## Usage

See [2_validate_expectations.ipynb](2_validate_expectations.ipynb) for a full example.

```python
import mlrun

fn = mlrun.import_function("hub://great_expectations")
run = fn.run(
inputs={"data": "https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv"},
params={
"expectation_suite_name": "test_suite",
"data_asset_name": "iris_dataset",
},
)
```

## All Configuration
Inputs
```rst
:param data: Data to validate. Can be local or remote link.
```

Parameters
```rst
:param expectation_suite_name: Name of expectation suite to validate against.
:param data_asset_name: Name of dataset in Great Expectations.
:param datasource_name: Name of datasource to use for validation.
:param data_connector_name: Name of data connector to use for validation.
:param datasource_config: Full configuration for datasource. For use with custom
data sources other than the default pandas datasource.
:param batch_identifiers: Custom metadata for identifying particular batches of
data. For use when not using the default batch identifiers.
:param root_directory: Path to underlying Great Expectations project. Defaults to
MLRun project artifact path if not specified.
:param checkpoint_name: Name of checkpoint to use for validation.
:param checkpoint_config: Full configuration for checkpoint. For use with custome
checkpoint config other than the default.
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
170 changes: 170 additions & 0 deletions validate_great_expectations/function.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
kind: job
metadata:
name: validate-great-expectations
tag: ''
hash: 82d0b647d443eb6e643d9dbfc8c0a650d74da018
project: ''
labels:
author: nicks
framework: great-expectations
categories:
- data-validation
- data-analysis
spec:
command: ''
args: []
image: ''
build:
functionSourceCode: 
base_image: mlrun/mlrun
commands:
- python -m pip install great-expectations==0.15.41
code_origin: https://github.com/igz-us-sales/functions.git#c7b44af35294494a531a014f3d02a28eff3f4105:/User/functions/validate_great_expectations/validate_great_expectations.py
origin_filename: /User/functions/validate_great_expectations/validate_great_expectations.py
entry_points:
get_default_datasource_config:
name: get_default_datasource_config
doc: 'Convenience function to get the default pandas datasource config
for use in validating expectations.'
parameters:
- name: datasource_name
type: str
doc: Name of datasource.
default: ''
- name: data_connector_name
type: str
doc: Name of data connector.
default: ''
outputs:
- default: ''
doc: Configuration for default datasource.
type: dict
lineno: 15
get_default_checkpoint_config:
name: get_default_checkpoint_config
doc: 'Convenience function to get the default checkpoint config for
use in validating expectations.'
parameters:
- name: checkpoint_name
type: str
doc: Name of checkpoint.
default: ''
outputs:
- default: ''
doc: Configuration for default checkpoint.
type: dict
lineno: 46
get_data_doc_path:
name: get_data_doc_path
doc: 'Convenience function to get the path of the output
data doc from a checkpoint result.'
parameters:
- name: checkpoint_result
type: CheckpointResult
doc: Great Expectations checkpoint result.
default: ''
outputs:
- default: ''
doc: Absolute path to new data doc.
type: str
lineno: 63
validate_expectations:
name: validate_expectations
doc: 'Main function to validate an input dataset, datasource, data connector,
and expectation suite.
Runs the Great Expectation validation and logs
whether the validation was a success as well as the output page
of the data docs.'
parameters:
- name: context
type: MLClientCtx
doc: MLRun context.
default: ''
- name: data
type: DataItem
doc: Data to validate. Can be local or remote link.
default: ''
- name: expectation_suite_name
type: str
doc: Name of expectation suite to validate against.
default: ''
- name: data_asset_name
type: str
doc: Name of dataset in Great Expectations.
default: ''
- name: datasource_name
type: str
doc: Name of datasource to use for validation.
default: pandas_datasource
- name: data_connector_name
type: str
doc: Name of data connector to use for validation.
default: default_runtime_data_connector_name
- name: datasource_config
type: dict
doc: Full configuration for datasource. For use with custom data sources other
than the default pandas datasource.
default: null
- name: batch_identifiers
type: dict
doc: Custom metadata for identifying particular batches of data. For use when
not using the default batch identifiers.
default: null
- name: root_directory
type: str
doc: Path to underlying Great Expectations project. Defaults to MLRun project
artifact path if not specified.
default: null
- name: checkpoint_name
type: str
doc: Name of checkpoint to use for validation.
default: null
- name: checkpoint_config
type: dict
doc: Full configuration for checkpoint. For use with custome checkpoint config
other than the default.
default: null
outputs:
- default: ''
lineno: 80
description: Validate a dataset using Great Expectations
default_handler: validate_expectations
disable_auto_mount: false
env: []
resources:
requests:
memory: 1Mi
cpu: 25m
limits:
memory: 20Gi
cpu: '2'
priority_class_name: igz-workload-medium
preemption_mode: prevent
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: app.iguazio.com/lifecycle
operator: NotIn
values:
- preemptible
- key: eks.amazonaws.com/capacityType
operator: NotIn
values:
- SPOT
- key: node-lifecycle
operator: NotIn
values:
- spot
tolerations: null
security_context: {}
verbose: false
26 changes: 26 additions & 0 deletions validate_great_expectations/item.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
apiVersion: v1
categories:
- data-validation
- data-analysis
description: Validate a dataset using Great Expectations
doc: ''
example: validate_great_expectations.ipynb
generationDate: 2022-04-26:12-28
hidden: false
icon: ''
labels:
author: nicks
framework: great-expectations
maintainers: []
marketplaceType: ''
mlrunVersion: 1.1.0
name: validate-great-expectations
platformVersion: 3.5.2
spec:
filename: validate_great_expectations.py
handler: validate_expectations
image: mlrun/mlrun
kind: job
requirements: [great-expectations==0.15.41]
url: ''
version: 1.1.0
1 change: 1 addition & 0 deletions validate_great_expectations/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
great-expectations==0.15.41
Loading

0 comments on commit 6849c44

Please sign in to comment.