-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for unique
validation in PySpark
#1396
Add support for unique
validation in PySpark
#1396
Conversation
Signed-off-by: Filipe Oliveira <[email protected]>
Signed-off-by: Filipe Oliveira <[email protected]>
Signed-off-by: Filipe Oliveira <[email protected]>
Signed-off-by: Filipe Oliveira <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1396 +/- ##
==========================================
+ Coverage 93.92% 94.08% +0.16%
==========================================
Files 91 91
Lines 6787 6802 +15
==========================================
+ Hits 6375 6400 +25
+ Misses 412 402 -10
☔ View full report in Codecov by Sentry. |
def _coerce_df_dtype(obj: DataFrame) -> DataFrame: | ||
if schema.dtype is None: | ||
raise ValueError( | ||
"dtype argument is None. Must specify this argument " | ||
"to coerce dtype" | ||
) | ||
|
||
try: | ||
return schema.dtype.try_coerce(obj) | ||
except ParserError as exc: | ||
raise SchemaError( | ||
schema=schema, | ||
data=obj, | ||
message=( | ||
f"Error while coercing '{schema.name}' to type " | ||
f"{schema.dtype}: {exc}\n{exc.failure_cases}" | ||
), | ||
failure_cases=exc.failure_cases, | ||
check=f"coerce_dtype('{schema.dtype}')", | ||
) from exc | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different from pandas
namespace, this function was not being used inside the coercion functions. Removed.
@@ -89,17 +105,12 @@ def wrapper(self, *args, **kwargs): | |||
return func(self, *args, **kwargs) | |||
else: | |||
warnings.warn( | |||
"Skipping Execution of function as parameters set to DATA_ONLY ", | |||
f"Skipping execution of function {func.__name__} as validation depth is set to DATA_ONLY ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improving warning messages
@@ -53,7 +53,7 @@ def test_schema_only(self, spark, sample_spark_schema): | |||
CONFIG.validation_enabled = True | |||
CONFIG.validation_depth = ValidationDepth.SCHEMA_ONLY | |||
|
|||
pandra_schema = DataFrameSchema( | |||
pandera_schema = DataFrameSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
@validate_scope(scope=ValidationScope.DATA) | ||
def _data_checks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate function was defined to run DATA
-related validations when this validation depth is enabled, given that the existing one (_schema_checks
) is designated to SCHEMA
-related validations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great @filipeo2-mck! I like how we're catching the SchemaError raised by the unique
method, but I think it's a bit weird that we're rasing a Schema error when it's potentially nothing wrong with the schema. @cosmicBboy, does it make sense to introduce a DataError
in pandera.errors
to make sure we distigish between different type of errors, or is there any reason that we don't want it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantically, what does the SchemaError
in pandera.errors
object stands for?
- the segregation between two available validation types (schema vs data validations)? If this is the meaning, we should probably add data-related
DataError
exceptions to this namespace, to be specific about the kind of issue being raised. - the Pandera objects or components, at a macro level (a schema, some data - the df, the declared checks, the existing columns...)? If this is the meaning, I see no issues about calling it
SchemaError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantically, SchemaError(s)
stand for anything that's wrong with the data or metadata of a validated object. That includes metadata
(column names, types, etc), and data
(actual data values contained in the object).
I think for clarity we should rename ValidationScope.SCHEMA
to ValidationScore.METADATA
to clarify the difference in pandera (I understand that the term Schema often refers to what I'm calling metadata
here i.e. columns and their types, but pandera takes a slightly broader view of what a schema is).
but I think it's a bit weird that we're rasing a Schema error when it's potentially nothing wrong with the schema. @cosmicBboy, does it make sense to introduce a DataError in pandera.errors to make sure we distigish between different type of errors, or is there any reason that we don't want it?
I'm not sure what the concern is here: if the unique
method is the function that raises a SchemaError
that's caught elsewhere in the validation pipeline, doesn't that mean, by definition, that there's something wrong with the data (under pandera's definition of a "schema")?
for value in kwargs.values(): | ||
if isinstance(value, pyspark.sql.DataFrame): | ||
return value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code coverage does not reach this because we are not passing check_obj
as a kwarg. We use an arg instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you confirm if there's any scenario where a DataFrame might be passed as a kwarg?
- For our tests related to this decorator, are we ensuring that we pass the pyspark.sql.DataFrame both as a positional and as a keyword argument?
- If we anticipate that the DataFrame will always be passed as a positional argument and never as a keyword argument, would it make sense to refactor the decorator to remove the kwargs check for simplicity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrongly assumed that this decorator was used in pandas
and other namespaces too and I wanted to keep the pattern, but checking now, there is no such thing in other integrations.
I'm changing it to remove the kwargs
capability, as it's not used anywhere. Thank you for the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decorator was added to pyspark
only. I hope eventually pandas
will follow the suite...
Signed-off-by: Filipe Oliveira <[email protected]>
"""Run the checks related to data validation and uniqueness.""" | ||
|
||
# uniqueness of values | ||
check_obj = self.unique( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't rememver how this was implemented, but wouldn't it make sense to catch the error which uniqe would raise and append to the ErrorHandler
? Similar pattern to the SchemaError above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially catching a DataError
or similar. @NeerajMalhotra-QB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I understood the issue now, it's about the use of an SchemaError
exception in data vadliation, not the point where it's raised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I added a bunch of comments.
Signed-off-by: Filipe Oliveira <[email protected]>
LGTM but I am wondering if it could have bee more intuitive for the user to define uniqueness as a check in |
@NeerajMalhotra-QB this is just to be consistent with the These options are currently exposed in the Would eventually like to support a syntax like this: class Model(DataFrameModel, unique=[...], **kwargs):
... But for now this is how it is. |
Both approaches are valid from an usage perspective. I didn't implement it because:
EDIT: I'll add this distinction of what is being supported in the PR cover. |
sounds good to me. May be in future we can enhance Field to include unique check too but for now I agree this is the way it is done. :) |
The |
Exactly. The uniqueness of values will be tested for the columns listed in |
…ue functions Signed-off-by: Filipe Oliveira <[email protected]>
Update: Added a missing condition/test to check if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!!
Thank you @filipeo2-mck ! |
Currently, the
unique
property in aConfig
class forDataFrameModel
(PySpark) does not work, given no support for it was added in the initial pyspark support in release0.16.0
.This PR adds support for it: test cases and a warning in the docs about a possible performance hit when using it were added.
The Column/Field-based
unique
capability remained untouched.This PR is open for improvement suggestions.
Test code I
Output
Considering
year
column only, two rows have duplicated values:Test code II
If two columns are used:
Output
Considering
year
andmonth
columns, just one row has duplicated values: