A class TransformUtils provides several methods that simplify transformer's implementation. Currently, it includes the following methods:
deep_get_size
is the method to get the complete size of the Python object based on https://www.askpython.com/python/built-in-methods/variables-memory-size-in-python It supports Python structures: list, tuple and setnormalize_string
normalizes string, converting it to lowercase and removing spaces, punctuation and CRstr_to_hash
convert string to 259 bit hashstr_to_int
getting an integer representing string by calculating string's hashvalidate_columns
check whether required columns exist in the tableadd_column
adds column to the table avoiding duplicates. If the column with the given name already exists it will be removed before it is addedvalidate_path
cleans up s3 path - Removes white spaces from the input/output paths removes schema prefix (s3://, http:// https://), if exists adds the "/" character at the end, if it doesn't exist removes URL encoding
It also contains two variables:
RANDOM_SEED
number that is used for methods that require seedLOCAL_TO_DISK
rough local size to size on disk/S3
This class should be extended with additional methods, generally useful across multiple transformers and documentation should be added here