Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
- Boris Lublinsky ([email protected])
This transform assigns unique identifiers to the documents in a dataset and supports the following annotations to the original data:
- Adding a Document Hash to each document. The unique hash-based ID is generated using
hashlib.sha256(doc.encode("utf-8")).hexdigest()
. To store this hash in the data specify the desired column name using thehash_column
parameter. - Adding an Integer Document ID: to each document. The integer ID is unique across all rows and tables processed by
the
transform()
method. To store this ID in the data, specify the desired column name using theint_id_column
parameter.
Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes like fuzzy deduplication, which depend on the presence of integer IDs. If your dataset lacks document ID columns, this transform can be used to generate them.
Input Column Name | Data Type | Description |
---|---|---|
Column specified by the contents_column configuration argument | str | Column that stores document text |
Output Column Name | Data Type | Description |
---|---|---|
hash_column | str | Unique hash assigned to each document |
int_id_column | uint64 | Unique integer ID assigned to each document |
The set of dictionary keys defined in DocIDTransform configuration for values are as follows:
- doc_column - specifies name of the column containing the document (required for ID generation)
- hash_column - specifies name of the column created to hold the string document id, if None, id is not generated
- int_id_column - specifies name of the column created to hold the integer document id, if None, id is not generated
- start_id - an id from which ID generator starts ()
At least one of hash_column or int_id_column must be specified.
When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the options provided by the ray launcher.
--doc_id_doc_column DOC_ID_DOC_COLUMN
doc column name
--doc_id_hash_column DOC_ID_HASH_COLUMN
Compute document hash and place in the given named column
--doc_id_int_column DOC_ID_INT_COLUMN
Compute unique integer id and place in the given named column
--doc_id_start_id DOC_ID_START_ID
starting integer id
These correspond to the configuration keys described above.
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.
Following the testing strategy of data-processing-lib
Currently we have:
Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
This project wraps the Document ID transform with a Ray runtime.
Document ID configuration and command line options are the same as for the base python transform.
A docker file that can be used for building docker the ray image. You can use
make build
When running the transform with the Ray launcher (i.e., RayTransformLauncher), in addition to Python command line options, there are options provided by the ray launcher.
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.
This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the monotonically_increasing_id pyspark function to generate the unique integer IDs. As described in the documentation of this function:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
Document ID configuration and command line options are the same as for the base python transform.
You can run the doc_id_local.py (spark-based implementation) to transform the
test1.parquet
file in test input data to an output
directory. The directory will contain both
the new annotated test1.parquet
file and the metadata.json
file.
When running the transform with the Spark launcher (i.e., SparkTransformLauncher), the following command line arguments are available in addition to the options provided by the python launcher.
--doc_id_column_name DOC_ID_COLUMN_NAME
name of the column that holds the generated document ids
(venv) cma:src$ python doc_id_local.py
18:32:13 INFO - data factory data_ is using local data access: input_folder - /home/cma/de/data-prep-kit/transforms/universal/doc_id/spark/test-data/input output_folder - /home/cma/de/data-prep-kit/transforms/universal/doc_id/spark/output at "/home/cma/de/data-prep-kit/data-processing-lib/ray/src/data_processing/data_access/data_access_factory.py:185"
18:32:13 INFO - data factory data_ max_files -1, n_sample -1 at "/home/cma/de/data-prep-kit/data-processing-lib/ray/src/data_processing/data_access/data_access_factory.py:201"
18:32:13 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'] at "/home/cma/de/data-prep-kit/data-processing-lib/ray/src/data_processing/data_access/data_access_factory.py:214"
18:32:13 INFO - pipeline id pipeline_id at "/home/cma/de/data-prep-kit/data-processing-lib/ray/src/data_processing/runtime/execution_configuration.py:80"
18:32:13 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'} at "/home/cma/de/data-prep-kit/data-processing-lib/ray/src/data_processing/runtime/execution_configuration.py:83"
18:32:13 INFO - spark execution config : {'spark_local_config_filepath': '/home/cma/de/data-prep-kit/transforms/universal/doc_id/spark/config/spark_profile_local.yml', 'spark_kube_config_filepath': 'config/spark_profile_kube.yml'} at "/home/cma/de/data-prep-kit/data-processing-lib/spark/src/data_processing_spark/runtime/spark/spark_execution_config.py:42"
24/05/26 18:32:14 WARN Utils: Your hostname, li-7aed0a4c-2d51-11b2-a85c-dfad31db696b.ibm.com resolves to a loopback address: 127.0.0.1; using 192.168.1.223 instead (on interface wlp0s20f3)
24/05/26 18:32:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/26 18:32:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:32:17 INFO - files = ['/home/cma/de/data-prep-kit/transforms/universal/doc_id/spark/test-data/input/test_doc_id_1.parquet', '/home/cma/de/data-prep-kit/transforms/universal/doc_id/spark/test-data/input/test_doc_id_2.parquet'] at "/home/cma/de/data-prep-kit/data-processing-lib/spark/src/data_processing_spark/runtime/spark/spark_launcher.py:184"
24/05/26 18:32:23 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
The metadata generated by the Spark doc_id
transform contains the following statistics:
total_docs_count
,total_columns_count
: total number of documents (rows), and columns in the input table, before thedoc_id
transform randocs_after_doc_id
,columns_after_doc_id
: total number of documents (rows), and columns in the output table, after thedoc_id
transform ran
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.