Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
This module is designed to detect and remove license and copyright information from code files. It leverages the ScanCode Toolkit to accurately identify and process licenses and copyrights in various programming languages.
After locating the position of license or copyright in the input code/sample, this module delete/remove those lines and returns the updated code as parquet file.
The set of dictionary keys holding configuration for values are as follows:
- contents_column_name - used to define input column name. Default value is 'contents'.
- license - write 'true' to remove license from input data else 'false'. By default set as 'true'.
- copyright - write 'true' to remove copyright from input data else 'false'. by default set as 'true'.
You can run the header_cleanser_local.py (python-only implementation) or header_cleanser_local_ray.py (ray-based implementation) to transform the test1.parquet
file in test input data to an output
directory. The directory will contain both the new annotated test1.parquet
file and the metadata.json
file.
When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the python launcher.
- --header_cleanser_contents_column_name - set the contents_column_name configuration key.
- --header_cleanser_document_id_column_name - set the document_id_column_name configuration key.
- --header_cleanser_license - set the license configuration key.
- --header_cleanser_copyright - set the copyright configuration key.
- --header_cleanser_n_processes - set the n_processes configuration key.
- --header_cleanser_tmp_dir - set the tmp_dir configuration key.
- --header_cleanser_timeout - set the timeout configuration key.
- --header_cleanser_skip_timeout - set the skip_timeout configuration key.
To run the samples, use the following make
targets
run-cli-sample
- runs src/header_cleanser_transform_python.py using command line argsrun-local-python-sample
- runs src/header_cleanser_local_python.pyrun-local-sample
- runs src/header_cleanser_local.py
These targets will activate the virtual environment and set up any configuration needed.
Use the -n
option of make
to see the detail of what is done to run the sample.
For example,
make run-cli-sample
...
Then
ls output
To see results of the transform.
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.