A Python package to manage Google Cloud Data Catalog helper commands and scripts.
Disclaimer: This is not an officially supported Google product.
Group | Command | Description | Documentation Link | Code Repo |
---|---|---|---|---|
tags |
create | Load Tags from CSV file. | GO | GO |
tags |
delete | Delete Tags from CSV file. | GO | GO |
tags |
export | Export Tags to CSV file. | GO | GO |
tag-templates |
create | Load Templates from CSV file. | GO | GO |
tag-templates |
delete | Delete Templates from CSV file. | GO | GO |
tag-templates |
export | Export Templates to CSV file. | GO | GO |
filesets |
create | Create GCS filesets from CSV file. | GO | GO |
filesets |
enrich | Enrich GCS filesets with Tags. | GO | GO |
filesets |
clean-up-templates-and-tags | Cleans up the Fileset Template and their Tags. | GO | GO |
filesets |
delete | Delete GCS filesets from CSV file. | GO | GO |
filesets |
export | Export Filesets to CSV file. | GO | GO |
object-storage |
create-entries | Create Entries for each Object Storage File. | GO | GO |
object-storage |
delete-entries | Delete Entries that belong to the Object Storage Files. | GO | GO |
- 0. Executing in Cloud Shell from PyPi
- 1. Environment setup for local build
- 2. Load Tags from CSV file
- 3. Export Tags to CSV file
- 4. Load Templates from CSV file
- 5. Export Templates to CSV file
- 6. Filesets Commands
- 7. Export Filesets to CSV file
- 8. DataCatalog Object Storage commands
- 9. Data Catalog Templates Examples
If you want to execute this script directly in cloud shell, download it from PyPi:
# Set your SERVICE ACCOUNT, for instructions go to 1.3. Auth credentials
# This name is just a suggestion, feel free to name it following your naming conventions
export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-util-sa.json
# Install datacatalog-util
pip3 install --upgrade datacatalog-util --user
# Add to your PATH
export PATH=~/.local/bin:$PATH
# Look for available commands
datacatalog-util --help
Using virtualenv is optional, but strongly recommended unless you use Docker.
git clone https://github.com/mesmacosta/datacatalog-util
cd ./datacatalog-util
All paths starting with ./
in the next steps are relative to the datacatalog-util
folder.
pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate
pip install --upgrade .
Docker may be used as an alternative to run the script. In this case, please disregard the Virtualenv setup instructions.
- Data Catalog Admin
- Storage Admin
This name is just a suggestion, feel free to name it following your naming conventions
./credentials/datacatalog-util-sa.json
This step may be skipped if you're using Docker.
export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-util-sa.json
Tags are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description | Mandatory |
---|---|---|
linked_resource | Full name of the asset the Entry refers to. | Y |
template_name | Resource name of the Tag Template for the Tag. | Y |
column | Attach Tags to a column belonging to the Entry schema. | N |
field_id | Id of the Tag field. | Y |
field_value | Value of the Tag field. | Y |
TIPS
- sample-input/create-tags for reference;
- Data Catalog Sample Tags (Google Sheets) may help to create/export the CSV.
- Python + virtualenv
datacatalog-util tags create --csv-file CSV_FILE_PATH
- Docker
docker build --rm --tag datacatalog-util .
docker run --rm --tty \
--volume CREDENTIALS_FILE_FOLDER:/credentials --volume CSV_FILE_FOLDER:/data \
datacatalog-util create-tags --csv-file /data/CSV_FILE_NAME
- Python + virtualenv
datacatalog-util tags delete --csv-file CSV_FILE_PATH
One file with summary with stats about each template, will also be created on the same directory.
The columns for the summary file are described as follows:
Column | Description |
---|---|
template_name | Resource name of the Tag Template for the Tag. |
tags_count | Number of tags found from the template. |
tagged_entries_count | Number of tagged entries with the template. |
tagged_columns_count | Number of tagged columns with the template. |
tag_string_fields_count | Number of used String fields on tags of the template. |
tag_bool_fields_count | Number of used Bool fields on tags of the template. |
tag_double_fields_count | Number of used Double fields on tags of the template. |
tag_timestamp_fields_count | Number of used Timestamp fields on tags of the template. |
tag_enum_fields_count | Number of used Enum fields on tags of the template. |
The columns for each template file are described as follows:
Column | Description |
---|---|
relative_resource_name | Full resource name of the asset the Entry refers to. |
linked_resource | Full name of the asset the Entry refers to. |
template_name | Resource name of the Tag Template for the Tag. |
tag_name | Resource name of the Tag. |
column | Attach Tags to a column belonging to the Entry schema. |
field_id | Id of the Tag field. |
field_type | Type of the Tag field. |
field_value | Value of the Tag field. |
- Python + virtualenv
datacatalog-util tags export --project-ids my-project --dir-path DIR_PATH
- Python + virtualenv
datacatalog-util tags export --project-ids my-project \
--dir-path DIR_PATH \
--tag-templates-names projects/my-project/locations/us-central1/tagTemplates/my-template,\
projects/my-project/locations/us-central1/tagTemplates/my-template-2
Templates are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description | Mandatory |
---|---|---|
template_name | Resource name of the Tag Template for the Tag. | Y |
display_name | Resource name of the Tag Template for the Tag. | Y |
field_id | Id of the Tag Template field. | Y |
field_display_name | Display name of the Tag Template field. | Y |
field_type | Type of the Tag Template field. | Y |
enum_values | Values for the Enum field. | N |
- Python + virtualenv
datacatalog-util tag-templates create --csv-file CSV_FILE_PATH
- Python + virtualenv
datacatalog-util tag-templates delete --csv-file CSV_FILE_PATH
TIPS
- sample-input/create-tag-templates for reference;
Templates are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description |
---|---|
template_name | Resource name of the Tag Template for the Tag. |
display_name | Resource name of the Tag Template for the Tag. |
field_id | Id of the Tag Template field. |
field_display_name | Display name of the Tag Template field. |
field_type | Type of the Tag Template field. |
enum_values | Values for the Enum field. |
- Python + virtualenv
datacatalog-util tag-templates export --project-ids my-project --file-path CSV_FILE_PATH
Filesets are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description | Mandatory |
---|---|---|
entry_group_name | Entry Group Name. | Y |
entry_group_display_name | Entry Group Display Name. | N |
entry_group_description | Entry Group Description. | N |
entry_id | Entry ID. | Y |
entry_display_name | Entry Display Name. | Y |
entry_description | Entry Description. | N |
entry_file_patterns | Entry File Patterns. | Y |
schema_column_name | Schema column name. | N |
schema_column_type | Schema column type. | N |
schema_column_description | Schema column description. | N |
schema_column_mode | Schema column mode. | N |
Please note that the schema_column_type
is an open string field and accept anything, if you want
to use your fileset with Dataflow SQL, follow the data-types in the official docs.
- Python + virtualenv
datacatalog-util filesets create --csv-file CSV_FILE_PATH
TIPS
-
sample-input/create-filesets for reference;
-
If you want to create filesets without schema: sample-input/create-filesets/fileset-entry-opt-1-all-metadata-no-schema.csv for reference;
- Python + virtualenv
datacatalog-util filesets create --csv-file CSV_FILE_PATH --validate-dataflow-sql-types
Users are able to choose the Tag fields from the list provided at Tags
datacatalog-util filesets enrich --project-id my-project
6.3.1 Enrich all fileset entries using Tag Template from a different Project (Good way to reuse the same Template)
If you are using a different Project, make sure the Service Account has the following permissions on that Project or that Template:
- Data Catalog TagTemplate Creator
- Data Catalog TagTemplate User
datacatalog-util filesets \
--project-id my_project \
enrich --tag-template-name projects/my_different_project/locations/us-central1/tagTemplates/fileset_enricher_findings
Cleans up the Template and Tags from the Fileset Entries, running the main command will recreate those.
datacatalog-util filesets clean-up-templates-and-tags --project-id my-project
- Python + virtualenv
datacatalog-util filesets delete --csv-file CSV_FILE_PATH
Filesets are composed of as many lines as required to represent all of their fields. The columns are described as follows:
Column | Description | Mandatory |
---|---|---|
entry_group_name | Entry Group Name. | Y |
entry_group_display_name | Entry Group Display Name. | Y |
entry_group_description | Entry Group Description. | Y |
entry_id | Entry ID. | Y |
entry_display_name | Entry Display Name. | Y |
entry_description | Entry Description. | Y |
entry_file_patterns | Entry File Patterns. | Y |
schema_column_name | Schema column name. | N |
schema_column_type | Schema column type. | N |
schema_column_description | Schema column description. | N |
schema_column_mode | Schema column mode. | N |
- Python + virtualenv
datacatalog-util filesets export --project-ids my-project --file-path CSV_FILE_PATH
datacatalog-util \
object-storage sync-entries --type cloud_storage \
--project-id my_project \
--entry-group-name projects/my_project/locations/us-central1/entryGroups/my_entry_group \
--bucket-prefix my_bucket
datacatalog-util \
object-storage delete-entries --type cloud_storage \
--project-id my_project \
--entry-group-name projects/my_project/locations/us-central1/entryGroups/my_entry_group