Skip to content

Commit

Permalink
merge 0.4.0Merge branch '0.4.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
sonalgoyal committed Aug 29, 2024
2 parents fad22e5 + 3be49c4 commit 38f24a6
Show file tree
Hide file tree
Showing 63 changed files with 1,430 additions and 322 deletions.
Binary file removed .DS_Store
Binary file not shown.
6 changes: 3 additions & 3 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,6 @@ sphinx:
# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
# python:
# install:
# - requirements: docs/requirements.txt
python:
install:
- requirements: python/requirements.txt
5 changes: 2 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,12 @@
description: Hope you find us useful :-)
---

# Welcome to Zingg
# Welcome To Zingg

![](https://static.scarf.sh/a.png?x-pxid=d6dda06e-06c7-4e4a-99c9-ed9f6364dfeb)

This is the latest documentation for Zingg. Release wise documentation can be accessed through:

* [v0.4.1 ](https://docs.zingg.ai/zingg0.4.1/)
* [v0.4.0 ](https://docs.zingg.ai/zingg0.4.0/)
* [v0.3.4 ](https://docs.zingg.ai/zingg0.3.4/)
* [v0.3.3](https://docs.zingg.ai/zingg0.3.3/)
Expand All @@ -25,4 +24,4 @@ Zingg is a quick and scalable way to build a single source of truth for core bus

## Book Office Hours

If you want to schedule a 30-min call with our team to help you get set up, please select some time directly [here](https://calendly.com/sonalgoyal/30min).
If you want to schedule a 30-min call with our team to help you get set up, please select a slot directly [here](https://calendly.com/sonalgoyal/30min).
37 changes: 20 additions & 17 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Table of contents

* [Welcome to Zingg](README.md)
* [Welcome To Zingg](README.md)
* [Step-By-Step Guide](stepByStep.md)
* [Installation](setup/installation.md)
* [Docker](stepbystep/installation/docker/README.md)
* [Sharing custom data and config files](stepbystep/installation/docker/sharing-custom-data-and-config-files.md)
* [Shared locations](stepbystep/installation/docker/shared-locations.md)
* [File read/write permissions](stepbystep/installation/docker/file-read-write-permissions.md)
* [Copying Files To and From the Container](stepbystep/installation/docker/copying-files-to-and-from-the-container.md)
* [Sharing Custom Data And Config Files](stepbystep/installation/docker/sharing-custom-data-and-config-files.md)
* [Shared Locations](stepbystep/installation/docker/shared-locations.md)
* [File Read/Write Permissions](stepbystep/installation/docker/file-read-write-permissions.md)
* [Copying Files To And From The Container](stepbystep/installation/docker/copying-files-to-and-from-the-container.md)
* [Installing From Release](stepbystep/installation/installing-from-release/README.md)
* [Single Machine Setup](stepbystep/installation/installing-from-release/single-machine-setup.md)
* [Spark Cluster Checklist](stepbystep/installation/installing-from-release/spark-cluster-checklist.md)
Expand All @@ -19,23 +19,25 @@
* [Zingg Command Line](stepbystep/zingg-command-line.md)
* [Configuration](stepbystep/configuration/README.md)
* [Configuring Through Environment Variables](stepbystep/configuration/configuring-through-environment-variables.md)
* [Data Input and Output](stepbystep/configuration/data-input-and-output/README.md)
* [Data Input And Output](stepbystep/configuration/data-input-and-output/README.md)
* [Input Data](stepbystep/configuration/data-input-and-output/data.md)
* [Output](stepbystep/configuration/data-input-and-output/output.md)
* [Field Definitions](stepbystep/configuration/field-definitions.md)
* [Deterministic Matching](deterministicMatching.md)
* [Model Location](stepbystep/configuration/model-location.md)
* [Tuning Label, Match And Link Jobs](stepbystep/configuration/tuning-label-match-and-link-jobs.md)
* [Telemetry](stepbystep/configuration/telemetry.md)
* [Working With Training Data](setup/training/createTrainingData.md)
* [Finding Records For Training Set Creation](setup/training/findTrainingData.md)
* [Labeling Records](setup/training/label.md)
* [Find And Label](setup/training/findAndLabel.md)
* [Using pre-existing training data](setup/training/addOwnTrainingData.md)
* [Using Pre-existing Training Data](setup/training/addOwnTrainingData.md)
* [Updating Labeled Pairs](updatingLabels.md)
* [Exporting Labeled Data](setup/training/exportLabeledData.md)
* [Building and saving the model](setup/train.md)
* [Finding the matches](setup/match.md)
* [Linking across datasets](setup/link.md)
* [Building And Saving The Model](setup/train.md)
* [Finding The Matches](setup/match.md)
* [Adding Incremental Data](runIncremental.md)
* [Linking Across Datasets](setup/link.md)
* [Data Sources and Sinks](dataSourcesAndSinks/connectors.md)
* [Zingg Pipes](dataSourcesAndSinks/pipes.md)
* [Databricks](dataSourcesAndSinks/databricks.md)
Expand All @@ -51,19 +53,20 @@
* [BigQuery](dataSourcesAndSinks/bigquery.md)
* [Exasol](dataSourcesAndSinks/exasol.md)
* [Working With Python](working-with-python.md)
* [Running Zingg on Cloud](running/running.md)
* [Running on AWS](running/aws.md)
* [Running on Azure](running/azure.md)
* [Running on Databricks](running/databricks.md)
* [Python API](python/markdown/index.md)
* [Running Zingg On Cloud](running/running.md)
* [Running On AWS](running/aws.md)
* [Running On Azure](running/azure.md)
* [Running On Databricks](running/databricks.md)
* [Zingg Models](zModels.md)
* [Pre-trained models](pretrainedModels.md)
* [Pre-Trained Models](pretrainedModels.md)
* [Improving Accuracy](improving-accuracy/README.md)
* [Ignoring Commonly Occuring Words While Matching](accuracy/stopWordsRemoval.md)
* [Defining Domain Specific Blocking And Similarity Functions](accuracy/definingOwn.md)
* [Documenting The Model](generatingdocumentation.md)
* [Interpreting Output Scores](scoring.md)
* [Reporting bugs and contributing](contributing.md)
* [Setting Zingg Development Environment](settingUpZingg.md)
* [Reporting Bugs And Contributing](contributing.md)
* [Setting Up Zingg Development Environment](settingUpZingg.md)
* [Community](community.md)
* [Frequently Asked Questions](faq.md)
* [Reading Material](reading.md)
Expand Down
22 changes: 10 additions & 12 deletions docs/accuracy/definingOwn.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,27 @@ nav_order: 6
description: To add blocking functions and how they work
---

# Defining Own Functions
# Defining Domain Specific Blocking And Similarity Functions

You can add your own [blocking functions](https://github.com/zinggAI/zingg/tree/main/common/core/src/main/java/zingg/common/core/hash) which will be evaluated by Zingg to build the [blocking tree.](../zModels.md)

The blocking tree works on the matched records provided by the user as part of the training. At every node, it selects the hash function and the field on which it should be applied so that there is the least elimination of the matching pairs. Say we have data like this:
The blocking tree works on the matched records provided by the user as part of the training. At every node, it selects the hash function and the field on which it should be applied so that there is the least elimination of the matching pairs. \
\
Say we have data like this:

| Pair 1 | firstname | lastname |
| :------: | :-------: | :------: |
| Record A | john | doe |
| Record B | johnh | d oe |

****
***

| Pair 2 | firstname | lastname |
| :-------: | :-------: | :------: |
| Rrecord A | mary | ann |
| Record B | marry | |



Let us assume we have hash function first1char and we want to check if it is a good function to apply to firstname:
Let us assume we have hash function **first1char** and we want to check if it is a good function to apply to **firstname**:

| Pair | Record | Output |
| :--: | :------: | ------ |
Expand All @@ -34,9 +34,7 @@ Let us assume we have hash function first1char and we want to check if it is a g

There is no elimination in the pairs above, hence it is a good function.



Now let us try last1char on firstname
Now let us try **last1char** on **firstname:**

| Pair | Record | Output |
| :--: | :------: | ------ |
Expand All @@ -45,12 +43,12 @@ Now let us try last1char on firstname
| 2 | Record A | y |
| 2 | Record B | y |

Pair 1 is getting eliminated above, hence last1char is not a good function. 
Pair 1 is getting eliminated above, hence **last1char** is not a good function.

So, first1char(firstname) will be chosen. This brings near similar records together - in a way, clusters them to break the cartesian join.
So, **first1char**(**firstname**) will be chosen. This brings near similar records together - in a way, clusters them to break the cartesian join.

These business-specific blocking functions go into [Hash Functions](https://github.com/zinggAI/zingg/tree/main/common/core/src/main/java/zingg/common/core/hash) and must be added to [HashFunctionRegistry](../../common/core/src/main/java/zingg/common/core/hash/HashFunctionRegistry.java) and [hash functions config](../../common/core/src/main/resources/hashFunctions.json).

Also, for similarity, you can define your own measures. Each dataType has predefined features, for example, [String](../../common/core/src/main/java/zingg/common/core/feature/StringFeature.java) fuzzy type is configured for Affine and Jaro.
Also, for similarity, you can define your own measures. Each **dataType** has predefined features, for example, [String](../../common/core/src/main/java/zingg/common/core/feature/StringFeature.java) fuzzy type is configured for Affine and Jaro.

You can define your own [comparisons](https://github.com/zinggAI/zingg/tree/main/common/core/src/main/java/zingg/common/core/similarity/function) and use them.
11 changes: 4 additions & 7 deletions docs/accuracy/stopWordsRemoval.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,18 @@
# Ignoring Commonly Occuring Words While Matching

Common words like Mr, Pvt, Av, St, Street etc do not add differential signal and confuse matching. These words are called stopwords and matching is more accurate when stopwrods are ignored.
Common words like Mr, Pvt, Av, St, Street etc. do not add differential signals and confuse matching. These words are called **stopwords** and matching is more accurate when stopwords are ignored.

In order to remove stopwords from a field, configure 
The stopwords can be recommended by Zingg by invoking:

The stopwords can be recommended by Zingg by invoking

`./scripts/zingg.sh --phase recommend --conf <conf.json> --columns <list of columns to generate stop word recommendations>`&#x20;
`./scripts/zingg.sh --phase recommend --conf <conf.json> --columns <list of columns to generate stop word recommendations>`

By default, Zingg extracts 10% of the high-frequency unique words from a dataset. If the user wants a different selection, they should set up the following property in the config file:

```
stopWordsCutoff: <a value between 0 and 1>
```

Once you have verified the above stop words, you can configure them in the JSON variable **stopWords** with the path to the CSV file containing them. Please ensure while editing the CSV or building it manually that it should contain one word per row.
Once you have verified the above stop words, you can configure them in the JSON variable **stopWords** with the path to the CSV file containing them. Please ensure while editing the CSV or building it manually that it should contain _one word per row_.

```
"fieldDefinition":[
Expand All @@ -26,4 +24,3 @@ Once you have verified the above stop words, you can configure them in the JSON
"stopWords": "models/100/stopWords/fname.csv"
},
```

4 changes: 2 additions & 2 deletions docs/connectors/jdbc/mysql.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MySQL

## Reading from MySQL database:
## Reading From MySQL Database:

```json
"data" : [{
Expand All @@ -16,4 +16,4 @@
}],
```

Please replace \<db\_name> with the name of the database in addition to other props. For more details, refer to the [spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
Please replace `<db_name>` with the _name_ of the database in addition to other props. For more details, refer to the [Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
4 changes: 2 additions & 2 deletions docs/connectors/jdbc/postgres.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Postgres

## JSON Settings for reading data from Postgres database:
## JSON Settings For Reading Data From Postgres Database:

```json
"data" : [{
Expand All @@ -16,4 +16,4 @@
}],
```

Please replace \<db\_name> with the name of the database in addition to other props. For more details, refer to the [spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
Replace `<db_name>` with the _name_ of the database in addition to other props. For more details, refer to the [Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
10 changes: 5 additions & 5 deletions docs/dataSourcesAndSinks/bigquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ In addition, the following property needs to be set
spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
```

If Zingg is run from outside the Google cloud, it requires authentication, please set the following env variable to the location of the file containing the service account key. A service account key can be created and downloaded in JSON format from the [Google Cloud console](https://cloud.google.com/docs/authentication/getting-started).
If Zingg is run from outside the Google cloud, it requires authentication, please set the following _environment variable_ to the location of the file containing the _service account key_. A service account key can be created and downloaded in JSON format from the [Google Cloud console](https://cloud.google.com/docs/authentication/getting-started).

```bash
export GOOGLE_APPLICATION_CREDENTIALS=path to google service account key file
Expand All @@ -24,7 +24,7 @@ Connection properties for BigQuery as a data source and data sink are given belo

## Properties for reading data from BigQuery:

The property **"credentialsFile"** should point to the google service account key file location. This is the same path that is used to set variable **GOOGLE\_APPLICATION\_CREDENTIALS**. The **"table"** property should point to a BigQuery table that contains source data. The property **"viewsEnabled"** must be set to true only.
The property `credentialsFile` should point to the Google service account key file location. This is the same path that is used to set variable `GOOGLE_APPLICATION_CREDENTIALS`. The `table` property should point to a BigQuery table that contains source data. The property `viewsEnabled` must be set to **true** only.

```json
"data" : [{
Expand All @@ -38,9 +38,9 @@ The property **"credentialsFile"** should point to the google service account ke
}],
```

## Properties for writing data to BigQuery:
## Properties For Writing Data To BigQuery:

To write to BigQuery, a bucket needs to be created and assigned to the **"temporaryGcsBucket"** property.
To write to BigQuery, a bucket needs to be created and assigned to the `temporaryGcsBucket` property.

```json
"output" : [{
Expand All @@ -57,7 +57,7 @@ To write to BigQuery, a bucket needs to be created and assigned to the **"tempor
## Notes:

* The library **"gcs-connector-hadoop2-latest.jar"** can be downloaded from [Google](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar) and the library **"spark-bigquery-with-dependencies\_2.12-0.24.2"** from the [maven repo](https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies\_2.12/0.24.2/spark-bigquery-with-dependencies\_2.12-0.24.2.jar).
* A typical service account key file looks like the below. The format of the file is JSON.
* A typical service account key file looks like below (JSON).

```json
{
Expand Down
7 changes: 4 additions & 3 deletions docs/dataSourcesAndSinks/connectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@ has_children: true

# Data Sources and Sinks

Zingg connects, reads, and writes to most on-premise and cloud data sources.

Zingg can read and write to Databricks, Snowflake, Cassandra, S3, Azure, Elastic, Exasol, major RDBMS, and any Spark-supported data sources. Zingg also works with all major file formats like Parquet, Avro, JSON, XLSX, CSV, TSV, etc. This is done through the Zingg [pipe](pipes.md) abstraction.
Zingg _connects, reads,_ and _writes_ to most on-premise and cloud data sources.

Zingg can read and write to **Databricks, Snowflake, Cassandra, S3, Azure, Elastic, Exasol**, major **RDBMS**, and any **Spark**-supported data sources. \
\
Zingg also works with all major file formats like Parquet, Avro, JSON, XLSX, CSV, TSV, etc. This is done through the Zingg [Pipe](pipes.md) abstraction.

![](../../assets/zinggOSS.png)
4 changes: 2 additions & 2 deletions docs/dataSourcesAndSinks/databricks.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Databricks

As a Spark based application, Zingg Open Source works seamlessly on Databricks. Zingg leverages Databricks Spark environment, and can access all the supported data sources like parquet and the delta file format.
As a Spark-based application, Zingg Open Source works seamlessly on Databricks. Zingg leverages Databricks' Spark environment, and can access all the supported data sources like parquet and the delta file format.

Please check the various ways in which you can run Zingg On Databricks [here](../running/databricks.md)
Please check the various ways in which you can run Zingg on Databricks [here](../running/databricks.md)
38 changes: 19 additions & 19 deletions docs/dataSourcesAndSinks/exasol.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ For example:
spark.jars=spark-connector_2.12-1.3.0-spark-3.3.2-assembly.jar
```

If there are more than one jar files, please use comma as separator. Additionally, please change the version accordingly so that it matches your Zingg and Spark versions.
If there are more than one jar files, please use _comma_ as separator. Additionally, please change the version accordingly so that it matches your Zingg and Spark versions.

## Connector Settings

Expand All @@ -52,28 +52,28 @@ For example:
...
```

Similarly, for output:
Similarly, for output:

```json
...
```json
...
"output": [
{
"name": "output",
"format": "com.exasol.spark",
"props": {
"host": "10.11.0.2",
"port": "8563",
"username": "sys",
"password": "exasol",
"create_table": "true",
"table": "DB_SCHEMA.ENTITY_RESOLUTION",
},
"mode": "Append"
}
{
"name": "output",
"format": "com.exasol.spark",
"props": {
"host": "10.11.0.2",
"port": "8563",
"username": "sys",
"password": "exasol",
"create_table": "true",
"table": "DB_SCHEMA.ENTITY_RESOLUTION",
},
"mode": "Append"
}
],
...
```

Please note that, the `host` parameter should be the first internal node's IPv4 address.
Please note that, the `host` parameter should be the first internal node's **IPv4** **address**.

As Zingg uses [Exasol Spark connector](https://github.com/exasol/spark-connector) underneath, please also check out the [user guide](https://github.com/exasol/spark-connector/blob/main/doc/user_guide/user_guide.md) and [configuration options](https://github.com/exasol/spark-connector/blob/main/doc/user_guide/user_guide.md#configuration-options) for more information.
As Zingg uses [Exasol Spark connector](https://github.com/exasol/spark-connector) underneath, please also check out the [user guide](https://github.com/exasol/spark-connector/blob/main/doc/user\_guide/user\_guide.md) and [configuration options](https://github.com/exasol/spark-connector/blob/main/doc/user\_guide/user\_guide.md#configuration-options) for more information.
7 changes: 3 additions & 4 deletions docs/dataSourcesAndSinks/jdbc.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
# Jdbc
# JDBC

Zingg can connect to various databases such as Mysql, DB2, MariaDB, MS SQL, Oracle, PostgreSQL, etc. using JDBC. One just needs to download the appropriate driver and made it accessible to the application.
Zingg can connect to various databases such as MySQL, DB2, MariaDB, MS SQL, Oracle, PostgreSQL, etc. using JDBC. One just needs to download the appropriate driver and made it accessible to the application.

To include the JDBC driver for your particular database on the Spark classpath, please add the property **spark.jars** in [Zingg's runtime properties.](../stepbystep/zingg-runtime-properties.md)

```
spark.jars=<location of jdbc driver jar>
```

Connection details are given in the following sections for a few common JDBC sources.&#x20;

Connection details are given in the following sections for a few common JDBC sources.
Loading

0 comments on commit 38f24a6

Please sign in to comment.