Skip to content

Commit

Permalink
Update readme info
Browse files Browse the repository at this point in the history
  • Loading branch information
mormigil committed Oct 10, 2024
1 parent 998477b commit 4e3949d
Show file tree
Hide file tree
Showing 12 changed files with 1,260 additions and 9 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,8 @@
# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
hs_err_pid*
replay_pid*
**/target/
**/.idea/
*.iws
*.iml
*.ipr
11 changes: 8 additions & 3 deletions AnalyticsForEveryone/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# Analytics for Everyone

Here you can find reference material for a Druid summit talk https://druidsummit.org/agenda/
Here you can find reference material for a Druid summit 2024 talk given by Willis Kennedy https://druidsummit.org/agenda/

This directory holds two plugins we use at Roblox for the usage of Approximate Datasketches. Our data platform primarily uses Spark and Trino so we built custom udaf extensions for apache datasketches to ensure a binary compatible datasketch that we can share from our C++ engine code all the way to Spark jobs or Trino queries. You can find out plenty more about these sketches at https://datasketches.apache.org/
## Datasketches

Additionally in the druid_workflow piece you can find example data and usage of datasketch data in druid.
This directory holds two plugins we use at Roblox that wrap Approximate Datasketches. Our data platform primarily uses Spark and Trino so we built custom udaf (user defined aggregate function) extensions for apache datasketches to ensure a binary compatible datasketch that we can share from our C++ engine code all the way to Spark jobs or Trino queries. You can find out plenty more about these sketches at https://datasketches.apache.org/


## Druid Workflow

In the workflow directory you can find an example docker compose that will stand up druid along with details about how to ingest some sample data using sql statements in this repo. Feel free to adapt or re-use these sql statements for your own queries or ingestion.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Installation

Follow Trino standard [plugin deployment](https://trino.io/docs/current/develop/spi-overview.html). This plugin needs to be available on all nodes so however you deploy your trino cluster it will need to be included. We deploy this with the trino version specified in the pom.xml in this repo.

# DataSketches UDFs

[Apache DataSketches](https://datasketches.apache.org/) is a library of highly efficient data streaming algorithms for running approximate queries on very large datasets.
These algorithms create data structures called "sketches", which aggregate data to be queried efficiently.
These UDFs provide interfaces to HLL, Theta, and KLL sketches. For more information on DataSketches, see the [official documentation](https://datasketches.apache.org/docs/Background/TheChallenge.html).

We provide:
* HLL Sketches - Count Distinct Elements (Fast)
* Theta Sketches - Count Distinct Elements w/ Set Operations (Sometimes less fast)
* KLL Floats Sketches - Compute Quantiles / Ranks (Less precision, more efficient)
* KLL Doubles Sketches - Compute Quantiles / Ranks (More precision, less efficient)
* String Items Sketches - Estimate frequency of String entries
* Double Items Sketches - Estimate frequency of Double / Real / Float entries
* Long Items Sketches - Estimate frequency of Long / Int entries

## DataSketches UDF Documentation

- [HLL Sketch](./datasketches-udfs/hll-sketch)
- [Theta Sketch](./datasketches-udfs/theta-sketch)
- [KLL Floats Sketch](./datasketches-udfs/kll-floats-sketch)
- [KLL Doubles Sketch](./datasketches-udfs/kll-doubles-sketch)
- [String Items Sketch](./datasketches-udfs/string-items-sketch)
- [Double Items Sketch](./datasketches-udfs/double-items-sketch)
- [Long Items Sketch](./datasketches-udfs/long-items-sketch)
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Double Items Sketches

A Double Items sketch is an [Items Sketch](https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html) that stores Double values. It can be used to estimate the frequency of items in a dataset.

Note that `DOUBLE` and `REAL` items should not be used together in a Double Items sketch. The difference in precision causes items that should be the same to be considered distinct (e.g. `DOUBLE 4 != REAL 4`).


(double_items_sketch_1)=
## `double_items_sketch(column)`

Parameters:
* `column` (`DOUBLE`, `REAL`, `VARBINARY`): The column of values to create the sketch from. If inputs are `VARBINARY`, they are assumed to be serialized sketches which
are unioned to produce the output sketch.

Returns:
* (`VARBINARY`): The serialized Double Items sketch.

Notes:
* This is an aggregation function, so a column will be reduced to a single `VARBINARY` value.
* When creating a new sketch from values, the output sketch will use a maximum map size of 64. To customize this, use [](double_items_sketch_2).
* When aggregating existing sketches, the output sketch will use the same maximum map size as one of the input sketches. Ensure all input sketches have the same maximum map size to avoid errors.

Examples:
```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch(item_price) AS item_price_sketch
-- Output: 0x... (VARBINARY)
```

```sql
-- item_price_items_sketch is a VARBINARY column
SELECT double_items_sketch(item_price_items_sketch) AS item_price_sketch
-- Output: 0x... (VARBINARY)
```


(double_items_sketch_2)=
## `double_items_sketch(column, size)`

Parameters:
* `column` (`VARBINARY`): The column of values to create the sketch from.
* `size` (`BIGINT`): The desired sketch's maximum map size. Must be a power of 2.

Returns:
* (`VARBINARY`): The serialized Double Items sketch with the specified maximum map size.

Notes:
* This is an aggregation function, so a column will be reduced to a single `VARBINARY` value.

Examples:
```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch(item_price, 128) AS item_price_sketch
-- Output: 0x... (VARBINARY)
```


(double_items_sketch_estimate)=
## `double_items_sketch_estimate(sketch, item)`

Parameters:
* `sketch` (`VARBINARY`): A serialized Double Items sketch.
* `item` (`DOUBLE`, `REAL`, `ARRAY[DOUBLE]`, `ARRAY[REAL]`): The item or list of items to estimate the frequency for.

Returns:
* (`BIGINT` or `ARRAY[BIGINT]`): The estimated frequency or list of frequencies of `item` in `sketch`.

Examples:
```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch_estimate(double_items_sketch(item_price), 3.99) AS price1_frequency
-- Output: 14 (BIGINT)
```

```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch_estimate(double_items_sketch(item_price), ARRAY[3.99, 7.49, 12.95]) AS price_frequencies
-- Output: [14, 9, 11] (ARRAY[BIGINT])
```


(double_items_sketch_estimate_lb)=
## `double_items_sketch_estimate_lb(sketch, item)`

Parameters:
* `sketch` (`VARBINARY`): A serialized Double Items sketch.
* `item` (`DOUBLE`, `REAL`, `ARRAY[DOUBLE]`, `ARRAY[REAL]`): The item or list of items to estimate the lower bound of the frequency for.

Returns:
* (`BIGINT` or `ARRAY[BIGINT]`): The estimated lower bound of the frequency or list of frequencies of `item` in `sketch`.

Examples:
```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch_estimate_lb(double_items_sketch(item_price), 3.99) AS price1_frequency_lb
-- Output: 13 (BIGINT)
```

```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch_estimate_lb(double_items_sketch(item_price), ARRAY[3.99, 7.49, 12.95]) AS price_frequencies_lb
-- Output: [13, 8, 9] (ARRAY[BIGINT])
```


(double_items_sketch_estimate_ub)=
## `double_items_sketch_estimate_ub(sketch, item)`

Parameters:
* `sketch` (`VARBINARY`): A serialized Double Items sketch.
* `item` (`DOUBLE`, `REAL`, `ARRAY[DOUBLE]`, `ARRAY[REAL]`): The item or list of items to estimate the upper bound of the frequency for.

Returns:
* (`BIGINT` or `ARRAY[BIGINT]`): The estimated upper bound of the frequency or list of frequencies of `item` in `sketch`.

Examples:
```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch_estimate_ub(double_items_sketch(item_price), 3.99) AS price1_frequency_ub
-- Output: 16 (BIGINT)
```

```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch_estimate_ub(double_items_sketch(item_price), ARRAY[3.99, 7.49, 12.95]) AS price_frequencies_ub
-- Output: [16, 11, 14] (ARRAY[BIGINT])
```


(double_items_sketch_frequent_items)=
## `double_items_sketch_frequent_items(sketch, false_positives)`

Parameters:
* `sketch` (`VARBINARY`): A serialized Double Items sketch.
* `false_positives` (`BOOLEAN`): Whether to include potential false positives in the output. If this is true, the output will include all items that may be frequent. If this is
false, the output will only include items that are guaranteed to be frequent.

Returns:
* (`ARRAY[DOUBLE]`, `ARRAY[REAL]`): The frequent items in the sketch.

Examples:
```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch_frequent_items(double_items_sketch(item_price), true) AS frequent_prices
-- Output: [3.99, 7.49, 12.95, 0.99]
```

```sql
-- item_price is a DOUBLE column
SELECT double_items_sketch_frequent_items(double_items_sketch(item_price), false) AS frequent_prices
-- Output: [3.99, 7.49]
```


Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# HLL Sketches

An [HLL sketch](https://datasketches.apache.org/docs/HLL/HLL.html), or HyperLogLog sketch, is a sketch that can estimate the count of distinct values in a dataset.

(hll_sketch_1)=
## `hll_sketch(column)`

Parameters:
* `column` (`BIGINT`, `DOUBLE`, `REAL`, `VARCHAR`, `VARBINARY`): The column of values to create the sketch from. If inputs are `VARBINARY`, they are assumed to be serialized sketches which
are unioned to produce the output sketch.

Returns:
* (`VARBINARY`): The serialized HLL sketch.

Notes:
* This is an aggregation function, so a column will be reduced to a single `VARBINARY` value.
* When creating a new sketch from values, the output sketch will use a `lg_k` of 12. To customize this, use [](hll_sketch_2).
* When aggregating existing sketches, the output sketch will use the same `lg_k` as one of the input sketches. Ensure all input sketches have the same `lg_k` to avoid errors.

Examples:
```sql
-- user_id is a VARCHAR column
SELECT hll_sketch(user_id) AS users_sketch
-- Output: 0x... (VARBINARY)
```

```sql
-- user_id_hll is a VARBINARY column
SELECT hll_sketch(user_id_hll) AS users_sketch
-- Output: 0x... (VARBINARY)
```


(hll_sketch_2)=
## `hll_sketch(column, lg_k)`

Parameters:
* `column` (`BIGINT`, `DOUBLE`, `REAL`, `VARCHAR`): The column of values to create the sketch from.
* `lg_k` (`BIGINT`): The log2 of the desired sketch's `k` parameter. `lg_k` can be between 4 and 21.

Returns:
* (`VARBINARY`): The serialized HLL sketch with the specified `lg_k`.

Notes:
* This is an aggregation function, so a column will be reduced to a single `VARBINARY` value.

Examples:
```sql
-- user_id is a VARCHAR column
SELECT hll_sketch(user_id, 14) AS users_sketch
-- Output: 0x... (VARBINARY)
```


(hll_count_distinct)=
## `hll_count_distinct(sketch)`

Parameters:
* `sketch` (`VARBINARY`): A serialized HLL sketch.

Returns:
* (`BIGINT`): The estimated count of distinct values in `sketch`.

Examples:
```sql
-- user_id is a VARCHAR column
SELECT hll_count_distinct(hll_sketch(user_id)) AS num_distinct_users
-- Output: 9996 (BIGINT)
```


(hll_count_distinct_lb)=
## `hll_count_distinct_lb(sketch, num_std_dev)`

Parameters:
* `sketch` (`VARBINARY`): A serialized HLL sketch.
* `num_std_dev` (`BIGINT`): The number of standard deviations to use for the lower bound.

Returns:
* (`BIGINT`): The lower bound on the count of distinct values in `sketch` to `num_std_dev` standard deviations.

Examples:
```sql
-- user_id is a VARCHAR column
SELECT hll_count_distinct_lb(hll_sketch(user_id), 2) AS num_distinct_users_lb
-- Output: 9751 (BIGINT)
```


(hll_count_distinct_ub)=
## `hll_count_distinct_ub(sketch, num_std_dev)`

Parameters:
* `sketch` (`VARBINARY`): A serialized HLL sketch.
* `num_std_dev` (`BIGINT`): The number of standard deviations to use for the upper bound.

Returns:
* (`BIGINT`): The upper bound on the count of distinct values in `sketch` to `num_std_dev` standard deviations.

Examples:
```sql
-- user_id is a VARCHAR column
SELECT hll_count_distinct_ub(hll_sketch(user_id), 2) AS num_distinct_users_ub
-- Output: 10338 (BIGINT)
```
Loading

0 comments on commit 4e3949d

Please sign in to comment.