Skip to content

Commit

Permalink
[Doc] v3.4 Multi-level Expression Partitioning (backport #54509) (#54929
Browse files Browse the repository at this point in the history
)

Co-authored-by: 絵空事スピリット <[email protected]>
  • Loading branch information
mergify[bot] and EsoragotoSpirit authored Jan 10, 2025
1 parent cfd9e33 commit b95f511
Show file tree
Hide file tree
Showing 8 changed files with 275 additions and 29 deletions.
66 changes: 60 additions & 6 deletions docs/en/table_design/data_distribution/Data_distribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,8 +163,8 @@ The partitioning method divides a table into multiple partitions. Partitioning p
| **Partitioning method** | **Scenarios** | **Methods to create partitions** |
| ------------------------------------- | ------------------------------------------------------------ | ------------------------------------------ |
| Expression partitioning (recommended) | Previously known as automatic partitioning. This partitioning method is more flexible and easy-to-use. It is suitable for most scenarios including querying and managing data based on continuous date ranges or enum values. | Automatically created during data loading |
| Range partitioning | The typical scenario is to store simple, ordered data that is often queried and managed based on continuous date/numeric ranges. For instance, in some special cases, historical data needs to be partitioned by month, while recent data needs to be partitioned by day. | Created manually, dynamically, or in batch |
| List partitioning | A typical scenario is to query and manage data based on enum values, and a partition needs to include data with different values for each partitioning column. For example, if you frequently query and manage data based on countries and cities, you can use this method and select `city` as the partitioning column. So a partition can store data for multiple cities belonging to the same country. | Created manually |
| Range partitioning (Legacy) | The typical scenario is to store simple, ordered data that is often queried and managed based on continuous date/numeric ranges. For instance, in some special cases, historical data needs to be partitioned by month, while recent data needs to be partitioned by day. | Created manually, dynamically, or in batch |
| List partitioning (Legacy) | A typical scenario is to query and manage data based on enum values, and a partition needs to include data with different values for each partitioning column. For example, if you frequently query and manage data based on countries and cities, you can use this method and select `city` as the partitioning column. So a partition can store data for multiple cities belonging to the same country. | Created manually |

##### How to choose partitioning columns and granularity

Expand Down Expand Up @@ -194,11 +194,65 @@ The number of buckets: By default, StarRocks automatically sets the number of bu

> **NOTICE**
>
> Since v3.1, StarRocks's shared-data mode supports the time function expression and does not support the column expression.
> StarRocks's shared-data mode supports the time function expression from v3.1.0 and the column expression from v3.1.1.
Since v3.0, StarRocks has supported [expression partitioning](expression_partitioning.md)](./expression_partitioning.md) (previously known as automatic partitioning) which is more flexible and easy to use. This partitioning method is suitable for most scenarios such as querying and managing data based on continuous date ranges or ENUM values.
Since v3.0, StarRocks has supported [expression partitioning](expression_partitioning.md) (previously known as automatic partitioning) which is more flexible and easy to use. This partitioning method is suitable for most scenarios such as querying and managing data based on continuous date ranges or ENUM values.

You only need to configure a partition expression (a time function expression or a column expression) at table creation, and StarRocks will automatically create partitions during data loading. You no longer need to manually create numerous partitions in advance, nor configure dynamic partition properties.
You only need to configure a partition expression at table creation, and StarRocks will automatically create partitions during data loading. You no longer need to manually create numerous partitions in advance, nor configure dynamic partition properties.

From v3.4 onwards, expression partitioning is further optimized to unify all partitioning strategies and supported more complex solutions. It is recommended in most cases, and will replace the other partitioning strategies in future releases.

Example 1: Use a simple time function expression with a DATETIME column.

```SQL
CREATE TABLE site_access(
event_day DATETIME NOT NULL,
site_id INT DEFAULT '10',
city_code VARCHAR(100),
user_name VARCHAR(32) DEFAULT '',
pv BIGINT DEFAULT '0'
)
DUPLICATE KEY(event_day, site_id, city_code, user_name)
PARTITION BY time_slice(event_day, INTERVAL 7 day)
DISTRIBUTED BY HASH(event_day, site_id)
```

Example 2: Use a column expression with multiple columns.

```SQL
CREATE TABLE t_recharge_detail1 (
id bigint,
user_id bigint,
recharge_money decimal(32,2),
city varchar(20) not null,
dt varchar(20) not null
)
DUPLICATE KEY(id)
PARTITION BY dt,city
DISTRIBUTED BY HASH(`id`);
```

Example 3: Use a complex time function expression with a Unix timestamp column.

```SQL
CREATE TABLE orders (
ts BIGINT NOT NULL,
id BIGINT NOT NULL,
city STRING NOT NULL
)
PARTITION BY from_unixtime(ts,'%Y%m%d');
```

Example 4: Use a mixed expression of time function expression and column expression.

```SQL
CREATE TABLE orders (
ts BIGINT NOT NULL,
id BIGINT NOT NULL,
city STRING NOT NULL
)
PARTITION BY from_unixtime(ts,'%Y%m%d'), city;
```

#### Range partitioning

Expand Down Expand Up @@ -530,7 +584,7 @@ Multiple partitions can be created in batch at and after table creation. You can

StarRocks stores data in the corresponding partitions based on the explicit mapping of the predefined value list for each partition.

### Manage partitions
### Manage partitions

#### Add partitions

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,16 @@ displayed_sidebar: docs
sidebar_position: 30
---

# Dynamic partitioning
# Dynamic partitioning (Legacy)

StarRocks supports dynamic partitioning, which can automatically manage the time to life (TTL) of partitions, such as partitioning new input data in tables and deleting expired partitions. This feature significantly reduces maintenance costs.

:::note

Please note that from v3.4 onwards, [expression partitioning](./expression_partitioning.md) is further optimized to unify all partitioning strategies and supported more complex solutions. It is recommended in most cases, and will replace the dynamic partitioning strategy in future releases.

:::

## Enable dynamic partitioning

Take table `site_access` as an example. To enable dynamic partitioning, you need to configure the PROPERTIES parameter. For information about the configuration items, see [CREATE TABLE](../../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE.md).
Expand Down
73 changes: 65 additions & 8 deletions docs/en/table_design/data_distribution/expression_partitioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@ sidebar_position: 10

Since v3.0, StarRocks has supported expression partitioning (previously known as automatic partitioning), which is more flexible and user-friendly. This partitioning method is suitable for most scenarios such as querying and managing data based on continuous time ranges or ENUM values.

You only need to specify a simple partition expression (either a time function expression or a column expression) at table creation. During data loading, StarRocks will automatically create partitions based on the data and the rule defined in the partition expression. You no longer need to manually create numerous partitions at table creation, nor configure dynamic partition properties.
You only need to specify a simple partition expression at table creation. During data loading, StarRocks will automatically create partitions based on the data and the rule defined in the partition expression. You no longer need to manually create numerous partitions at table creation, nor configure dynamic partition properties.

## Partitioning based on a time function expression
From v3.4 onwards, expression partitioning is further optimized to unify all partitioning strategies and supported more complex solutions. It is recommended in most cases, and will replace the other partitioning strategies in future releases.

## Partitioning based on a simple time function expression

If you frequently query and manage data based on continuous time ranges, you only need to specify a date type (DATE or DATETIME) column as the partition column and specify year, month, day, or hour as the partition granularity in the time function expression. StarRocks will automatically create partitions and set the partitions' start and end dates or datetime based on the loaded data and partition expression.

Expand All @@ -33,7 +35,7 @@ expression ::=
#### `expression`

**Required**: YES<br/>
**Description**: Currently, only the [date_trunc](../../sql-reference/sql-functions/date-time-functions/date_trunc.md) and [time_slice](../../sql-reference/sql-functions/date-time-functions/time_slice.md) functions are supported. If you use the function `time_slice`, you do not need to pass the `boundary` parameter. It is because in this scenario, the default and valid value for this parameter is `floor`, and the value cannot be `ceil`. <br/>
**Description**: A simple time function expression that use the [date_trunc](../../sql-reference/sql-functions/date-time-functions/date_trunc.md) or [time_slice](../../sql-reference/sql-functions/date-time-functions/time_slice.md) functions. If you use the function `time_slice`, you do not need to pass the `boundary` parameter. It is because in this scenario, the default and valid value for this parameter is `floor`, and the value cannot be `ceil`. <br/>

#### `time_unit`

Expand All @@ -57,7 +59,7 @@ expression ::=
- StarRocks sets the default maximum number of automatically created partitions for one load to 4096, which can be configured by the FE parameter `auto_partition_max_creation_number_per_load`. This parameter can prevent you from accidentally creating too many partitions.
- The naming rule for partitions is consistent with the naming rule for dynamic partitioning.

### **Examples**
### Examples

Example 1: Suppose you frequently query data by day. You can use the partition expression `date_trunc()` and set the partition column as `event_day` and the partition granularity as `day` at table creation. Data is automatically partitioned based on dates during loading. Data of the same day is stored in one partition and partition pruning can be used to significantly improve query efficiency.

Expand Down Expand Up @@ -139,7 +141,7 @@ PARTITION BY expression
...

expression ::=
( partition_columns )
partition_columns

partition_columns ::=
<column>, [ <column> [,...] ]
Expand Down Expand Up @@ -172,7 +174,7 @@ CREATE TABLE t_recharge_detail1 (
dt varchar(20) not null
)
DUPLICATE KEY(id)
PARTITION BY (dt,city)
PARTITION BY dt,city
DISTRIBUTED BY HASH(`id`);
```

Expand Down Expand Up @@ -212,13 +214,68 @@ LastConsistencyCheckTime: NULL
1 row in set (0.00 sec)
```

## Partitioning based on a complex time function expression (since v3.4)

From v3.4.0 onwards, expression partitioning supports any expressions that return DATE or DATETIME types to accommodate to even more complex partitioning scenarios.

For example, you can define a Unix timestamp column, and use from_unixtime() directly against the column in the partition expression to define the partition key, instead of define a generated DATE or DATETIME column with the function. For more about the usage, see [Examples](#examples-2).

### Examples

Example 1: Suppose you assign a Unix timestamp to each data row and frequently query data by day. You can use the timestamp column with the function from_unixtime() in the expression to define the timestamp as the partition column and the partition granularity to a day at table creation. Data of each day is stored in one partition and partition pruning can be used to improve query efficiency.

```SQL
CREATE TABLE orders (
ts BIGINT NOT NULL,
id BIGINT NOT NULL,
city STRING NOT NULL
)
PARTITION BY from_unixtime(ts,'%Y%m%d');
```

Example 2: Suppose you assign an irregular STRING type timestamp to each data row and frequently query data by day. You can use the timestamp column with the functions cast() and date_parse() in the expression to transform the timestamp into the DATE type, set it as the partition column, and the partition granularity to a day at table creation. Data of each day is stored in one partition and partition pruning can be used to improve query efficiency.

```SQL
CREATE TABLE orders_new (
ts STRING NOT NULL,
id BIGINT NOT NULL,
city STRING NOT NULL
)
PARTITION BY CAST(DATE_PARSE(CAST(ts AS VARCHAR(100)),'%Y%m%d') AS DATE);
```

### Usage notes

Partition pruning is applicable to cases of partitioning based on a complex time function expression:

- If the partition clause is `PARTITION BY from_unixtime(ts)`, queries with filters in the format `ts>1727224687` can be pruned to corresponding partitions.
- If the partition clause is `PARTITION BY CAST(DATE_PARSE(CAST(ts AS VARCHAR(100)),'%Y%m%d') AS DATE)`, queries with filters in the format `ts = "20240506"` can be pruned.
- The above cases are also applicable to [Partitioning based on mixed expression](#partitioning-based-on-the-mixed-expression-since-v34).

## Partitioning based on the mixed expression (since v3.4)

From v3.4.0 onwards, expression partitioning supports multiple partition columns with one of them being a time function expression.

### Examples

Example 1: Suppose you assign a Unix timestamp to each data row and frequently query data by day and a specific city. You can use the timestamp column (with the from_unixtime() function) and the city column as the partition columns at table creation. Data of each day in each city is stored in one partition and partition pruning can be used to improve query efficiency.

```SQL
CREATE TABLE orders (
ts BIGINT NOT NULL,
id BIGINT NOT NULL,
city STRING NOT NULL
)
PARTITION BY from_unixtime(ts,'%Y%m%d'), city;
```

## Manage partitions

### Load data into partitions

During data loading, StarRocks will automatically create partitions based on the loaded data and partition rule defined by the partition expression.

Note that if you use expression partitioning at table creation and need to use [INSERT OVERWRITE](../../loading/InsertInto.md#overwrite-data-via-insert-overwrite-select) to overwrite data in a specific partition, whether the partition has been created or not, you currently need to explicitly provide a partition range in `PARTITION()`. This is different from [Range Partitioning../Data_distribution.mdmd#range-partitioning) or [List Partitioning../list_partitioning.mdmd), which allow you only to provide the partition name in `PARTITION (<partition_name>)`.
Note that if you use expression partitioning at table creation and need to use [INSERT OVERWRITE](../../loading/InsertInto.md#overwrite-data-via-insert-overwrite-select) to overwrite data in a specific partition, whether the partition has been created or not, you currently need to explicitly provide a partition range in `PARTITION()`. This is different from [Range Partitioning](./Data_distribution.md#range-partitioning) or [List Partitioning](./list_partitioning.md), which allow you only to provide the partition name in `PARTITION (<partition_name>)`.

If you use a time function expression at table creation and want to overwrite data in a specific partition, you need to provide the starting date or datetime of that partition (the partition granularity configured at table creation). If the partition does not exist, it can be automatically created during data loading.

Expand Down Expand Up @@ -251,7 +308,7 @@ MySQL > SHOW PARTITIONS FROM t_recharge_detail1;

## Limits

- Since v3.1.0, StarRocks's shared-data mode supports the [time function expression](#partitioning-based-on-a-time-function-expression). And since v3.1.1, StarRocks's shared-data mode further supports the [column expression](#partitioning-based-on-the-column-expression-since-v31).
- Since v3.1.0, StarRocks's shared-data mode supports the [time function expression](#partitioning-based-on-a-simple-time-function-expression). And since v3.1.1, StarRocks's shared-data mode further supports the [column expression](#partitioning-based-on-the-column-expression-since-v31).
- Currently, using CTAS to create tables configured expression partitioning is not supported.
- Currently, using Spark Load to load data to tables that use expression partitioning is not supported.
- When the `ALTER TABLE <table_name> DROP PARTITION <partition_name>` statement is used to delete a partition created by using the column expression, data in the partition is directly removed and cannot be recovered.
Expand Down
8 changes: 7 additions & 1 deletion docs/en/table_design/data_distribution/list_partitioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,16 @@ displayed_sidebar: docs
sidebar_position: 20
---

# List partitioning
# List partitioning (Legacy)

Since v3.1, StarRocks supports list partitioning. Data is partitioned based on a predefined value list for each partition, which can accelerate queries and facilitate management according to enumerated values.

:::note

Please note that from v3.4 onwards, [expression partitioning](./expression_partitioning.md) is further optimized to unify all partitioning strategies and supported more complex solutions. It is recommended in most cases, and will replace the list partitioning strategy in future releases.

:::

## Introduction

You need to explicitly specify the column values list in each partition. These values do not need to be continuous, unlike the continuous time or numeric range required in Range Partitioning. During data loading, StarRocks will store the data in the corresponding partition based on the mapping between the data's partitioning column values and the predefined column values for each partition.
Expand Down
Loading

0 comments on commit b95f511

Please sign in to comment.