Skip to content

Commit

Permalink
[FLINK-30043] Some example sqls in flink table store rescale-bucket d…
Browse files Browse the repository at this point in the history
…oucument are incorrect

This closes apache#385
  • Loading branch information
houhang1005 authored Nov 17, 2022
1 parent 4d6bc72 commit c2df467
Showing 1 changed file with 20 additions and 5 deletions.
25 changes: 20 additions & 5 deletions docs/content/docs/development/rescale-bucket.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,14 +88,29 @@ WITH (
'bucket' = '16'
);

-- like from a kafka table
CREATE temporary TABLE raw_orders(
trade_order_id BIGINT,
item_id BIGINT,
item_price BIGINT,
gmt_create STRING,
order_status STRING
) WITH (
'connector' = 'kafka',
'topic' = '...',
'properties.bootstrap.servers' = '...',
'format' = 'csv'
...
);

-- streaming insert as bucket num = 16
INSERT INTO verified_orders
SELECT trade_order_id,
item_id,
item_price,
DATE_FORMAT(gmt_create, 'yyyy-MM-dd') AS dt
FROM raw_orders
WHERE order_status = 'verified'
WHERE order_status = 'verified';
```
The pipeline has been running well for the past few weeks. However, the data volume has grown fast recently,
and the job's latency keeps increasing. To improve the data freshness, users can
Expand All @@ -110,7 +125,7 @@ and the job's latency keeps increasing. To improve the data freshness, users can
- Increase the bucket number
```sql
-- scaling out
ALTER TABLE verified_orders SET ('bucket' = '32')
ALTER TABLE verified_orders SET ('bucket' = '32');
```
- Switch to the batch mode and overwrite the current partition(s) to which the streaming job is writing
```sql
Expand All @@ -122,7 +137,7 @@ and the job's latency keeps increasing. To improve the data freshness, users can
item_id,
item_price
FROM verified_orders
WHERE dt = '2022-06-22' AND order_status = 'verified'
WHERE dt = '2022-06-22';

-- case 2: there are late events updating the historical partitions, but the range does not exceed 3 days
INSERT OVERWRITE verified_orders
Expand All @@ -131,7 +146,7 @@ and the job's latency keeps increasing. To improve the data freshness, users can
item_price,
dt
FROM verified_orders
WHERE dt IN ('2022-06-20', '2022-06-21', '2022-06-22') AND order_status = 'verified'
WHERE dt IN ('2022-06-20', '2022-06-21', '2022-06-22');
```
- After overwrite job finished, switch back to streaming mode. And now, the parallelism can be increased alongside with bucket number to restore the streaming job from the savepoint
( see [Start a SQL Job from a savepoint](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/sqlclient/#start-a-sql-job-from-a-savepoint) )
Expand All @@ -145,5 +160,5 @@ and the job's latency keeps increasing. To improve the data freshness, users can
item_price,
DATE_FORMAT(gmt_create, 'yyyy-MM-dd') AS dt
FROM raw_orders
WHERE order_status = 'verified'
WHERE order_status = 'verified';
```

0 comments on commit c2df467

Please sign in to comment.