This document collects requirements for Apache CarbonData

Hit Count:

Priority level	Feature
High	Java API, S3 Table, Data exchange between S3 and HDFS, Advisor
Median	Datamap to accelerate S3 table, Different kind of Datamap, Segment Status
Low	Others

Improving usability

Advisor for SORT_COLUMNS
Advisor for pre-aggregate table

Typical Carbon Usage on Cloud

Support CarbonTable on S3
Support loading to local CarbonTable from CSV on S3
Support DataMap on local HDFS
Support Java API to write and read CarbonTable/CarbonFile on S3
Support adding external segment
- Support external segment with specified path, file format, partition value
- Support implementing static partition using segment. User can use ALTER TABLE ADD PARTITION to add existing files to a partition (segment)

Support non-carbonfile in Carbon

Support pre-aggregate datamap for Parquet/ORC
Support compaction for Parquet/ORC

Datamap to accelerate S3 table

Support datamap on HDFS and fact table on S3. The scenario is that, user can create external table on S3 and create datamap on HDFS in EC2, to improve query performance while keeping cost low. It can be used in following scenario.
- Pre-agg table in HDFS, fact table in S3
- Index in HDFS, fact table in S3
- Cache table in HDFS, fact table in S3
- Datamap in HDFS, fact table using parquet in S3
Support a new type of datamap: table cache. It is a subset data of the fact table, route to it when query filter hit this subset, like time range. Datamap option should have: event_time to specify event time column, expiration to specify cache expiration time.
Support assigning location when creating datamap.
Support creating datamap on non-carbon table
For batch load, support load datamap and fact table synchroniously
For streaming ingest, support ingest into cache datamap first then save to both datamap and fact table when handoff happen.

Example Usage:

// create a fact table stored in S3
CREATE TABLE fact(c1 int, c2 string) 
STORED BY 'carbondata'   // can be others
LOCATION 's3a://xx'

// create a cache table on fact table 
CREATE DATAMAP fact_cache ON TABLE fact 
USING 'cache'
LOCATION 'hdfs://yy'
DMPROPERTIES (event_time'='c2', 'expiration'='1day')

// create a preaggregate table on fact table 
CREATE DATAMAP fact_agg ON TABLE fact 
USING 'preaggregate'
LOCATION 'hdfs://yy'
AS SELECT c2, sum(c1) FROM fact GROUP BY c2

// create an text index on fact table 
CREATE DATAMAP fact_index ON TABLE fact 
USING 'lucene_index'
LOCATION 'hdfs://yy'
DMPROPERTIES ('index_column'='c2')

Differnet kind of Datamap

Support lucene datamap for text index (it needs UDF pushdown to continue)
Support R tree datamap for geospatial analytic
Support cache table as datamap (repeated)

File Level Input/Output

Support file level OuputFormat and spark/hive/presto integration, so that spark/hive/presto can write carbon as file as per other format like parquet/orc. This feature does not support dictionary.
Support file level InputFomat and spark/hive/presto integration. This feature can leverage index, but not support any feature needs optimizer enhancement like dictionary and datamap.
Support using file level integration in Hive partition related syntax. User can set to use carbon file format in hive partition

Advisor

TODO: add a diagram

Support workload analyzer, which takes create table script and query script and stats as input.
Support advisor for CREATE TABLE (main index, dictionary, etc), CREATE DATAMAP (index, pre-agg, etc)
Support evaluate the effect of advise output in a virtual environment by EXPLAIN command, like doing what-if analysis for virtual datamap
Support visualization of the workload to make it easier to tune
Support continuous monitor and collect workload plan to save in a store, like ES or JSON file in HDFS

Graph

Support a new format (carbongraph) for adjacent matrix for high performance graph traversal.
Implement a RPC based distributed graph compute framework
Support prefetch of data to leverage the advantage of carbongraph's format
Integrate CarbonData with Apache Tinkerpop as a TinkerPop-enabled data system provider. Expose Gremlin language to user as counterpart of SQL in data warehouse domain.
Provide Graph SQL extension like Azure SQL

Segment Status

Support segment interface and store segment related metadata in hive metastore
Support handle metadata correctly in all command, for cloud environment where data is in S3 and metadata in metastore.

CarbonStore for higher performance

Since carbon has pre-agg now, many query can transform group by into point query or range query, to make it faster, we should optimize the performance for point query, for both single query and concurrent query

Implement a long running service based on YARN container or k8s/docker
Implement a RPC based execution engine for simple query: query with only projection and filter. After generating the physical plan, invoke the RPC and execute by CarbonStore process instead of toRDD and execute by spark executor.

Timeseries Table

Support data temperature on cloud for S3, HDFS. CarbonStore automatically managed them according to the segment time range. Support data retention policy, so that old data is automatically spill to cooler storage, like 1 month in HDFS, others in S3.
In memory segment
Streaming pre-aggregate
Support pre-aggregate table loading in rollup manner to improve loading speed, like rollup to month table based on day table instead of fact table

Streaming Table

Compaction: auto compact and close stream
Integrate with flink
Integrate with kafka-connect

UDF for index column

Support MATCH filter UDF push down
1. vector feature matching
2. lucene index matching

AI integration

Support feature vector in carbon/spark integration, as separate data file or datamap
Support similar functionality for DL/Graph

Misc

merge index for global sort table
Support timestamp64 datatype for column that stores millisecond level timestamp
Make tempCSV default value as 'false' for DataframeWriter

Image Content Search

CREATE TABLE on pictures, sound, binary data, etc. CREATE DATAMAP on these binary to extract metadata and build secondary index, so that user can search these pictures by metadata like time, place, camera type, etc.
CarbonStore should accept JPG picture as segment file format

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This document collects requirements for Apache CarbonData

Improving usability

Typical Carbon Usage on Cloud

Support non-carbonfile in Carbon

Datamap to accelerate S3 table

Differnet kind of Datamap

File Level Input/Output

Advisor

Graph

Segment Status

CarbonStore for higher performance

Timeseries Table

Streaming Table

UDF for index column

AI integration

Misc

Image Content Search

About

Releases

Packages

jackylk/carbon-requested-feature

Folders and files

Latest commit

History

Repository files navigation

This document collects requirements for Apache CarbonData

Improving usability

Typical Carbon Usage on Cloud

Support non-carbonfile in Carbon

Datamap to accelerate S3 table

Differnet kind of Datamap

File Level Input/Output

Advisor

Graph

Segment Status

CarbonStore for higher performance

Timeseries Table

Streaming Table

UDF for index column

AI integration

Misc

Image Content Search

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages