Priority level | Feature |
---|---|
High | Java API, S3 Table, Data exchange between S3 and HDFS, Advisor |
Median | Datamap to accelerate S3 table, Different kind of Datamap, Segment Status |
Low | Others |
- Advisor for SORT_COLUMNS
- Advisor for pre-aggregate table
- Support CarbonTable on S3
- Support loading to local CarbonTable from CSV on S3
- Support DataMap on local HDFS
- Support Java API to write and read CarbonTable/CarbonFile on S3
- Support adding external segment
- Support external segment with specified path, file format, partition value
- Support implementing static partition using segment. User can use ALTER TABLE ADD PARTITION to add existing files to a partition (segment)
- Support pre-aggregate datamap for Parquet/ORC
- Support compaction for Parquet/ORC
- Support datamap on HDFS and fact table on S3. The scenario is that, user can create external table on S3 and create datamap on HDFS in EC2, to improve query performance while keeping cost low. It can be used in following scenario.
- Pre-agg table in HDFS, fact table in S3
- Index in HDFS, fact table in S3
- Cache table in HDFS, fact table in S3
- Datamap in HDFS, fact table using parquet in S3
- Support a new type of datamap: table cache. It is a subset data of the fact table, route to it when query filter hit this subset, like time range. Datamap option should have: event_time to specify event time column, expiration to specify cache expiration time.
- Support assigning location when creating datamap.
- Support creating datamap on non-carbon table
- For batch load, support load datamap and fact table synchroniously
- For streaming ingest, support ingest into cache datamap first then save to both datamap and fact table when handoff happen.
Example Usage:
// create a fact table stored in S3
CREATE TABLE fact(c1 int, c2 string)
STORED BY 'carbondata' // can be others
LOCATION 's3a://xx'
// create a cache table on fact table
CREATE DATAMAP fact_cache ON TABLE fact
USING 'cache'
LOCATION 'hdfs://yy'
DMPROPERTIES (event_time'='c2', 'expiration'='1day')
// create a preaggregate table on fact table
CREATE DATAMAP fact_agg ON TABLE fact
USING 'preaggregate'
LOCATION 'hdfs://yy'
AS SELECT c2, sum(c1) FROM fact GROUP BY c2
// create an text index on fact table
CREATE DATAMAP fact_index ON TABLE fact
USING 'lucene_index'
LOCATION 'hdfs://yy'
DMPROPERTIES ('index_column'='c2')
- Support lucene datamap for text index (it needs UDF pushdown to continue)
- Support R tree datamap for geospatial analytic
- Support cache table as datamap (repeated)
- Support file level OuputFormat and spark/hive/presto integration, so that spark/hive/presto can write carbon as file as per other format like parquet/orc. This feature does not support dictionary.
- Support file level InputFomat and spark/hive/presto integration. This feature can leverage index, but not support any feature needs optimizer enhancement like dictionary and datamap.
- Support using file level integration in Hive partition related syntax. User can set to use carbon file format in hive partition
TODO: add a diagram
- Support workload analyzer, which takes create table script and query script and stats as input.
- Support advisor for CREATE TABLE (main index, dictionary, etc), CREATE DATAMAP (index, pre-agg, etc)
- Support evaluate the effect of advise output in a virtual environment by EXPLAIN command, like doing what-if analysis for virtual datamap
- Support visualization of the workload to make it easier to tune
- Support continuous monitor and collect workload plan to save in a store, like ES or JSON file in HDFS
- Support a new format (carbongraph) for adjacent matrix for high performance graph traversal.
- Implement a RPC based distributed graph compute framework
- Support prefetch of data to leverage the advantage of carbongraph's format
- Integrate CarbonData with Apache Tinkerpop as a TinkerPop-enabled data system provider. Expose Gremlin language to user as counterpart of SQL in data warehouse domain.
- Provide Graph SQL extension like Azure SQL
- Support segment interface and store segment related metadata in hive metastore
- Support handle metadata correctly in all command, for cloud environment where data is in S3 and metadata in metastore.
Since carbon has pre-agg now, many query can transform group by into point query or range query, to make it faster, we should optimize the performance for point query, for both single query and concurrent query
- Implement a long running service based on YARN container or k8s/docker
- Implement a RPC based execution engine for simple query: query with only projection and filter. After generating the physical plan, invoke the RPC and execute by CarbonStore process instead of toRDD and execute by spark executor.
- Support data temperature on cloud for S3, HDFS. CarbonStore automatically managed them according to the segment time range. Support data retention policy, so that old data is automatically spill to cooler storage, like 1 month in HDFS, others in S3.
- In memory segment
- Streaming pre-aggregate
- Support pre-aggregate table loading in rollup manner to improve loading speed, like rollup to month table based on day table instead of fact table
- Compaction: auto compact and close stream
- Integrate with flink
- Integrate with kafka-connect
- Support MATCH filter UDF push down
- vector feature matching
- lucene index matching
- Support feature vector in carbon/spark integration, as separate data file or datamap
- Support similar functionality for DL/Graph
- merge index for global sort table
- Support timestamp64 datatype for column that stores millisecond level timestamp
- Make tempCSV default value as 'false' for DataframeWriter
- CREATE TABLE on pictures, sound, binary data, etc. CREATE DATAMAP on these binary to extract metadata and build secondary index, so that user can search these pictures by metadata like time, place, camera type, etc.
- CarbonStore should accept JPG picture as segment file format