Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?
Came across this question and saw there was no answer provided. I looked further and found the answer in the Hive documentation.
This will never work, because of the following restrictions on CTAS:
- The target table cannot be a partitioned table.
- The target table cannot be an external table.
- The target table cannot be a list bucketing table.
Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect%28CTAS
Furthermore https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
...
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
...
[AS select_statement];
Clustering requires the column to be defined and then the cfg goes to the As select_statement Therefore at this time it is not possible.
Optionally, you can ALTER the table and add buckets, but this does not change existing data.
CREATE TABLE BUCKET_TABLE
STORED AS ORC AS
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll limit 0;
ALTER TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS;
ALTER TABLE BUCKET_TABLE SET TBLPROPERTIES ('transactional'='true');
INSERT INTO BUCKET_TABLE
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;