Hive External Table Skip First Row

Header rows in data are a perpetual headache in Hive. Short of modifying the Hive source, I believe you can't get away without an intermediate step. (Edit: This is no longer true, see update below)

Unfortunately, that answers you question. I'll throw in some ideas for the intermediate step for completeness.

You can get away without an extra step in your data load if you are willing to filter out the header row on every query that touches the table. Unfortunately this adds an extra set just about everywhere else. And you will have to get clever/messy when the header row violates your schema. If you go with this approach, you might consider writing a custom SerDe that makes this row easier to filter. Unfortunately, SerDe's cannot remove the row entirely (or that might form a possible solution), they must return something like null. I've never seen this approach taken in practice to deal with header rows since it makes reading a pain, and reading tends to be much more common than writing. It might have a place if you are dealing with one-of tables or if the header row is just one row among many malformed rows.

You could do this filtering once with variations on deleting that first row in data load. A WHERE clause in an INSERT statement would do it. You could use utilities like sed to get rid of it. I've seen both approaches taken. There are trade-offs between which approach you take and neither is the one true way to deal with header rows. Unfortunately, both these approaches take time and require temporary duplication of the data. If you absolutely need the header row for another application, the duplication would be permanent.

Update:

From Hive v0.13.0, you can use skip.header.line.count. You could also specify the same while creating the table. For example:

create external table testtable (name string, message string)
row format delimited 
fields terminated by '\t' 
lines terminated by '\n' 
location '/testtable'
tblproperties ("skip.header.line.count"="1");

While you have your answer from Daniel, here are some customizations possible using OpenCSVSerde:

CREATE EXTERNAL TABLE `mydb`.`mytable`(
    `product_name` string,
    `brand_id` string,
    `brand` string,
    `color` string,
    `description` string,
    `sale_price` string)
PARTITIONED BY (
    `seller_id` string)
ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    'separatorChar' = '\t',
    'quoteChar' = '"',
    'escapeChar' = '\\')
STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
    'hdfs://namenode.com:port/data/mydb/mytable'
TBLPROPERTIES (
    'serialization.null.format' = '',
    'skip.header.line.count' = '1')

With this, you have total control over the separator, quote character, escape character, null handling and header handling.

Look here and here.


Just append below property in your query and the first header or line int the record will not load or it will be skipped.

Try this

tblproperties ("skip.header.line.count"="1");

Tags:

Hive

Cloudera