how to get input file name as column within hive query
Every table in Hive has two virtual columns. They are
INPUT__FILE__NAME
BLOCK__OFFSET__INSIDE__FILE
INPUT__FILE__NAME
gives the name of the file.
BLOCK__OFFSET__INSIDE__FILE
is the current global file position.
Suppose if we want to find the name of the file corresponding to each record in a file. We can use the INPUT__FILE__NAME
column. This feature is available from Hive versions above 0.8. A small example is given below.
Query
select INPUT__FILE__NAME, name from customer_data;
This will give us the file name corresponding to each record. If you want to get the file names corresponding to a hive table, the below query will help you.
select distinct(INPUT__FILE__NAME) from customer_data;
Yes, you can retrieve the file the record was found in using the virtual column named INPUT__FILE__NAME
, for example:
select INPUT__FILE__NAME, id, name from users where ...;
yields something like:
hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users1.txt 2 user2
hdfs://localhost.localdomain:8020/user/hive/warehouse/users/users2.txt 42 john.doe
If necessary, use the provided string functions to trim the host and directories from the uri.
You can find the documentation on virtual columns here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns