Search/Find a file and file content in Hadoop
You can use hadoop.HdfsFindTool with solr, is more quickly than 'hdfs dfs ls -R' and more useful.
hadoop jar search-mr-job.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive/tmp -mtime 7
Usage: hadoop fs [generic options]
[-find <path> ... <expression> ...]
[-help [cmd ...]]
[-usage [cmd ...]]
- You can do this:
hdfs dfs -ls -R / | grep [search_term]
. - It sounds like a MapReduce job might be suitable here. Here's something similar, but for text files. However, if these documents are small, you may run into inefficiencies. Basically, each file will be assigned to one map task. If the files are small, the overhead to set up the map task may be significant compared to the time necessary to process the file.
Usually when I'm searching for files in hadoop, as stated by ajduff574, it's done with
hdfs dfs -ls -R $path | grep "$file_pattern" | awk '{print $8}'
This code simply print out the path for each pattern and can then be further be manipulated incase you wish to search within the content of the files. Ex:
hdfs dfs -cat $(hdfs dfs -ls -R $path | grep "$file_pattern" | awk '{print $8}') | grep "$search_pattern"
search_pattern: The content that you are looking for within the file
file_pattern: The file that you are looking for.
path: The path for the search to look into recursivly, this includes sub folders as well.
Depending on how the data is stored in HDFS, you may need to use the -text option to dfs for a string search. In my case I had thousands of messages stored daily in a series of HDFS sequence files in AVRO format. From the command-line on an edge node, this script:
- Searches the /data/lake/raw directory at its first level for a list of files.
- Passes result to awk, which outputs columns 6 & 8 (date and file name)
- Grep outputs lines with the file date in question (2018-05-03)
- Passes those lines with two columns to awk, which outputs only column 2, the list of files.
- That is read with a while-loop which takes each file name, extracts it from HDFS as text.
- Each line of the file is grep-ed for string "7375675".
- Lines meeting that criteria are output the screen (stdout)
There is a solr jar-file implementation that is supposedly faster I have not tried.
hadoop fs -ls /data/lake/raw | awk {'print $6" "$8'} | grep 2018-05-03 | awk {'print $2'} | while read f; do hadoop fs -text $f | grep 7375675 && echo $f ; done