Copy files to local from multiple directories in HDFS for last 24 hours
You can make it simpler by using "find" in combination with "cp", for example:
find /path/to/directory/ -type f -name "*.csv" | xargs cp -t /path/to/copy
If you want to clean your directory of files older than 24 hours, you can use:
find /path/to/files/ -type f -name "*.csv" -mtime +1 | xargs rm -f
Maybe you can implement them as script, then set it as a task on Cron.
note: I was unable to test this, but you could test this step by step by looking at the output:
Normally I would say Never parse the output of ls
, but with Hadoop, you don't have a choice here as there is no equivalent to find
. (Since 2.7.0 there is a find, but it is very limited according to the documentation)
Step 1: recursive ls
$ hadoop fs -ls -R /path/to/folder/
Step 2: use awk to pick files only and CSV files only
directories are recognized by their permissions that start with d
, so we have to exclude those. And the CSV files are recognized by the last field ending with "csv":
$ hadoop fs -ls -R /path/to/folder/ | awk '!/^d/ && /\.csv$/'
make sure you do not end up with funny lines here which are empty or just the directory name ...
Step 3: continue using awk
to process the time. I am assuming you have any standard awk, so I will not use GNU extensions. Hadoop will output the time format as yyyy-MM-dd HH:mm
. This format can be sorted and is located in fields 6 and 7:
$ hadoop fs -ls -R /path/to/folder/ \
| awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
'(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff)'
Step 4: Copy files one by one:
First, check the command you are going to execute:
$ hadoop fs -ls -R /path/to/folder/ \
| awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
'(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff) {
print "migrating", $NF
cmd="hadoop fs -get "$NF" /path/to/local/"
print cmd
# system(cmd)
}'
(remove #
if you want to execute)
or
$ hadoop fs -ls -R /path/to/folder/ \
| awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
'(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff) {
print $NF
}' | xargs -I{} echo hadoop fs -get '{}' /path/to/local/
(remove echo
if you want to execute)