Using wget to recursively fetch a directory with arbitrary files in it
To download a directory recursively, which rejects index.html* files and downloads without the hostname, parent directory and the whole directory structure :
wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data
For anyone else that having similar issues. Wget follows robots.txt
which might not allow you to grab the site. No worries, you can turn it off:
wget -e robots=off http://www.example.com/
http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html
You have to pass the -np
/--no-parent
option to wget
(in addition to -r
/--recursive
, of course), otherwise it will follow the link in the directory index on my site to the parent directory. So the command would look like this:
wget --recursive --no-parent http://example.com/configs/.vim/
To avoid downloading the auto-generated index.html
files, use the -R
/--reject
option:
wget -r -np -R "index.html*" http://example.com/configs/.vim/