How to: Download a page from the Wayback Machine over a specified interval
The way wayback
URLs are formatted are as follows:
http://$BASEURL/$TIMESTAMP/$TARGET
Here BASEURL
is usually http://web.archive.org/web
(I say usually as I am unsure if it is the only BASEURL)
TARGET
is self explanatory (in your case http://nature.com
, or some similar URL)
TIMESTAMP
is YYYYmmddHHMMss
when the capture was made (in UTC):
YYYY
: Yearmm
: Month (2 digit - 01 to 12)dd
: Day of month (2 digit - 01 to 31)HH
: Hour (2 digit - 00 to 23)MM
: Minute (2 digit - 00 to 59)ss
: Second (2 digit - 00 to 59)
In case you request a capture time that doesn't exist, the wayback machine redirects to the closest capture for that URL, whether in the future or the past.
You can use that feature to get each daily URL using curl -I
(HTTP HEAD
) to get the set of URLs:
BASEURL='http://web.archive.org/web'
TARGET="SET_THIS"
START=1325419200 # Jan 1 2012 12:00:00 UTC (Noon)
END=1356998400 # Tue Jan 1 00:00:00 UTC 2013
if uname -s |grep -q 'Darwin' ; then
DATECMD="date -u '+%Y%m%d%H%M%S' -r "
elif uname -s |grep -q 'Linux'; then
DATECMD="date -u +%Y%m%d%H%M%S -d @"
fi
while [[ $START -lt $END ]]; do
TIMESTAMP=$(${DATECMD}$START)
REDIRECT="$(curl -sI "$BASEURL/$TIMESTAMP/$TARGET" |awk '/^Location/ {print $2}')"
if [[ -z "$REDIRECT" ]]; then
echo "$BASEURL/$TIMESTAMP/$TARGET"
else
echo $REDIRECT
fi
START=$((START + 86400)) # add 24 hours
done
This gets you the URLs that are closest to noon on each day of 2012. Just remove the duplicates, and, and download the pages.
Note: The Script above can probably be greatly improved to jump forward in case the REDIRECT
is for a URL more than 1 day in the future, but then it requires deconstructing the returned URL, and adjusting START
to the correct date value.
There is a ruby gem on Github: https://github.com/hartator/wayback-machine-downloader