Download HTTPS website available only through username and password with wget?
The session information is probably saved in a cookie to allow you to navigate to other pages after you have logged in.
If this is the case, you could do this in two steps :
- Use
wget
's--save-cookies mycookies.txt
and--keep-session-cookies
options on the login page of the website along with your--username
and--password
options - Use
wget
's--load-cookies mycookies.txt
option on the subsequent pages you are trying to retrieve.
EDIT
If the --password
and --username
option doesn't work, you must find out the info sent to the server by the login page and mimic it :
- For a
GET
request, you can add theGET
parameters directly in the address wget must fetch (make sure you properly quote the&
,=
and other special characters). The url would probably look something likehttps://the_url?user=foo&pass=bar
. - For a
POST
request you can usewget
's--post-data=the_needed_info
option to use the post method on the needed login info.
EDIT 2
It seems that you indeed need the POST
method with the j_username
and j_password
set. Try --post-data='j_username=yourusername&j_password=yourpassword
option to wget
.
EDIT 3
With the page of origin, I was able to understand a little more of what is happening. That being said, I cannot make sure that it works because, well, I don't have (nor do I want) valid credentials.
That being said, here is what's happening :
- The page
https://progtest.fit.cvut.cz/
sets aPHPSESSID
cookie and present you with login options. - Clicking the
login
button sends a request tohttps://progtest.fit.cvut.cz/shibboleth-fit.php
which takes the PHPSESSID cookie (not sure if it uses it) and redirects you to the SSO engine with a specially crafted url just for you which looks like this :https://idp2.civ.cvut.cz/idp/profile/SAML2/Redirect/SSO?SAMLRequest=SOME_VERY_LONG_AND_UNIQUE_ID
- The SSO response sets a new cookie named
_idp_authn_lc_key
and redirects you to the pagehttps://idp2.civ.cvut.cz:443/idp/AuthnEngine
which redirects you again tohttps://idp2.civ.cvut.cz:443/idp/Authn/UserPassword
(the real login page) - You enter your credentials and send the post data
j_username
andj_password
along with the cookie from the SSO response - ???
The first four steps can be done with wget
like this :
origin='https://progtest.fit.cvut.cz/'
# Get the PHPSESSID cookie
wget --save-cookies phpsid.cki --keep-session-cookies "$origin"
# Get the _idp_authn_lc_key cookie
wget --load-cookies phpsid.cki --save-cookies sso.cki --keep-session-cookies --header="Referer: $origin" 'https://progtest.fit.cvut.cz/shibboleth-fit.php'
# Send your credentials
wget --load-cookies sso.cki --save-cookies auth.cki --keep-session-cookies --post-data='j_username=usr&j_password=pwd' 'https://idp2.civ.cvut.cz/idp/Authn/UserPassword'
Note that wget
follows redirection all by himself, which helps us quite a bit in this case.
Why are you playing around with wget
? Better use some headless browser to automate this task.
What is a headless browser, you ask?
A headless browser is a web browser without a graphical user interface. They provide automated control of a web page in an environment similar to popular web browsers, but are executed via a command line interface or using network communication.
Two popular headless browsers are phantomjs
(javascript) and Ghost.py
(python).
Solution using phantomjs
First you will need to install phantomjs
. On Ubuntu based systems, you can install it using the package manager or you could build it from source from their home page.
sudo apt-get install phantomjs
After this you write javascript script and run it using phantomjs:
phantomjs script.js
That's it.
Now, to learn how to implement it for your case, head over to its quickstart guide. As an example, to login to facebook automatically, and take a snapshot, one could use the gist provided here:
// This code login's to your facebook account and takes snap shot of it.
var page = require('webpage').create();
var fillLoginInfo = function(){
var frm = document.getElementById("login_form");
frm.elements["email"].value = 'your fb email/username';
frm.elements["pass"].value = 'password';
frm.submit();
}
page.onLoadFinished = function(){
if(page.title == "Welcome to Facebook - Log In, Sign Up or Learn More"){
page.evaluate(fillLoginInfo);
return;
}
else
page.render('./screens/some.png');
console.log("completed");
phantom.exit();
}
page.open('https://www.facebook.com/');
Look around the documentation to implement it for your specific case. If you face some troubles for your https
website due to ssl errors, run your script like this:
phantomjs --ssl-protocol=any script.js
Solution using Ghost.py
To install Ghost.py, you will need pip
:
sudo apt-get install python-pip #On a Debian based system
sudo pip install Ghost.py
Now you have installed Ghost.py. Now, to use it inside a python script, just follow the documentation given in its home page. I've tried using Ghost.py on an https website but it somehow didn't work for me. Do try it and see if it works.
UPDATE : GUI based solution
You can also use tools like Selenium to automate the login process and retrieve the information. It is pretty easy to use. You will just need to install a plugin for your browser from here. And then you can record your process and replay it later on.
Try using 'curl'
curl --data "j_username=value1&j_password=value2" https://idp2.civ.cvut.cz/idp/Authn/UserPassword
You may need to look at the response type and set the 'content-type'
header to match; i.e: XML, json etc