How to get site's 'real' URL (before being rewritten)
I'm not sure what your question is. Let's say you have a simple rewrite rule like to redirect content from http://www.example.com/old.html
to http://www.example.com/descriptive-directory/new.html
something like:
RewriteRule ^old.html$ http://www.example.com/descriptive-directory/new.html [R=301,L]
Then a user's web browser sends a GET http request to http://www.example.com
to fetch old.html
:
GET /old.html HTTP/1.1
Host: www.example.com
The web server catches this in the rewrite rules, and sends back a http response from the server like:
HTTP/1.1 301 Moved Permanently
Location: http://www.example.com/descriptive-directory/new.html
and then your browser fetches whatever content is at http://www.example.com/descriptive-directory/new.html
as if you had originally type in the rewritten url.
So what is your question? You presumably know (and can easily log) the web addresses your browser has been requesting before it has been rewritten. At the very least you can capture the GET requests by following the TCP stream with a tool like wireshark.
You know where the redirect rule has ultimately sent you; e.g., the location is now displayed in your web browser. If you have access to the apache logs from the webserver side you'll see something like:
127.0.0.1 - - [2/Feb/2012:12:36:17 -0400] "GET /old.html HTTP/1.0" 301 315 "-" "Mozilla/5.0"
127.0.0.1 - - [2/Feb/2012:12:36:17 -0400] "GET /descriptive-directory/new.html HTTP/1.0" 200 1702 "-" "Mozilla/5.0"
though you could easily just look in the apache configuration to find the actual rewrite rules.
Note: none of this has anything to do with where the content is stored on the web server. There may be no directory called descriptive-directory
or files called new.html
or old.html
on the web server. The entire http response from a request to http://www.example.com/descriptive-directory/new.html
could be taken by the web server and then return a dynamically written html page. E.g., the following simple webpy code can be executed to act as web server without any html files existing.
# call this file silly_website.py
import web
urls = (
'/descriptive-directory/new.html', 'new',
)
class new(object):
def GET(self):
return "<html><head><title>Hello</title></head><body>World! from new</body></html>"
app = web.application(urls, globals())
if __name__ == '__main__':
app.run()
which could be then run as python silly_website.py [your_ip]
and you have a running webserver that will give back a very simple webpage for a request to /descriptive-directory/new.html
. As such there's no generic way of finding out where content returned from the web server is actually stored on the webserver (even in relation to the web server's root directory).
This is not possible unless you know the rewrite rule. In some cases direct access the "real" file is forbidden entirely.
Other than that you could try using DirBuster with a custom directory list, such as a list created from the seo friendly urls. Being hackers we all know how to write code so this is pretty trivial.