Syntax for apache RewriteRule to match %-encoded URLs? (to fix character encoding issues; windows-1252 <=> utf-8 )
Solution 1:
You can't "convert encodings" as such using only mod_rewrite, however, you can search for that specific sequence of characters in the requested URL and "correct it".
http://localhost:60151/load?file=http://example.org/project²/some/data/file.bam
RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE]
Note that project²
appears as part of the query string in the example URL you posted, however, the RewriteRule
pattern (which you are using above) matches against the %-decoded URL-path only (which excludes the query string). To match against the query string you need to use an additional RewriteCond
directive and match against the QUERY_STRING
(or THE_REQUEST
) server variable instead.
Note that the QUERY_STRING
(and THE_REQUEST
) server variable is %-encoded (or rather, as sent from the client) - they have not been %-decoded.
Try the following instead:
RewriteCond %{QUERY_STRING} (.+)/project%B2/(.*)
RewriteRule ^(load)$ $1?%1/project%C2%B2/%2 [NE,L]
The backreferences %1
and %2
in the substitution string refer to the preceding CondPattern - the parts before and after the troublesome /project%B2/
part.
$1
is simply a backreference to the URL-path (to save repetition), which I assume is always load
.
The NE
flag prevents the %
itself (when used as part of the URL-encoded characters) being URL encoded.
UPDATE: I'm afraid my original question was unclear about who GETs which URL, so the "query-string" part of your answer doesn't apply...
If you need to match the %-encoded URL-path then you should match against THE-REQUEST
server variable instead. THE_REQUEST
contains the first line of the HTTP request header and is not %-decoded. It contains the full URL-path (and query string) as sent from the client (as well as the request method and protocol version). For example, in the case of the malformed request, a string of the form:
GET /project%B2/some/data/file.bam HTTP/1.1
Which you could match and correct as follows:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,7}\s(/project)%B2([^\s]+)
RewriteRule ^/?project %1%B2%C2%2 [NE,L]
%1
and %2
are backreferences to the captured subpatterns in the preceding CondPattern.
The RewriteRule
pattern, on the other hand, matches against a pre-processed %-decoded URL-path only (as mentioned above). So, %B2
is whatever that decodes to; assuming a UTF-8 encoding. Unfortunately, this is a non-printable character so would need to be represented by the hex character sequence in the regex, ie. \xb2
(this is PCRE syntax representing a single byte sequence).
Solution 2:
Solution
RewriteRule
s must use \x
instead of %
in order to match %-encoded URLs! (PCRE syntax for byte sequences)
mod_rewrite
-config uses PCRE regex syntax, and operates on decoded URLs, so typing a %
-encoding in a RewriteRule
pattern causes it to look for the literal %
-character, not an encoded value.
The correct escape-character in RewriteRules is \x
, so the URLencoded value %B2
can be matched using \xb2
(or \xB2
, it's case-insensitive).
Note that RewriteRule
is a hacky solution for character encoding issues, that only works when there is exactly one specific wrong-encoded character is in a specific, predictable place.
For a general solution for multiple wrong-encoded characters in arbitrary places, please see Can Apache .htaccess convert the percent-encoding in encoded URIs from Win-1252 to UTF-8? , which suggests a general solution using RewriteMap
coupled to an external program in a full-featured programming language.
The proper solution is still to prevent this from the source, using explicit %-encoding throughout the entire chain. This avoids OS-dependent encoding accidentally happening 'somewhere in the middle', outside of your control. (assuming no client along the paths does double-encoding, which should be a punishable offense..)
How I got here
Getting desperate, I upped the server-wide logging using LogLevel Warn rewrite:trace3
as suggested in mod_rewrite docs. This is warned to (heavily) impact server performance, but was manageable because this is a low-traffic server, and there were no pre-existing rewrites.
The additional logging is emitted into (ssl_
)error_log
.
This gave me insight into how exactly the matching was attempted, and what the internal representations for rules and URIs are in mod_rewrite
.
excerpt from ssl_error_log
(many columns ommitted for brevity),
with rule RewriteRule (.*)project%B2/(.*) $1project²/$2 [NE,L]
[rewrite:trace3] applying pattern '(.*)project%B2/(.*)' to uri 'project\xb2/'
[rewrite:trace1] pass through /var/www/html/example.org/project\xb2
Note that the request-uri from client is written \xb2
, but my pattern uses %B2
.
Matching the rule-syntax to the uri-syntax, with rule RewriteRule (.*)project\xB2/(.*) $1project²/$2 [NE,L]
[rewrite:trace3] applying pattern '(.*)project\\xb2/(.*)' to uri 'project\xb2/'
[rewrite:trace2] rewrite 'project\xb2/' -> 'project%c2%b2/'
[rewrite:trace1] internal redirect with /auth-test/project\xc2\xb2/ [INTERNAL REDIRECT]
success! As we can see, we are now matching!
Why no [R]
/[R=302]
flag?
As this is a character-encoding issue, I do not think doing an extra HTTP-round-trip will add value; Every link fed into the client will run into the same issue again, unless I fix the encoding issue before feeding it into the client-side java-program.
Don't forget RewriteBase
Please note that this shortened version omits setting the correct RewriteBase
, which can screw up the rewritten path, depending on where in your conf
it is written (e.g. <Directory>
vs <Location>
). Without RewriteBase
I accidentally redirected to ❌https://example.org/var/www/html/rewrite-testing/project²
instead of ✅https://example.org/rewrite-testing/project²
)