Which special characters are safe to use in url?
The following characters have special meaning in the path component of your URL (the path component is everything before the '?'):
";" | "/" | "?"
In addition to those, the following characters have special meaning in the query part of your URL (everything after '?'). Therefore, if they are after the '?' you need to escape them:
":" | "@" | "&" | "=" | "+" | "$" | ","
For a more in-depth explanation, see the RFC.
The safe characters are a-z, A-Z, 0-9, and _ - (underscore and minus), that besides the reserved characters who are used for the parameters.
Other characters will give problems in some degree. example: if one parameter is an array ?param=array[content]
ie will show an url whit the square brackets url encoded, which look ugly and impossible to dictate.
But the problem is not only it's ugly, lets say you have a jpg with a character beside the safer ones, many times the browser will be unable to download it getting a 404. This is a problem of older browsers and some mobile browsers.
How to test this?
- put a bunch of images/js/css with the characters you want to test in the names in a public page with many visitors
- Make the 404 page send you a email every time it get a hit
I have an inbox with 14000 emails proving my point.
This question popped up first, of course, when I googled up "URL safe characters", as most people would. I think it's worthy to put up a straightforward answer to a concise question. From the horse's— ugh, RFC2396— I mean, Sir Timothy's mouth:
2.3. Unreserved Characters Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper and lower case letters, decimal digits, and a limited set of punctuation marks and symbols. unreserved = alphanum | mark mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.
"Upper and lower case letters" in this context are understood as defined earlier in the section 1.6 of the same standard:
The following definitions are common to many elements: alpha = lowalpha | upalpha lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" alphanum = alpha | digit
So the answer is, URL-safe characters are good old ASCII-7 Latin characters A
through Z
in lower and upper case, decimal digits 0
through 9
, and a handful of non-alphanumerics explicitly enumerated in the mark
production rule of the grammar in Sec. 2.3.
If the question is to be understood about the HTTP/HTTPS URL (note that RFC2396 defines the URI), the semantic treatment of the RFC2396 syntax as resource locators for the HTTP[S] protocol is currently standardised by RFC7230, Sec. 2.7. Nevertheless, inferring that the set of "URL-safe" characters is larger than that defined by the RFC2396 from the observation that they are not treated specially in RFC7230 Sec. 2.7 would not be a future-proof move; a possible future RFC7230 update may ascribe semantics to more characters that are outside of the "URL-safe" RFC2396 set, rendering such an inference ex statu quo invalid.
TL;DR, it is the safest and future-proof approach to treat the set of URL-safe characters defined in RFC2396 as the largest possible and non-extensible, and not extend it with those that are currently okay/safe/non-special per RFC7230: this may change. The RFC2396 set, in contrast, cannot.