What are the safe characters for making URLs?
There are two sets of characters you need to watch out for: reserved and unsafe.
The reserved characters are:
- ampersand ("&")
- dollar ("$")
- plus sign ("+")
- comma (",")
- forward slash ("/")
- colon (":")
- semi-colon (";")
- equals ("=")
- question mark ("?")
- 'At' symbol ("@")
- pound ("#").
The characters generally considered unsafe are:
- space (" ")
- less than and greater than ("<>")
- open and close brackets ("[]")
- open and close braces ("{}")
- pipe ("|")
- backslash ("\")
- caret ("^")
- percent ("%")
I may have forgotten one or more, which leads to me echoing Carl V's answer. In the long run you are probably better off using a "white list" of allowed characters and then encoding the string rather than trying to stay abreast of characters that are disallowed by servers and systems.
To quote section 2.3 of RFC 3986:
Characters that are allowed in a URI, but do not have a reserved purpose, are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
ALPHA DIGIT "-" / "." / "_" / "~"
Note that RFC 3986 lists fewer reserved punctuation marks than the older RFC 2396.
Always Safe
In theory and by the specification, these are safe basically anywhere, except the domain name. Percent-encode anything not listed, and you're good to go.
A-Z a-z 0-9 - . _ ~ ( ) ' ! * : @ , ;
Sometimes Safe
Only safe when used within specific URL components; use with care.
Paths: + & =
Queries: ? /
Fragments: ? / # + & =
Never Safe
According to the URI specification (RFC 3986), all other characters must be percent-encoded. This includes:
<space> <control-characters> <extended-ascii> <unicode>
% < > [ ] { } | \ ^
If maximum compatibility is a concern, limit the character set to A-Z a-z 0-9 - _ . (with periods only for filename extensions).
Keep Context in Mind
Even if valid per the specification, a URL can still be "unsafe", depending on context. Such as a file:/// URL containing invalid filename characters, or a query component containing "?", "=", and "&" when not used as delimiters. Correct handling of these cases are generally up to your scripts and can be worked around, but it's something to keep in mind.