How do I correctly decode unicode parameters passed to a servlet
You are nearly there. EncodeURIComponent correctly encodes to UTF-8, which is what you should always use in a URL today.
The problem is that the submitted query string is getting mutilated on the way into your server-side script, because getParameter() uses ISO-8559-1 instead of UTF-8. This stems from Ancient Times before the web settled on UTF-8 for URI/IRI, but it's rather pathetic that the Servlet spec hasn't been updated to match reality, or at least provide a reliable, supported option for it.
(There is request.setCharacterEncoding in Servlet 2.3, but it doesn't affect query string parsing, and if a single parameter has been read before, possibly by some other framework element, it won't work at all.)
So you need to futz around with container-specific methods to get proper UTF-8, often involving stuff in server.xml. This totally sucks for distributing web apps that should work anywhere. For Tomcat see https://cwiki.apache.org/confluence/display/TOMCAT/Character+Encoding and also What's the difference between "URIEncoding" of Tomcat, Encoding Filter and request.setCharacterEncoding.
I got the same problem and solved it by decoding Request.getQueryString()
using URLDecoder(), and after extracting my parameters.
String[] Parameters = URLDecoder.decode(Request.getQueryString(), 'UTF-8')
.splitat('&');
There is way to do it in java (no fiddling with server.xml
)
Do not work :
protected static final String CHARSET_FOR_URL_ENCODING = "UTF-8";
String uname = request.getParameter("name");
System.out.println(uname);
// Ã÷óÃÃÃÃ÷
uname = request.getQueryString();
System.out.println(uname);
// name=%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7
uname = URLDecoder.decode(request.getParameter("name"),
CHARSET_FOR_URL_ENCODING);
System.out.println(uname);
// Ã÷óÃÃÃÃ÷ // !!!!!!!!!!!!!!!!!!!!!!!!!!!
uname = URLDecoder.decode(
"name=%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7",
CHARSET_FOR_URL_ENCODING);
System.out.println("query string decoded : " + uname);
// query string decoded : name=ÏηγÏÏÏÏη
uname = URLDecoder.decode(new String(request.getParameter("name")
.getBytes()), CHARSET_FOR_URL_ENCODING);
System.out.println(uname);
// Ã÷óÃÃÃÃ÷ // !!!!!!!!!!!!!!!!!!!!!!!!!!!
Works :
final String name = URLDecoder
.decode(new String(request.getParameter("name").getBytes(
"iso-8859-1")), CHARSET_FOR_URL_ENCODING);
System.out.println(name);
// ÏηγÏÏÏÏη
Worked but will break if default encoding != utf-8 - try this instead (omit the call to decode() it's not needed):
final String name = new String(request.getParameter("name").getBytes("iso-8859-1"),
CHARSET_FOR_URL_ENCODING);
As I said above if the server.xml
is messed with as in :
<Connector connectionTimeout="20000" port="8080" protocol="HTTP/1.1"
redirectPort="8443" URIEncoding="UTF-8"/>
(notice the URIEncoding="UTF-8"
) the code above will break (cause the getBytes("iso-8859-1")
should read getBytes("UTF-8")
). So for a bullet proof solution you have to get the value of the URIEncoding
attribute. This unfortunately seems to be container specific - even worse container version specific. For tomcat 7 you'd need something like :
import javax.management.AttributeNotFoundException;
import javax.management.InstanceNotFoundException;
import javax.management.MBeanException;
import javax.management.MBeanServer;
import javax.management.MBeanServerFactory;
import javax.management.MalformedObjectNameException;
import javax.management.ObjectName;
import javax.management.ReflectionException;
import org.apache.catalina.Server;
import org.apache.catalina.Service;
import org.apache.catalina.connector.Connector;
public class Controller extends HttpServlet {
// ...
static String CHARSET_FOR_URI_ENCODING; // the `URIEncoding` attribute
static {
MBeanServer mBeanServer = MBeanServerFactory.findMBeanServer(null).get(
0);
ObjectName name = null;
try {
name = new ObjectName("Catalina", "type", "Server");
} catch (MalformedObjectNameException e1) {
e1.printStackTrace();
}
Server server = null;
try {
server = (Server) mBeanServer.getAttribute(name, "managedResource");
} catch (AttributeNotFoundException | InstanceNotFoundException
| MBeanException | ReflectionException e) {
e.printStackTrace();
}
Service[] services = server.findServices();
for (Service service : services) {
for (Connector connector : service.findConnectors()) {
System.out.println(connector);
String uriEncoding = connector.getURIEncoding();
System.out.println("URIEncoding : " + uriEncoding);
boolean use = connector.getUseBodyEncodingForURI();
// TODO : if(use && connector.get uri enc...)
CHARSET_FOR_URI_ENCODING = uriEncoding;
// ProtocolHandler protocolHandler = connector
// .getProtocolHandler();
// if (protocolHandler instanceof Http11Protocol
// || protocolHandler instanceof Http11AprProtocol
// || protocolHandler instanceof Http11NioProtocol) {
// int serverPort = connector.getPort();
// System.out.println("HTTP Port: " + connector.getPort());
// }
}
}
}
}
And still you need to tweak this for multiple connectors (check the commented out parts). Then you would use something like :
new String(parameter.getBytes(CHARSET_FOR_URI_ENCODING), CHARSET_FOR_URL_ENCODING);
Still this may fail (IIUC) if parameter = request.getParameter("name");
decoded with CHARSET_FOR_URI_ENCODING was corrupted so the bytes I get with getBytes() were not the original ones (that's why "iso-8859-1" is used by default - it will preserve the bytes). You can get rid of it all by manually parsing the query string in the lines of:
URLDecoder.decode(request.getQueryString().split("=")[1],
CHARSET_FOR_URL_ENCODING);
I am still looking for the place in the docs where it is mentioned that request.getParameter("name")
does call URLDecoder.decode()
instead of returning the %CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7
string ? A link in the source would be much appreciated.Also how can I pass as the parameter's value the string, say, => see comment : %CE
?parameter=%25CE