How do I correctly decode unicode parameters passed to a servlet

You are nearly there. EncodeURIComponent correctly encodes to UTF-8, which is what you should always use in a URL today.

The problem is that the submitted query string is getting mutilated on the way into your server-side script, because getParameter() uses ISO-8559-1 instead of UTF-8. This stems from Ancient Times before the web settled on UTF-8 for URI/IRI, but it's rather pathetic that the Servlet spec hasn't been updated to match reality, or at least provide a reliable, supported option for it.

(There is request.setCharacterEncoding in Servlet 2.3, but it doesn't affect query string parsing, and if a single parameter has been read before, possibly by some other framework element, it won't work at all.)

So you need to futz around with container-specific methods to get proper UTF-8, often involving stuff in server.xml. This totally sucks for distributing web apps that should work anywhere. For Tomcat see https://cwiki.apache.org/confluence/display/TOMCAT/Character+Encoding and also What's the difference between "URIEncoding" of Tomcat, Encoding Filter and request.setCharacterEncoding.

I got the same problem and solved it by decoding Request.getQueryString() using URLDecoder(), and after extracting my parameters.

String[] Parameters = URLDecoder.decode(Request.getQueryString(), 'UTF-8')
                       .splitat('&');

There is way to do it in java (no fiddling with server.xml)

Do not work :

protected static final String CHARSET_FOR_URL_ENCODING = "UTF-8";

String uname = request.getParameter("name");
System.out.println(uname);
// ÃÃÂ·ÃÂ³ÃÃÃÃÃÂ·
uname = request.getQueryString();
System.out.println(uname);
// name=%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7
uname = URLDecoder.decode(request.getParameter("name"),
        CHARSET_FOR_URL_ENCODING);
System.out.println(uname);
// ÃÃÂ·ÃÂ³ÃÃÃÃÃÂ· // !!!!!!!!!!!!!!!!!!!!!!!!!!!
uname = URLDecoder.decode(
        "name=%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7",
        CHARSET_FOR_URL_ENCODING);
System.out.println("query string decoded : " + uname);
// query string decoded : name=ÏÎ·Î³ÏÏÏÏÎ·
uname = URLDecoder.decode(new String(request.getParameter("name")
        .getBytes()), CHARSET_FOR_URL_ENCODING);
System.out.println(uname);
// ÃÃÂ·ÃÂ³ÃÃÃÃÃÂ· // !!!!!!!!!!!!!!!!!!!!!!!!!!!

~~Works~~ :

final String name = URLDecoder
        .decode(new String(request.getParameter("name").getBytes(
                "iso-8859-1")), CHARSET_FOR_URL_ENCODING);
System.out.println(name);
// ÏÎ·Î³ÏÏÏÏÎ·

Worked but will break if default encoding != utf-8 - try this instead (omit the call to decode() it's not needed):

final String name = new String(request.getParameter("name").getBytes("iso-8859-1"),
        CHARSET_FOR_URL_ENCODING);

As I said above if the server.xml is messed with as in :

<Connector connectionTimeout="20000" port="8080" protocol="HTTP/1.1"
                     redirectPort="8443"  URIEncoding="UTF-8"/>

(notice the URIEncoding="UTF-8") the code above will break (cause the getBytes("iso-8859-1") should read getBytes("UTF-8")). So for a bullet proof solution you have to get the value of the URIEncoding attribute. This unfortunately seems to be container specific - even worse container version specific. For tomcat 7 you'd need something like :

import javax.management.AttributeNotFoundException;
import javax.management.InstanceNotFoundException;
import javax.management.MBeanException;
import javax.management.MBeanServer;
import javax.management.MBeanServerFactory;
import javax.management.MalformedObjectNameException;
import javax.management.ObjectName;
import javax.management.ReflectionException;

import org.apache.catalina.Server;
import org.apache.catalina.Service;
import org.apache.catalina.connector.Connector;

public class Controller extends HttpServlet {

    // ...
    static String CHARSET_FOR_URI_ENCODING; // the `URIEncoding` attribute
    static {
        MBeanServer mBeanServer = MBeanServerFactory.findMBeanServer(null).get(
            0);
        ObjectName name = null;
        try {
            name = new ObjectName("Catalina", "type", "Server");
        } catch (MalformedObjectNameException e1) {
            e1.printStackTrace();
        }
        Server server = null;
        try {
            server = (Server) mBeanServer.getAttribute(name, "managedResource");
        } catch (AttributeNotFoundException | InstanceNotFoundException
                | MBeanException | ReflectionException e) {
            e.printStackTrace();
        }
        Service[] services = server.findServices();
        for (Service service : services) {
            for (Connector connector : service.findConnectors()) {
                System.out.println(connector);
                String uriEncoding = connector.getURIEncoding();
                System.out.println("URIEncoding : " + uriEncoding);
                boolean use = connector.getUseBodyEncodingForURI();
                // TODO : if(use && connector.get uri enc...)
                CHARSET_FOR_URI_ENCODING = uriEncoding;
                // ProtocolHandler protocolHandler = connector
                // .getProtocolHandler();
                // if (protocolHandler instanceof Http11Protocol
                // || protocolHandler instanceof Http11AprProtocol
                // || protocolHandler instanceof Http11NioProtocol) {
                // int serverPort = connector.getPort();
                // System.out.println("HTTP Port: " + connector.getPort());
                // }
            }
        }
    }
}

And still you need to tweak this for multiple connectors (check the commented out parts). Then you would use something like :

new String(parameter.getBytes(CHARSET_FOR_URI_ENCODING), CHARSET_FOR_URL_ENCODING);

Still this may fail (IIUC) if parameter = request.getParameter("name"); decoded with CHARSET_FOR_URI_ENCODING was corrupted so the bytes I get with getBytes() were not the original ones (that's why "iso-8859-1" is used by default - it will preserve the bytes). You can get rid of it all by manually parsing the query string in the lines of:

URLDecoder.decode(request.getQueryString().split("=")[1],
        CHARSET_FOR_URL_ENCODING);

_{I am still looking for the place in the docs where it is mentioned that request.getParameter("name") does call URLDecoder.decode() instead of returning the %CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7 string ? A link in the source would be much appreciated.
Also how can I pass as the parameter's value the string, say, %CE ? => see comment : parameter=%25CE}

How do I correctly decode unicode parameters passed to a servlet

Tags:

Java

Unicode

Servlets

Related

Recent Posts