Who is to blame: parsing UTF8 encoded JSON HTTPResponse fails
I'm Riccardo, current developer of URLRead in WL and I have some experience working with encoding in WL.
I would like to inform you that this is not a bug.
In modern versions of mathematica we have ByteArray, and this is a representation of bytes. But for decades strings have been both bytes and "unicode" at the same time.
The problem here is that all Import functions are expecting bytes as input and all Export functions are producing bytes as Output.
Let's take your example, <|"a" -> "\[Dash]"|>
, and let's produce JSON out of it by using ExportString.
In[9]:= ExportString[<|"a" -> "\[Dash]"|>, "RawJSON", "Compact" -> True]
Out[9]= "{\"a\":\"â\"}"
What you get out is a string, but the string has been encoded in UTF-8 and now it's "unreadable". the output of ExportString is always a string that contains bytes in the range {0, 255}.
if you try to do the opposite operation, ImportString, you are getting back an association with encoded string:
In[12]:= ImportString[ExportString[<|"a" -> "\[Dash]"|>, "RawJSON", "Compact" -> True], "RawJSON"]
Out[12]= <|"a" -> "\[Dash]"|>
Trying to call ImportString
over something that was decoded won't work.
Infact, this is the "bug" you are experiencing:
ImportString["{\"a\":\"\[Dash]\"}", "RawJSON"]
During evaluation of In[14]:= General::jsonoutofrangeunicode: Out of range unicode code point encountered.
During evaluation of In[14]:= Import::jsoninvalidtoken: Invalid token found.
During evaluation of In[14]:= Import::jsonhintposition: An error occurred at line 1:8
Out[14]= $Failed
you are trying to import from a string that was already decoded, unfortunately there is NO WAY to distinguish between a string that has been encoded vs a string that is bytes, which is why the import is failing.
now, let's speak of URLRead. URLRead[..., "Body"] is returning the decoded body of the response, which is what you expect this method to do.
In[17]:= co = CloudDeploy@
Delayed[HTTPResponse[
ByteArray[
Join[ToCharacterCode["[\"", "UTF-8"],
ToCharacterCode[FromCharacterCode[8211], "UTF-8"],
ToCharacterCode["\"]", "UTF-8"]]], <|
"ContentType" -> "application/json; charset=utf-8"|>]];
In[18]:= URLRead[co, "Body"]
Out[18]= "[\"\[Dash]\"]"
now as I was explaining, the problem here is that you are importing using a decoded string. so calling ImportString over the Body won't work, it will fail because Body is not a string rappresenting bytes. what you should do is to use bytes, not a decoded string:
In[20]:= ImportString[FromCharacterCode[URLRead[co, "BodyBytes"]], "JSON"]
Out[20]= {"\[Dash]"}
in 11.2 there are functions to import / export using byte array instead of strings, so until then you need to use BodyBytes, starting from 11.2 you can pass to Import a bytearray.
Another small note: charset=...; is something you should specify only for text/* contenttypes, everything else (application, image) are not text formats and they do not accept a charset.
JSON only accepted encoded is UTF-8.
https://tools.ietf.org/html/rfc7159
Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.
I hope this is helpful to you. Let me know if you have any other questions.
Riccardo Di Virgilio
Tracing the evaluation (Mathematica 11.1.1) shows that the string is passed to Developer`ReadRawJSONStream
which actually produces the messages:
Developer`ReadRawJSONStream[StringToStream@string, "IssueMessagesAs" -> Import]
General::jsonoutofrangeunicode: Out of range unicode code point encountered.
Import::jsoninvalidtoken: Invalid token found.
Import::jsonhintposition: An error occurred at line 1:4
$Failed
Since the first message says that "out of range unicode code point encountered", we can try to find out which Unicode code points are allowed:
Cases[Table[{n, Quiet@Developer`ReadRawJSONStream[
StringToStream["[\"" <> FromCharacterCode[n] <> "\"]"],
"IssueMessagesAs" -> Import]}, {n, 0, 50000}], {n_, Except[$Failed]} :> n] // MinMax
{1, 254}
Obviously, even ASCII table isn't fully supported! No one Unicode symbol is allowed in the input stream.
But considering the workaround you already found we can conclude that it is an encoding issue, for some reason we must use only single-byte encoding:
ToString[body, OutputForm, CharacterEncoding -> "UTF8"] // FullForm
"[\"\[AHat]\.80\.93\"]"
We can achieve the same simpler and (probably) faster via ToCharacterCode
/FromCharacterCode
:
ImportString[FromCharacterCode@ToCharacterCode[body, "UTF-8"], "RawJSON"]
{"–"}
I still strongly suspect that current behavior is a bug and recommend reporting it to the tech support.
It is also worth to note that in version 8.0.4 ImportString[body, "JSON"]
works, but in version 11.1.1 it fails, what supports that we have a bug in the current importer.
Alexey Popkov's answer is correct. So as to who is to blame, it's Mathematica. The standard requires applications to accept all Unicode characters in strings.
Quote from RFC 7159:
Section 7:
All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
Section 9:
A JSON parser MUST accept all texts that conform to the JSON grammar.