Facebook JSON badly encoded

I would like to extend @Geekmoss' answer with the following recursive code snippet, I used to decode my facebook data.

import json

def parse_obj(obj):
    if isinstance(obj, str):
        return obj.encode('latin_1').decode('utf-8')

    if isinstance(obj, list):
        return [parse_obj(o) for o in obj]

    if isinstance(obj, dict):
        return {key: parse_obj(item) for key, item in obj.items()}

    return obj

decoded_data = parse_obj(json.loads(file))

I noticed this works better, because the facebook data you download might contain list of dicts, in which case those dicts would be just returned 'as is' because of the lambda identity function.


Here is a command-line solution with jq and iconv. Tested on Linux.

cat message_1.json | jq . | iconv -f utf8 -t latin1 > m1.json


I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin-1 instead. I’ll make sure to file a bug report.

What this means is that any non-ASCII character in the string data was encoded twice. First to UTF-8, and then the UTF-8 bytes were encoded again by interpreting them as Latin-1 encoded data (which maps exactly 256 characters to the 256 possible byte values), by using the \uHHHH JSON escape notation (so a literal backslash, a literal lowercase letter u, followed by 4 hex digits, 0-9 and a-f). Because the second step encoded byte values in the range 0-255, this resulted in a series of \u00HH sequences (a literal backslash, a literal lower case letter u, two 0 zero digits and two hex digits).

E.g. the Unicode character U+0142 LATIN SMALL LETTER L WITH STROKE in the name Radosław was encoded to the UTF-8 byte values C5 and 82 (in hex notation), and then encoded again to \u00c5\u0082.

You can repair the damage in two ways:

  1. Decode the data as JSON, then re-encode any string values as Latin-1 binary data, and then decode again as UTF-8:

     >>> import json
     >>> data = r'"Rados\u00c5\u0082aw"'
     >>> json.loads(data).encode('latin1').decode('utf8')
     'Radosław'
    

    This would require a full traversal of your data structure to find all those strings, of course.

  2. Load the whole JSON document as binary data, replace all \u00hh JSON sequences with the byte the last two hex digits represent, then decode as JSON:

     import re
     from functools import partial
    
     fix_mojibake_escapes = partial(
         re.compile(rb'\\u00([\da-f]{2})').sub,
         lambda m: bytes.fromhex(m[1].decode()),
     )
    
     with open(os.path.join(subdir, file), 'rb') as binary_data:
         repaired = fix_mojibake_escapes(binary_data.read())
     data = json.loads(repaired)
    

    (If you are using Python 3.5 or older, you'll have to decode the repaired bytes object from UTF-8, so use json.loads(repaired.decode())).

    From your sample data this produces:

     {'content': 'No to trzeba ostatnie treningi zrobić xD',
      'sender_name': 'Radosław',
      'timestamp': 1524558089,
      'type': 'Generic'}
    

    The regular expression matches against all \u00HH sequences in the binary data and replaces those with the bytes they represent, so that the data can be decoded correctly as UTF-8. The second decoding is taken care of by the json.loads() function when given binary data.