Example 1: object of type set is not json serializable
JSON notation has only a handful of native datatypes (objects, arrays, strings, numbers, booleans, and null), so anything serialized in JSON needs to be expressed as one of these types.
As shown in the json module docs, this conversion can be done automatically by a JSONEncoder and JSONDecoder, but then you would be giving up some other structure you might need (if you convert sets to a list, then you lose the ability to recover regular lists; if you convert sets to a dictionary using dict.fromkeys(s) then you lose the ability to recover dictionaries).
A more sophisticated solution is to build-out a custom type that can coexist with other native JSON types. This lets you store nested structures that include lists, sets, dicts, decimals, datetime objects, etc.:
from json import dumps, loads, JSONEncoder, JSONDecoder
import pickle
class PythonObjectEncoder(JSONEncoder):
def default(self, obj):
if isinstance(obj, (list, dict, str, unicode, int, float, bool, type(None))):
return JSONEncoder.default(self, obj)
return {'_python_object': pickle.dumps(obj)}
def as_python_object(dct):
if '_python_object' in dct:
return pickle.loads(str(dct['_python_object']))
return dct
Here is a sample session showing that it can handle lists, dicts, and sets:
>>> data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]
>>> j = dumps(data, cls=PythonObjectEncoder)
>>> loads(j, object_hook=as_python_object)
[1, 2, 3, set(['knights', 'say', 'who', 'ni']), {u'key': u'value'}, Decimal('3.14')]
Alternatively, it may be useful to use a more general purpose serialization technique such as YAML, Twisted Jelly, or Python's pickle module. These each support a much greater range of datatypes.
Example 2: object of type bytes is not json serializable
You are creating those bytes objects yourself:
item['title'] = [t.encode('utf-8') for t in title]
item['link'] = [l.encode('utf-8') for l in link]
item['desc'] = [d.encode('utf-8') for d in desc]
items.append(item)
Each of those t.encode(), l.encode() and d.encode() calls creates a bytes string. Do not do this, leave it to the JSON format to serialise these.
Next, you are making several other errors; you are encoding too much where there is no need to. Leave it to the json module and the standard file object returned by the open() call to handle encoding.
You also don't need to convert your items list to a dictionary; it'll already be an object that can be JSON encoded directly:
class W3SchoolPipeline(object):
def __init__(self):
self.file = open('w3school_data_utf8.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(item) + '\n'
self.file.write(line)
return item
I'm guessing you followed a tutorial that assumed Python 2, you are using Python 3 instead. I strongly suggest you find a different tutorial; not only is it written for an outdated version of Python, if it is advocating line.decode('unicode_escape') it is teaching some extremely bad habits that'll lead to hard-to-track bugs. I can recommend you look at Think Python, 2nd edition for a good, free, book on learning Python 3.