reading json file in pyspark
import pyspark
from pyspark import SparkConf
# You can configure the SparkContext
conf = SparkConf()
conf.set('spark.local.dir', '/remote/data/match/spark')
conf.set('spark.sql.shuffle.partitions', '2100')
SparkContext.setSystemProperty('spark.executor.memory', '10g')
SparkContext.setSystemProperty('spark.driver.memory', '10g')
sc = SparkContext(appName='mm_exp', conf=conf)
sqlContext = pyspark.SQLContext(sc)
data = sqlContext.read.json(file.json)
I feel that he missed an important part of the read sequence. You have to initialize a SparkContext.
When you start a SparkContext, it also spins up a webUI on port 4040. The webUI can be accessed using http://localhost:4040. That is a useful place to check progress of all calculations.
First of all, the json is invalid. After the header a ,
is missing.
That being said, lets take this json:
{"header":{"platform":"atm","version":"2.0"},"details":[{"abc":"3","def":"4"},{"abc":"5","def":"6"},{"abc":"7","def":"8"}]}
This can be processed by:
>>> df = sqlContext.jsonFile('test.json')
>>> df.first()
Row(details=[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')], header=Row(platform='atm', version='2.0'))
>>> df = df.flatMap(lambda row: row['details'])
PythonRDD[38] at RDD at PythonRDD.scala:43
>>> df.collect()
[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')]
>>> df.map(lambda entry: (int(entry['abc']), int(entry['def']))).collect()
[(3, 4), (5, 6), (7, 8)]
Hope this helps!
try this with latest spark version.
df = spark.read.json('test.json')