Parsing large JSON file in .NET
This might still be relevant to some now that the "new" System.Text.Json
is out.
await using FileStream file = File.OpenRead("files/data.json");
var options = new JsonSerializerOptions {
PropertyNamingPolicy = JsonNamingPolicy.CamelCase
};
// Switch the JsonNode type with one of your own if
// you have a specific type you want to deserialize to.
IAsyncEnumerable<JsonNode?> enumerable = JsonSerializer.DeserializeAsyncEnumerable<JsonNode>(file, options);
await foreach (JsonNode? obj in enumerable) {
var firstname = obj?["firstname"]?.GetValue<string>();
}
If you're interested in more, such as how to parse zipped JSON, there's this blog post that I wrote: Parsing 60GB Json Files using Streams in .NET.
As you've correctly diagnosed in your update, the issue is that the JSON has a closing ]
followed immediately by an opening [
to start the next set. This format makes the JSON invalid when taken as a whole, and that is why Json.NET throws an error.
Fortunately, this problem seems to come up often enough that Json.NET actually has a special setting to deal with it. If you use a JsonTextReader
directly to read the JSON, you can set the SupportMultipleContent
flag to true
, and then use a loop to deserialize each item individually.
This should allow you to process the non-standard JSON successfully and in a memory efficient manner, regardless of how many arrays there are or how many items in each array.
using (WebClient client = new WebClient())
using (Stream stream = client.OpenRead(stringUrl))
using (StreamReader streamReader = new StreamReader(stream))
using (JsonTextReader reader = new JsonTextReader(streamReader))
{
reader.SupportMultipleContent = true;
var serializer = new JsonSerializer();
while (reader.Read())
{
if (reader.TokenType == JsonToken.StartObject)
{
Contact c = serializer.Deserialize<Contact>(reader);
Console.WriteLine(c.FirstName + " " + c.LastName);
}
}
}
Full demo here: https://dotnetfiddle.net/2TQa8p
I have done a similar thing in Python for the file size of 5 GB. I downloaded the file in some temporary location and read it line by line to form an JSON object similar on how SAX works.
For C# using Json.NET, you can download the file, use a stream reader to read the file, and pass that stream to JsonTextReader and parse it to JObject using JTokens.ReadFrom(your JSonTextReader object)
.
Json.NET supports deserializing directly from a stream. Here is a way to deserialize your JSON using a StreamReader
reading the JSON string one piece at a time instead of having the entire JSON string loaded into memory.
using (WebClient client = new WebClient())
{
using (StreamReader sr = new StreamReader(client.OpenRead(stringUrl)))
{
using (JsonReader reader = new JsonTextReader(sr))
{
JsonSerializer serializer = new JsonSerializer();
// read the json from a stream
// json size doesn't matter because only a small piece is read at a time from the HTTP request
IList<Contact> result = serializer.Deserialize<List<Contact>>(reader);
}
}
}
Reference: JSON.NET Performance Tips