Convert JSON file to a CSV file using R
There are several issues with your text file. As you already noticed, you need to remove the lines of the form /* 0 */
. What results is still not valid json. If you want to have several json objects in a file, you need to store them in an array. The json objects are the parts that are closed in curly brakets, e.g.,
{
"_id" : "93ccbdb6-8947-4687-8e12-edf4e40d6650",
...
"totalRecords" : 0,
"status" : "SUCCESS"
}
The structure of an array of objects is as follows:
[
{
...
},
{
...
}
]
To get your file in shape, you need to add a comma between the objects and add the square brackets. You could do this as follows:
raw <- readLines("mydata.txt")
# get rid of the "/* 0 */" lines
json <- grep("^/\\* [0-9]* \\*/", raw, value = TRUE, invert = TRUE)
# add missing comma after }
n <- length(json)
json[-n] <- gsub("^}$", "},", json[-n])
# add brakets at the beginning and end
json <- c("[", json, "]")
This can be read by fromJSON()
, so I assume it is valid json:
library(jsonlite)
table <- fromJSON(json)
The table is nested, that is, some of the tables cells contain a data frame or a list themselves. For example,
table[1,2]
## travelDate travelDuration shopperDuration oneWay userId queryId
## 1 20151206 7 30 FALSE ATP1KKP 93ccbdb6-8947-4687-8e12-edf4e40d6650
## subRequests
## 1 WAS, LON, AA, , 1,2
You could use flatten()
from the jsonlite
package, to get a table with one level of nesting less
flatten(table)[1:3, c(1, 6, 12)]
## _id uiSearchRequest.travelDate uiSearchRequest.subRequests
## 1 93ccbdb6-8947-4687-8e12-edf4e40d6650 20151206 WAS, LON, AA, , 1,2
## 2 b736c374-b8ae-4e99-8073-9c54517fecd5 20151206 WAS, LON, AA, , 1,2
## 3 3312605f-8304-4ab8-96d6-6e1a03cfbd9e 20151206 LON, IAD, AA, , 1,2
The last column is still a list. There are many ways you could handle this. One possibility, is to create a row per subrequest, where the contents of all the other columns (X_id
, downloadCount
, etc.) are repeated. (This is almost the form that you give in your question, with the only difference that you left cells empty in the reapeated columns, while I repeat the contents.) This is how it can be done:
table <- flatten(fromJSON(json))
tab_list <- lapply(1:nrow(table),
function(i) data.frame(table[i, -12], table[i, 12],
stringsAsFactors = FALSE))
library(dplyr)
flat_table <- bind_rows(tab_list)
The second line creates a list of data frames. These are combined into a single data frame using bind_rows()
from dpylr
. (To be more precise, flat_table
will be a tbl_df
, but the difference to a data.frame
is small.) This can then be written to a csv file in the usual way:
write.csv(flat_table, file = "mydata.csv")