Converting large XML file to relational database
I had task with xml files as you wrote. This are principals I used:
- All incoming files I stored as is in DB (XMLTYPE), because I need a source file info;
- All incoming files parsed with XSL transformation. For example, I see that it is three entity here: fileInfo, fileCases, fileClassification. You can write XSL transformation to compile source file info in 3 entity types (in tags FileInfo, FileCases, FileClassification);
- When you have output transformed XML you can make 3 procedures, that inserts data into DB (each entity in DB area).
All steps could certainly be accomplished using node.js
. There are modules available that will help you with each of these tasks:
- node-cron: lets you easily set up cron tasks in your node program. Another option would be to set up a cron task on your operating system (lots of resources available for your favourite OS).
- download: module to easily download files from a URL.
xml-stream: allows you to stream a file and register events that fire when the parser encounters certain XML elements. I have successfully used this module to parse KML files (granted they were significantly smaller than your files).
node-postgres: node client for PostgreSQL (I am sure there are clients for many other common RDBMS, PG is the only one I have used so far).
Most of these modules have pretty great examples that will get you started. Here's how you would probably set up the XML streaming part:
var XmlStream = require('xml-stream');
var xml = fs.createReadStream('path/to/file/on/disk'); // or stream directly from your online source
var xmlStream = new XmlStream(xml);
xmlStream.on('endElement case-file', function(element) {
// create and execute SQL query/queries here for this element
});
xmlStream.on('end', function() {
// done reading elements
// do further processing / query database, etc.
});
Are you sure you need to put the data in a relational database, or do you just want to search it in general?
There don't seem to be any actual relations in the data, so it might be simpler to put it in a document search index such as ElasticSearch.
Any automatic XML to JSON converter would probably produce suitable output. The large file size is an issue. This library, despite its summary saying "not streaming", is actually streaming if you inspect the source code, so it would work for you.