Import XML files to PostgreSQL
Extending @stefan-steiger's excellent answer, here is an example that extracts XML elements from child nodes that contain multiple siblings (e.g., multiple <synonym>
elements, for a particular <synomyms>
parent node).
I encountered this issue with my data and searched quite a bit for a solution; his answer was the most helpful, to me.
Example data file, hmdb_metabolites_test.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<hmdb>
<metabolite>
<accession>HMDB0000001</accession>
<name>1-Methylhistidine</name>
<synonyms>
<synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
<synonym>1-Methylhistidine</synonym>
<synonym>Pi-methylhistidine</synonym>
<synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
</synonyms>
</metabolite>
<metabolite>
<accession>HMDB0000002</accession>
<name>1,3-Diaminopropane</name>
<synonyms>
<synonym>1,3-Propanediamine</synonym>
<synonym>1,3-Propylenediamine</synonym>
<synonym>Propane-1,3-diamine</synonym>
<synonym>1,3-diamino-N-Propane</synonym>
</synonyms>
</metabolite>
<metabolite>
<accession>HMDB0000005</accession>
<name>2-Ketobutyric acid</name>
<synonyms>
<synonym>2-Ketobutanoic acid</synonym>
<synonym>2-Oxobutyric acid</synonym>
<synonym>3-Methyl pyruvic acid</synonym>
<synonym>alpha-Ketobutyrate</synonym>
</synonyms>
</metabolite>
</hmdb>
Aside: the original XML file had a URL in the Document Element
<hmdb xmlns="http://www.hmdb.ca">
that prevented xpath
from parsing the data. It will run (without error messages), but the relation/table is empty:
[hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/sql/hmdb_test.sql
DO
accession | name | synonym
-----------+------+---------
Since the source file is 3.4GB, I decided to edit that line using sed
:
sed -i '2s/.*hmdb xmlns.*/<hmdb>/' hmdb_metabolites.xml
[Adding the 2
(instructs sed
to edit "line 2") also -- coincidentally, in this instance -- doubling the sed
command execution speed.]
My postgres data folder (PSQL: SHOW data_directory;
) is
/mnt/Vancouver/Programming/RDB/postgres/postgres/data
so, as sudo
, I needed to copy my XML data file there and chown
it for use in PostgreSQL:
sudo chown postgres:postgres /mnt/Vancouver/Programming/RDB/postgres/postgres/data/hmdb_metabolites_test.xml
Script (hmdb_test.sql
):
DO $$DECLARE myxml xml;
BEGIN
myxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('hmdb_metabolites_test.xml'), 'UTF8'));
DROP TABLE IF EXISTS mytable;
-- CREATE TEMP TABLE mytable AS
CREATE TABLE mytable AS
SELECT
(xpath('//accession/text()', x))[1]::text AS accession
,(xpath('//name/text()', x))[1]::text AS name
-- The "synonym" child/subnode has many sibling elements, so we need to
-- "unnest" them,otherwise we only retrieve the first synonym per record:
,unnest(xpath('//synonym/text()', x))::text AS synonym
FROM unnest(xpath('//metabolite', myxml)) x
;
END$$;
-- select * from mytable limit 5;
SELECT * FROM mytable;
Execution, output (in PSQL
):
[hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/hmdb_test.sql
accession | name | synonym
-------------+--------------------+----------------------------------------------------------
HMDB0000001 | 1-Methylhistidine | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid
HMDB0000001 | 1-Methylhistidine | 1-Methylhistidine
HMDB0000001 | 1-Methylhistidine | Pi-methylhistidine
HMDB0000001 | 1-Methylhistidine | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate
HMDB0000002 | 1,3-Diaminopropane | 1,3-Propanediamine
HMDB0000002 | 1,3-Diaminopropane | 1,3-Propylenediamine
HMDB0000002 | 1,3-Diaminopropane | Propane-1,3-diamine
HMDB0000002 | 1,3-Diaminopropane | 1,3-diamino-N-Propane
HMDB0000005 | 2-Ketobutyric acid | 2-Ketobutanoic acid
HMDB0000005 | 2-Ketobutyric acid | 2-Oxobutyric acid
HMDB0000005 | 2-Ketobutyric acid | 3-Methyl pyruvic acid
HMDB0000005 | 2-Ketobutyric acid | alpha-Ketobutyrate
[hmdb_test]#
I would try a different approach: read the XML file directly into variable inside a plpgsql function and proceed from there. Should be a lot faster and a lot more robust.
CREATE OR REPLACE FUNCTION f_sync_from_xml()
RETURNS boolean AS
$BODY$
DECLARE
myxml xml;
datafile text := 'path/to/my_file.xml';
BEGIN
myxml := pg_read_file(datafile, 0, 100000000); -- arbitrary 100 MB max.
CREATE TEMP TABLE tmp AS
SELECT (xpath('//some_id/text()', x))[1]::text AS id
FROM unnest(xpath('/xml/path/to/datum', myxml)) x;
...
You need superuser privileges, and file must be local to the DB server, in an accessible directory.
Complete code example with more explanation and links:
- XML data to PostgreSQL database
Necromancing: For those that need a working example:
DO $$
DECLARE myxml xml;
BEGIN
myxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'));
DROP TABLE IF EXISTS mytable;
CREATE TEMP TABLE mytable AS
SELECT
(xpath('//ID/text()', x))[1]::text AS id
,(xpath('//Name/text()', x))[1]::text AS Name
,(xpath('//RFC/text()', x))[1]::text AS RFC
,(xpath('//Text/text()', x))[1]::text AS Text
,(xpath('//Desc/text()', x))[1]::text AS Desc
FROM unnest(xpath('//record', myxml)) x
;
END$$;
SELECT * FROM mytable;
Or with less noise
SELECT
(xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name
,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
,myTempTable.myXmlColumn as myXmlElement
FROM unnest(
xpath
( '//record'
,XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'))
)
) AS myTempTable(myXmlColumn)
;
With this example XML file (MyData.xml):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set>
<record>
<ID>1</ID>
<Name>A</Name>
<RFC>RFC 1035[1]</RFC>
<Text>Address record</Text>
<Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>
</record>
<record>
<ID>2</ID>
<Name>NS</Name>
<RFC>RFC 1035[1]</RFC>
<Text>Name server record</Text>
<Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>
</record>
</data-set>
Note:
MyData.xml needs to be in the PG_Data directory (the parent-directory of the pg_stat directory).
e.g. /var/lib/postgresql/9.3/main/MyData.xml
This requires PostGreSQL 9.1+
Overall, you can achive it fileless, like this:
SELECT
(xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name
,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
,myTempTable.myXmlColumn as myXmlElement
-- Source: https://en.wikipedia.org/wiki/List_of_DNS_record_types
FROM unnest(xpath('//record',
CAST('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set>
<record>
<ID>1</ID>
<Name>A</Name>
<RFC>RFC 1035[1]</RFC>
<Text>Address record</Text>
<Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>
</record>
<record>
<ID>2</ID>
<Name>NS</Name>
<RFC>RFC 1035[1]</RFC>
<Text>Name server record</Text>
<Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>
</record>
</data-set>
' AS xml)
)) AS myTempTable(myXmlColumn)
;
Note that unlike in MS-SQL, xpath text() returns NULL on a NULL value, and not an empty string.
If for whatever reason you need to explicitly check for the existence of NULL, you can use [not(@xsi:nil="true")]
, to which you need to pass an array of namespaces, because otherwise, you get an error (however, you can omit all namespaces but xsi).
SELECT
(xpath('//xmlEncodeTest[1]/text()', myTempTable.myXmlColumn))[1]::text AS c1
,(
xpath('//xmlEncodeTest[1][not(@xsi:nil="true")]/text()', myTempTable.myXmlColumn
,
ARRAY[
-- ARRAY['xmlns','http://www.w3.org/1999/xhtml'], -- defaultns
ARRAY['xsi','http://www.w3.org/2001/XMLSchema-instance'],
ARRAY['xsd','http://www.w3.org/2001/XMLSchema'],
ARRAY['svg','http://www.w3.org/2000/svg'],
ARRAY['xsl','http://www.w3.org/1999/XSL/Transform']
]
)
)[1]::text AS c22
,(xpath('//nixda[1]/text()', myTempTable.myXmlColumn))[1]::text AS c2
--,myTempTable.myXmlColumn as myXmlElement
,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e
,xmlexists('//nixda[1]' PASSING BY REF myTempTable.myXmlColumn) AS c2e
,xmlexists('//xmlEncodeTestAbc[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1ea
FROM unnest(xpath('//row',
CAST('<?xml version="1.0" encoding="utf-8"?>
<table xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<xmlEncodeTest xsi:nil="true" />
<nixda>noob</nixda>
</row>
</table>
' AS xml)
)
) AS myTempTable(myXmlColumn)
;
You can also check if a field is contained in an XML-text, by doing
,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e
for example when you pass an XML-value to a stored-procedure/function for CRUD. (see above)
Also, note that the correct way to pass a null-value in XML is <elementName xsi:nil="true" />
and not <elementName />
or nothing. There is no correct way to pass NULL in attributes (you can only omit the attribute, but then it gets difficult/slow to infer the number of columns and their names in a large dataset).
e.g.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<table>
<row column1="a" column2="3" />
<row column1="b" column2="4" column3="true" />
</table>
(is more compact, but very bad if you need to import it, especially if from XML-files with multiple GB of data - see a wonderful example of that in the stackoverflow data dump)
SELECT
myTempTable.myXmlColumn
,(xpath('//@column1', myTempTable.myXmlColumn))[1]::text AS c1
,(xpath('//@column2', myTempTable.myXmlColumn))[1]::text AS c2
,(xpath('//@column3', myTempTable.myXmlColumn))[1]::text AS c3
,xmlexists('//@column3' PASSING BY REF myTempTable.myXmlColumn) AS c3e
,case when (xpath('//@column3', myTempTable.myXmlColumn))[1]::text is null then 1 else 0 end AS is_null
FROM unnest(xpath('//row', '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<table>
<row column1="a" column2="3" />
<row column1="b" column2="4" column3="true" />
</table>'
)) AS myTempTable(myXmlColumn)