"ERROR: extra data after last expected column" when using PostgreSQL COPY
An empty table won't do. You need table that matches the structure of input data. Something like:
CREATE TABLE raw_data (
col1 int
, col2 int
...
);
You don't need to declare tab
as DELIMITER
since that's the default:
COPY raw_data FROM '/home/Projects/TestData/raw_data.txt';
800 columns you say? That many columns would typically indicate a problem with your design. Anyway, there are ways to half-automate the CREATE TABLE
script.
Automation
Assuming simplified raw data
1 2 3 4 -- first row contains "column names"
1 1 0 1 -- tab separated
1 0 0 1
1 0 1 1
Define a different DELIMITER
(one that does not occur in the import data at all), and import to a temporary staging table with a single text
column:
CREATE TEMP TABLE tmp_data (raw text);
COPY tmp_data FROM '/home/Projects/TestData/raw_data.txt' WITH (DELIMITER '§');
This query creates the CREATE TABLE
script:
SELECT 'CREATE TABLE tbl (col' || replace (raw, E'\t', ' bool, col') || ' bool)'
FROM (SELECT raw FROM tmp_data LIMIT 1) t;
A more generic & safer query:
SELECT 'CREATE TABLE tbl('
|| string_agg(quote_ident('col' || col), ' bool, ' ORDER BY ord)
|| ' bool);'
FROM (SELECT raw FROM tmp_data LIMIT 1) t
, unnest(string_to_array(t.raw, E'\t')) WITH ORDINALITY c(col, ord);
Returns:
CREATE TABLE tbl (col1 bool, col2 bool, col3 bool, col4 bool);
Execute after verifying validity - or execute dynamically if you trust the result:
DO
$$BEGIN
EXECUTE (
SELECT 'CREATE TABLE tbl (col' || replace(raw, ' ', ' bool, col') || ' bool)'
FROM (SELECT raw FROM tmp_data LIMIT 1) t
);
END$$;
Then INSERT
the data with this query:
INSERT INTO tbl
SELECT (('(' || replace(replace(replace(
raw
, '1', 't')
, '0', 'f')
, E'\t', ',')
|| ')')::tbl).*
FROM (SELECT raw FROM tmp_data OFFSET 1) t;
Or simpler with translate()
:
INSERT INTO tbl
SELECT (('(' || translate(raw, E'10\t', 'tf,') || ')')::tbl).*
FROM (SELECT raw FROM tmp_data OFFSET 1) t;
The string is converted into a row literal, cast to the newly created table row type and decomposed with (row).*
.
All done.
You could put all of that into a plpgsql function, but you'd need to safeguard against SQL injection. (There are a number of related solutions here on SO. Try a search.
db<>fiddle here
Old SQL Fiddle