Encoding problems with ogr2ogr and Postgis/PostgreSQL database
Magnus is right and I will discuss the solution here.
I have seen the option to inform PostgreSQL about character encoding, options=’-c client_encoding=xxx’
, used many places, but it does not seem to have any effect. If someone knows how this part is working, feel free to elaborate.
Magnus suggested to set the environment variable PGCLIENTENCODING to LATIN1. This can, according to a mailing list I queried, be done by modifying the call to ogr2ogr:
ogr2ogr -–config PGCLIENTENCODING LATIN1 –f PostgreSQL
PG:”host=hostname user=username dbname=databasename password=password” inputfile
This didn’t do anything for me. What worked for me was to, before the call to ogr2ogr, to:
SET PGCLIENTENCODING=LATIN1
It would be great to hear more details from experienced users and I hope it can help others :)
That does sound like it would set the client encoding to LATIN1. Exactly what error do you get?
Just in case ogr2ogr doesn't pass it along properly, you can also try setting the environment variable PGCLIENTENCODING
to latin1
.
I suggest you double check that they are actually LATIN1. Simply running file
on it will give you a good idea, assuming it's actually consistent within the file. You can also try sending it through iconv
to convert it to either LATIN1 or UTF8.
Currently, OGR from GDAL does not perform any recoding of character data during translation between vector formats. The team has prepared RFC 23.1: Unicode support in OGR document which discusses support of recoding for OGR drivers. The RFC 23 was adopted and the core functionality was already released in GDAL 1.6.0. However, most of OGR drivers have not been updated, including Shapefile driver.
For the time being, I would describe OGR as encoding agnostic and ignorant. It means, OGR does take what it gets and sends it out without any processing. OGR uses char type to manipulate textual data. This is fine to handle multi-byte encoded strings (like UTF-8) - it's just a plain stream of bytes stored as array of char elements.
It is advised that developers of OGR drivers should return UTF-8 encoded strings of attribute values, however this rule has not been widely adopted across OGR drivers, thus making this functionality not end-user ready yet.