How do I read a text file's hidden characters?
An easy way to view this kind of stuff in Windows is to use the "type" command.
I would do something like this:
type filename.txt | more
Well, I'm using NotePad++ and I can't see that at all! What is the best text file reader for this kind of problems?
The problem is, a ‘good’ text editor should be able to load all text encodings transparently — even stupid broken ones like UTF-8-plus-BOM — which would prevent you from seeing the problem. Sure, a good text editor should save UTF-8 without the bogus-BOM, or at least give you the option to do so, but you won't know to re-save it if you don't see the faux-BOM there.
The reason you see the three high-bytes at the start of the file in TextMate is actually because TextMate has got it wrong and guessed the encoding as Latin-1 instead of UTF-8. This presumably reproduces the behaviour of the service you're sending to which don't know about Unicode, but it's not really a desirable feature in itself. It's also why the æ
s and ø
s haven't come out.
If you want to see every byte in the file explicitly, what you want isn't really a text editor, but a hex editor. There are lots to choose from, eg. xvi32 on Windows.
And then fix your application to not produce bogus BOMs; they have no place in a UTF-8 file anyway, never mind the problems it causes to non-Unicode applications. [I don't know what the application is written in, but a common cause of unwanted BOMs is using .NET's Encoding.UTF8
encoding. A new UTF8Encoding(false)
would be preferable.]
Whether the service you're sending to wants UTF-8 or some other encoding is in any case something you'll have to ask the operators of that service. If they're already describing the high-bytes for æ
et al in your file as inherently ‘invalid’, you may be facing a situation where they don't support any non-ASCII characters at all, in which case you'll have to consider transliterating characters appropriately for the target language, eg. æ
->ae
.
Frhed jumps to my mind...it is a very nice tool. And as Arjan pointed out, you're saving the file as UTF-8 encoded document.