What are the differences between Linux and Windows .txt files (Unicode encoding)

"Unicode" on Windows is UTF-16LE, and each character is 2 or 4 bytes. Linux uses UTF-8, and each character is between 1 and 4 bytes.

"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"


Line breaks

Windows uses CRLF (\r\n, 0D 0A) line endings while Unix just uses LF (\n, 0A).

Character Encoding

Most modern (i.e., since 2004 or so) Unix-like systems make UTF-8 the default character encoding.

Windows, however, lacks native support for UTF-8. It internally works in UTF-16, and assumes that char-based strings are in a legacy code page. Fortunately, Notepad is capable of reading UTF-8 files; unfortunately, "ANSI" encoding is still the default.

Problematic Special Characters

U+001A SUBSTITUTE

Windows (rarely) uses Ctrl+Z as an end-of-file character. For example, if you type a file at the command prompt, it will be truncated at the first 1A byte.

On Unix, Ctrl+Z is nothing special.

U+FEFF ZERO WITH NO-BREAK SPACE (Byte-Order Mark)

On Windows, UTF-8 files often start with a "byte order mark" EF BB BF to distinguish them from ANSI files.

On Linux, the BOM is discouraged because it breaks things like shebang lines in shell scripts. Plus, it'd be pointless to have a UTF-8 signature when UTF-8 is the default encoding anyway.


One difference I've hear is the use of \r\n (Windows) vs. \n for line breaks (Linux).

Yes. Most UNIX text editors will handle this automatically, Windows programmers editors may handle this, general text editors (base Notepad) will not.

Windows seems to also need the EOF (Ctrl-Z) as END OF FILE in some contexts, whereas you'll probably never see it on UNIX.

Remember that MacOS X is now UNIX underneath, so it uses UNIX line endings. Though before OS X (MacOS 9 and below) it had its own ending (\r)

EDIT: in other format CR and LF: