Specification of source charset encoding in MSVC++, like gcc "-finput-charset=CharSet"

For those who subscribe to the motto "better late than never", Visual Studio 2015 (version 19 of the compiler) now supports this.

The new /source-charset command line switch allows you to specify the character set encoding used to interpret source files. It takes a single parameter, which can be either the IANA or ISO character set name:

/source-charset:utf-8

or the decimal identifier of a particular code page (preceded by a dot):

/source-charset:.65001

The official documentation is here, and there is also a detailed article describing these new options on the Visual C++ Team Blog.

There is also a complementary /execution-charset switch that works in exactly the same way but controls how narrow character- and string-literals are generated in the executable. Finally, there is a shortcut switch, /utf-8, that sets both /source-charset:utf-8 and /execution-charset:utf-8.

These command-line options are incompatible with the old #pragma setlocale and #pragma execution-character-set directives, and they apply globally to all source files.

For users stuck on older versions of the compiler, the best option is still to save your source files as UTF-8 with a BOM (as other answers have suggested, the IDE can do this when saving). The compiler will automatically detect this and behave appropriately. So, too, will GCC, which also accepts a BOM at the start of source files without choking to death, making this approach functionally portable.

Open File->Advances Save Options... Select Unicode(UTF-8 with signature) - Codepage 65001 in Encoding combo. Compiler will use selected encoding automatically.

According to Microsoft answer here:

if you want non-ASCII characters then the "official" and portable way to get them is to use the \u (or \U) hex encoding (which is, I agree, just plain ugly and error prone).

The compiler when faced with a source file that does not have a BOM the compiler reads ahead a certain distance into the file to see if it can detect any Unicode characters - it specifically looks for UTF-16 and UTF-16BE - if it doesn't find either then it assumes that it has MBCS. I suspect that in this case that in this case it falls back to MBCS and this is what is causing the problem.

Being explicit is really best and so while I know it is not a perfect solution I would suggest using the BOM.

Jonathan Caves
Visual C++ Compiler Team.

Good solution will be placing text strings in resource files. It is convenient and portable way. You could use localization libraries, such as gettext to manage translations.

The flow we used: save files as UTF8-with BOM, share the same source between linux and windows, for linux: preprocess the source files on compilation command in order to remove the BOM, run g++ on the intermediate non-BOM file.

Specification of source charset encoding in MSVC++, like gcc "-finput-charset=CharSet"

Tags:

C++

Visual C++

Unicode

Command Line Arguments

Character Encoding

Related

Recent Posts