none
#pragma execution_character_set("utf-8")

    Question

  • Hello,

    in Visual Studio 2010 the pragma execution_character_set was available to define the encoding of strings in the source code.

    When this pragma was set the following statement created an UTF-8 encoded string when being compiled:

    char* name = "Hügel";

    Now Visual Studio 2012 doesn't know this pragma any longer.

    Is there a new way of doing this?

    Regards,

    Ralf

    Update:

    Thanks for all the answers about the general concept of Unicode below.

    But my specific question is why Microsoft removed a feature in Visual Studio 2012 which was introduced as a hotfix in Visual Studio 2010.

    My expectation would be that it is superseded by a better solution.


    • Edited by Ralf Tobel Monday, December 03, 2012 7:41 AM
    Thursday, November 29, 2012 9:45 AM

Answers

All replies

  • In latest versions of GCC C++11 Unicode encoding prefixes are already available. Like - 

    char* s1 = u8R"(Raw UTF-8 encoded string literal)";

    Where u8"" creates a UTF-8 encoded string. It is still unavailable in VC++2012.

    But in any case compiler has to know the encoding of the source file and that is what the Microsoft's pragma directive for. See this discussion for some enlightenment on the subject - How to use utf8 character arrays in c++?


    Thursday, November 29, 2012 1:50 PM
  • Friday, November 30, 2012 3:11 PM
  • > But in any case compiler has to know the encoding of the source file
    > and that is what the Microsoft's pragma directive for.

    No, thats not what pragama execution_character_set is for.

    When the compiler reads an UTF-8 encoded source file, it recognizes the encoding from the byte order mark of the file. But the compiler does not generate UTF-8 encoded strings. Instead, when the compiler sees a multi-byte encoded UTF-8 character inside a string, it converts it according the ACP into a 1-byte character. Which is okay, because what should the compiler do with 'ö'? It needs to generate a single char value. A multi-byte sequence would not make sense here. In contrast, generating multi-byte encoded characters inside a string often makes a lot of sense (even if the source file itself might not be UTF-8, but ANSI or UTF-16 encoded).

    I too need a solution for this problem. Sadly Microsoft still does not support u8"", which would be the perfect solution. To tell you the truth, I am quite angry with Microsoft for removing the feature without providing an alternative solution.


    • Edited by Charcoalgrin Wednesday, December 05, 2012 10:22 AM type fixed
    Wednesday, December 05, 2012 9:57 AM
  • > But in any case compiler has to know the encoding of the source file
    > and that is what the Microsoft's pragma directive for.

    No, thats not what pragama execution_character_set is for.

    When the compiler reads an UTF-8 encoded source file, it recognizes the encoding from the byte order mark of the file. 


    When you state something like this please provide the source from where you get the information. 

    There can be any file supplied as a source for compiler. With or without BOM.

    Wednesday, December 05, 2012 11:52 AM
  • There can be any file supplied as a source for compiler. With or without BOM.

    Most UTF-8 encoded files I have seen had a BOM.

    And yes, it think it would still be nice to have the possibility to define the encoding of the source file. Although I would prefer to define this in the file properties in Visual Studio, not in the source code itself. Because if you put the definition into the file as a pragma, the compiler must read and analyze the file before it can determine the encoding. For this the compiler has to know if the file is ANSI or UTF-16 anyway ...

    But that is not the point. My problem is that I need the compiler to generate UTF-8 strings, no matter if I have to save the source code as ANSI, UTF-8 or UTF-16 for this to work. I could live with each of this encodings, just if the compiler could generate UTF-8 encoded strings. If have many strings with umlauts, and writing them as a sequence of octal numbers is impractical.


    Regarding the source from where I get the information: There is no source. I just tried it. If I am not totally mistaken, then what I described is what the Microsoft compiler actually does.
    • Edited by Charcoalgrin Wednesday, December 05, 2012 2:13 PM something added
    Wednesday, December 05, 2012 2:06 PM
  • You're absolutely right!

    We tried this too. But saving the file in UTF-8 doesn't mean that the compiler is generating UTF-8 encoded strings.

    For the following code there is no UTF-8 sequence generated for the character 'ü':

    char* name = "Hügel";
    Instead, as you mentioned, the ACP is used to convert this to a single byte.
    Wednesday, December 05, 2012 3:53 PM
  • Addition info on the pragma: http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/#li-comment-756

    Let me know if I can clear up any questions after reading my previous link.

    gg

    Wednesday, December 05, 2012 4:34 PM
  • Thank you for the link. The information it provides is interesting, but does not solve the problem. The problem is not about writing UTF-8 into a stream or onto the console. The problem is to create a string literal that is UTF-8 encoded in the first place.
    Thursday, December 06, 2012 8:22 AM
  • Thank you for the link. The information it provides is interesting, but does not solve the problem. The problem is not about writing UTF-8 into a stream or onto the console. The problem is to create a string literal that is UTF-8 encoded in the first place.

    After reading and understanding that link, you'll know what it takes to create a string literal in a particular encoding - or if it's even possible with MSVC/GCC. It also straightens out other misinformation present in this thread (and in other links presented in this thread). 

    gg

    Thursday, December 06, 2012 4:33 PM
  • On 12/6/2012 3:22 AM, Charcoalgrin wrote:

    Thank you for the link. The information it provides is interesting, but does not solve the problem. The problem is not about writing UTF-8 into a stream or onto the console. The problem is to create a string literal that is UTF-8 encoded in the first place.

    "H\xC3\xBCgel"

    Igor Tandetnik

    Thursday, December 06, 2012 5:41 PM
  • Of course I know that I can use hexadecimal or octal notation to write any byte sequence into a string literal. But as I said before: This is impractical.

    The project I am working on has many thousands of strings. Many of them include imlauts and sharp s. Finding and changing them would be expensive and error-prone, and it would make the strings completely unreadable.

    What I do not get is why the heck would Microsoft remove the pragma execution_character_set feature? It would have been okay if they had implemented the u8"" feature in Visual Studio 2012. Instead they put a lot of effort into ruining the user interface. Grey icons ... who comes up with such ideas? I fear what they will do in the next version. Maybe they will come up with white letters on white background...

    Friday, December 07, 2012 2:43 PM
  • On 12/7/2012 9:43 AM, Charcoalgrin wrote:

    Of course I know that I can use hexadecimal or octal notation to write any byte sequence into a string literal. But as I said before: This is impractical.

    Well, that's the only reliable, portable way I know of.

    The project I am working on has many thousands of strings.

    Do they have to be represented as string literals? A string table resource, or even a plain old text file, could be a more practical alternative. As an added benefit, if you decide one day that you want to ship localized versions of your application, you'd have all strings in need of translation handily in one place.


    Igor Tandetnik

    Friday, December 07, 2012 6:11 PM
  • Well, that's the only reliable, portable way I know of.

    The best way I can think of would be to use C++11 UTF-8 string literals: char* s = u8"Hügel";

    But the pragma did the trick as well.

    Regarding the thousands of string in our project: We are already using string resources for all strings that need to be localized (having thousands of them too). The strings I was talking about must not be localized. They are stored in large arrays. The array element type is a structure with about 15 members, and only one of them is a string. The array definitions are spread over more than 100 source files. So, in theory we could load the strings from a file into memory, but how would we put the right string into the right element of the right array? As an alternative we could read each array from a separate file, reading all the struct members from the file. Then again we would have to parse the file and detect all kind of errors. Currently the compiler tells us during compile time if we make a syntax error, use the wrong data type or mistype an enumeration value.

    PS: Sorry for picking on Microsoft earlier. The development teams at Microsoft mostly do a good job. It's just that when I first read about the VS 2012 user interface, I was suprised that someone could come up with so many obviously bad ideas. Later I was frustrated that the "designers" mostly ignored the user complaints. It is like buying a car, only to find out on delivery that the windshield is painted black. When you complain, the car salesman would only say: "But it looks cool!"


    Friday, December 07, 2012 8:51 PM
  • On 12/7/2012 3:51 PM, Charcoalgrin wrote:

    Regarding the thousands of string in our project: We are already using string resources for all strings that need to be localized (having thousands of them too). The strings I was talking about must not be localized. They are stored in large arrays. The array element type is a structure with a about 15 members, and only one of them is a string. The array definitions are spread over more than 100 source files. So, in theory we could load the strings from a file into memory, but how would we put the right string into the right element of the right array?

    How about something like this. You gather all these strings into a text file, containing lines like:

    kHuegel=Hügel

    (doesn't have to be this exact way - whatever you find convenient). You then write a tool that consumes this file and generates two source files:

    // MyStrings.h
    extern const char* kHuegel;
    
    // MyStrings.cpp
    const char* kHuegel = "H\xC3\xBCgel";     // Hügel

    In the rest of your code, just include MyStrings.h and use kHuegel in place of "Hügel".


    Igor Tandetnik

    Friday, December 07, 2012 9:13 PM
  • might be an idea to use

    wchar* str;

    for a container as that will handle all symbols properly


    Windows MVP, XP, Vista, 7 and 8. More people have climbed Everest than having 3 MVP's on the wall.

    Hardcore Games, Legendary is the only Way to Play

    Developer | Windows IT | Chess | Economics | Vegan Advocate | PC Reviews

    Friday, December 07, 2012 9:40 PM
  • Thank you for your suggestions.

    Defining the strings as wide character strings and converting them into UTF-8 either on use or during the startup of the program is a possible way to go. We are still using Visual Studio 2010 for now (mainly because we need to support Windows XP), but we consider to move to VS 2012 a few months from now. I am still hoping that Microsoft will provide a solution until then (for example the pragma solution).

    Friday, December 07, 2012 11:46 PM
  • wchar is a standard component of C++ and its supported by all major vendors

    xp and up can use wchar, but remember xp is now nearing the end of extended support

    no need for a #pragma 


    Windows MVP, XP, Vista, 7 and 8. More people have climbed Everest than having 3 MVP's on the wall.

    Hardcore Games, Legendary is the only Way to Play

    Developer | Windows IT | Chess | Economics | Vegan Advocate | PC Reviews

    Saturday, December 08, 2012 12:13 AM
  • no need for a #pragma

    I disagree. UTF-8 and wchar is not the same. Sure I can convert one into the other, but why should I be forced to do that?

    Imagine having the data type double but having no way to write double literals into your source code. You say that is silly? Why? You can always use strings instead and convert them into double values using the atof() function. So there is no need for double literals, right?

    In my opinion, not having UTF-8 literals is very much the same. And by the way: why would Microsoft implement the pragma into VC++2010 in the first place, if there is no need for it?
    • Edited by Charcoalgrin Saturday, December 08, 2012 11:02 AM
    Saturday, December 08, 2012 10:43 AM
  • you will need to overload your output such as cout <<

    with a parser to convert UNICODE to UTF-8 or any other internet protocol

    I am sure there is a parser around on the net somewhere, try google


    Windows MVP, XP, Vista, 7 and 8. More people have climbed Everest than having 3 MVP's on the wall.

    Hardcore Games, Legendary is the only Way to Play

    Developer | Windows IT | Chess | Economics | Vegan Advocate | PC Reviews

    Saturday, December 08, 2012 12:33 PM
  • As I said before: This is not about sending UTF-8 to a stream or the console, or about converting Unicode into UTF-8 (which is quite simple). It is about making the compiler generate UTF-8 encoded strings.

    Tuesday, December 11, 2012 8:51 AM
  • Hi,

    Here is the link in connect.microsoft.com:

    http://connect.microsoft.com/VisualStudio/feedback/details/773186/pragma-execution-character-set-utf-8-didnt-support-in-vc-2012

    Have a nice day.

    Regards,


    Elegentin Xie
    MSDN Community Support | Feedback to us
    Develop and promote your apps in Windows Store
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Wednesday, December 12, 2012 6:23 AM
    Moderator