BStrings RRS feed

  • General discussion

  • Hi
    I've mentioned BSTRINGs, ASCII and Unicode a few times. So for those of you interested in what happens under the hood, here's an old MSDN article that I personally think is very interesting (especially when hadnling the Win32 APIs in BASIC):

    Article 3. Strings the OLE Way

    Bruce McKinney

    April 18, 1996


    The difference between Microsoft® Visual Basic® strings and Visual C++® strings is the difference between "I'll do it" and "You do it." The C++ way is fine for who it's for, but there aren't many programmers around anymore who get a thrill out of allocating and destroying their own string buffers. In fact, most C++ class libraries (including the Microsoft Foundation Classes, or MFC) provide string classes that work more or less on the Basic model, which is similar to the model of Pascal and FORTRAN.

    When you manage an array of bytes (or an array of books or beer bottles or babies), there are two ways of maintaining the lengths. The marker system puts a unique marker at the end of the array. Everything up to the marker is valid. The count system adds a special array slot containing the number of elements. You have to update the count every time you resize the array. Both systems have their advantages and disadvantages. The marker system assumes you can find some unique value that will never appear in the array. The count system requires tedious bookkeeping to keep the count accurate.

    The C language and most of its offspring uses the marker system for storing strings, with the null character as the marker. All the other languages I know use the count system. You might argue that the majority indicates the better choice, but even if you buy that, C still gets the last laugh. Many of the leading operating systems of the world (all the flavors of Unix®, Windows®, and OS/2®, for example) expect strings passed to the system to be null-terminated. As a result, languages such as Pascal and FORTRAN support a special null-terminated string type for passing strings to the operating system. Basic doesn't have a separate type for null-terminated strings, but it has features that make passing null-terminated strings easy.

    As a language-independent standard, OLE can't afford to take sides. It must accommodate languages in which null is not a special character, but it must also be able to output null-terminated strings for its host operating system. More importantly, OLE recognizes that requiring the operating system to manage strings is inherently more stable and reliable in a future computing world where strings may be transferred across process, machine, and eventually Internet boundaries. I've been told that the name BSTR is a compression of Basic STRing, but in fact a BSTR looks a lot more like a Pascal string than like the strings Basic old-timers remember.

    In any case, C++ programmers have some unlearning to do when it comes to writing strings for OLE. But before you can get into BSTR details, you need to clearly understand the difference between Unicode™ and ANSI strings.

    Unicode vs. ANSI

    Stringwise, we are cursed to live in interesting times. The world according to Microsoft (and many other international companies) is moving from ANSI to Unicode characters, but the transition isn't exactly a smooth one.

    Most of the Unicode confusion comes from the fact that we are in the midst of a comprehensive change in the way characters are represented. The old way uses the ANSI character set for the first 256 bytes, but reserves some characters as double-byte character prefixes so that non-ANSI character sets can be represented. This is very efficient for the cultural imperialists who got there first with Latin characters, but it's inefficient for those who use larger character sets. Unicode represents all characters in two bytes. This is inefficient for the cultural imperialists (although they still get the honor of claiming most of the first 128 characters with zero in the upper byte), but it's more efficient (and more fair) for the rest of the world.

    Different Views of Unicode

    Eventually, everybody will use Unicode, but nobody seems to agree on how to deal with the transition.

    • Windows 3.x—Doesn't know a Unicode from a dress code, and never will.

    • 16-bit OLE—Ditto.

    • Windows NT®—Was written from the ground up first to do the right thing (Unicode) and secondly to be compatible (ANSI). All strings are Unicode internally, but Windows NT also completely supports ANSI by translating internal Unicode strings to ANSI strings at run time. Windows NT programs that use Unicode strings directly can be more efficient by avoiding frequent string translations, although Unicode strings take about twice as much data space.

    • Windows 95—Uses ANSI strings internally. Furthermore, it doesn't support Unicode strings even indirectly in most contexts—with one big exception.

    • 32-bit OLE—Was written from the ground up to do the right thing (Unicode) and doesn't do ANSI. The OLE string types—OLESTR and BSTR—are Unicode all the way. Any 32-bit operating system that wants to do OLE must have at least partial support for Unicode. Windows 95 has just enough Unicode support to make OLE work.

    • Visual Basic—The designers had to make some tough decisions about how they would represent strings internally. They might have chosen ANSI, because it's the common subset of Windows 95 and Windows NT, and converted to Unicode whenever they needed to deal with OLE. But since Visual Basic 4.0 is OLE inside and out, they chose Unicode as the internal format, despite potential incompatibilities with Windows 95. The Unicode choice caused many problems and inefficiencies both for the developers of Visual Basic and for Visual Basic developers—but the alternative would have been worse.

    • The Real World—Most existing data files use ANSI. The .WKS, .DOC, .BAS, .TXT, and most other standard file formats use ANSI. If a system uses Unicode internally but needs to read from or write to common data formats, it must do Unicode-to-ANSI conversion. Someday there will be Unicode data file formats, but today they're pretty rare.

    What does this mean for you? It means you must make choices about any program you write:

    • If you write using Unicode internally, your application will run only on Windows NT, but it will run faster. Everything is Unicode, inside and out. There are no string translations—except when you need to write string data to standard file formats that use ANSI. An application written this way won't be out-of-date when some future iteration of Windows 9x gets Unicode.

    • If you write using ANSI internally, your application will run on either Windows NT or Windows 95, but it will run slower under Windows NT because there are a lot of string translations going on in the background. An application written this way will someday be outdated when the whole world goes Unicode, but it may not happen in your lifetime.

    The obvious choice for most developers is to use the ANSI version because it works right now for all 32-bit Windows platforms. But I'd like to urge you to take a little extra time to build both versions.

    If you choose to write your application using both ANSI and Unicode, Win32® and the C run-time library both provide various types and macros to make it easier to create portable programs from the same source. To use them, define the symbol _UNICODE for your Unicode builds and the symbol _MBCS for your ANSI builds. The samples already have these settings for the Microsoft Developer Studio.

    Note   As far as this article is concerned, there is no difference between double-byte character strings—DBCS—and multi-byte character strings—MBCS. Similarly, "wide character" and "Unicode" are synonymous in the context of this article.

    A WCHAR Is a wchar_t Is an OLECHAR

    Just in case you're not confused enough about ANSI and Unicode strings, everybody seems to have a different name for them. Furthermore, there's a third type of string called a single-byte character string (SBCS), which we will ignore in this article.

    In the Win32 API, ANSI normally means MBCS. The Win32 string functions (lstrlenA, lstrcpyA, and so on) assume multi-byte character strings, as do the ANSI versions of all application programming interface (API) functions. You also get Unicode versions (lstrlenW, lstrcpyW). Unfortunately, these aren't implemented in Windows 95, so you can't use them on BSTRs. Finally, you get generic macro versions (lstrlen, lstrcpy) that depend on whether you define the symbol UNICODE.

    The C++ run-time library is even more flexible. For each string function, it supports a single-byte function (strlen); a multi-byte function (_mbslen); a wide character (wcslen), and a generic macro version (_tcslen) that depends on whether you define _UNICODE, _MBCS, or _SBCS. Notice that the C run-time library tests _UNICODE while Win32 tests UNICODE. We get around this by defining these to be equivalent in OLETYPE.H.

    Win32 provides the MultiByteToWideChar and WideCharToMultiByte functions for converting between ANSI and Unicode. The C++ run-time library provides the mbstowcs and wcstombs functions for the same purpose. The Win32 functions are more flexible, but not in any way that matters for this article. We'll use the simpler run-time versions.

    Types also come in Unicode and ANSI versions, but to add to the confusion, OLE adds its own types to those provided by Win32 and ANSI. Here are some of the types and type coercion macros you need to be familiar with:

    Type Description
    char An 8-bit signed character (an ANSI character).
    wchar_t A typedef to a 16-bit unsigned short (a Unicode character).
    CHAR The Win32 version of char.
    WCHAR The Win32 version of wchar_t.
    OLECHAR The OLE version of wchar_t.
    _TCHAR A generic character that maps to char or wchar_t.
    LPSTR, LPCSTR A Win32 character pointer. The version with C is const.
    LPWSTR, LPCWSTR A Win32 wide character pointer.
    LPOLESTR, LPCOLESTR An OLE wide character pointer.
    LPTSTR, LPCTSTR A Win32 generic character pointer.
    _T(str), _TEXT(str) Identical macros to create generic constant strings.
    OLESTR(str) OLE macro to create generic constant strings.

    Do you notice a little redundancy here? A little inconsistency? The sample code uses the Win32 versions of these types, except when there isn't any Win32 version or the moon is full.

    In normal C++ programming, you should use the generic versions of functions and types as much as possible so that your strings will work in either Unicode or ANSI builds. In this series, the String class hides a lot of the detail of making things generic. Generally it provides overloaded ANSI and Unicode versions of functions rather than using generic types. When you have a choice, you should use Unicode strings rather than ANSI or generic strings. You'll see how and why this nontypical coding style works later.

    Note   Versions of Visual C++ before 4.0 had a DLL called OLE2ANSI that automatically translated OLE Unicode strings to ANSI strings behind the scenes. This optimistic DLL made OLE programming simpler than previously possible. It was indeed pleasant to have the bothersome details taken care of, but performance-wise, users were living in a fool's paradise. OLE2ANSI is history now, although conditional symbols for it still exist in the OLE include files. The OLECHAR type, rather than the WCHAR type, was used in OLE prototypes so that it could be transformed into the CHAR type by this DLL. Do not define the symbol OLE2ANSI in the hopes that OLE strings will magically transform themselves into ANSI strings. There is no Santa Claus.

    What Is a BSTR?

    The BSTR type is actually a typedef, which in typical Windows include file fashion, is made up of more typedefs and defines. You can follow the twisted path yourself, but here's what it boils down to:

    typedef wchar_t * BSTR; 

    Hmmm. A BSTR is actually a pointer to Unicode characters. Does that look familiar? In case you don't recognize this, let me point out a couple of similar typedefs:

    typedef wchar_t * LPWSTR; typedef char * LPSTR; 

    So if a BSTR is just a pointer to characters, how is it different from the null-terminated strings that C++ programmers know so well? Internally, the difference is that there's something extra at the start and end of the string. The string length is maintained in a long variable just before the start address being pointed to, and the string always has an extra null character after the last character of the string. This null isn't part of the string, and you may have additional nulls embedded in the string.

    That's the technical difference. The philosophical difference is that the contents of BSTRs are sacred. You're not allowed to modify the characters except according to very strict rules that we'll get to in a minute. OLE provides functions for allocating, reallocating, and destroying BSTRs. If you own an allocated BSTR, you may modify its contents as long as you don't change its size. Because every BSTR is, among other things, a pointer to a null-terminated string, you may pass one to any string function that expects a read-only (const) C string. The rules are much tighter for passing BSTRs to functions that modify string buffers. Usually, you can only use functions that take a string buffer argument and a maximum length argument.

    All the rules work on the honor system. A BSTR is a BSTR by convention. Real types can be designed to permit only legal operations. Later we'll define a C++ type called String that does its best to enforce the rules. The point is that BSTR servers are honor-bound to follow the rules so that BSTR clients can use strings without even knowing that there are rules.

    The BSTR System Functions

    My descriptions of the OLE BSTR functions are different from and, in my opinion, more complete than the descriptions in OLE documentation. I had to experiment to determine some behavior that was scantily documented, and I checked the include files to get the real definitions, so I am confident that my descriptions are valid and will work for you.

    For consistency with the rest of the article, the syntax used for code in this section has been normalized to use Win32 types such as LPWSTR and LPCWSTR. The actual prototypes in OLEAUTO.H use const OLECHAR FAR * (ignoring the equivalent LPCOLESTR types). The original reasons for using OLECHAR pointers rather than LPCWSTRs don't matter for this article.

    You need to read this section only if you want to fully understand how the String class (presented later) works. But you don't really need to understand BSTRs in order to use the string class.

    BSTR SysAllocString(LPCWSTR wsz);

    Given a null-terminated wide character string, allocates a new BSTR of the same length and copies the string to the BSTR. This function works for empty and null strings. If you pass in a null string, you get back a null string. You also get back a null string if there isn't enough memory to allocate the given string.


    // Create BSTR containing "Text" bs = SysAllocString(L"Text") 

    BSTR SysAllocStringLen(LPCWSTR wsz, unsigned len);

    Given a null-terminated wide-character string and a maximum length, allocates a new BSTR of the given length and copies up to that length of characters from the string to the BSTR. If the length of the copied string is less than the given maximum length, a null character is written after the last copied character. The rest of the requested length is allocated, but not initialized (except that there will always be a null character at the end of the BSTR). Thus the string will be doubly null-terminated—once at the end of the copied characters and once at the end of the allocated space. If NULL is passed as the string, the whole length is allocated, but not initialized (except for the terminating null character). Don't count on allocated but uninitialized strings to contain null characters or anything else in particular. It's best to fill uninitialized strings as soon after allocation as possible.


    // Create BSTR containing "Te" bs = SysAllocStringLen(L"Text", 2) // Create BSTR containing "Text" followed by \0 and a junk character bs = SysAllocStringLen(L"Text", 6) 

    BSTR SysAllocStringByteLen(LPSTR sz, unsigned len);

    Given a null-terminated ANSI string, allocates a new BSTR of the given length and copies up to that length of bytes from the string to the BSTR. The result is a BSTR with two ANSI characters crammed into each wide character. There is very little you could do with such a string, and therefore not much reason to use this function. It's there for string conversion operations such as Visual Basic's StrConv function. What you really want is a function that creates a BSTR from an ANSI string, but this isn't it (we'll write one later). The function works like SysAllocStringLen if you pass a null pointer or a length greater than the length of the input string.

    BOOL SysReAllocString(BSTR * pbs, LPWSTR wsz);

    Allocates a new BSTR of the same length as the given wide-character string, copies the string to the BSTR, frees the BSTR pointed to by the first pointer, and resets the pointer to the new BSTR. Notice that the first parameter is a pointer to a BSTR, not a BSTR. Normally, you'll pass a BSTR pointer with the address-of operator.


    // Reallocate BSTR bs as "NewText" f = SysReAllocString(&bs, "NewText"); 

    BOOL SysReAllocStringLen(BSTR * pbs, LPWSTR wsz, unsigned len);

    Allocates a new BSTR of the given length, and copies as many characters as fit of the given wide-character string to the new BSTR. It then frees the BSTR pointed to by the first pointer and resets the pointer to the new BSTR. Often the new pointer will be the same as the old pointer, but you shouldn't count on this. You can give the same BSTR for both arguments if you want to truncate an existing BSTR. For example, you might allocate a BSTR buffer, call an API function to fill the buffer, and then reallocate the string to its actual length.


    // Create uninitialized buffer of length MAX_BUF. BSTR bsInput = SysAllocStringLen(NULL, MAX_BUF); // Call API function to fill the buffer and return actual length. cch = GetTempPathW(MAX_BUF, bsInput); // Truncate string to actual length. BOOL f = SysReAllocStringLen(&bsInput, bsInput, cch); 

    unsigned SysStringLen(BSTR bs);

    Returns the length of the BSTR in characters. This length does not include the terminating null. This function will return zero as the length of either a null BSTR or an empty BSTR.


    // Get character length of string. cch = SysStringLen(bs); 

    unsigned SysStringByteLen(BSTR bs);

    Returns the length of the BSTR in bytes, not including the terminating null. This information is rarely of any value. Note that if you look at the length prefix of a BSTR in a debugger, you'll see the byte length (as returned by this function) rather than the character length.

    void SysFreeString(BSTR bs);

    Frees the memory assigned to the given BSTR. The contents of the string may be completely freed by the operating system, or they may just sit there unchanged. Either way, they no longer belong to you and you had better not read or write to them. Don't confuse a deallocated BSTR with a null BSTR. The null BSTR is valid; the deallocated BSTR is not.


    // Deallocate a string. SysFreeString(bs); 

    BSTR SysAllocStringA(LPCSTR sz);

    The same as SysAllocString, except that it takes an ANSI string argument. OLE doesn't provide this function; it's declared in BString.H and defined in BString.Cpp. Normally, you should only use this function to create Unicode BSTRs from ANSI character string variables or function return values. It works for ANSI string literals, but it's wasted effort because you could just declare Unicode literals and save yourself some run-time processing.


    // Create BSTR containing "Text". bs = SysAllocStringA(sz) 

    BSTR SysAllocStringLenA(LPCSTR sz, unsigned len);

    The same as SysAllocStringLen, except that it takes an ANSI string argument. This is my enhancement function, declared in BString.H.


    // Create BSTR containing six characters, some or all of them from sz. bs = SysAllocStringLenA(sz, 6) 

    The Eight Rules of BSTR

    Knowing what the BSTR functions do doesn't mean you know how to use them. Just as the BSTR type is more than its typedef implies, the BSTR functions require more knowledge than documentation states. Those who obey the rules live in peace and happiness. Those who violate them live in fear—plagued by the ghosts of bugs past and future.

    The trouble is, these rules are passed on in the oral tradition; they are not carved in stone. You're just supposed to know. The following list is an educated attempt—based on scraps of ancient manuscripts, and revised through trial and error—to codify the oral tradition. Remember, it is just an attempt.

    Rule 1: Allocate, destroy, and measure BSTRs only through the OLE API (the Sys functions).

    Those who use their supposed knowledge of BSTR internals are doomed to an unknowable but horrible fate in future versions. (You have to follow the rules if you don't want bugs.)

    Rule 2: You may have your way with all the characters of strings you own.

    The last character you own is the last character reported by SysStringLen, not the last non-null character. You may fool functions that believe in null-terminated strings by inserting null characters in BSTRs, but don't fool yourself.

    Rule 3: You may change the pointers to strings you own, but only by following the rules.

    In other words, you can change those pointers with SysReAllocString or SysReAllocStringLen. The trick with this rule (and rule 2) is determining whether you own the strings.

    Rule 4: You do not own any BSTR passed to you by value.

    The only thing you can do with such a string is copy it or pass it on to other functions that won't modify it. The caller owns the string and will dispose of it according to its whims. A BSTR passed by value looks like this in C++:

    void DLLAPI TakeThisStringAndCopyIt(BCSTR bsIn); 

    The BCSTR is a typedef that should have been defined by OLE, but wasn't. I define it like this in OleType.H:

    typedef const wchar_t * const BCSTR; 

    If you declare input parameters for your functions this way, the C++ compiler will enforce the law by failing on most attempts to change either the contents or the pointer.

    The Object Description Language (ODL) statement for the same function looks like this:

    void WINAPI TakeThisStringAndCopyIt([in] BCSTR bsIn); 

    The BCSTR type is simply an alias for BSTR because MKTYPLIB doesn't recognize const. The [in] attribute allows MKTYPLIB to compile type information indicating the unchangeable nature of the BSTR. OLE clients such as Visual Basic will see this type information and assume you aren't going to change the string. If you violate this trust, the results are unpredictable.

    Rule 5: You own any BSTR passed to you by reference as an in/out parameter.

    You can modify the contents of the string, or you can replace the original pointer with a new one (using SysReAlloc functions). A BSTR passed by reference looks like this in C++:

    void DLLAPI TakeThisStringAndGiveMeAnother(BSTR * pbsInOut); 

    Notice that the parameter doesn't use BCSTR because both the string and the pointer are modifiable. In itself the prototype doesn't turn a reference BSTR into an in/out BSTR. You do that with the following ODL statement:

    void WINAPI TakeThisStringAndGiveMeAnother([in, out] BSTR * pbsInOut); 

    The [in, out] attribute tells MKTYPLIB to compile type information indicating that the string will have a valid value on input, but that you can modify that value and return something else if you want. For example, your function might do something like this:

    // Copy input string. bsNew = SysAllocString(*pbsInOut); // Replace input with different output. f = SysReAllocString(pbsInOut, L"Take me home"); // Use the copied string for something else. UseString(bsNew); 

    Rule 6: You must create any BSTR passed to you by reference as an out string.

    The string parameter you receive isn't really a string—it's a placeholder. The caller expects you to assign an allocated string to the unallocated pointer, and you'd better do it. Otherwise the caller will probably crash when it tries to perform string operations on the uninitialized pointer. The prototype for an out parameter looks the same as one for an in/out parameter, but the ODL statement is different:

    void WINAPI TakeNothingAndGiveMeAString([out] BSTR * pbsOut); 

    The [out] attribute tells MKTYPLIB to compile type information indicating that the string has no valid input but expects valid output. A container such as Visual Basic will see this attribute and will free any string assigned to the passed variable before calling your function. After the return the container will assume the variable is valid. For example, you might do something like this:

    // Allocate an output string. *pbsOut = SysAllocString(L"As you like it"); 

    Rule 7: You must create a BSTR in order to return it.

    A string returned by a function is different from any other string. You can't just take a string parameter passed to you, modify the contents, and return it. If you did, you'd have two string variables referring to the same memory location, and unpleasant things would happen when different parts of the client code tried to modify them. So if you want to return a modified string, you allocate a copy, modify the copy, and return it. You prototype a returned BSTR like this:

    BSTR DLLAPI TransformThisString(BCSTR bsIn); 

    The ODL version looks like this:

    BSTR WINAPI TransformThisString([in] BSTR bsIn); 

    You might code it like this:

    // Make a new copy. BSTR bsRet = SysAllocString(bsIn); // Transform copy (uppercase it). _wcsupr(bsRet); // Return copy. return bsRet; 

    Rule 8: A null pointer is the same as an empty string to a BSTR.

    Experienced C++ programmers will find this concept startling because it certainly isn't true of normal C++ strings. An empty BSTR is a pointer to a zero-length string. It has a single null character to the right of the address being pointed to, and a long integer containing zero to the left. A null BSTR is a null pointer pointing to nothing. There can't be any characters to the right of nothing, and there can't be any length to the left of nothing. Nevertheless, a null pointer is considered to have a length of zero (that's what SysStringLen returns).

    When dealing with BSTRs, you may get unexpected results if you fail to take this into account. When you receive a string parameter, keep in mind that it may be a null pointer. For example, Visual Basic 4.0 makes all uninitialized strings null pointers. Many C++ run-time functions that handle empty strings without any problem fail rudely if you try to pass them a null pointer. You must protect any library function calls:

    if (bsIn != NULL) { wcsncat(bsRet, bsIn, SysStringLen(bsRet)); } 

    When you call Win32 API functions that expect a null pointer, make sure you're not accidentally passing an empty string:

    cch = SearchPath(wcslen(bsPath) ? bsPath : (BSTR)NULL, bsBuffer, wcslen(bsExt) ? bsExt : (BSTR)NULL, cchMax, bsRet, pBase); 

    When you return functions (either in return values or through out parameters), keep in mind that the caller will treat null pointers and empty strings the same. You can return whichever is most convenient. In other words, you have to clearly understand and distinguish between null pointers and empty strings in your C++ functions so that callers can ignore the difference in Basic.

    In Visual Basic, a null pointer (represented by the constant vbNullString) is equivalent to an empty string. Therefore, the following statement prints True:

    Debug.Print vbNullString = "" 

    If you need to compare two strings in a function designed to be called from Visual Basic, make sure you respect this equality.

    Those are the rules. What is the penalty for breaking them? If you do something that's clearly wrong, you may just crash. But if you do something that violates the definition of a BSTR (or a VARIANT or SAFEARRAY, as we'll learn later) without causing an immediate failure, results vary.

    When you're debugging under Windows NT (but not under Windows 95) you may hit a breakpoint in the system heap code if you fail to properly allocate or deallocate resources. You'll see a message box saying "User breakpoint called from code at 0xXXXXXXX" and you'll  see an int 3 instruction pop up in the disassembly window with no clue as to where you are or what caused the error. If you continue running (or if you run the same code outside the debugger or under Windows 95), you may or may not encounter a fate too terrible to speak of. This is not my idea of a good debugging system. An exception or an error dialog box would be more helpful, but something is better than nothing, which is what you get under Windows 95.

    A BSTR Sample

    The Test.Cpp module contains two functions that test BSTR arguments. They're the basis of much of what I just passed on as the eight rules. The TestBStr function exercises each of the BSTR operations. This function doesn't have any output or arguments, but you can run it in the C++ debugger to see exactly what happens when you allocate and reallocate BSTRs. The TestBStrArgs function tests some legal and illegal BSTR operations. The illegal ones are commented out so that the sample will compile and run. This article is about the String class, not raw BSTR operations, so I'll leave you to figure out these functions on your own. It's probably more interesting to study this code than to run it, but the BSTR button in the Cpp4VB sample program does call these functions.

    Before you start stepping through this sample with the Microsoft Developer Studio, you'll have to tell the debugger about Unicode. You must decide whether you want arrays of unsigned shorts to be displayed as integer arrays or as Unicode strings. The choice is pretty obvious for this project, but you'll be up a creek if you happen to have both unsigned short arrays and Unicode strings in some other project. The debugger can't tell the difference. You probably won't have this kind of problem if your compiler and debugger interpret wchar_t as an intrinsic type.

    To get the Microsoft debugger to display wchar_t arrays as Unicode, you must open the Tools menu and select Options. Click the Debug tab and enable Display Unicode Strings. (Note that this applies to Visual C++ versions 5 and later.)

    For Visual C++ version 5, you can use the su format specifier on all Unicode variables in your watch window (although this won't help you in the locals window). To get a little ahead of ourselves, you can add the following line to make the String class described in the next section display its internal BSTR member as a Unicode string:

    ; from BString.h String =<m_bs,su> 

    Comments in AUTOEXP.DAT explain the syntax of format definitions. Comments in AUTOEXP.DAT explain the syntax of format definitions. You don't need to do this for Visual C++ version 6.

    The String Class

    One reason for writing server DLLs is to hide ugly details from clients. We'll take care of all the Unicode conversions in the server so that clients don't have to, but handling those details in every other line of code would be an ugly way to program. C++ provides classes so that we can hide ugly details even deeper. The String class is designed to make BSTR programming look almost as easy as programming with Basic's String type. Unfortunately, structural problems (or perhaps lack of imagination on my part) make this worthy goal unachievable. Still, I think you'll find the String type useful.

    Note   I know that it's presumptuous of me to name my BSTR class wrapper String, my VARIANT class wrapper Variant, and my SAFEARRAY class wrapper SafeArray. Most vendors of classes have the courtesy to use some sort of class naming convention that avoids stealing the most obvious names from the user's namespace. But I've been using the Basic names for other OLE types through typedefs. Why not use them for the types that require classes? After all, the goal is to make my classes look and work as much like intrinsic types as possible. The include filename, however, is BString.H because the name string.h is already used by the C++ run-time library.

    Rather than getting into String theory, let's just plunge into a sample. The goal is to implement the Visual Basic GetTempFile function. If you read my earlier book, Hardcore Visual Basic, you may remember this function. It's a thin wrapper for the Win32 GetTempFileName function. Like most API functions, GetTempFileName is designed for C programmers. GetTempFile is designed for Basic programmers. You call it the obvious way:

    sTempFile = GetTempFile("C:\TMP", "VB") 

    The first argument is the directory where you want to create the temporary file, the second is an optional prefix for the file name, and the return value is a full file path. You might get back a filename such as C:\TMP\VB6E.TMP. This name is guaranteed to be unique in its directory. You can create the file and fill it with data without being concerned about it overwriting any other temporary file or even a permanent file that happens (incredibly) to have the same name.

    A String API Sample

    It would probably be easier to write the GetTempFile wrapper in Visual Basic, but we're going to do it in C++ to prove a point. Besides, some of the more complex samples we'll be looking at later really do need C++.

    The GetTempFile function is tested by the event handler attached to the Win32 button in the sample program. This code also tests other Win32 emulation functions and, when possible, the raw Win32 functions from which they are created. You can study the Basic code in Cpp4VB.Frm and the C++ code in Win32.Cpp.

    Here's the GetTempFile function:

    BSTR DLLAPI GetTempFile( BSTR bsPathName, BSTR bsPrefix ) { try { String sPathName = bsPathName; String sPrefix = bsPrefix; String sRet(ctchTempMax); if (GetTempFileName(sPathName, sPrefix, 0, Buffer(sRet)) == 0) { throw (Long)GetLastError(); } sRet.ResizeZ(); return sRet; } catch(Long e) { ErrorHandler(e); return BNULL; } } 

    Exception Handling

    This function, like many of the other functions you'll see in this series of articles, uses C++ exception handling. I'm not going to say much about this except that all the normal code goes in the try block and the catch block gets all the exceptions. The ErrorHandler function is purposely elsewhere so that we can change the whole error system just by changing this function. For now, we're only interested in the normal branch of the code.

    You can see that if the GetTempFileName API function returns zero, we throw an error having the value of the last API error. This will transfer control to the catch block where the error will be handled. What you can't see (yet) is that constructors and methods of the String class can also throw exceptions, and when they do, the errors will bubble up through as many levels of nested String code as necessary and be handled by this outside catch block. Instead of handling errors where they happen, you defer them to one place in client code.

    In other words, C++ exception handling works a lot like Visual Basic's error handling. Throwing an exception in C++ is like calling the Raise method of the Err object, and catching an exception is like trapping an error with Basic's On Error statement. We'll revisit exception handling again in Article 4.

    Initializing String Variables

    The first thing we do is assign the BSTR parameters to String variables. This is an unfortunate requirement. It would be much nicer if we could just pass String parameters. Unfortunately, a String variable requires more storage than a BSTR variable and you can't just use the two interchangeably. You'll understand this later when you get a brief look inside the String type, but for now just be aware that the performance and size overhead for this assignment is very low and well worth the cost, especially on functions that are larger than GetTempFile.

    The second thing we do is initialize the sRet variable, which will be the return value. The String type has several constructors and one of them creates an empty buffer of a given length. The constant ctchTempMax is the maximum Win32 file length—256 characters. That's a lot more than you'll need for a temporary filename on most disks, but we're being safe. If you watch the code in a debugger, you'll see that in debug builds the buffer is filled with an unusual padding character—the @ sign. The only purpose is so that you can see exactly what's going on. The data is left uninitialized in release builds.

    In the extremely unlikely case that you don't have 256 bytes of memory left in your system, the initialization will fail and throw an out-of-memory exception.

    Buffers for Output Strings

    Now we're ready to call the Win32 GetTempFileName function. The sPathName and sPrefix arguments provide the input, and the sRet argument is a buffer that the function will fill. There's only one problem. Strings are Unicode internally, but GetTempFileName will usually be GetTempFileNameA and will expect ANSI string arguments. Of course if you're building for Windows NT only, you can do a Unicode build and call GetTempFileNameW. Either way, the String type should do the right thing, and do it automatically.

    Well, that worthy goal isn't as easy as you might expect. It's not too bad for the input arguments because the String type has a conversion operator that knows how to convert the internal Unicode character string to a separate internal ANSI character string and return the result. The conversion just happens automatically. But the buffer in the sRet variable is a little more difficult because the conversion must be two-way.

    The API function has to get an ANSI string, and the ANSI string created by the function must be converted back to a Unicode BSTR. That's why we pass the Buffer object rather than passing the sRet argument directly. You might think from the syntax that Buffer is a function. Wrong! Buffer is a class that has a constructor taking a String argument. A temporary Buffer object is constructed on the stack when GetTempFileName is called. This temporary object is destroyed when GetTempFileName returns. And that's the whole point of the object. The destructor for the Buffer object forces the automatic Unicode conversion.

    Let's step through what happens if you're doing an ANSI build. You call the GetTempFileName function. The Buffer object is constructed on the stack by assigning the sRet String variable to an internal variable inside the Buffer object. But sRet contains a Unicode string and GetTempFileName expects an ANSI string. No problem. Buffer provides a conversion operator that converts the Unicode string to an ANSI string and returns it for access by the ANSI API function. GetTempFileName fills this buffer with the temporary filename. Now the ANSI copy is right, but the Unicode buffer is untouched. That's OK because when GetTempFileName returns, the destructor for the Buffer object will convert the ANSI copy of the buffer to Unicode in the real string buffer. Sounds expensive, but all these operations are actually done with inline functions and the cost is acceptable. You can check out the details in BString.H.

    Now, what happens during a Unicode build? Pretty much nothing. The Buffer constructor is called, but it just stores the BSTR pointer. The Buffer class also has a conversion operator that makes the buffer return a Unicode character buffer, but it just returns the internal BSTR pointer. The destructor checks to see if anything needs to be converted to Unicode, but nothing does. That's pretty much how the String type works throughout. It performs Unicode conversion behind the scenes only when necessary.

    String Returns

    Let's continue with the rest of the function. After calling GetTempFileName, the GetTempFile function has the filename followed by a null in the sRet variable. That's what a C program would want, but it's not what a Basic program wants because the length of sRet is still 256. If you passed the variable back as is, you'd see a whole lot of junk characters following the null in the Visual Basic debugger. So we first call the ResizeZ method to truncate the string to its first null. Later we'll see a Resize method that truncates to a specified length. Unlike most API string functions, GetTempFile doesn't return the length, so we have to figure it out from the position of the null.

    Finally, we return the sRet variable and exit from the GetTempFile function. The destructors for all three String variables are called. At this point, all the temporary ANSI buffers are destroyed, but the destructors don't destroy the internal Unicode strings because they're owned by the caller. Visual Basic will destroy those strings when it gets good and ready, and it wouldn't be very happy if our String destructor wiped them out first—especially the return value.

    If this seems hopelessly complicated, don't sweat it. You don't have to understand the implementation to use the String type. It's a lot simpler (and shorter) than doing the same operations with the BSTR type. You just have to understand a few basic principles.

    A String Warm-up

    Let's take a look at some of the specific things you can do with strings. The most important thing you'll be doing with them is passing them to and receiving them from functions. Here's how it's done in the mother of all String functions, TestString. TestString puts the String class through its paces, testing all the methods and operators and writing the results to a returned string for analysis.

    BSTR DLLAPI TestString( BCSTR bsIn, BSTR * pbsInOut, BSTR * pbsOut) { 

    This doesn't mean much without its ODL definition:

    [ entry("TestString"), helpstring("Returns modified BSTR manipulated with String type"), ] BSTR WINAPI TestString([in] BSTR bsIn, [in, out] BSTR * pbsInOut, [out] BSTR * pbsOut); 

    We talked about in and out parameters in Article 1, but at that point they were primarily documentation. With the BSTR type (as well as with VARIANT and SAFEARRAY) you had better code your DLL functions to match their ODL declarations. Otherwise, your disagreements with Visual Basic can have drastic consequences.

    The purpose of an in parameter (such as bsIn) is to pass a copy of a string for you to read or copy. It's not yours, so don't mess with the contents. The purpose of an in/out parameter (such as pbsInOut) is to pass you some input and receive output from you. Do what you want with it. Modify its contents, copy it to another String, or pass a completely different string back through its pointer. The purpose of an out parameter (such as pbsOut) is to receive an output string from you. There's nothing there on input, but there had better be something there (if only a NULL) when you leave the function, because Visual Basic will be counting on receiving something.

    String Constructors

    Once you receive your BSTRs, you need to convert them to String. You can also create brand new Strings to return through the return value or out parameters, or just to serve as temporary work space. The String constructors create various kinds of strings:

    // Constructors String sTmp; // Uninitialized String sIn = bsIn; // In argument from BSTR String sCopy = *pbsInOut; // In/out argument from BSTR String sString = sIn; // One String from another String sChar(1, WCHAR('A')); // A single character String sChars(30, WCHAR('B')); // A filled buffer String sBuf(30); // An uninitialized buffer String sWide = _W("Wide"); // From Unicode string String sNarrow = "Narrow"; // From ANSI string String sNative = _T("Native"); // From native string String sRet; 

    Most of these speak for themselves, but notice the WCHAR casts and the use of the _W macro to initialize with a wide-character constant. When initializing Strings with constants, you should always use Unicode characters or strings. The String type will just have to convert your ANSI strings to Unicode anyway. Conversion is a necessary evil if you have an ANSI character string variable, but if you have a constant, you can save run-time processing by making it a Unicode string to start with.

    Unfortunately, you can't just initialize a String with a Unicode string like this:

    String s = L"Test"; 

    The problem is that the String type has a BSTR constructor and a LPCWSTR constructor, but what you'll get here is an LPWSTR and there's no separate constructor for that. There can't be a separate constructor because to C++, a BSTR looks the same as an LPWSTR, but of course internally it's very different. Any time you assign a wide character string to a String, you must cast it to an LPCWSTR so that it will go through the right constructor. The _W macro casts to LPCWSTR unobtrusively. C++ is a very picky language, and the String class seems to hit the edges of the pickiness in a lot of places. You have to develop very careful habits to use it effectively.

    Note   Extending Visual Basic with C++ DLLs Many of the problems in writing a String class are caused by Unicode confusion, and much of that confusion comes from the fact that in most current compilers the wchar_t type (called WCHAR in this article) is a typedef to an unsigned short rather than an intrinsic type. Overloaded functions are a critical part of designing a safe, convenient class in C++, but when overloading, C++ considers a typedef to be a simple alias rather than a unique type. A constructor overloaded to take a WCHAR type actually sees an unsigned short, which may conflict with other overloaded integer constructors. Debuggers won't know whether to display a WCHAR pointer as a string or as an array of unsigned shorts. Compile-time error messages will display confusing errors showing unsigned short rather than the character type you thought you were using. If you're fortunate enough to use a compiler that provides wchar_t as an intrinsic type, you won't see these problems. Unfortunately, Microsoft Visual C++ is not yet among those compilers.

    String Assignment

    As you already know (or had better find out soon if you're going to program in C++), initialization is a very different thing from assignment, even though the syntax may look similar. The String type provides the assignments you expect through the operator= function:

    // Assignment WCHAR wsz[] = L"Wide"; char sz[] = "Narrow"; sTmp = sIn; // From another String variable sTmp = _W("Wide"); // From Unicode literal string sTmp = WCHAR('W'); // From Unicode character sTmp = LPCWSTR(wsz); // From Unicode string variable sTmp = LPCSTR(sz); // From ANSI string variable 

    Again, you have to jump through some hoops to make sure your wide-character string assignments go through the proper const operator. C++ can't tell the difference between a wide-character string and a BSTR, so you have to tell it. Generally, you should avoid doing anything with ANSI character strings. The String type can handle ANSI strings, but you just end up sending a whole lot of zeros to and from nowhere. The only reason to use ANSI strings is to pass them to API functions or to C run-time functions, and you normally shouldn't do the latter either, because it's much more efficient to use the wscxxx versions of the run-time functions.

    String Returns

    Let's skip all the cool things you can do to massage String variables and go to the end of TestString where you return your Strings:

     // Return through out parameters. sTmp = _W("...send me back"); *pbsInOut = sTmp; *pbsOut = _B("Out of the fire"); // Return value return sRet; } catch(Long err) { HandleError(err); } } 

    In the first line we assign a wide string to the temporary variable (sTmp) and then assign sTmp to the BSTR out parameter (pbsInOut). A BSTR conversion operator in the String type enables you to perform the assignment of the wide string stored in sTmp to the BSTR out parameter, pbsInOut. The second assignment does the same thing, but uses the _B macro to create and destroy a temporary String variable on the stack. The _B macro uses a double typecast and token paste to hide the following atrocity:

    *pbsOut = String(LPCWSTR(L"Out of the fire")); 

    Finally, the return value is set to the sRet variable containing the string that we'll build in the next section. Internally, the return works exactly like the assignment to an out parameter and in fact calls the same BSTR conversion operator. Think of the Basic syntax:

    TestString = sRet 

    This gives you a better picture of what actually happens in a C++ return statement.


    Friday, August 22, 2008 3:06 PM

All replies

  • The other half of the article on btsrings:

    A String Workout

    There's a lot more to the String type than initialization and assignment. It's designed to be a full-featured string package—duplicating most of the functions you find in the C run-time library or in popular string classes such as MFC's CString. You won't find everything you could ever need, but conversion operators make it easy to pass Strings to run-time string functions. Or better yet, whenever you want to do something that isn't directly supported, add it to the library and send me the code. Be sure to use the wscxxx version of run-time library calls.

    The TestString function uses the iostream library to build a formatted string that tests the String methods and operators, and then assigns that string to the return value. Here's how it works:

    ostrstream ostr; ostr << endcl << "Test length and resize:" << endcl; sTmp = _W("Yo!"); ostr << "sTmp = _W(\"Yo!\"); // sTmp==\"" << sTmp << "\", " << "sTmp.Length()==" << sTmp.Length() << endcl; . . . ostr << ends; char * pch = ostr.str(); sRet = pch; delete[] pch; 

    The String class defines an iostream insertion operator (<<) so that you can easily insert ANSI character strings (converting from Unicode BSTRs) into an output stream. Notice that I also use a custom endcl manipulator rather than the standard endl manipulator. My version inserts a carriage return/line feed sequence rather than the standard line feed only.

    You can study up on iostream and check the code if this isn't clear. The point here is to show off String features, not the iostream library. The rest of this section will show chunks of output that put the String type through its paces.

    Length Methods

    We'll start with the length-related methods:

    sTmp = _W("Yo!"); // sTmp=="Yo!", sTmp.Length()==3 sTmp.Resize(20); // sTmp=="Yo!", sTmp.Length()==20, sTmp.LengthZ()==3 sTmp.ResizeZ(); // sTmp=="Yo!", sTmp.Length()==3 

    The Length() method always returns the real length of the String regardless of nulls, while LengthZ() returns the length to the first null. Normally you'll Resize to truncate a string to a specified length, but you can also expand a string to create a buffer, then truncate back to the first null after passing the buffer to an API function.

    Empty Strings and Comparisons

    Internally, a String, like a BSTR, can be either a NULL string or an empty string, although Basic treats these the same. The String type provides methods to test and set this state:

    sTmp = "Empty"; // sTmp=="Empty",sTmp.IsEmpty==0, sTmp.IsNull==0 sTmp.Empty(); // sTmp=="",sTmp.IsEmpty==1, sTmp.IsNull==0 sTmp.Nullify(); // sTmp=="",sTmp.IsEmpty==1, sTmp.IsNull==1 

    In the Basic tradition, the IsEmpty() method returns True if the string is either null or empty. That's generally all you need to know. Many C++ run-time functions can't handle null strings, and some API functions can't handle empty strings. So you can use the IsNull() function to identify a null string. There's no direct way to identify what C++ thinks of as an empty string, but the following expression will work:

    sTmp.IsEmpty() && !sTmp.IsNull() 

    Of course, you can test equality to empty or any other value with logical operators. If sTmp is empty (in either sense), the String == operator will return True for (sTmp == BNULL) or for (sTmp == _B("")). Notice how cast macros are used to convert literals to Strings before comparison. You can also test comparisons with expressions such as:

    (sNarrow >= sWide) 

    String Indexing

    The String class provides an indexing operator to insert or extract characters in strings. For example:

    // sWide=="Wide", i==2, wch=='n' sWide[i] = wch; // sWide=="Wine" wch = sWide[i - 1]; // wch=='i' sWide[0] = 'F'; // sWide=="Fine" 

    There's nothing to prevent you from enhancing the index operator so that you could insert a string with it or even extract one. I'll leave that to you.


    Any string type worth its salt must be capable of concatenation, and String does it as you would expect—with the + and += operators. It can append characters or strings:

    // sChar=="A", sIn=="Send me in" sChar += sIn; // sChar=="ASend me in" sChar += WCHAR('F'); // sChar=="ASend me inF" sChar += 'G'; // sChar=="ASend me inFG" sChar += _W("Wide"); // sChar=="ASend me inFGWide" sChar += "Narrow"; // sChar=="ASend me inFGWideNarrow" sTmp = sNarrow + sNative + _W("Slow") + "Fast" + WCHAR('C') + 'D' // sTmp=="NarrowNativeSlowFastCD" 

    Some of the String methods look and act like Visual Basic string functions. Don't forget that Visual Basic strings are 1-based, not 0-based like C++ strings:

    sChar = sTmp.Mid(7, 6); // sChar=="Native" sChar = sTmp.Mid(7); // sChar=="NativeSlowFastCD" sChar = sTmp.Left(6); // sChar=="Narrow" sChar = sTmp.Right(6); // sChar=="FastCD" 

    An additional challenge (left as an exercise for the reader) is to add the Visual Basic Mid statement to insert characters into a string.

    String Transformations

    The String class has some transformation functions in both method and function versions:

    // sWide=="Fine" sWide.UCase(); // sWide=="FINE" sWide.LCase(); // sWide=="fine" sWide.Reverse(); // sWide=="enif" sChar = UCase(sWide); // sChar=="ENIF", sWide=="enif" sChar = LCase(sWide); // sChar=="enif", sWide=="enif" sChar = Reverse(sWide); // sChar=="fine", sWide=="enif" 

    There are also similar versions of the Trim, LTrim, and RTrim functions:

    sChar = Trim(sTmp); // sChar=="Stuff", sTmp==" Stuff " sTmp.Trim(); // sTmp=="Stuff" 

    String Searching

    I always found Basic's InStr function confusing, so I called the equivalent String function Find. It can find characters or strings, searching forward or backward, with or without case sensitivity.

    // sTmp="A string in a String in a String in a string" // "12345678901234567890123456789012345678901234567890" i = sTmp.Find('S'); // Found at position: 15 i = sTmp.Find('S', ffReverse); // Found at position: 27 i = sTmp.Find('S', ffIgnoreCase); // Found at position: 3 i = sTmp.Find('S', ffReverse | ffIgnoreCase); // Found at position: 39 i = sTmp.Find('Z'); // Found at position: 0 i = sTmp.Find("String"); // Found at position: 15 i = sTmp.Find("String", ffReverse); // Found at position: 27 i = sTmp.Find("String", ffIgnoreCase); // Found at position: 3 i = sTmp.Find("String", ffIgnoreCase | ffReverse); // Found at position: 39 i = sTmp.Find("Ztring"); // Found at position: 0 

    This method is 1-based, so C++ programmers may need to make a mental adjustment when using it.

    It's not too difficult to think of enhancements for the String type. Just look through the Basic and C++ run-time functions and add anything that looks interesting. It's easy to map existing C++ functions to a natural String format, and it's not much harder to write your own functions that provide string features that C++ lacks. But before you spend a lot of time on this, consider how String is used. In most DLLs, you'll be using the constructors, the conversion operators, and maybe a few logical or assignment operators. Basic already provides its own string functionality, so unless you want to replace it with your own more powerful string library, there's not much point in having a full-featured String type. On the other hand, maybe Basic does need a more powerful string library. Be my guest.

    How the String Class Works

    We've talked a lot about how to use the String type, but not much about how it is implemented. This article is not about how to write class libraries in C++, so I haven't explained the internals. However, you'll probably feel a little more comfortable using the class (and it will certainly be easier to enhance it) if you have some idea how String works, so let's take a look under the hood.

    class String { friend class Buffer; public: // Constructors String(); String(const String& s); // Danger! C++ can't tell the difference between BSTR and LPWSTR. If // you pass LPWSTR to this constructor, you'll get very bad results, // so don't. Instead, cast to constant before assigning. String(BSTR bs); // Any non-const LPSTR or LPWSTR should be cast to LPCSTR or LPCWSTR // so that it comes through here. String(LPCSTR sz); String(LPCWSTR wsz); // Filled with given character (default -1 means unitialized allocate). String(int cch, WCHAR wch = WCHAR(-1)); // Destructor ~String(); . . . private: BSTR m_bs; // The Unicode data LPSTR m_pch; // ANSI representation of it Boolean m_fDestroy; // Destruction flag // Implementation helpers void Concat(int c, LPCWSTR wsz); void Destroy(); void DestroyA(); }; 

    String Construction

    A String consists of three pieces of data: the internal BSTR, a pointer to an array of ANSI characters, and a flag indicating how the String should be destroyed. You can see how this works by looking at a few constructors.

    inline String::String() : m_bs(SysAllocString(NULL)), m_pch(NULL), m_fDestroy(True) { } inline String::String(const String& s) : m_bs(SysAllocString(s.m_bs)), m_pch(NULL), m_fDestroy(True) { } // Convert BSTR to String. inline String::String(BSTR bs) : m_bs(bs), m_pch(NULL), m_fDestroy(False) { } inline String::String(LPCWSTR wsz) : m_bs(SysAllocString(wsz)), m_pch(NULL), m_fDestroy(True) { } inline String::String(LPCSTR sz) : m_bs(SysAllocStringA(sz)), m_pch(NULL), m_fDestroy(True) { } 

    The constructors do nothing but initialize the three members. Notice that the ANSI string constructor (the one with the LPCSTR argument) uses the SysAllocStringA function to create a Unicode BSTR from an ANSI string. Another important point is that the constructors that create an internal BSTR set the m_fDestroy flag so that the BSTR will be destroyed by the destructor. The constructor that takes a BSTR parameter just wraps an existing BSTR parameter (usually passed as a parameter). The String doesn't own this BSTR and has no right to destroy it, so the m_fDestroy flag is set to false.

    String Translation

    The m_pch member is initialized to null, and it stays that way until someone asks to translate the BSTR to an ANSI string. The translation mechanism is the LPCSTR conversion operator, which is called automatically whenever you pass a String to a parameter that expects an LPCSTR. It looks like this: 

    String::operator LPCSTR() { if ((m_pch == NULL) && (m_bs != NULL)) { // Check size. unsigned cmb = wcstombs(NULL, m_bs, SysStringLen(m_bs)) + 1; // Allocate string buffer and translate ANSI string into it. m_pch = new CHAR[cmb]; wcstombs(m_pch, m_bs, cmb); } return m_pch; } 

    If the internal BSTR is not NULL, this operator allocates an ANSI buffer of the proper size and copies a translated string to it. This ANSI string will be maintained for reuse by subsequent calls to LPCSTR until the String is destroyed or until some member function changes the contents of the internal BSTR, thus invalidating the ANSI buffer. Any such member should destroy or update the ANSI buffer.

    String Destruction

    Here are the String destruction functions:

    void String::Destroy() { if (m_fDestroy) { SysFreeString(m_bs); } DestroyA(); } inline String::~String() { Destroy(); } // Invalidate ANSI buffer. inline void String::DestroyA() { delete[] m_pch; m_pch = NULL; } 

    The destruction job is broken into parts so that member functions can destroy the whole String or just invalidate the ANSI buffer.

    For example, here's how the operator= members handle Unicode and ANSI strings.

    const String& String::operator=(LPCSTR sz) { Destroy(); m_bs = SysAllocStringA(sz); return *this; } const String& String::operator=(LPCWSTR wsz) { DestroyA(); if (SysReAllocString(&m_bs, wsz) == 0) throw E_OUTOFMEMORY; return *this; } 

    One way or another, an operator= function must replace the previous contents of the object with the new contents being assigned. The LPCSTR version destroys the whole member and creates a new one, while the LPCWSTR version just destroys the ANSI buffer and reallocates the BSTR member. The only reason for this difference is that I didn't write a SysReAllocStringA function.

    A String Method

    Once you get construction, destruction, and ANSI conversion figured out, the methods and overloaded operators are easy. Most of them are simply calls to the wsc versions of C++ run-time functions. For example, let's look at the UCase method, which comes in two versions.

    Here's the member function:

    const String & String::UCase() { DestroyA(); // Invalidate ANSI buffer. wcsupr(m_bs); return *this; } 

    It simply calls the wcsupr function (which you may know as strupr) to modify the internal BSTR member. Here's the function version:

    String UCase(String& s) { String sRet = s; sRet.UCase(); return sRet; } 

    The version above uses a String argument (which it leaves unchanged) and returns a modified String copy. Its implementation creates a new string and uses the method version of UCase on it.

    A Challenge

    Before I leave you to figure out the rest of the String internals, let me pose a challenge. If the String class had only one member, m_bs, it would be the same size and have the same contents as a BSTR parameter. You could pass a BSTR in from Basic and receive a String in your C++ DLL. But this still wouldn't save you from doing Unicode conversion or from cleaning up correctly in destructors. You'd need to use the equivalents of m_pch and m_fDestroy without actually putting them in the class. How are you going to manage that?

    Well, here's an idea. Create a static class member that is an array of data structures, each of which contains a buffer for ANSI conversion and a flag for destruction. Every time you create a String object, you insert one of these items into the array. When you need to use the ANSI buffer, you look up the item in the array and allocate or use the ANSI buffer. You'll probably want to insert each item in sorted order (maybe by the value of the BSTR pointer) for faster lookup. Whenever you destroy a String, you must find and remove its corresponding data structure from the array

    Performance would suffer, but probably not by much because you're not going to have that many String variables active at any one time. You would end up with a more intuitive String class. From what I understand, this is how the old OLE2ANSI DLL used to work. Is it worth the extra work? I didn't think it was for this article, but perhaps it would be for your projects.

    Friday, August 22, 2008 3:08 PM