none
Internals of String type and String.Intern() RRS feed

  • Question

  • I've lately been analyzing (and trying to improve) performance of a string intensive application. When checking the way strings are stored in memory in the .NET framework I came across this funny memory layout for a string reference in memory:
    (example compiled on an IA32 with VS2010SP1 targeting .NET 3.5)

    79330c6c   00000005   00000004   0054 0065 0073 0074   0000 0000    80000000
    V-Table(?) Length+1   Length     T    e    s    t      null-char    ??? unclear
    for String                                          (4byte-aligned?)

    I'd like to know the following:

    -  Why do we have the string's Length and additionally seemingly Length+1 stored in the structure? Shouldn't the latter be computable? Am I getting something wrong?

    -  How are strings aligned in memory, i. e. how many trailing null-chars do we get for a length of 3, 4, 5, 6, 7, 8 chars?

    -  What is the ominous 0x80000000? I didn't find this when e. g. checking an array of strings built up at runtime. Is that a garbage collection pin or something? Just random data? ( I never found anything else but 0x00000000 and 0x80000000 in this place...)

    -  Since strings are immutable, I would actually have expected to find an Int32 for the hash code of the string in the structure. For example instead of Length+1. However, it seems GetHashCode() will have no short exit for strings, i. e. hash codes are not cashed. Wouldn't that be a good idea?


    A second mystery to me is the way String.Intern works. The documentation tells us:

    If you are trying to reduce the total amount of memory your application allocates, keep in mind that interning a string has two unwanted side effects. First, the memory allocated for interned String objects is not likely be released until the common language runtime (CLR) terminates. The reason is that the CLR's reference to the interned String object can persist after your application, or even your application domain, terminates.

    Could somebody please be less cryptic and define the "not likely" and "can" in above documentation? I would like to have clarification under which circumstances an interned string is actually released if it's no longer used/referenced in the application/the process that interned it. Will it ever be released? Can we have details for each framework version about the behaviour in case it differs? (Regardless of the difference concerning String.Empty; that is understood.)

    When the application finally terminates, under which circumstances will the string still live on? Will it live on forever? Will it live only if a second application/process references the same interned text? Is it ever collected at all?

    In my evaluation tests at least it seemed the memory is never ever collected again during the lifetime of the application. Making it completely useless for our use cases. It's also more than unclear why the framework should behave this way. Any good reasons?

    Anyway, after evaluating this, I went on a rant when I had the idea to simply implement a String table of my own for interning purposes by using HashSet<WeakString> where WeakString is a struct with a WeakReference targeting a string and with an int for saving a string's hashcode. It turned out to be impossible since HashSet<T> offers nothing resembling a Get(T item) method. This may not be a problem for mere value types, but for the intended purpose, it is simply useless though internally it definitely has got all it takes. Whatever references you put into a HashSet, you will never be able to retrieve them in the same quick way HashSet allows you to check whether an element is contained. This could (*cough*) be improved in a future .NET version.

    Any good ideas for implementing an application level string pool without wasting too much memory are welcome. To me it seemed one will have to reinvent the wheel here because HashSet lacks a method that probably could be introduced in a few keystrokes...




    • Edited by Xaver111 Wednesday, June 5, 2013 3:40 PM sketched out some more details, enhanced questions
    Wednesday, June 5, 2013 1:23 PM

Answers

  • 6c 0c 33 79 05 00 00 00 04 00 00 00 54 00 65 00 73  00 74 …

    The bold part is a type handle (which is a IntPtr, i.e. a native-size integer). Since you are dealing with a string object, it should be the type handle of the System.String type. To verify this, repeat your experiment, then compare that bold number against the value of typeof(string).TypeHandle.Value. The two should match.

    The CLI specification says (but I might be simplifying here) that value types are simple bit patterns, while class types are "self-describing". As far as I can tell, this means for one thing that with instances of class types, a type token gets stored along with their actual value, while that type token will be missing for value types. Since System.String is a class type, the type token is present. But if you repeated your experiment with a struct instance (let's say a System.Int32, you should find the type handle to be missing.


    Finally, if you use Reflection on the System.String type:

    //using System.Reflection;
    typeof(string).GetFields(BindingFlags.Instance | BindingFlags.NonPublic)

    You will find that it has three private fields: System.Int32 m_arrayLength, System.Int32 m_stringLength, System.Char m_firstChar. This is obviously not all that makes up a string value (I don't know about the rest), but it might be a good indication what the two numbers after the type handle mean.


    At least one of the trailing zero values seems to be related to System.String being a class type. I do not observe any terminating zeros for value types.


    Finally, I suspect that the 0x80000000 is not part of the string object at all.

    • Edited by stakx Friday, June 14, 2013 6:28 AM expanded
    • Marked as answer by Mike FengModerator Monday, June 24, 2013 2:44 PM
    Thursday, June 13, 2013 11:02 PM

All replies

  • Here's a good article that explains all the practical aspects of string interning.  http://broadcast.oreilly.com/2010/08/understanding-c-stringintern-m.html

    Wednesday, June 5, 2013 2:59 PM
  • You might enjoy reading http://blogs.msdn.com/b/ericlippert/archive/2011/07/19/strings-immutability-and-persistence.aspx

    1. You're probably getting something wrong. When I dump a string object in Windbg I see that it has three fields: m_stringLength, m_firstChar, Empty

    2. How ever many the compiler feels like putting there, and it's not guaranteed to ever be the same number twice. The compiler is free to do whatever it wants to internally.

    3. I don't know what the ominous value is

    4. Caching the Hash would be a good idea, if most strings have GetHashCode called on them, but most don't. Since the hash of the average string is never calculated, and given how many string instance may exist in the average program, it would be a giant use a memory that most would find to be "bloat".

    5. I would believe that if the .Net team hasn't made public the internals to string interning, it may be because they want to be free to change the implementation in the future. To speculate, I wouldn't be surprised if there are multiple underlying collections for interned strings and some may be associated with the AppDomain, whereas others may be associated with the process.

    6. Pools are good ideas when you want to repurpose objects that are either 1) created really frequently, 2) expensive to create. But you can't repurpose a string because they're immutable. I imagine that a StringCollection will do what you want.

    Wednesday, June 5, 2013 9:26 PM
  • I think I was being too unspecific.

    Some further tests showed that it actually is different in at least .NET 3.5 and .NET 4.0. Consider the following program:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    
    namespace StringTest
    {
        class Program
        {
            static void Main(string[] args)
            {
                string s1 = "Test";
                StringBuilder sb = new StringBuilder(4);
                string s2 = sb.Append('T').Append('e').Append('s').Append('t').ToString();
                string s3 = string.Intern(s2);
                sb.Length = 0;
                string s4 = sb.Append('t').Append('e').Append('s').Append('t').ToString();
                string s5 = sb.Append('t').Append('e').Append('s').Append('t').ToString();
                string s6 = string.Intern(s4);
                Console.WriteLine("s1 <--> s3: {0}", object.ReferenceEquals(s1, s3));
                Console.WriteLine("s4 <--> s6: {0}", object.ReferenceEquals(s4, s6));
                Console.ReadKey(true);
            }
        }
    }

    and compile targeting .NET 3.5 , platform Any and in Release mode.

    You'll need a memory window to verify the results from my post. (In VS2010, set a breakpoint, start debug mode, then view e.g. Debug->Windows->Memory->Memory1.) You can enter a variable name in there to go to a memory address. In above example, viewing s1 after initialization will yield something similar to:

    6c 0c 33 79 05 00 00 00 04 00 00 00 54 00 65 00 73  00 74 00 00 00 00 00 00 00 00 00

    with the bold part being different, and the rest should be the same for the "Test" string. No guarantee for the last four bytes, these are the ones I have found to be 'ominous' in the other example I was talking about in my first post since they weren't 0-byte fillers but seemed to have 0x80000000 in them.

    In .NET 4.0 I was unable to find a trace of the second four bytes (i. e. Length+1) in the example in this post. Only the 'true' length of 4 characters is present there it seems.

    I would really like to have someone from Microsoft comment on the questions in my original post, since it's not quite clear what is going on there and previous replies have either not been helpful at all or only a small amount.

    I'd like to comment on jader3rd's answers:

    The article you posted is at least interesting. However, concerning the BSTR theory there, I was unable to find concrete evidence supporting this. As far as I know, BSTRs are prepended with a byte(!) count before their actual character data (which a similar or identical to two-byte .NET Chars). None of the numbers found in memory correlate to a byte-count which implies that marshaling a .NET string to COM and vice versa most probably involves a form of conversion anyway.

    1. Not sure what you looked at, but see the amendments in this reply.

    2. I can live with that, though I'm still interested how it's done. Since assemblies targeting any CPU should be 'portable', the layout shouldn't be random but is probably thought out well.

    3. In above mini-example it didn't seem to occur, I might investigate this further at a later point in time

    4. I agree in some way, but was rather astonished by the bloating with Length+1. That's when I thought, rather use those four bytes to store the hash code...

    5. I can't accept this, the functionality must be documented because you exactly need to know what you are doing when using String.Intern(). You might well harm your application by interning instead of helping it.

    6. I imagine that StringCollection is a deprecate type that can be replaced by List<string> at any time. It wouldn't be helpful at all. Our application is server based, reading in more or less volatile input (let's call that a script) from users and computing things for them, more often than not based on other scripts (maybe even from different users) running. The input will contain strings needed for computations and can be characterized like this: It's likely that they are identifiers, it's likely that users use the same identifiers for certain things and it's likely that there are thousands of different identifiers (maybe millions). It's very probable that pooling these will save loads of memory on the server, since in at least half of the cases identifiers will be long living (as are the scripts containing them), but eventually they will be done with and should be collected effectively to make room for other scripts that may be entirely different. So we simply need a resonable solution to save up memory here, and if String.Intern() never frees while our server application is planned for running 24/7/52, it's just not the thing to use for us and we need clarification on this.
    An additional benefit is hopefully that comparisons of pooled strings will be much faster since the references will be equal and such comparisons will occur a thousandfold.







    • Edited by Xaver111 Monday, June 10, 2013 3:07 PM typos, better sketch-out
    Monday, June 10, 2013 9:48 AM
  • The goal of string interning is for the string to never be freed. The MSDN doc's you quoted do make it sound like someone can force some of the strings to be freed, but it sounds like you have to try and force that to happen. I think that you should operate on the assumption that an interned string will never be freed so long as the executable is running; since that's most likely what it going to happen.
    Monday, June 10, 2013 6:39 PM
  • 6c 0c 33 79 05 00 00 00 04 00 00 00 54 00 65 00 73  00 74 …

    The bold part is a type handle (which is a IntPtr, i.e. a native-size integer). Since you are dealing with a string object, it should be the type handle of the System.String type. To verify this, repeat your experiment, then compare that bold number against the value of typeof(string).TypeHandle.Value. The two should match.

    The CLI specification says (but I might be simplifying here) that value types are simple bit patterns, while class types are "self-describing". As far as I can tell, this means for one thing that with instances of class types, a type token gets stored along with their actual value, while that type token will be missing for value types. Since System.String is a class type, the type token is present. But if you repeated your experiment with a struct instance (let's say a System.Int32, you should find the type handle to be missing.


    Finally, if you use Reflection on the System.String type:

    //using System.Reflection;
    typeof(string).GetFields(BindingFlags.Instance | BindingFlags.NonPublic)

    You will find that it has three private fields: System.Int32 m_arrayLength, System.Int32 m_stringLength, System.Char m_firstChar. This is obviously not all that makes up a string value (I don't know about the rest), but it might be a good indication what the two numbers after the type handle mean.


    At least one of the trailing zero values seems to be related to System.String being a class type. I do not observe any terminating zeros for value types.


    Finally, I suspect that the 0x80000000 is not part of the string object at all.

    • Edited by stakx Friday, June 14, 2013 6:28 AM expanded
    • Marked as answer by Mike FengModerator Monday, June 24, 2013 2:44 PM
    Thursday, June 13, 2013 11:02 PM
  • The documentation of ICorProfilerInfo2::GetStringLayout suggests that the Length+1 field was the buffer length and was removed in CLR 4.0.  This change was listed in .NET Framework 4 Migration Issues and in CLR V4: Stuff That May Break Your Profiler.  See also RuntimeHelpers.OffsetToStringData.
    Wednesday, July 10, 2013 7:45 PM