String interning?

I guess itz repeating again

Possible Duplicate of

Strange string literal comparison

Two different "strings" are the same object instance?

Repeated

The Common Language Infrastructure (CLI) guarantees that the result of two ldstr instructions referring to two metadata tokens that have the same sequence of characters return precisely the same string object (a process known as "string interning").

Yes, constant string expressions in the compiler are treated with ldstr, which guarantees interning (via MSDN):

The Common Language Infrastructure (CLI) guarantees that the result of two ldstr instructions referring to two metadata tokens that have the same sequence of characters return precisely the same string object (a process known as "string interning").

This isn't every string; it is constant string expressions in your code. For example:

string s = "abc" + "def";

is only 1 string expression - the IL will be a ldstr on "abcdef" (the compiler can compute the composed expression).

This does not hurt performance.

Strings generated at runtime are not interned automatically, for example:

int i = GetValue();
string s = "abc" + i;

Here, "abc" is interned, but "abc8" is not. Also note that:

char[] chars = {'a','b','c'};
string s = new string(chars);
string t = "abc";

note that s and t are different references (the literal (assigned to t) is interned, but the new string (assigned to s) is not).


String literals are automatically interned.

Programatically created strings will not be interned by default (nor would user input strings).

In the above, "Some Text" and "Some Other Text" have both been interned and since you are using the literal in these places, you see that the interned version is the one referenced.

In your code, if you have:

string.Format("{0} {1}", "Some", "Text")

You will see that the returned reference is not the same as for other literals.


Does .net uses string interns for every string that I use?

No, but it does use it for those strings that it knows about at compile time because they are constants in the code.

string x = "abc"; //interned
string y = "ab" + "c"; //interned as the same string because the
                       //compiler can work out that it's the same as
                       //y = "abc" at compile time so there's no need
                       //to do that concatenation at run-time. There's
                       //also no need for "ab" or "c" to exist in your
                       //compiled application at all.
string z = new StreamReader(new FileStream(@"C:\myfile.text")).ReadToEnd();
                       //z isn't interned because it isn't known at compile
                       //time. Note that @"C:\myfile.text" is interned because
                       //while we don't have a variable we can access it by
                       //it is a string in the code.

If so, isn't it hurts the performance?

No, it helps performance:

First: All these strings are going to be in the application's memory somewhere. Interning means we don't have unnecessary copies, so we use less memory. Second: It makes string-comparisons we know are from interned strings only super-fast. Third: That doesn't come up much, but the boost it gives other comparisons does. Consider this code that exists in one of the built-in comparers:

public override int Compare(string x, string y)
{
    if (object.ReferenceEquals(x, y))
    {
        return 0;
    }
    if (x == null)
    {
        return -1;
    }
    if (y == null)
    {
        return 1;
    }
    return this._compareInfo.Compare(x, y, this._ignoreCase ? CompareOptions.IgnoreCase : CompareOptions.None);
}

This is for ordering, but the same applies to equality/inequality checks. To check two strings are equal or to put them in an order requires us to do an O(n) operation where n is proportional to the length of the string (even in cases where some skips and cleverness can be done, it's still proportional). This is potentially slow for long strings, and comparing strings is something that a lot of applications does a lot of the time - a great place for a speed boost. It's also slowest for the equality case (because the moment we find a difference we can return a value, but equal strings must be examined entirely).

Everything is always equal to itself even if you redefine what "equals" means (case-sensitive, insensitive, different cultures - everything is still equals to itself and if you create an Equals() override that doesn't follow that you will have a bug). Everything is always ordered at the same point as something it is equal to. This means two things:

  1. We can always consider something equal to itself without doing any more work.
  2. We can always give a comparison value of 0 for comparing something with itself without any more work.

Hence the code above short-cuts on this case without having to do the more complicated and expensive comparison. There's also no down-side since if we didn't cover this case we'd have to add in a test for the case where both values passed where null anyway.

Now, it so happens that comparing something to itself comes up quite often naturally with the way certain algorithms work, so it's always worth doing. However, string interning increases the times when two strings we have in different values (x and z at the start of your question, for example) are actually the same, so it increases how often the short-cut works for us.

It's a tiny optimisation most of the time, but we get it for free and we get it so often that it's great to have it. Practical takeaway from this - if you're writing an Equals or a Compare consider whether you too should use this short-cut.

A related question then, is "should I intern everything?"

Here though, we have to consider the downside that compiled-in strings don't have. Interning is never wasteful with compiled in strings, because they have to be somewhere. If however you read a string from a file, interned it, and then never used it again it's going to live a long time, and that's wasteful. If you did it all the time, you could cripple your memory use.

Let's imagine though that you are frequently reading bunch of items that include some identifiers. You are regularly using these identifiers to match items with data from another source. There's a small set of identifiers that will ever see (say there's only a couple hundred possible values). Then because equality checks is what these strings are all about, and there isn't lots of them, interning (on both the data read in and the data you compare it with - it's pointless otherwise) becomes a win.

Or, let's say that there's a few thousand such objects, and the data we match it with is always cached in memory - that means those strings are always going to be somewhere in memory anyway, so interning becomes a no-brainer win. (Unless there's the possibility of lots of "not found" results - interning those identifiers just to not find a match is a lose).

Finally, the same basic technique can be done differently. XmlReader for example stores strings it's comparing in a NameTable that acts like a private intern-pool, but the whole thing can be collected when it's finished. You can also apply the technique to any reference type that won't be changed during the time it's pooled (best way to guarantee that is to have it immutable so it won't change during any time ever). Use of this technique with very large collections with large amounts of duplication can massively reduce memory use (my biggest saving was at least 16GB - it could be more but the server kept crashing at around that point before the technique was applied) and/or speed up comparisons.