What is Java String interning?

http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#intern()

Basically doing String.intern() on a series of strings will ensure that all strings having same contents share same memory. So if you have list of names where 'john' appears 1000 times, by interning you ensure only one 'john' is actually allocated memory.

This can be useful to reduce memory requirements of your program. But be aware that the cache is maintained by JVM in permanent memory pool which is usually limited in size compared to heap so you should not use intern if you don't have too many duplicate values.


More on memory constraints of using intern()

On one hand, it is true that you can remove String duplicates by internalizing them. The problem is that the internalized strings go to the Permanent Generation, which is an area of the JVM that is reserved for non-user objects, like Classes, Methods and other internal JVM objects. The size of this area is limited, and is usually much smaller than the heap. Calling intern() on a String has the effect of moving it out from the heap into the permanent generation, and you risk running out of PermGen space.

-- From: http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html


From JDK 7 (I mean in HotSpot), something has changed.

In JDK 7, interned strings are no longer allocated in the permanent generation of the Java heap, but are instead allocated in the main part of the Java heap (known as the young and old generations), along with the other objects created by the application. This change will result in more data residing in the main Java heap, and less data in the permanent generation, and thus may require heap sizes to be adjusted. Most applications will see only relatively small differences in heap usage due to this change, but larger applications that load many classes or make heavy use of the String.intern() method will see more significant differences.

-- From Java SE 7 Features and Enhancements

Update: Interned strings are stored in main heap from Java 7 onwards. http://www.oracle.com/technetwork/java/javase/jdk7-relnotes-418459.html#jdk7changes


There are some "catchy interview" questions, such as why you get equals! if you execute the below piece of code.

String s1 = "testString";
String s2 = "testString";
if(s1 == s2) System.out.println("equals!");

If you want to compare Strings you should use equals(). The above will print equals because the testString is already interned for you by the compiler. You can intern the strings yourself using intern method as is shown in previous answers....


JLS

JLS 7 3.10.5 defines it and gives a practical example:

Moreover, a string literal always refers to the same instance of class String. This is because string literals - or, more generally, strings that are the values of constant expressions (§15.28) - are "interned" so as to share unique instances, using the method String.intern.

Example 3.10.5-1. String Literals

The program consisting of the compilation unit (§7.3):

package testPackage;
class Test {
    public static void main(String[] args) {
        String hello = "Hello", lo = "lo";
        System.out.print((hello == "Hello") + " ");
        System.out.print((Other.hello == hello) + " ");
        System.out.print((other.Other.hello == hello) + " ");
        System.out.print((hello == ("Hel"+"lo")) + " ");
        System.out.print((hello == ("Hel"+lo)) + " ");
        System.out.println(hello == ("Hel"+lo).intern());
    }
}
class Other { static String hello = "Hello"; }

and the compilation unit:

package other;
public class Other { public static String hello = "Hello"; }

produces the output:

true true true true false true

JVMS

JVMS 7 5.1 says says that interning is implemented magically and efficiently with a dedicated CONSTANT_String_info struct (unlike most other objects which have more generic representations):

A string literal is a reference to an instance of class String, and is derived from a CONSTANT_String_info structure (§4.4.3) in the binary representation of a class or interface. The CONSTANT_String_info structure gives the sequence of Unicode code points constituting the string literal.

The Java programming language requires that identical string literals (that is, literals that contain the same sequence of code points) must refer to the same instance of class String (JLS §3.10.5). In addition, if the method String.intern is called on any string, the result is a reference to the same class instance that would be returned if that string appeared as a literal. Thus, the following expression must have the value true:

("a" + "b" + "c").intern() == "abc"

To derive a string literal, the Java Virtual Machine examines the sequence of code points given by the CONSTANT_String_info structure.

  • If the method String.intern has previously been called on an instance of class String containing a sequence of Unicode code points identical to that given by the CONSTANT_String_info structure, then the result of string literal derivation is a reference to that same instance of class String.

  • Otherwise, a new instance of class String is created containing the sequence of Unicode code points given by the CONSTANT_String_info structure; a reference to that class instance is the result of string literal derivation. Finally, the intern method of the new String instance is invoked.

Bytecode

Let's decompile some OpenJDK 7 bytecode to see interning in action.

If we decompile:

public class StringPool {
    public static void main(String[] args) {
        String a = "abc";
        String b = "abc";
        String c = new String("abc");
        System.out.println(a);
        System.out.println(b);
        System.out.println(a == c);
    }
}

we have on the constant pool:

#2 = String             #32   // abc
[...]
#32 = Utf8               abc

and main:

 0: ldc           #2          // String abc
 2: astore_1
 3: ldc           #2          // String abc
 5: astore_2
 6: new           #3          // class java/lang/String
 9: dup
10: ldc           #2          // String abc
12: invokespecial #4          // Method java/lang/String."<init>":(Ljava/lang/String;)V
15: astore_3
16: getstatic     #5          // Field java/lang/System.out:Ljava/io/PrintStream;
19: aload_1
20: invokevirtual #6          // Method java/io/PrintStream.println:(Ljava/lang/String;)V
23: getstatic     #5          // Field java/lang/System.out:Ljava/io/PrintStream;
26: aload_2
27: invokevirtual #6          // Method java/io/PrintStream.println:(Ljava/lang/String;)V
30: getstatic     #5          // Field java/lang/System.out:Ljava/io/PrintStream;
33: aload_1
34: aload_3
35: if_acmpne     42
38: iconst_1
39: goto          43
42: iconst_0
43: invokevirtual #7          // Method java/io/PrintStream.println:(Z)V

Note how:

  • 0 and 3: the same ldc #2 constant is loaded (the literals)
  • 12: a new string instance is created (with #2 as argument)
  • 35: a and c are compared as regular objects with if_acmpne

The representation of constant strings is quite magic on the bytecode:

  • it has a dedicated CONSTANT_String_info structure, unlike regular objects (e.g. new String)
  • the struct points to a CONSTANT_Utf8_info Structure that contains the data. That is the only necessary data to represent the string.

and the JVMS quote above seems to say that whenever the Utf8 pointed to is the same, then identical instances are loaded by ldc.

I have done similar tests for fields, and:

  • static final String s = "abc" points to the constant table through the ConstantValue Attribute
  • non-final fields don't have that attribute, but can still be initialized with ldc

Conclusion: there is direct bytecode support for the string pool, and the memory representation is efficient.

Bonus: compare that to the Integer pool, which does not have direct bytecode support (i.e. no CONSTANT_String_info analogue).