String concatenation, redux
Indified String concatenation is a fantastic beast. In this post I will try to shed some light on some of the implementation details, and maybe get to why I get excited over finding some peculiar way to optimize it from time to time.
Let me know if something is particularly unclear, or worse, wrong.
TL;DR
I turned an exponential factor into a constant one…
RFR: 8222852: Reduce String concat combinator tree shapes by folding constants into prependers
Joining Strings
JEP 280 - Indify String Concatenation - ISC - brought the ability to implement String concatenation expressions using invokedynamic
- indy for short.
What this means is that when a compiler like javac is faced with an expression like "foo" + bar
then instead of emitting some bytecode sequence that more directly implements the expression, it uses an indy to defer to the runtime to build up an implementation. This can have some interesting effects, since the runtime is free to create a concrete implementation that is better tailored for the particular environment - and also free to hook into some internal implementation details that statically compiled bytecode couldn’t (without resorting to Unsafe
and such).
Mainly the focus has been on building up an implementation that works well with the current breed of JIT compilers.
The StringBuilder
chains emitted by javac historically turns out to occassionally be hard to optimize for most JITs, especially when things get more complex. There’s been a lot of optimization work over the years to try and optimize String concatenation, so when an indy implementation managed to outperform those older forms by a rather significant factor (4-6x) there was much excitement. More work is now in progress to apply similar strategies to other parts of the runtime and libraries. For example JEP 348 should hopefully provide an indified version of String.format
that is quite reminiscent of ISC.
The drawback is that when loading and linking each call site we now need to run through a complex bootstrap routine, which is likely to be more expensive than “just” loading a bunch of straight-forward, statically compiled bytecode.
Exactly how expensive depends, but in JDK 9 bootstrapping the first indified String concat expression could take something like 30-90ms (depending on hardware). Some of that bootstrapping overhead is/was a one-off deal, some of the added overhead will happen again when bootstrapping subsequent call sites. For some cases that incremental overhead might look benign, but I guess that depends on expectations…
Optimize the runtime, not the bytecode
One of the clever ideas behind deferring to the runtime to provide a concrete implementation is that our runtimes are likely to evolve faster than the bytecode it ends up running, which will often be compiled to an older target than the JVM version we run on.
One beautiful thing with this isn’t the specific improvements themselves, but that these optimizations apply without recompilation (once you’re targetting something newer than JDK 8). Any JDK 9-13 numbers in this post remains the same whether I compile with a target of JDK 9 or any later version.
But let’s get to those promised implementation details, shall we? Oh, boy…
Expression shapes
When implementing a String concat expression like "foo" + bar
we generalize the expression and say it has the shape of a String literal concatenated with some argument that has whatever type bar
is. So for a bar
of type int
we’ll get one shape, if bar
is an Object
we’ll get another shape. When bootstrapping a method handle to implement these, each unique shape will result in a unique expression tree. Parts of the expression tree will be perfectly shareable between expressions of another shape, parts will not be.
String bar = ...
String foo = "foo" + bar;
String baz = "baz" + foo; // perfectly shareable shape.
In the above example the two concat expressions are perfectly shareable. Any MethodHandles
and underlying structures loaded and generated to implement them will be perfectly shareable - they’ll be different instances of the same expression tree: one created with the "foo"
literal, another with the "baz"
literal.
Many expression shapes!
There are theoretically a huge number of possible such shapes expressible. Really big expressions would thankfully be split apart by javac
, but there are still a staggering amount of possible shapes. Just enumerate the shapes of concatenating two String arguments intermixed by String literals:
// No literals:
foo = bar + baz;
// One literal:
foo = "foo" + bar + baz;
foo = bar + "foo" + baz;
foo = bar + baz + "foo";
// Two literals:
foo = "foo" + bar + "foo" + baz;
foo = bar + "foo" + baz + "foo";
foo = "foo" + bar + baz + "foo";
// Three literals:
foo = "foo" + bar + "foo" + baz + "foo";
1 + 3 + 3 + 1 = 8 shapes already, and that’s just getting started. If we add similar expressions where bar
is an int
instead? Eight more shapes. Same, but we replace baz
with an int
instead? Another eight.
The number of shapes of two arguments mixed with constants are thus 8 times the number of types we care about, squared. The types we cared about originally were String
, Object
and all primitives (boolean
, byte
, char
, short
, int
, long
, float
, double
), so counting 800 shapes in total. And that’s just for two arguments.
The factor eight isn’t constant, either: With more arguments, the number of ways to mix in String literals also grows. In fact the possibilities doubles with every argument, so in aggregate have 2^(n+1)
ways to mix in constants around n
arguments. The actual number of shapes for n
arguments is, in theory, 2^(n+1)*10^n
. So for 3 arguments it’d be possible to observe 16000 shapes in a given application. Four arguments: 320000, and so on.
In the default strategy for indified String concatenation each shape will require at least some unique class. Maximum number of classes? Well, things gets even more complicated after a while since the MethodHandle
implementation will start chaining these expressions into sub-expressions that might themselves be partially shareable, but it definitely grows into the millions.
In practice the number of shapes in any particular program is likely to be much more modest. While concatenation expressions are likely to be common, the number of unique shapes in an application is seldom too extreme, maybe 10-200 shapes are to be expected. Each shape can generate a small to moderate amount of classes, however, so we can get quite substantial improvements even when the number of shapes is small in practice.
MethodHandleInlineCopyStrategy
The default OpenJDK strategy build up an MethodHandle
combinator tree, piece by piece:
- first some rough filtering to turn
String
,Object
,float
anddouble
arguments intoString
- then a sequence of “mixers” that look at each argument and calculates the right size and encoding for the String
- then instantiate the
byte[]
that will back theString
object - then a sequence of “prependers” that take each constant and argument in reverse order and fill up the
byte[]
- finally take the
String
encoding value and thebyte[]
and allocate the resultingString
In essence the algorithm remains the same since JDK 9. The implementation is “arguably hard to read” since the MethodHandle
expression trees is built up in reverse.
This gets some of the performance edge from the fact that we can call package-private APIs in java.lang
, some from the use of Unsafe
to work on uninitialized byte[]
s, and some from the fact that MethodHandle
combinator trees are likely to be aggressively inlined.
JDK 12: indexCoder, Object Stringifier
In JDK 12, I found we improved by lumping together the index
and coder
fields implementation into
a single long
field. This simplified the expression tree a lot, with
fewer classes needed for any and all expressions.
I also merged the “Stringifier” used for filtering String
and Object
arguments into a String
suited for prepending.
This effectively means the number of types we care about when determining whether a shape is unique drops from 10 to 9, which drops off a lot of unique shapes.
Four argument shapes drops from 320000 to 209952, for example. Big drop in theory, perhaps not so big in practice, but even on more realistic applications these optimization started to add up.
Oh, and I found out we could reduce the number of rebinds we do when applying the same filter to more than one argument. This optimization proved quite beneficial to some String concat usage scenarios, and is generally applicable so might end up being an optimization in other places as well.
All in all JDK 12 about halved bootstrap overheads on some typical scenarios.
JDK 13: Folding constants
There are a few nice little optimizations coming in JDK 13, including one that should make a category of really trivial concatenations bootstrap really fast.
However, the one that really got me excited is this one:
Instead of binding in a simple prepender for each argument and each constant, bind surrounding constants to the prependers for the arguments. This prepender will then prepend the suffix constant (if one exist), then prepend of argument, then the prefix constant (if on exist).
This means we’ll only bind in one prepender per argument into the
main MethodHandle
, and none for each constant. The constants will instead be neatly folded
into each argument prepender. (Your compiler probably does this sort of thing all the time, but we
don’t have that luxury here.)
Effectively this means that constants will no longer affect the shape of the expression.
So in essence the expressions "foo" + bar + baz
and bar + "foo" + baz
will share the same shape
(assuming bar
and baz
are of the same type). This reduces the theoretical number of
observable shapes for n
arguments by a factor of 2^(n+1)
! *tweets excitedly*
Still a huge number of shapes are possible in theory; in practice things are looking quite… OK.
Show you some numbers
I added a note to the review thread for my latest optimization in this area, showing numbers for a simple but realistic mix of concatenation expressions:
public class StringConcatMix {
public static void main(String ... args) {
String s = String.valueOf(args.length);
String concat;
concat = s + s;
concat = "c" + s + "c";
concat = s + "c" + s;
concat = "c" + s + "c" + s;
concat = "c" + s + s + "c";
concat = s + "c" + s + s + "c";
concat = "c" + s + "c" + args.length;
concat = "c" + s + "c" + s + "c" + s;
}
}
Building and running the above on JDK 8 through most-recent JDK 13 (*EA!)
Total Overhead
JDK 8: 60ms 0ms
JDK 9: 215ms 129ms
JDK 11: 164ms 118ms
JDK 12: 111ms 68ms
JDK 13*: 86ms 46ms
What we see here is that the overhead in JDK 9 was pretty hefty, and that we’re now
down to more acceptable numbers. Still there’s a small regression compared to the JDK 8
StringBuilder
approach. Might be a small price to pay for better peak
performance. There are always trade-offs involved, and historically the
JVM favors throughput (and latency) over startup and footprint.
Show you some outrageous numbers
I mentioned above that there are theoretically up to 320000 shapes needed to implement all concatenation expressions of four arguments mixed with zero to five constants.
Let’s see how we fare in practice on something as outrageously synthetic like that.
So I wrote this horrible little program to generate a java program enumerating them all (the source code generated is 9Mb, and I had to split the implementation into a lot of nested inner classes for it to even compile).
The results are somewhat interesting:
JDK #classes Time
8 - ~3.6s
11 39394 ~19.5s
12 27212 ~18.5s
13* 3174 ~15.6s
(#classes is the difference in loaded classes between running the code built with JDK 8 and built with JDK 9 for the given JDK)
It appears we are far from needing one class per “shape” at this scale in JDK 11: Under the hood, the
method handle implementation is pretty good at splitting apart heavy expression trees into shareable
sub-expressions, which makes reasoning about theoretical bounds of the needed number of
generated classes (such as LambdaForm
:s and Species
classes) hard in practice.
In the end we really do massively reduce the number of classes needed in JDK 12, and spectacularly so in the latest builds. The bootstrap time is falling off, too, but maybe not as quickly as I’d have expected on this synthetic test (compared to the outcome in other tests).
A (sub)word about theory and practice
Another saving grace in the implementation is that subword integral primitive types (boolean
, byte
,
char
, short
) are automatically widened to int
for purposes of some of the internal structures created,
so we end up being concerned with only six types for the most part (seven in JDK 9-11). This
helps improve internal sharing of structural classes a lot. The number of classes generated in practice does look closer to 2^5*7^4 = 76832
in
JDK 11, 2^5*6^4 = 41472
in JDK 12 and 6^4 = 1296
in JDK 13. Various implementation details make us
need more in some cases, fewer in others.
What’s next?
I think there’s still room for improvement, for sure, but I guess we should keep in mind to focus on improvements that help in more realistic cases. If we can improve some problematic but theoretical corner case we should at least make sure we don’t make trivial and realistic cases worse.
A few ideas:
- Filtering is applied in order, so the optimization to bind in just one unique filter combinator acting on
all the arguments won’t apply perfectly when there are interleaving
String
,float
anddouble
Stringifiers. Since the “Stringification” offloat
anddouble
arguments shouldn’t be order dependent, we could group the filters and apply each filter in turn. - We could try “unrolling” arguments once and consume arguments pairwise. That’d mean we’d need a mixer and a prepender for each combination of two types. We’d need a lot more boilerplate, but might end up with an implementation that builds much shorter trees in practice, so we’d get more reusable building blocks, and fewer unique forms.
- Maybe we should also consider emitting a trivial, boxing vararg adapter for when the number of arguments becomes so large that we don’t really gain much peak performance from specialization to begin with. Such a fallback might be able to implement every non-trivial shape with a single, complex but somewhat inefficient
MethodHandle
. After some given number of arguments then perhaps the performance loss will be lost in the noise, anyhow.
Edit: Added overhead measures for StringConcatMix