Introduction

JDK 9 introduced a way to dynamically decide how to concatenate Strings through JEP 280.

In the words of the JEP, Indify String Concatenation changed the String-concatenation bytecode sequence generated by javac to use invokedynamic calls to JDK library functions.

Bytecode view

Let’s take a quick look at that bytecode sequence before and after JEP 280.

Take this simple “Hello $1” Java program:

public class StringConcat {
    public static void main(String[] args) {
        System.out.println("Hello " + args[0]);
    }
}

JDK 8 bytecode

Compile for JDK 8 compatibility with javac --release 8 StringConcat.java then print the bytecode with javap -v StringConcat:

  public static void main(java.lang.String[]);
    descriptor: ([Ljava/lang/String;)V
    flags: (0x0009) ACC_PUBLIC, ACC_STATIC
    Code:
      stack=4, locals=1, args_size=1
         0: getstatic     #7                  // Field java/lang/System.out:Ljava/io/PrintStream;
         3: new           #13                 // class java/lang/StringBuilder
         6: dup
         7: invokespecial #15                 // Method java/lang/StringBuilder."<init>":()V
        10: ldc           #16                 // String Hello
        12: invokevirtual #18                 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        15: aload_0
        16: iconst_0
        17: aaload
        18: invokevirtual #18                 // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
        21: invokevirtual #22                 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
        24: invokevirtual #26                 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
        27: return
      LineNumberTable:
        line 3: 0
        line 4: 27

Push System.out to stack, allocate a StringBuilder, append the "Hello" constant, append the argv[0], call toString() on the StringBuilder, call System.out.println with the result, return. Simple!

JDK 9 View

Now let’s see the same with JEP 280. Compile with javac StringConcat and print the class bytecode with javap -v StringConcat:

  public static void main(java.lang.String[]);
    descriptor: ([Ljava/lang/String;)V
    flags: (0x0009) ACC_PUBLIC, ACC_STATIC
    Code:
      stack=3, locals=1, args_size=1
         0: getstatic     #7                  // Field java/lang/System.out:Ljava/io/PrintStream;
         3: aload_0
         4: iconst_0
         5: aaload
         6: invokedynamic #13,  0             // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;)Ljava/lang/String;
        11: invokevirtual #17                 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
        14: return
      LineNumberTable:
        line 3: 0
        line 4: 14
...
BootstrapMethods:
  0: #34 REF_invokeStatic java/lang/invoke/StringConcatFactory.makeConcatWithConstants:(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/invoke/CallSite;
    Method arguments:
      #32 Hello \u0001

Push System.out to stack, load the argv[0] argument, invokedynamic!, call System.out.println with the result, return.

invokedynamic looks like magic, but it’s really just a trick: the first time the invokedynamic is executed the VM will bootstrap by making an upcall to the referenced bootstrap method - StringConcatFactory.makeConcatWithConstants - which returns a CallSite. It’s the job of the bootstrap method to make sure that the produced CallSite has a target MethodHandle, which will then be invoked as-if the call sequence was an invokevirtual with the MethodHandle on stack before the arguments - in this case a single String argument and return a String.

A MethodHandle is an executable reference to some code which the VM natively knows how to invoke. It could be a direct reference to a method, or there can be layers of transforms of the inputs or the return values - arguments can be filtered using other MethodHandles, auxiliary arguments can be added and folded in etc. The resulting MethodHandle could be visualized as a syntax tree. (I’ve not seen a proof that the MethodHandles API is turing complete in isolation, but it probably is)

StringConcatFactory.makeConcatWithConstants

Specifically, the default strategy for bootstrapping concats in StringConcatFactory will create an expression tree that combines some public methods (such as String::valueOf) and a few helper methods from java.lang.StringConcatHelper:

static long initialCoder() { ... }
// T: boolean, char, int, long, String
static long mix(long lengthCoder, T value) { ... } 
static byte[] newArray(long indexCoder) { ... }
static long prepend(long lengthCoder, byte[] buf, T value) { ... } 
static String newString(byte[] buf, long indexCoder) { ... }

using a number of combinators and transforms from java.lang.invoke.MethodHandles:

Class<?>[] ptypes = mt.erase().parameterArray(); 
MethodHandle mh = MethodHandles.dropArgumentsTrusted(newString(), 2, ptypes);
..
mh = filterInPrependers(mh, constants, ptypes);
..
MethodHandle newArrayCombinator = newArray();
mh = MethodHandles.foldArgumentsWithCombiner(mh, 0, newArrayCombinator,
        1 // index
);
..
mh = filterAndFoldInMixers(mh, initialLengthCoder, ptypes);
if (objFilters != null) {
  mh = MethodHandles.filterArguments(mh, 0, objFilters);
}
return mh;

The combinator tree is built up in reverse, making the code challenging to read. There are also a few non-public

Flow chart

A flow chart of what happens when the mh returned is invoked:

flowchart TB;
    A("mh.invoke(Object,float,int,...)") --filter arguments and narrow types to only deal with String, int, long, char, boolean --> B("mh.invoke(String, String, int, ...)")
    B -- insert initial lengthCoder --> C("mh.invoke(long lengthCoder, String, String, int, ...)")
    C -- for each chunk of up to 4 arguments --> D("mixer.invoke(lengthCoder, arg1, ..)")
    D -- replace lengthCoder with result --> C
    C -- after mixing, call newArray --> E("mh.invoke(long lengthCoder, byte[] buf, String, String, int, ...")
    E -- for each chunk of up to 4 arguments --> F("Call prepend using coder, buf and the arguments")
    F -- update the lengthCoder, fold away the argument --> G("mh.invoke(long coder, byte[] buf)")
    G -- newString --> H(Return)

We do have a static variant for simple concatenations of two Object arguments which can be used as a visualization-of-sorts for what’s going on:

    static String simpleConcat(Object first, Object second) {
        // MethodHandles.filterArguments
        String s1 = stringOf(first); 
        String s2 = stringOf(second);
        ...
        // filterAndFoldInMixers:
        long indexCoder = mix(initialCoder(), s1);
        indexCoder = mix(indexCoder, s2);
        
        byte[] buf = newArray(indexCoder);
        // prepend each argument in reverse order, since we prepending
        // from the end of the byte array
        indexCoder = prepend(indexCoder, buf, s2);
        indexCoder = prepend(indexCoder, buf, s1);
        // return newString
        return newString(buf, indexCoder);
    }

The high-arity `StringBuilder` fallback

As the MethodHandle expression tree grows we eventually ran into some blocking issues. Besides generating a lot of intermediate transform classes (which end up being unused), the resulting MH can take an unreasonable amount of time and resources to be compiled: JDK-8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases

As a fix we opted to bring back code which spins a class per concatenation for sufficiently complex expressions, using a StringBuilder approach very similar to the code that would be emitted by javac pre-JEP 280.

  return new Consumer<CodeBuilder>() {
    @Override
    public void accept(CodeBuilder cb) {
      cb.new_(STRING_BUILDER);
      cb.dup();

      int len = 0;
      for (String constant : constants) {
        if (constant != null) {
          len += constant.length();
        }
      }
      len += args.parameterCount() * ARGUMENT_SIZE_FACTOR;
      cb.loadConstant(len);
      cb.invokespecial(STRING_BUILDER, "<init>", INT_CONSTRUCTOR_TYPE);

      // At this point, we have a blank StringBuilder on stack, fill it in with .append calls.
      {
        int off = 0;
        for (int c = 0; c < args.parameterCount(); c++) {
          if (constants[c] != null) {
            cb.ldc(constants[c]);
            cb.invokevirtual(STRING_BUILDER, "append", APPEND_STRING_TYPE);
          }
          Class<?> cl = args.parameterType(c);
          TypeKind kind = TypeKind.from(cl);
          cb.loadLocal(kind, off);
          off += kind.slotSize();
          MethodTypeDesc desc = getSBAppendDesc(cl);
          cb.invokevirtual(STRING_BUILDER, "append", desc);
        }
        if (constants[constants.length - 1] != null) {
          cb.ldc(constants[constants.length - 1]);
          cb.invokevirtual(STRING_BUILDER, "append", APPEND_STRING_TYPE);
        }
      }

      cb.invokevirtual(STRING_BUILDER, "toString", TO_STRING_TYPE);
      cb.areturn();
    }
  };

This fallback is new in JDK 23 and can be controlled by supplying -Djava.lang.invoke.StringConcat.highArityThreshold=<NN> - where NN has a default value of 20. This means any expression with more than 20 arguments will generate a specialized class using the StringBuilder approach.

Performance impact

JEP 280 was obviously motivated by performance, and the ability to emit code that is easier for JIT compilers to optimize. As such there are various benchmarks, including the org.openjdk.bench.java.lang.StringConcat benchmark, which has been added to over the years.

Let’s build that with pre-JEP 280 mode, using -XDstringConcat=inline when building, and compare with the recent JDK mainline.

Name                                    Change
StringConcat.concat123String             0,87x (p = 0.000*)
StringConcat.concat13String              1,39x (p = 0.000*)
StringConcat.concat23String              0,85x (p = 0.000*)
StringConcat.concat23StringConst         0,91x (p = 0.000*)
StringConcat.concat4String               1,43x (p = 0.000*)
StringConcat.concat6String               1,56x (p = 0.000*)
StringConcat.concatConst2String          1,64x (p = 0.000*)
StringConcat.concatConst4String          1,73x (p = 0.000*)
StringConcat.concatConst6Object          1,72x (p = 0.000*)
StringConcat.concatConst6String          1,74x (p = 0.000*)
StringConcat.concatConstBoolByte         2,69x (p = 0.000*)
StringConcat.concatConstInt              1,62x (p = 0.000*)
StringConcat.concatConstIntConstInt      1,62x (p = 0.000*)
StringConcat.concatConstString           1,33x (p = 0.000*)
StringConcat.concatConstStringConstInt   1,68x (p = 0.000*)
StringConcat.concatEmptyConstInt         1,15x (p = 0.000*)
StringConcat.concatEmptyConstString      2,40x (p = 0.000*)
StringConcat.concatEmptyLeft             3,02x (p = 0.000*)
StringConcat.concatEmptyRight            2,98x (p = 0.000*)
StringConcat.concatMethodConstString     1,00x (p = 0.339 )
StringConcat.concatMix4String            1,47x (p = 0.000*)

---
config:
    xyChart:
        width: 900
        height: 600
    themeVariables:
        xyChart:
            plotColorPalette: "#4344A3, #B34443" 
---
xychart-beta horizontal
  title "StringConcat JMH microbenchmark, speed-up factor"
  x-axis [concat123String, concat13String, concat23String, concat23StringConst, concat4String, concat6String, concatConst2String, concatConst4String, concatConst6Object, concatConst6String, concatConstBoolByte, concatConstInt, concatConstIntConstInt, concatConstString, concatConstStringConstInt, concatEmptyConstInt, concatEmptyConstString, concatEmptyLeft, concatEmptyRight, concatMethodConstString, concatMix4String]
  y-axis 0.04 --> 3.2
  bar [0.87, 1.39, 0.85, 0.91, 1.43, 1.56, 1.64, 1.73, 1.72, 1.74, 2.69, 1.62, 1.62, 1.33, 1.68, 1.15, 2.40, 3.02, 2.98, 1.00, 1.47]
  line [1.0,  1.0, 1.0,  1.0, 1.0,  1.0, 1.0,  1.0, 1.0,  1.0, 1.0,  1.0, 1.0,  1.0, 1.0,  1.0, 1.0,  1.0, 1.0,  1.0, 1.0,  1.0]

(Linux-x64, Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz)

We can see that there are mostly improvements while a few cases regress.

The regressions at the top are due to when we fall back to spinning bytecode using the simple StringBuilder strategy. The performance cost come from a combination of outlining that code to a separate class and that that code is large enough that inlining it back into the caller is unlikely to happen. In practice outlining might even help overall performance.

Startup woes - past and present

The current strategy has evolved since JDK 9, mainly to address startup and footprint impacts.

An example of changes made was to combine the logical int length and byte coder into a single long lengthCoder (JDK-8213035).

This change (in JDK 12) meant that we bind fewer method handles into the root method handle, which leads to faster execution and fewer generated classes. Posted results for this change alone was 10ms faster out of a total 131ms runtime on a startup test.

Further work refactored how we stringify Objects and fold constants into the prepend combinators so that even fewer shapes were needed. From https://cl4es.github.io/2019/05/14/String-Concat-Redux.html:

         Total   Overhead
JDK 8:    60ms        0ms
JDK 9:   215ms      129ms
JDK 11:  164ms      118ms 
JDK 12:  111ms       68ms
JDK 13*:  86ms       46ms

On one stress test we dropped the number of loaded classes from 39394 to 3174 from JDK 11 to JDK 13 alone.

Since then there’s not been much targeted work to improve this, and the overhead has slus

Sidetrack: Improving the current implementation

Some of the changes like JDK-8213035 wasn’t perfectly performance neutral, though. I saw some small changes on x64 back then, and when doing a comparison with a -XDstringConcat=inline baseline I saw some pretty significant slowdowns on my M1.

A few turned out to be noise but a few persisted, such as concatMix4String.

Investigating it turns out that we could profit from streamlining the code a bit. Taking care to better inline code into the helper functions. Avoid unnecessary shifts. That kind of thing: PR#19927

Name                            Change
StringConcat.concat13String      1.25x (p = 0.000*) // Linux x64
StringConcat.concatConst2String  1.20x (p = 0.000*)
StringConcat.concatConst4String  1.17x (p = 0.000*)
StringConcat.concatConst6String  1.17x (p = 0.000*)
StringConcat.concatMix4String    1.24x (p = 0.000*)

StringConcat.concatMix4String    1,76x (p = 0,000*) // Macbook M1

Some of this fixes regressions that have crept into the implementation since inception in JDK 9.

But this PR also demonstrates a key design benefit of JEP 280: Delegating to the runtime to generate code shape leaves us free to experiment and optimize the code without a need to recompiling the java code with a patched javac. The static bytecode remains unchanged.

Lingering woes

So while things have improved since JDK 9 there is still a hefty cost of spinning up complex concat expressions.

That 46ms overhead number I blogged about for JDK all those years ago more or less persist on the same hardware setup.

When working on PR#19927 I realized none of the String concat stress tests I’ve experimented with were added to the OpenJDK, so I’ve added a couple of JMH:ified variants in StringConcatStartup. Running this stand-alone with perf stat -r 10 20 times and collecting the results yields this result:

Name                  Cnt           Base           Error
StringConcatStartup    20        238,000 ±         6,667
  :.cycles                2432315862,050 ±  54176897,441
  :.instructions          5762516585,300 ± 129310503,641
  :.taskclock                    785,500 ±        15,806
* = significant

Add to this the aforementioned StringBuilder fallback where we opt out of the optimized concatenation for high-arity

Leyden to the rescue!

Ioi Lam has done great work within Project Leyden to allow pre-resolving String concat expressions as produced by the StringConcatFactory, storing them in the Leyden AOT archive and reconstituting the final MethodHandle from the archive when linking the callsite. Basically short-cutting the StringConcatFactory entirely:

---
config:
    xyChart:
        width: 600
        height: 600
    themeVariables:
        xyChart:
            plotColorPalette: "#4344A3, #8384F3" 
---
xychart-beta
  title "StringConcatStartup.main, ms/op"
  x-axis [Default, Premain]
  y-axis 0 --> 400
  bar [238, 51.5]

That’s a 4.6x speed-up. And this ~51ms figure includes not only the initialization of all the String concat call sites, but is the start-to-finish time for the entire JVM process.

So Leyden actually does a great work already for StringConcatFactory - for anything captured in a training run.

So we’re done, then?

Well, not so fast. First off anything not resolved during training will not be captured. Thus when we encount any unseen callsites the runtime will go through the StringConcatFactory from scratch - potentially spinning a significant amount of classes at runtime.

While it’s anyone’s guess what concat callsite capture rate leyden deployments will have, I still think it’s prudent to improve on the status quo. No reason to let the baseline generate a lot of classes that we don’t need.

Another issue with the current implementation is that it scales poorly if we ever want to improve on type specialization.

In the current model javac emits exactly typed invocations to the StringConcatFactory bootstrap method, and the returned method handle is adapted to the exact type - even though the MethodHandle currently produced by the factory only cares about (some) primitive types and Object. We could specialize more today - for rather incremental gains - but shy away from this since each added specialization increases the potential number of LambdaForm classes that could be generated exponentially.

Remodeling to spin classes

This is almost what the current SCF strategy would produce if we let it. One difference is that if one of the operands is a string constant the SCF MH strategy would seed the MH expression with the result of mix(initialCoder(), constant). But for a single constant + parameter there’s not much of a difference.

When the StringConcatFactory sees something which can use simpleConcat it either produces a DirectMethodHandle (two reference args being concatenated, foo + bar) or inserts the constant in the right position:

        if (paramCount == 1) {
            String prefix = constants[0];
            if (prefix == null) {
                if (suffix == null) {
                    return unaryConcat(mt.parameterType(0));
                } else if (!mt.hasPrimitives()) {
                    return MethodHandles.insertArguments(simpleConcat(), 1, suffix);
                } // else fall-through
            } else if (suffix == null && !mt.hasPrimitives()) {
                // Non-primitive argument
                return MethodHandles.insertArguments(simpleConcat(), 0, prefix);
            } // fall-through if there's both a prefix and suffix
        }

The idea is to generalize. To do that we need something that will hold on to the constants, then efficiently goes through the arguments and returns a string.

Extrapolating from simpleConcat and taking a generalized approach to whether there are constants around the arguments we would end up with something like this for a concatenation taking a String and an int:

    private final String c0;
    private final String c1;
    private final String c2;
    private final long initialCoder;
    GeneratedStringConcat(String[] constants) {
        long initialCoder = StringConcatHelper.initialCoder();
        c0 = constants[0];
        initialCoder = StringConcatHelper.mix(initialCoder, c0);
        c1 = constants[1];
        initialCoder = StringConcatHelper.mix(initialCoder, c1);
        c2 = constants[2];
        initialCoder = StringConcatHelper.mix(initialCoder, c2);
        this.initialCoder = initialCoder;
    }

    // Concatenates an expression "prefix" + foo + "constant" + bar + "suffix" 
    String concat(Object o0, int i1) {
        // Stringify Object, float, double args:
        String s0 = StringConcatHelper.stringOf(o0);

        long lengthCoder = initialCoder;
        lengthCoder = StringConcatHelper.mix(lengthCoder, s0);
        lengthCoder = StringConcatHelper.mix(lengthCoder, i1);

        // prepend from the end
        byte[] buf = StringConcatHelper.newArray(lengthCoder, c2);

        // prepend from the end
        lengthCoder = StringConcatHelper.prepend(lengthCoder, buf, c1, i1);
        lengthCoder = StringConcatHelper.prepend(lengthCoder, buf, c0, s0);
        return StringConcatHelper.newString(buf, lengthCoder);
    }

But hold on! The long lengthCoder hack was something we did to workaround overheads incurred in the MH combinator trees. That is.. by applying mixers on a long which encoded both the int length and byte coder we simplified the mixer MethodHandles and could reduce the number of synthetic arguments in the expression tree.

In a plain Java translation all that is probably just added complexity. Let’s simplify!

Let StringConcatHelper redefine mixer to only deal with length and retain the intermediate overflow checking:

    static int mix(int length, String value) {
        length += value.length();
        return checkOverflow(length);
    }
    
    private static int checkOverflow(int length) {
        if (length < 0) {
            throw new OutOfMemoryError("Overflow: String length out of range");
        }
        return length;
    }

    private final String c0;
    private final String c1;
    private final String c2;
    private final int length;
    private final byte coder;
    GeneratedStringConcat(String[] constants) {
        byte coder = String.COMPACT_STRINGS ? String.LATIN1 : String.UTF16;
        int length = 0;

        c0 = constants[0];
        coder |= c0;
        length = StringConcatHelper.mix(length, c0); // check for overlow
        ...

        this length = length;
        this.coder = coder;
    }

    // Concatenates an expression "prefix" + foo + "constant" + bar + "suffix" 
    String doConcat(Object o0, int i1) {
        // Stringify Object, float, double args:
        String s0 = StringConcatHelper.stringOf(o0);
        
        // Only string(ified) args can mutate initial coder
        int coder = this.coder | s0.coder(); // we're inside java.lang, which gives access to package-private

        // Mix in lengths
        int length = this.length;
        length = StringConcatHelper.mix(length, s0);
        length = StringConcatHelper.mix(length, i1);
        
        // prepend from the end
        byte[] buf = StringConcatHelper.newArray(length, coder, c2);
                
        // prepend from the end
        length = StringConcatHelper.prepend(length, coder, buf, c1, i1);
        length = StringConcatHelper.prepend(length, coder, buf, c0, s0);
        return StringConcatHelper.newString(buf, length, coder);
    }

Splitting explands the code a bit more in both the constructor and the concat method, but not having to pack and unpack the coder at every mixer and prepend step should make it more straightforward and easier for the compilers to optimize.

Incremenally getting there

After discussing these prototyping ideas in a few related PRs, Shaojin Wen (Alibaba) created a PR to generate code similar to what the MH-based strategy would do, but using the classfile API: https://github.com/openjdk/jdk/pull/20273

The code generated has more or less the same raw performance for low-arity expressions as the optimal strategy generated by the MH strategy. Which makes sense since it’s more or less the same generated code. On small startup tests the pr#20273 implementation already looks like a great win, too:

---
config:
    xyChart:
        width: 600
        height: 600
    themeVariables:
        xyChart:
            plotColorPalette: "#4344A3, #8384F3" 
---
xychart-beta
  title "StringConcatStartup, ms/op"
  x-axis [MixedLarge, MixedSmall, StringLarge]
  y-axis 0 --> 400
  bar [320.172, 25.647, 89.95]
  bar [108.981, 5.0, 45.094]

But Shaojin’s implementation generates a class per concatenation. This is due to scale poorly as a moderately sized Java application can have many thousand String concatenations. Which means it’s probably not a good replacement for the main StringConcatFactory out-of-the box.

On a stress test that enumerates 320000 4-arity concatenations the MH-based baseline loads about 3500 generated classes; PR#20273 generates 320000 (and takes roughly twice as long on this extreme):

---
config:
    xyChart:
        width: 800
        height: 600
    themeVariables:
        xyChart:
            plotColorPalette: "#4344A3, #B34443" 
---
xychart-beta
  title "Strings stress test, classes loaded"
  x-axis [Baseline, pr#20273]
  y-axis 0 --> 340000
  bar [13888, 332564]

However, we realized it might be a good replacement for the StringBuilder fallback and as a basis for further work.

Throughput seem compareble to the StringBuilder strategy for high-arity expressions, but the optimized approach helps keep memory pressure low and comparable to the MH-based strategy (which is problematic for other reasons for complex expressions):

---
config:
    xyChart:
        width: 1000
        height: 600
    themeVariables:
        xyChart:
            plotColorPalette: "#4344A3, #B34443" 
---
xychart-beta
  title "StringConcat.concat123 -prof gc, B/op"
  x-axis [MH, StringBuilder, pr#20273]
  y-axis 0 --> 2000
  bar [600, 1840, 616]

Prototyping continues to see if we can get something that performs well at peak, starts up fast and scales nicely.

Full-fledged prototype

I started out from Shaojin’s approach with and prototype a version which generates a shareable class and puts it in a cache. It’s now been merged back into PR#20273, but

It performs well on micros. We’re even beating the current MH-based on some of the micros, though it regresses a bit on others.

---
config:
    xyChart:
        width: 800
        height: 600
    themeVariables:
        xyChart:
            plotColorPalette: "#4344A3, #B34443" 
---
xychart-beta
  title "StringConcat.concatMix4String, ns/op"
  x-axis [MH-Baseline, SB-Baseline, Prototype]
  y-axis 0 --> 125
  bar [94.172, 99.163, 73.405]

Startup tests show compelling improvements with more substantial tests taking 60-70% less time:

Name                            Cnt    Base    Error     Test   Error  Unit  Change
MixedLarge.run                   10 357,285 ± 41,059  151,936 ± 7,964 ms/op   2,35x (p = 0,000*)
MixedSmall.run                   20  25,464 ±  0,777    7,490 ± 0,343 ms/op   3,40x (p = 0,000*)
StringLarge.run                  10  93,364 ±  5,002   27,388 ± 1,423 ms/op   3,41x (p = 0,000*)
StringSingle.constBool           40   2,887 ±  2,490    1,097 ± 0,059 ms/op   2,63x (p = 0,015 )
StringSingle.constBoolString     40   0,288 ±  0,026    0,763 ± 0,041 ms/op   0,38x (p = 0,000*)
StringSingle.constBoolean        40   0,165 ±  0,016    0,149 ± 0,009 ms/op   1,11x (p = 0,003*)
StringSingle.constBooleanString  40   3,816 ±  0,165    1,029 ± 0,050 ms/op   3,71x (p = 0,000*)
StringSingle.constFloat          40   2,785 ±  0,120    1,370 ± 0,493 ms/op   2,03x (p = 0,000*)
StringSingle.constFloatString    40   5,268 ±  2,117    1,485 ± 0,077 ms/op   3,55x (p = 0,000*)
StringSingle.constInt            40   2,178 ±  0,127    1,031 ± 0,044 ms/op   2,11x (p = 0,000*)
StringSingle.constIntString      40   0,183 ±  0,027    0,106 ± 0,007 ms/op   1,72x (p = 0,000*)
StringSingle.constInteger        40   0,155 ±  0,015    0,143 ± 0,009 ms/op   1,09x (p = 0,014 )
StringSingle.constIntegerString  40   3,750 ±  0,164    0,994 ± 0,051 ms/op   3,77x (p = 0,000*)
StringSingle.constString         40   0,166 ±  0,017    0,137 ± 0,008 ms/op   1,21x (p = 0,000*)
StringThree.stringIntString      40   6,939 ±  1,475    1,616 ± 0,120 ms/op   4,29x (p = 0,000*)
StringThree.stringIntegerString  40   6,066 ±  2,076    1,093 ± 0,064 ms/op   5,55x (p = 0,000*)
  * = significant

There’s still some overhead showing on the 4-arity startup stress test the existing solution actually generates fewer classes here than there are distinct shapes. The latest version generates around 6,500 classes compared to ~3,500 classes for the baseline implementation. The wall clock times are more or less the same, though.

Main difference comes from doing distinct logic when there are float or double arguments, and perhaps using the stringifier trick to pre-process those arguments and turn them into String means we can make do with fewer classes total.

Conclusions

Building up complex logic from small building blocks using MethodHandles transforms - as done by JEP 280 - has proven throughput performance, but has challenges with startup overheads and code complexity for larger expressions which can be cumbersome for JITs.

Generating hidden classes into privileged packages from bootstrap methods gives access to privileged APIs and unlocks similar performance as the current-best MH-based approach, at lower deployment and warmup cost.

A hybrid approach where we generate as few classes as possible by leveraging MethodHandles for things that it’s good at, such as filtering and adapting arguments, will end up being the best overall implementation.

                           Cnt     Base     Error      Test    Error  Unit  Change
concat123String             15 1115,249 ±  47,949  1117,600 ± 63,910 ns/op   1,00x (p = 0,904 )
concat13String              15   47,661 ±   0,411    47,305 ±  1,359 ns/op   1,01x (p = 0,314 )
concat13StringConst         15   75,703 ±   3,654    68,695 ±  0,421 ns/op   1,10x (p = 0,000*)
concat23String              15  145,045 ±   1,407   144,934 ±  3,178 ns/op   1,00x (p = 0,897 )
concat23StringConst         15  124,363 ±   1,247   125,227 ±  5,216 ns/op   0,99x (p = 0,514 )
concat30Mix                 15  358,019 ±  19,446   344,140 ±  7,561 ns/op   1,04x (p = 0,013 )
concat3String               15   13,540 ±   0,693    16,308 ±  0,740 ns/op   0,83x (p = 0,000*)
concat4String               15   16,193 ±   0,731    25,779 ±  1,276 ns/op   0,63x (p = 0,000*)
concat6String               15   21,549 ±   0,954    20,055 ±  0,425 ns/op   1,07x (p = 0,000*)
concatConst2String          15   11,599 ±   0,889     8,549 ±  0,221 ns/op   1,36x (p = 0,000*)
concatConst4String          15   16,871 ±   0,861    25,449 ±  0,771 ns/op   0,66x (p = 0,000*)
concatConst6Object          15   58,020 ±   2,317    52,429 ±  1,515 ns/op   1,11x (p = 0,000*)
concatConst6String          15   21,050 ±   0,939    20,368 ±  1,163 ns/op   1,03x (p = 0,070 )
concatConstBool             15    3,832 ±   0,038     3,842 ±  0,115 ns/op   1,00x (p = 0,734 )
concatConstString           15    5,362 ±   0,044     5,453 ±  0,359 ns/op   0,98x (p = 0,316 )
concatConstStringConst      15    8,799 ±   0,310     6,668 ±  0,212 ns/op   1,32x (p = 0,000*)
concatConstStringConstInt   15   13,411 ±   1,395    29,877 ±  1,989 ns/op   0,45x (p = 0,000*)
concatEmptyRight            15    2,483 ±   0,132     2,513 ±  0,095 ns/op   0,99x (p = 0,451 )
concatMethodConstString     15    5,582 ±   0,357     5,393 ±  0,186 ns/op   1,04x (p = 0,064 )
concatMix4String            15   93,714 ±   5,203    80,934 ±  2,970 ns/op   1,16x (p = 0,000*)
concatStringBoolString      15   22,164 ±   1,641     8,943 ±  0,309 ns/op   2,48x (p = 0,000*)