Investigating MD5 overheads
Background
A few weeks ago we received a PR intending to speed up java.util.UUID::nameUUIDFromBytes
by caching the return from MessageDigest.getInstance("MD5")
in a static field . On the face of it, this seemed like a nice little improvement. But that was before realizing that MessageDigest
objects are stateful and thus can’t be shared statically without some synchronization mechanism.
To fix this correctness issue the PR was updated to use a ThreadLocal
, which comes with its own set of issues:
- Every thread that has ever called
UUID::nameUUIDFromBytes
would indefinitely hold on to an instance of the MD5MessageDigest
object. (This could be partially resolved by wrapping the object stored in theThreadLocal
in aSoftReference
, at the cost of complexity and some speed…) ThreadLocals
might not play at all nicely with Project Loom, since we can expect the number of virtual threads used to grow at a higher rate. As a precaution the OpenJDK team has deliberately gotten rid of a few dubiousThreadLocal
optimizations for this reason.
Shooting down or flat out rejecting a PR is no fun, though. And it definitely seems like there is room for improvement here. In this post I’ll explore some of that profiling and prototyping work to find alternative optimizations we could consider. Maybe this can inspire even better suggestions.
I’ve drafted a PR with the current best performing mix of improvements, but let’s start from the beginning…
It’s 2021 - why bother with MD5?
MD5 is a cryptographic hashing function used for more things beside the UUID
method we’re looking at here.
I’m no cryptography expert but I know that MD5 is considered cryptographically broken and susceptible to collision attacks. So while it’s probably never a good idea to rely on MD5 hashing for anything security sensitive it’s a very fast hash that remain a popular choice for integrity checking.
As it happens the MD5 implementation in the OpenJDK was recently optimized by means of intrinsifying the bulk of the algorithm. Only on x86 initially, but implementations on other architectures seem to be coming. Intrinsification is a fun but complicated (and sometimes surprising) technique where the JVM replaces some Java code with hand-optimized versions during compilation. This recent optimization, which will be added in JDK 16, allegedly improves performance of doing MD5 digests by 15-20%.
Basic performance analysis
Before even attempting to optimize anything, it’s important to first try to understand the performance of what you’re trying to optimize. We could zoom in on MessageDigest.getInstance("MD5")
directly - or why not the public API method the original PR sought to optimize: java.util.UUID.nameUUIDFromBytes
. There’s no given answer exactly where to start examining, but drilling down into the details from a somewhat high level API extracted from a real world use case will often lead you in the right direction.
Pick (or create) a microbenchmark
Consider this method in java.util.UUID
:
public static UUID nameUUIDFromBytes(byte[] name) {
MessageDigest md;
try {
md = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException nsae) {
throw new InternalError("MD5 not supported", nsae);
}
byte[] md5Bytes = md.digest(name);
md5Bytes[6] &= 0x0f; /* clear version */
md5Bytes[6] |= 0x30; /* set to version 3 */
md5Bytes[8] &= 0x3f; /* clear variant */
md5Bytes[8] |= 0x80; /* set to IETF variant */
return new UUID(md5Bytes);
}
This certainly seem like a gateway to examining the overheads of MessageDigest.getInstance
and digest
, as well as seeing how these overheads stack up against the creation of the UUID
object itself. Which struck me as a good starting point to allow us to get a feel for how much each part contributes.
It so happens there already is a UUID
-related microbenchmark in the OpenJDK sources, so why not extend that. Let’s keep it simple: generate some random byte[]s
at setup and call the public method:
@Benchmark
public UUID fromType3Bytes() {
return UUID.nameUUIDFromBytes(uuidBytes[index]);
}
After setting up the OpenJDK for building and running JMH, running this will execute that microbenchmark: make test TEST=micro:UUIDBench.fromType3Bytes
.
Benchmark Score Error Units
fromType3Bytes 1.460 ± 0.089 ops/us
Run some quick diagnostics
Many of the typical JMH parameters are controllable via the make
-based runner, including a catch-all MICRO=OPTIONS="..."
, but you can also just build the microbenchmarks bundle with make build-microbenchmark
and run them directly via java -jar build/<platform>/images/test/micro/benchmarks.jar
Running with the JMH GC profiler, -prof gc
, is my go-to first quick check. It gives rundown on how much allocations we’re doing every call:
Benchmark Score Error Units
·gc.alloc.rate.norm 488.042 ± 0.005 B/op
488 bytes per call? That’s quite a bit of allocations for something so innocuous.
Pick your profiler
Since version 1.24 JMH includes support for async-profiler (needs to be installed separately). A very powerful profiler which now works nicely together with JMH. The ability to generate CPU flame graphs is nice and easy to use:
make test TEST=micro:UUIDB.from MICRO=OPTIONS="-prof async:output=flamegraph"
# generates flame-cpu-forward|reverse.svg in a folder named after the micro
There’s also an allocation profiler/sampler, which I decided to toy with for this analysis. No better reason than wanting to play with something that’s new to me:
make test TEST=micro:UUIDB.from MICRO=OPTIONS="-prof async:event=alloc"
...
--- 1291138320 bytes (24.90%), 923 samples
[ 0] int[]
[ 1] sun.security.provider.MD5.<init>
[ 2] jdk.internal.reflect.GeneratedConstructorAccessor1.newInstance
--- 1287651328 bytes (24.83%), 921 samples
[ 0] byte[]
[ 1] sun.security.provider.DigestBase.<init>
[ 2] sun.security.provider.MD5.<init>
[ 3] jdk.internal.reflect.GeneratedConstructorAccessor1.newInstance
--- 645300224 bytes (12.44%), 310 samples
[ 0] byte[]
[ 1] sun.security.provider.DigestBase.engineDigest
[ 2] java.security.MessageDigest$Delegate.engineDigest
--- 640256000 bytes (12.35%), 610 samples
[ 0] java.util.UUID
[ 1] java.util.UUID.nameUUIDFromBytes
[ 2] org.openjdk.bench.java.util.UUIDBench.fromType3Bytes
--- 640251904 bytes (12.35%), 609 samples
[ 0] java.lang.ref.WeakReference
[ 1] java.lang.reflect.AccessibleObject.slowVerifyAccess
[ 2] java.lang.reflect.AccessibleObject.verifyAccess
[ 4] java.lang.reflect.Constructor.newInstanceWithCaller
[ 5] java.lang.reflect.Constructor.newInstance
[ 6] java.security.Provider.newInstanceUtil
--- 639006720 bytes (12.32%), 305 samples
[ 0] java.lang.Object[]
[ 1] java.security.Provider.newInstanceUtil
[ 2] java.security.Provider$Service.newInstance
I’ve trimmed some of the output. It took me a moment to realize the output listed are all reversed stack traces of the hottest allocation sites. And that the current JMH integration does not seem to support generating flamegraphs from alloc events. Alas! This is still pretty cool since it quickly points me to some potential areas of concern…
The first two stack traces pertain to the instantiation of the sun.security.provider.MD5
instance - and represent just
about half of the allocations. Seems like a good place to focus our search for improvements.
Next is some byte[]
allocated when calling digest on the MD5
. This is likely what’s being returned out to UUID
. The UUID
code will then unpack the byte
values into two long
values. I guess returning a view into the digested bytes could avoid some of this, but that might be a lot of brain surgery for a dubious gain.
Next is the UUID
allocation itself. Probably not much we can do about that.
The next two are interesting. Both originate in java.security.Provider::newInstanceUtil
. Allocating WeakReferences
in a method called slowVerifyAccess
seems suspect. And the Object[]s
allocated in Provider.newInstanceUtil
seem they might be empty varargs arrays? We’ll get to that..
MD5 - room for improvement?
I mentioned that code in the MD5
implementation has recently been intrinsified. When a method is intrinsified, what you see is not necessarily what you get. And in this case something struck me when I examined the state of the current code: one of the large data structures in our allocation profiles is not used at all by the intrinsified code. Still it’s always allocated, filled up and cleared.
// temporary buffer, used by implCompress()
private int[] x; // will be cleared after a call to digest(byte[])
void implCompress(byte[] buf, int ofs) {
implCompressCheck(buf, ofs);
implCompress0(buf, ofs);
}
private void implCompressCheck(byte[] buf, int ofs) {
Objects.requireNonNull(buf);
// The checks performed by the method 'b2iBig64'
// are sufficient for the case when the method
// 'implCompressImpl' is replaced with a compiler
// intrinsic.
b2iLittle64(buf, ofs, x); // fills up x
}
// The method 'implCompress0 seems not to use its parameters.
// The method can, however, be replaced with a compiler intrinsic
// that operates directly on the array 'buf' (starting from
// offset 'ofs') and not on array 'x', therefore 'buf' and 'ofs'
// must be passed as parameter to the method.
@IntrinsicCandidate
void implCompress0(byte[] buf, int ofs) {
// regular java code uses x, but the intrinsic doesn't
}
It also seems a bit wrong to have a method implCompressCheck
whose name implies that it’s just doing a few checks also have a side effect of reading values into the temporary buffer x
.
I’ve prototyped a variant that refactors implCompressCheck
to only do checking, gets rid of the temporary buffer and moves the reading of values from the buffer into implCompress0
- which avoid redundantly reading values twice. This turns out to be beneficial in the UUID
micro:
Benchmark Score Error Units
fromType3Bytes 1.523 ± 0.066 ops/us
fromType3Bytes:·gc.norm 408.036 ± 0.003 B/op
80 B/op reduction over our 488 B/op baseline, and maybe a small throughput win. If we zoom in on the digest
method, for which there’s another microbenchmark we see that the improvement there can be rather pronounced:
Benchmark Score Error Units
MessageDigests.digest 2719.049 ± 30.538 ops/ms
MessageDigests.digest 3241.352 ± 67.353 ops/ms
(digesterName: md5, length: 16, provider: DEFAULT)
The effect of this optimization diminish on larger inputs - down to getting lost in the noise on digests over a few Kbs. This suggests the work done in implCompressCheck
- which would scale with the input length - is successfully DCE’d by the JIT, while the pointless allocation and clearing of the x
buffer can’t be eliminated.
Handling conversion from bytes to ints
If you’ve looked at the source code for the last experiment you might have noticed how I played around with using VarHandles
to read int
values from a byte[]
. This is essentially what the existing code in sun.security.provider.ByteArrayAccess
does, but with slightly more ceremony and use of Unsafe
. VarHandles
were added in JDK 9. They do have a few quirks, but I think using them can nicely encapsulate some of the tricky logic and Unsafe
usage in the ByteArrayAccess::b2iLittle64
method that I’m “inlining” here.
Running experiments on this code with the intrinsic disabled shows that this quick and dirty VarHandles
implementation perform just as well as Unsafe
on my hardware. There’s a bit of overhead (-30%) when running in the interpreter. This overhead was almost halved extracting long
values rather than ints
, without regressing higher optimization levels. But optimizing for the interpreter is typically not worth the hassle, unless you’re optimizing code that is startup sensitive in some way. And if so it might be better to look at ways of improving the VarHandle
implementation itself…?
Anyway.. if I am to finish up and propose a patch based on this experiment I want to first simplify the code as much as possible and do more thorough testing on more hardware. Revamping using more modern, non-Unsafe
APIs is nice, but comes with some risks. It would also be interesting to try out implementing and comparing this byte-to-int conversion using other techniques, including ByteBuffers
and the incubating Foreign Access API - as well as consolidating any change in approach with the code in the legacy implementation in sun.security.provider.ByteArrayAccess
.
Accidental constraints
One unfortunate side effect of the intrinsification of MD5.implCompress0
is that now there’s a HotSpot intrinsic depending on the int[] state
array in MD5
. This means that to try out things like flattening this array into four ints
I’d have to rewrite some gnarly intrinsic code on all platforms they are implemented on. I’m not sure flattening this array would make much of a difference here (a bit less allocation and faster cloning) but it’s a bit annoying that that’s now far from trivial.
Reflection overheads
Let’s take a step back and instead zoom in on the various allocations we saw in Provider::newInstanceUtil
. In the code that was allocating Object[]s
to the tune of 12%(!?) of the total we find this:
Constructor<?> con = clazz.getConstructor();
return con.newInstance();
For security and integrity reasons clazz.getConstructor()
will copy out a new Constructor
on each call, and on the first call of newInstance
on a Constructor
there’s an expensive access check performed on every invocation.
Both methods are also vararg methods. Calling a vararg method with no arguments means javac
generates the allocation of an empty Object[]
and passes that to the method. This could mean optimizing this like is “meaningful”:
private static final Class<?>[] EMPTY_CLZ = new Class<?>[0];
private static final Object[] EMPTY_OBJ = new Class<?>[0];
Constructor<?> con = clazz.getConstructor(EMPTY_CLZ);
return con.newInstance(EMPTY_OBJ);
Our JIT compilers can often optimize away such allocations. But in this case we do see a tiny gain:
Benchmark Score Error Units
fromType3Bytes 1.575 ± 0.497 ops/us
·gc.alloc.rate.norm 472.040 ± 0.029 B/op
-16 bytes per call. Or about 3.2% of the total. Not the ~12% the allocation profiling estimated for this method..
Since I’m running this on a 64-bit JVM with default settings this equals one less allocation of a minimally sized object. Such as an empty array. In real applications a micro-optimizations like this is likely to be a waste of time 99 times out of 10 - but in internal JDK code that might be used by an assortment of public APIs in any number of ways it could very well be a reasonable thing to consider. Even so it seems we might have been led astray…
A bigger cost allocation-wise is probably that we’re copying a Constructor
from clazz
on every call. And for calling into a default constructor there’s a (deprecated) short-hand that caches the constructor: clazz.newInstance()
:
Benchmark Score Error Units
fromType3Bytes 1.576 ± 0.061 ops/us
·gc.alloc.rate.norm 368.031 ± 0.027 B/op
-120 bytes per call compared to the baseline. That’s more like it! And quite possibly a decent throughput gain.
Though our documentation recommends replacing uses of clazz.newInstance()
with
clazz.getDeclaredConstructor().newInstance()
, it appears the latter can underperform due copying along with an expensive (and allocating!) first-time access check. Fixing this might not be trivial without pulling off some heroics in the JIT compiler so in the meantime we could be better off caching the default constructor in Provider
.
Using a ClassValue
could be an efficient way of doing so while ensuring we only weakly reference classes - avoiding any leaks:
private static final ClassValue<Constructor<?>> DEFAULT_CONSTRUCTORS =
new ClassValue<>() {
@Override
protected Constructor<?> computeValue(Class<?> clazz) {
try {
return clazz.getConstructor();
} catch (NoSuchMethodException e) {
throw new RuntimeException(e);
}
}
};
private Constructor<?> getConstructor(Class<?> clazz)
throw NoSuchMethodException {
try {
return DEFAULT_CONSTRUCTORS.get(clazz);
} catch (RuntimeException re) {
if (re.getCause() instanceof NoSuchMethodException nsme) {
throw nsme;
}
throw re;
}
}
...
Constructor<?> con = getConstructor(clazz);
return con.newInstance(EMPTY_OBJ);
The need to wrap a NoSuchMethodException
in a RuntimeException
and then unwrap it slightly unfortunate, but this does the trick and gives a result very close to using the deprecated clazz.newInstance()
:
Benchmark Score Error Units
fromType3Bytes 1.575 ± 0.086 ops/us
·gc.alloc.rate.norm 368.034 ± 0.006 B/op
But we’re still spending quite a bit of time in Provider
. I prototyped a variant inspired by the original PR that caches the MessageDigest
returned from MessageDigest.getInstance("MD5")
in a holder in UUID
. But instead of a ThreadLocal
I clone the shared MD
before use. This is thread-safe, relatively straightforward, and avoids any reflection and Provider
-induced overheads on the hot path. This appears to be a marginal improvement over clazz.newInstance()
on total allocations, but a large improvement on throughput:
Benchmark Score Error Units
fromType3Bytes 2.018 ± 0.076 ops/us
·gc.alloc.rate.norm 344.029 ± 0.022 B/op
This might be the best we can do safely here, but it’d be nice to optimize this more generally and avoid tailoring our optimization efforts so specifically to UUID
.
Summing up
We started out with a baseline:
Benchmark Score Error Units
fromType3Bytes 1.460 ± 0.089 ops/us
·gc.alloc.rate.norm 488.042 ± 0.005 B/op
Combining the idea of cloning a cached MD5
object together with the removal of the temporary buffer x
from MD5
nets us a rather significant improvement over our baseline:
Benchmark Score Error Units
fromType3Bytes 2.186 ± 0.228 ops/us # ~1.45-1.5x
·gc.alloc.rate.norm 264.023 ± 0.006 B/op # -46%
The ThreadLocal
approach suggested in PR#1821 still wins on this
microbenchmark - and also works together with the independent MD5
optimization to get a bit further:
Benchmark Score Error Units
fromType3Bytes 2.578 ± 0.060 ops/us # ~1.75x
·gc.alloc.rate.norm 64.005 ± 0.001 B/op # -87%
We’ll probably not beat the ThreadLocal
approach on this microbenchmark by trimming down overheads in the underlying code, but the current and upcoming disadvantages of using ThreadLocals
in the JDK libraries means that ThreadLocal
PR is unlikely to make it through review. So maybe it’s the best we can do in a practical sense.
My intent is to clean up and get some of the experiments explored here integrated over the next few weeks as time allows. Rampdown work for JDK 16 might take precedence, though.
The ideal here would be to inch us close enough on the specific micro without making anything worse and without just resorting to point fixes.
(Happy New Year!)