Following up on a recent post - (Investigating MD5 Overheads - where I detailed some of the analysis I did around improving
java.util.UUID::nameUUIDFromBytes. As of writing that PR is still open, but we have integrated a few optimizations that improve things substantially in the vicinity.
(… by optimizing
#1933 was the first PR to get integated as a result of the analysis I did back then.
As suggested by the prototyping and analysis I did early on, the main improvements comes from caching the
Constructor used to instantiate the digest object. The main difference from the prototype is that instead of caching in a
ClassValue or a
ThreadLocal, I cache it in the
java.security.Provider$Service object. This means there’s no extra lookup or anything. And the gain gets a bit better for it.
Add in a few minor enhancements such as desugaring the
Objects.hash is a nice convenience, but it allocates a vararg array. Thus desugaring it can improve performance, but as always make sure that it really matters first. In this particular case - slightly affecting performance of a few public JDK APIs - it seemed like a reasonable thing to do.
All in all athe allocation overhead of
MessageDigest.getInstance is now zero. Only the
MessageDigest itself will be allocated. Per operation this means a 144 byte reduction in allocation pressure.
Since the optimization is general for any
java.security.Provider$Service then this not only means that
MessageDigest.getInstance gets a speed bump - but any security service retrieved via said Provider API will improve similarly.
Too late to change the summary of the enhancement to reflect this, though…
Shrink MD5 and SHA down to size
The prototype I did to optimize
MD5 itself turned out pretty good: Get rid of the temporary array, fold the code to read data to digest into the method that will be replaced by an intrinsic.
This shrinks the instance size by a fair chunk, and by avoiding some activity that did not get optimized away by the intrinsic we get a small improvement to digest performance. Most noticeable when the data to digest is small.
So I drafted the pull request … only to soon realize the same (or at least very similar) optimization could be applied to most of the SHA implementations. The main difference is that the temporary state array to be optimized away for the SHA impls are quite a bit larger, so removing the array code altogether would be cumbersome. But moving the allocation of and use of the arrays inside the intrinsified methods means that once the method gets JIT compiled, the array will never need to be allocated.
getAndDigest microbenchmark I added with the PR with
-prof gc shows that this takes effect, and can reduce allocations a lot:
Benchmark (digesterName) (length) Cnt Score Error Units Baseline MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op PR MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
(The intrinsic for SHA-1 isn’t enabled by default, so no allocation reduction there.)
Together with the 144b/op allocation reduction we got from the
MessageDigest.getInstance optimization, that means that stronger digests like
SHA-512 drop their footprint overheads by a lot: 416 and 800 bytes per operation respectively. More than half for
SHA-256 and two thirds for
Going back now and running the microbenchmark for the
UUID::nameUUIDFromBytes that I added in PR#1855, it’s clear we’ve achieved a respectable gain on the specific operation that started this little adventure.
Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 128 thrpt 15 1.620 ± 0.048 ops/us
Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 128 thrpt 15 2.309 ± 0.049 ops/us
Around 40-50% speed-up, depending on how large your inputs are.
By analysing and trying to find improvements deeper down in the internals we’ve managed to deliver patches that improve the security area in general. Translating the MD5 improvements to various SHA implementations is also a great success. Some of these improvements will hopefully add up to real gains on some hypothetical real app.
While great, those improvements doesn’t necessarily preclude caching of the
MD5 object in
java.util.UUID::nameUUIDFromBytes. There’s still some gains to be had on the microbenchmark by doing so:
Benchmark (size) Mode Cnt Score Error Units UUIDBench.fromType3Bytes 128 thrpt 15 2.632 ± 0.080 ops/us
However, we think adding a thread-local cache to the specific call site is not motivated. The true cost of adding thread-locals are hard to analyze: there might be footprint creep in large applications with many threads and there might be performance penalties if/when the cached objects are moved around by GCs (which is hard to capture properly in microbenchmarks).