Reducing MD5 (and SHA) overheads
Background
Following up on a recent post - (Investigating MD5 Overheads - where I detailed some of the analysis I did around improving java.util.UUID::nameUUIDFromBytes
. As of writing that PR is still open, but we have integrated a few optimizations that improve things substantially in the vicinity.
Optimize MessageDigest.getInstance
(… by optimizing java.security.Provider::getService
.)
#1933 was the first PR to get integated as a result of the analysis I did back then.
As suggested by the prototyping and analysis I did early on, the main improvements comes from caching the Constructor
used to instantiate the digest object. The main difference from the prototype is that instead of caching in a ClassValue
or a ThreadLocal
, I cache it in the java.security.Provider$Service
object. This means there’s no extra lookup or anything. And the gain gets a bit better for it.
Add in a few minor enhancements such as desugaring the Objects.hash
. Objects.hash
is a nice convenience, but it allocates a vararg array. Thus desugaring it can improve performance, but as always make sure that it really matters first. In this particular case - slightly affecting performance of a few public JDK APIs - it seemed like a reasonable thing to do.
All in all athe allocation overhead of MessageDigest.getInstance
is now zero. Only the MessageDigest
itself will be allocated. Per operation this means a 144 byte reduction in allocation pressure.
Since the optimization is general for any java.security.Provider$Service
then this not only means that MessageDigest.getInstance
gets a speed bump - but any security service retrieved via said Provider API will improve similarly.
Too late to change the summary of the enhancement to reflect this, though…
Shrink MD5 and SHA down to size
The prototype I did to optimize MD5
itself turned out pretty good: Get rid of the temporary array, fold the code to read data to digest into the method that will be replaced by an intrinsic.
This shrinks the instance size by a fair chunk, and by avoiding some activity that did not get optimized away by the intrinsic we get a small improvement to digest performance. Most noticeable when the data to digest is small.
So I drafted the pull request … only to soon realize the same (or at least very similar) optimization could be applied to most of the SHA implementations. The main difference is that the temporary state array to be optimized away for the SHA impls are quite a bit larger, so removing the array code altogether would be cumbersome. But moving the allocation of and use of the arrays inside the intrinsified methods means that once the method gets JIT compiled, the array will never need to be allocated.
Running the getAndDigest
microbenchmark I added with the PR with -prof gc
shows that this takes effect, and can reduce allocations a lot:
Benchmark (digesterName) (length) Cnt Score Error Units
Baseline
MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
PR
MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op
MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
(The intrinsic for SHA-1 isn’t enabled by default, so no allocation reduction there.)
Together with the 144b/op allocation reduction we got from the MessageDigest.getInstance
optimization, that means that stronger digests like SHA-256
and SHA-512
drop their footprint overheads by a lot: 416 and 800 bytes per operation respectively. More than half for SHA-256
and two thirds for SHA-512
.
Pretty neat!
Summing up
Going back now and running the microbenchmark for the UUID::nameUUIDFromBytes
that I added in PR#1855, it’s clear we’ve achieved a respectable gain on the specific operation that started this little adventure.
Before:
Benchmark (size) Mode Cnt Score Error Units
UUIDBench.fromType3Bytes 128 thrpt 15 1.620 ± 0.048 ops/us
After:
Benchmark (size) Mode Cnt Score Error Units
UUIDBench.fromType3Bytes 128 thrpt 15 2.309 ± 0.049 ops/us
Around 40-50% speed-up, depending on how large your inputs are.
By analysing and trying to find improvements deeper down in the internals we’ve managed to deliver patches that improve the security area in general. Translating the MD5 improvements to various SHA implementations is also a great success. Some of these improvements will hopefully add up to real gains on some hypothetical real app.
While great, those improvements doesn’t necessarily preclude caching of the MD5
object in java.util.UUID::nameUUIDFromBytes
. There’s still some gains to be had on the microbenchmark by doing so:
Benchmark (size) Mode Cnt Score Error Units
UUIDBench.fromType3Bytes 128 thrpt 15 2.632 ± 0.080 ops/us
However, we think adding a thread-local cache to the specific call site is not motivated. The true cost of adding thread-locals are hard to analyze: there might be footprint creep in large applications with many threads and there might be performance penalties if/when the cached objects are moved around by GCs (which is hard to capture properly in microbenchmarks).