Date: Sat, 27 Nov 2010 19:22:07 -0500
From: "D. J. Bernstein"
To: Multiple recipients of list
Subject: Second preimages for 6 (7? (8??)) rounds of Keccak?
At the Crypto 2010 rump session, Dinur and Shamir announced an algebraic
attack on Hamsi-256. They've now posted their paper:
http://eprint.iacr.org/2010/602
It turns out that Dinur and Shamir are looking through 2^256 possible
second preimages, but they argue that this still qualifies as an attack
because they've reduced the number of bit operations in the inner loop.
The simplest form of the attack is as follows. Split a possible preimage
into a 224-bit prefix and a 32-bit suffix. The structure of Hamsi allows
(e.g.) output bits 5,156,214,221,249 to be written as polynomials of
degree at most 6 in the suffix bits, where the polynomial coefficients
are particular functions of the prefix. Compute these polynomials, and
then evaluate these polynomials on all 2^32 suffixes; only about 2^27
suffixes will match the target in the selected 5 output bits.
Standard fast algorithms for multipoint evaluation take only about
5*32*2^31 bit operations, for a total of 80*2^256 bit operations.
Computing the polynomials in the first place is not a bottleneck, since
the polynomials have low degree. The number of compression-function
evaluations is reduced by a factor of almost 2^5.
Now let's think about what this means for Keccak. Out of the five SHA-3
candidates specified in the Keccak submission, let's focus on the
maximum-security candidate keccakc1024: Keccak[r=576,c=1024,nr=24] with
512-bit output. Each 576-bit input block is fed through 24 rounds; let's
weaken Keccak by reducing the 24 rounds to 6 rounds.
Because the block size is already above 512 bits, we can expect to find
a single-block second preimage. Here's how we'll do that:
Choose 64 bits (b[1],...,b[64]) not matching the first preimage.
For each pattern of 128 bits (c[1],...,c[128]):
Compute a polynomial representation of the function f
that maps (x[1],...,x[384]) to the first 10 bits
of Keccak6(b[1],...,b[64],c[1],...,c[128],x[1],...,x[384]).
Use fast multipoint evaluation to compute f
for all (x[1],...,x[384]).
For each (x[1],...,x[384]) where f matches the target:
Compute Keccak6(b[1],...,b[64],c[1],...,c[128],x[1],...,x[384]).
This procedure searches through 2^128*2^384 = 2^512 Keccak inputs, so it
has an excellent chance of finding a second preimage, and adding early
termination with a restart effectively increases this chance to 1. So
the only question is how long the procedure takes.
Each Keccak round is quadratic in the input bits, so Keccak6 has degree
at most 64, so each bit of f has degree at most 64 in x[1],...,x[384];
i.e., f can be written as
sum_S c_S prod_{i in S} x_i
where S ranges over subsets of {1,2,...,384} having size at most 64, and
each c_S is a 10-bit string. We'll compute c_S using a standard
identity, namely higher-order differentiation modulo 2:
c_S = sum_{T subset S} f(T).
There are sum(j=0,64,binomial(384,j)) < 2^246 choices of T, so the
initial computation of each f(T) uses only 2^246 evaluations of Keccak6.
Across the whole attack that's just 2^128*2^246 = 2^374 evaluations of
Keccak6.
Separate computation of c_S = sum_{T subset S} f(T) for each S now uses
sum(j=0,64,2^j*binomial(384,j)) < 2^310 additions of 10-bit strings.
Across the whole attack that's a total of 10*2^438 bit operations. This
is clearly highly suboptimal---standard optimizations (CSE, Paar, etc.)
make a huge difference in the number of additions, and this becomes
important as the degree increases---but I'll stick to it for simplicity.
Fast multipoint evaluation now takes just 384*2^383 additions of 10-bit
strings. Across the whole attack that's a total of 1920*2^512 bit
operations. This is again suboptimal, but it's already much faster than
2^512 evaluations of Keccak6.
This initial filter eliminates most of the 2^384 input strings, leaving
just 2^374 input strings that require Keccak6 evaluations. Across the
whole attack that's just 2^128*2^374 = 2^502 Keccak6 evaluations.
Overall the attack involves 2^502 Keccak6 evaluations plus 1920*2^512
bit operations plus negligible extra work. Can someone double-check the
details and see whether I've calculated all this correctly?
For comparison, there are several papers using Keccak degree bounds to
"distinguish" more rounds of Keccak from a random oracle, but nobody has
managed to give a proper definition of a hash-function "distinguisher"
(a definition that's mathematically clear and that doesn't allow every
function to be trivially distinguished), and it's certainly not possible
to turn these "distinguishers" into attacks against accepted security
notions such as preimage resistance. The SHA-3 Zoo doesn't list any
preimage attacks against reduced-round Keccak, and the only occurrences
of "preimage" on the Keccak cryptanalysis page are pointing to
* Lathrop finding partial preimages in 4 rounds of Keccak and
speculating that the attacks could be pushed to more rounds, and
* Morawiecki and Srebrny finding more restricted partial preimages in
3 rounds of Keccak.
I think that this very simple 6-round Keccak attack can be pushed to 7
rounds (possibly even 8?), if CSE is used to accelerate the computation
of c_S; if that's correct then Keccak will need at least 8 (or even 9?)
rounds to meet its preimage-resistance goals. Full 24-round Keccak
therefore has <=200% extra computation as a security margin.
For people who evaluate attack cost by counting compression-function
evaluations, this 2^10 reduction in the cost of second preimages can be
easily improved to a reduction factor of 2^100 or more. Of course, this
would actually make the attack much slower; to me this is yet another
clear demonstration of how silly it is to evaluate attack cost by
counting compression-function evaluations.
In the opposite direction, for people who have even the slightest
understanding of the physical reality of attack cost, this attack makes
no sense: it's inherently memory-intensive and communication-intensive,
and for the same machine size it's clearly much slower than parallel
exhaustive search, even though it has somewhat fewer bit operations. The
same is true for the Dinur--Shamir attack on Hamsi-256.
One can also reasonably ask why anyone should care about the difference
between 2^520 bit operations and 2^525 bit operations, or between 2^260
bit operations and 2^270 bit operations. If we're really trying to
maximize the epsilon in 2^(512+epsilon), shouldn't we be selecting the
SHA-3 candidate with the largest number of bit operations in its
compression function and with no known inner-loop optimizations? Is
there any evidence that this type of superficial "security" metric has
any value in predicting real-world security?
---D. J. Bernstein
Research Professor, Computer Science, University of Illinois at Chicago
P.S. I'm also curious whether anyone has tried message modification
against Keccak. I wouldn't expect anything to get past 12 rounds, but
the lack of reduced-round collision examples is a bit embarrassing!
===========================================================================
Date: Thu, 2 Dec 2010 09:48:49 -0500
From: "D. J. Bernstein"
To: Multiple recipients of list
Subject: Re: Second preimages for 6 (7? (8??)) rounds of Keccak?
Here's an improved analysis of the bit-operation count for the attack,
including one way to use CSE. For 7 rounds of Keccak, the number of bit
operations here is under 1/30 compared to brute force. For 8 rounds of
Keccak, the number of bit operations here is under 1/1.4 compared to
brute force. The technique breaks down at 9 rounds.
The general setup is that we have a function f mapping F_2^n to F_2^b,
and we know that f has degree at most d, i.e., each output bit of f can
be written as a sum of monomials of degree at most d in the input bits
x[1],x[2],...,x[n]. In other words,
f(x[1],x[2],...,x[n]) = sum_S c_S prod_{i in S} x[i]
where S ranges over subsets of {1,2,...,n} having size at most d, and
each c_S is in F_2^b. This formula uses only monomials that are reduced
modulo x[i]^2-x[i], but allowing x[i]^2 etc. does not add any generality.
We aren't given these coefficients c_S; we're given an algorithm that
computes f by computing a longer string and extracting b bits from it.
Our main goal is to compute f on all 2^n inputs without simply applying
the algorithm 2^n times.
Step 1: Use the specified f algorithm to compute f(T) for each subset T
of {1,2,...,n} having size at most d. The number of T's is exactly
sum_{j in {0,1,...,d}} {n choose j}.
This number is 2^n if d=n, but drops below 2^n as d decreases, and
begins dropping rapidly as d decreases below n/2.
Step 2: For each subset S of {1,2,...,n} having size at most d,
interpolate c_S from these known f(T) values using the higher-order
differentiation formula
c_S = sum_{T subset S} f(T).
A separate computation of each c_S (as in Dinur-Shamir) would take
sum_{j in {0,1,...,d}} (2^j-1){n choose j} additions of b-bit vectors,
but at the bottom of this message I'll explain an application of CSE
that reduces this cost to
sum_{j in {0,1,...,d}} j{n choose j}
additions of b-bit vectors. This makes a huge difference in complexity,
for example reducing 2^476 down to 2^356 for n=384, d=128.
Step 3: Put c_S = 0 for all other subsets S of {1,2,...,n}. This
produces an array of 2^n coefficients c_S. For each i in {n,...,2,1}
successively, add all x[i]^0 entries in the array to all x[i]^1 entries
in the array; this takes a total of n 2^(n-1) additions of b-bit
vectors, and produces the desired values of f. This is a standard fast
multipoint evaluation technique, which Dinur and Shamir call a "fast
Moebius transform." For example, for n=2, this process replaces
c0 + c1 x + c2 y + c3 xy with
c0 + c1 x, (c0+c2) + (c1+c3) x and then
c0, c0+c1, c0+c2, c0+c1+c2+c3.
The first step replaced the polynomials in y by their values at y=0,y=1,
and the second step did the same with x.
Step 4: For each output that matches the target b-bit string, apply the
specified f algorithm again to compute the longer string, and see
whether that matches the longer target. The b-bit match will happen
about 2^(n-b) times (unless f is horrifyingly nonrandom).
If f takes T bit operations then this whole procedure takes
T sum_{j in {0,1,...,d}} {n choose j}
+ b sum_{j in {0,1,...,d}} j{n choose j}
+ bn 2^(n-1)
+ T 2^(n-b)
bit operations. Here are some Keccak examples:
* 6 rounds of Keccak have T=48384 (if I've counted correctly), d=64.
Then b=9 and n=173 use just 892*2^n bit operations, more than 50
times faster than the T*2^n bit operations used by brute force.
Note that this applies to 256-bit Keccak outputs.
* 7 rounds of Keccak have T=56448, d=128. Then b=8 and n=317 use just
1510*2^n bit operations, 37 times faster than brute force. Note
that this also applies to Keccak[] with >=384-bit output, and can
marginally break 256-bit output.
* 8 rounds of Keccak have T=64512, d=256. Then b=7 and n=505 use just
44599*2^n bit operations, still a significant speedup. I've limited
n here so that there's only about a 1% overhead from wasted
computations in a search for 512-bit second preimages.
Finally, here's the CSE application. For each j from 0 through d in
turn, for each set S of size j, we'll compute c_S using exactly j
additions of recorded subexpressions. This is trivial for j=0, so assume
j>=1. Write S = {s1,s2,...,sj} with s1
To: Multiple recipients of list
Subject: RE: Second preimages for 6 (7? (8??)) rounds of Keccak?
Dear Dan,
we have studied the attack in your original message and the generalization in your follow-up message with great interest. Despite its theoretical nature (due to the memory requirements a.o.), we think it is interesting because it exploits the small algebraic degree of the Keccak-f round function for attacking Keccak rather than building a distinguisher of Keccak-f.
Reading your follow-up message, we noticed that most of it is not Keccak-specific. We tried to summarize it and make explicit the conditions a hash function must satisfy for this attack to be applicable. This is what we came up with. Can you comment?
Consider a hash function with an m-bit digest and consider messages of a certain fixed length L.
Let there be:
- a subset A (with |A| = n) of the message bit positions and
- a subset B (with |B| = b) of the digest bit positions, such that for any fixed value of the message bits not in A, the bits in B have algebraic degree d < n as functions of bits in A.
If this degree d is sufficiently small, the computation effort of finding (second) pre-images can be reduced.
This attack can be applied to any hash function (possibly round-reduced) where a subset B of the digest bits have a low algebraic degree in a subset A of the message bits of the last block. The bits in B are only separated from the bits in A by a single compression function (or output transformation) call.
The downside of the attack is that this workload reduction comes at the cost of memory. For it to be a reduction, one considers the computational effort not including the cost of memory access (which apparently becomes important as the amount of memory grows).
The memory required for the attack as described in your follow-up message is b*2^n bits, required for doing Step 3. This may be reduced somewhat by some clever programming techniques (at the cost of increasing the computational effort) but the required memory is in any case lower bounded by the storage required for the c_S values in between Step 2 and Step 3: b*sum_{j in {0,1,...,d}} {n choose j} bits.
For Keccak with Keccak-f reduced to 6, 7 or 8 rounds, the value of L is free. The active bits must all be in the last block, so n is limited by the rate (minus the padding). One can tune b and n resulting in different memory-time trade-offs, but the ones reported are:
- 6 rounds: 2^176 bits of memory give a workload reduction by a factor 50 (~6 bits)
- 7 rounds: 2^320 bits of memory give a workload reduction by a factor 37 (~5 bits)
- 8 rounds: 2^508 bits of memory give a workload reduction by a factor 1.4 (half a bit)
Kind regards,
the Keccak team