Difference between revisions of "SHA-3 Hardware Implementations Round Three"

Revision as of 17:47, 22 May 2012

1 Call for Contributions

Implementers (both submitters and non-submitters): You have results that complement this site? Let us know at sha3zoo-hardware@iaik.tugraz.at If you are making your HDL code available, please also provide us with according information.

2 Important Information

This page summarizes key properties of reported hardware implementations of those SHA-3 candidates, which are currently under consideration by NIST. This is work in progress. If you know of any implementations which should be mentioned on this page, refer to our call for contributions.

A list of hardware implementations of the round 1 candidates can be found here. A list of hardware implementations of the round 2 candidates is archived here. Please note that the pages for round 1 and 2 candidates are provided for reference and will not be updated.

The implementations are categorized into FPGA and standard-cell ASIC implementations. Note that the diversity of implementation scope, target technologies, and synthesis tools makes direct comparisons between different hardware implementations difficult. The more of these parameters agree, the more reasonable the comparison becomes.

The target technology should be as similar as possible. For FPGA implementation, it is desirable to compare implementations on the same target device (or at least on devices of the same FPGA family). For standard-cell ASIC implementation, at least the minimal gate length of the process (e.g., 0.13 µm) should agree. More ideally, the implementations use the same standard-cell library (which implies the use of the same process technology).

In order to facilitate the comparison of hardware modules with different implementation scopes, we classify them into three categories:

For suggestions regarding the structure of this site, let us know at sha3zoo-hardware@iaik.tugraz.at

2.1 Fully Autonomous Implementation

Such hardware implementations include the complete functionality of a SHA-3 candidate (or a specific version thereof). That means the input message can be loaded piecewise into the hardware module and it delivers the message digest as output. All hash calculations happen exclusively within the hardware module. If integrated in a system, the achievable throughput of a fully autonomous implementation depends on the speed of the hardware module itself and the speed of the (system dependent) data interface delivering the input message.

2.2 Implementation with External Memory

These implementations use external memory to hold intermediate values during the hashing of a message. The implemented hardware itself normally consists of the core logic functionality of the hash function, some registers for short-lived temporary values, and possible a memory controller for access to the external memory. Such implementations can load the input message either over a dedicated interface (similar to a fully autonomous implementation) or from the external memory. In order to reach the maximal throughput of the hardware module, the external memory must be sufficiently fast.

2.3 Implementation of Core Functionality

Such implementations comprise only important parts of the hash function (e.g., the compression function), which normally allows to get a first-order estimate of the performance figures of full implementations.

3 Tweaks of Round Three Candidates over Round Two

The main tweaks for round three consist of the adaption of round numbers for some of the candidates. For implementations of round 2 variants (cf. round two results), we extrapolated to the performance of round 3 variants. Extrapolated results are marked in color. If the tweaks for an algorithm are expected to be negligible for performance (e.g. just a change of constants), we include the results for the round 2 variant verbatim.

BLAKE: The round three versions of BLAKE have been renamed to BLAKE-224, BLAKE-256, BLAKE-384, and BLAKE-512. The number of rounds has been increased from 10 to 14 for BLAKE-224 and BLAKE-256, and from 14 to 16 for BLAKE-384 and BLAKE-512. Thus, throughput for BLAKE-224 and BLAKE-256 is expected to decrease by a factor of 10/14 (reduction by about 28.5%), and for BLAKE-384 and BLAKE-512 by a factor of 14/16 (reduction by 12.5%).
Grøstl: The shift distances for the Q permutation have been changed and the round constants for both P and Q permutation have been modified. The first is not expected to have an impact on hardware performance, whereas the latter is likely to increase overall hardware size and/or decrease throughput slightly.
JH: The number of rounds has been increased from 35.5 to 42. Thus, throughput of JH is expected to decrease by a factor of 35.5/42 (reduction by about 15.5%).
Keccak: The padding rule has been simplified and some parameters have been redefined. No significant impact on hardware performance is expected.
Skein: A single 64-bit constant has been changed. No significant impact on hardware performance is expected.

4 Ongoing Hardware Benchmarking Efforts

To describe it in the words of the initiators and maintainers: "ATHENa: Automated Tool for Hardware EvaluatioN is a project started at George Mason University, aimed at fair, comprehensive, and automated evaluation of cryptographic cores developed using hardware description languages, such as VHDL and Verilog." More information about the project and the current results can be found on the ATHENa webpage. Note: As each hash module submitted to ATHENAa is implemented on several FPGA platforms, the SHA-3 zoo pages will not replicate all results produced by the ATHENa project on this webpage. Instead please refer directly to the ATHENa webpage.

5 Summary of All Results

This section includes four categories of implementations (high-speed, low-area, both for FPGA and ASIC) which include known published results. If the HDL sourcecode is available, a link is provided as well.

5.1 High-Speed Implementations (FPGA)

Important note: The size and functionality of slices varies between FPGA families. A direct comparison of the slice count of implementations on different FPGA families is therefore problematic.

Hash Function Name	Reference / HDL	Impl. Scope	Impl. Details	Technology	Size	Throughput	Clock Frequency
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 8 G function units	Xilinx Virtex-II Pro	3091 slices	1724 Mbit/s	37.0 MHz
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 8 G function units	Xilinx Virtex 4	3087 slices	2235 Mbit/s	48.0 MHz
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 8 G function units	Xilinx Virtex 5	1694 slices	3103 Mbit/s	67.0 MHz
BLAKE-32	Namin and Hasan [2] / N/A	Core functionality	Compression function with 8 G function units and I/O registers	Altera Stratix III	5435 ALUTs	2186.2 Mbit/s	46.97 MHz
BLAKE-32	Kobayashi et al. [3] / RCIS webpage	Fully autonomous		Xilinx Virtex 5	1660 slices	2676 Mbit/s	115 MHz
BLAKE-32	Homsirikamol et al. [30] / On request	Fully autonomous	4 G function units per iteration	Xilinx Virtex 5	1523 slices	3143 Mbit/s	128.9 MHz
BLAKE-32	Homsirikamol et al. [30] / On request	Fully autonomous	4 G function units per iteration	Altera Stratix III	3635 ALUTs	2901 Mbit/s	119.0 MHz
BLAKE-32	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	1118 slices	1169 Mbit/s	118.06 MHz
BLAKE-32	Matsuo et al. [33] / RCIS website	Fully autonomous		Xilinx Virtex 5	1660 slices	2676 Mbit/s	115 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 8 G function units	Xilinx Virtex-II Pro	11122 slices	1177 Mbit/s	17.0 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 8 G function units	Xilinx Virtex 4	11483 slices	1707 Mbit/s	25.0 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 8 G function units	Xilinx Virtex 5	4329 slices	2389 Mbit/s	35.0 MHz
BLAKE-64	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	1718 slices	1299 Mbit/s	90.91 MHz
BLAKE-64	Homsirikamol et al. [30] / On request	Fully autonomous	4 G function units per iteration	Xilinx Virtex 5	3064 slices	3520 Mbit/s	99.7 MHz
BLAKE-64	Homsirikamol et al. [30] / On request	Fully autonomous	4 G function units per iteration	Altera Stratix III	7086 ALUTs	3161 Mbit/s	89.5 MHz
Grøstl-224/256	Jungk et al. [6] / N/A	Fully autonomous	P & Q permutation in parallel	Xilinx Spartan 3	6136 slices	4520 Mbit/s	88.3 MHz
Grøstl-224/256	Submission doc. [7] / N/A	Fully autonomous	P & Q permutation in parallel	Xilinx Virtex 5	1722 slices	10276 Mbit/s	200.7 MHz
Grøstl-224/256	Baldwin et al. [4] / N/A	Core functionality	P & Q permutation in parallel, S-box in BRAM	Xilinx Spartan 3	4827 slices	3660 Mbit/s	71.53 MHz
Grøstl-224/256	Baldwin et al. [4] / N/A	Core functionality	P & Q permutation in parallel, S-box in BRAM	Xilinx Virtex 5	4516 slices	7310 Mbit/s	142.87 MHz
Grøstl-256	Kobayashi et al. [3] / RCIS webpage	Fully autonomous		Xilinx Virtex 5	4057 slices	5171 Mbit/s	101 MHz
Grøstl-256	Homsirikamol et al. [30] / On request	Fully autonomous	P & Q permutations interleaved	Xilinx Virtex 5	1597 slices	7885 Mbit/s	323.4 MHz
Grøstl-256	Homsirikamol et al. [30] / On request	Fully autonomous	P & Q permutations interleaved	Altera Stratix III	6350 ALUTs	5380 Mbit/s	220.7 MHz
Grøstl-256	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	2391 slices	3242 Mbit/s	101.32 MHz
Grøstl-256	Matsuo et al. [33] / RCIS website	Fully autonomous		Xilinx Virtex 5	2616 slices	7885 Mbit/s	154 MHz
Grøstl-384/512	Submission doc. [7] / N/A	Fully autonomous	P & Q permutation in parallel	Xilinx Spartan 3	20233 slices	5901 Mbit/s	80.7 MHz
Grøstl-384/512	Baldwin et al. [4] / N/A	Core functionality	P & Q permutation parallel, S-box in LUTs	Xilinx Spartan 3	17452 slices	3180 Mbit/s	79.61 MHz
Grøstl-384/512	Baldwin et al. [4] / N/A	Core functionality	P & Q permutation parallel, S-box in LUTs	Xilinx Virtex 5	19161 slices	6090 Mbit/s	83.33 MHz
Grøstl-384/512	Submission doc. [7] / N/A	Fully autonomous	P & Q permutation in parallel	Xilinx Virtex 5	5419 slices	15395 Mbit/s	210.5 MHz
Grøstl-384/512	Jungk and Reith [22] / N/A	Fully autonomous	Shared P & Q permutation	Xilinx Spartan 3	8308 slices	3474 Mbit/s	95 MHz
Grøstl-512	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	4845 slices	3619 Mbit/s	123.4 MHz
Grøstl-512	Homsirikamol et al. [30] / On request	Fully autonomous	P & Q permutations interleaved	Xilinx Virtex 5	3138 slices	10314 Mbit/s	292.1 MHz
Grøstl-512	Homsirikamol et al. [30] / On request	Fully autonomous	P & Q permutations interleaved	Altera Stratix III	12355 ALUTs	7142 Mbit/s	202.3 MHz
JH-256	Homsirikamol et al. [30] / On request	Fully autonomous		Xilinx Virtex 5	1018 slices	5416 Mbit/s	380.8 MHz
JH-256	Homsirikamol et al. [30] / On request	Fully autonomous		Altera Stratix III	3525 ALUTs	5515 Mbit/s	387.8 MHz
JH-256	Matsuo et al. [33] / RCIS website	Fully autonomous		Xilinx Virtex 5	2661 slices	2639 Mbit/s	201 MHz
JH	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	1291 slices	1941 Mbit/s	250.13 MHz
JH-512	Homsirikamol et al. [30] / On request	Fully autonomous		Xilinx Virtex 5	1104 slices	5610 Mbit/s	394.5 MHz
JH-512	Homsirikamol et al. [30] / On request	Fully autonomous		Altera Stratix III	3709 ALUTs	5556 Mbit/s	390.6 MHz
Keccak	Updated spec. (v1.2) [8] / Submission webpage	Fully autonomous	Core (round function, state register) & IO buffer	Altera Cyclone III	5776 LEs	7500 Mbit/s	133 MHz
Keccak	Updated spec. (v1.2) [8] / Submission webpage	Fully autonomous	Core (round function, state register) & IO buffer	Altera Stratix III	4713 ALUTs	12400 Mbit/s	218 MHz
Keccak	J. Strömbergson [9] / Submission webpage	Fully autonomous	Core (round function, state register) only	Xilinx Spartan 3A	3393 slices	4800 Mbit/s	85 MHz
Keccak	Updated spec. (v1.2) [8] / Submission webpage	Fully autonomous	Core (round function, state register) & IO buffer	Xilinx Virtex 5	1412 slices	6900 Mbit/s	122 MHz
Keccak(-224)	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	1117 slices	5915 Mbit/s	189 MHz
Keccak(-256)	Homsirikamol et al. [30] / On request	Fully autonomous		Xilinx Virtex 5	1272 slices	12817 Mbit/s	282.7 MHz
Keccak(-256)	Homsirikamol et al. [30] / On request	Fully autonomous		Altera Stratix III	4213 ALUTs	12393 Mbit/s	273.4 MHz
Keccak(-256)	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	1117 slices	6263 Mbit/s	189 MHz
Keccak(-256)	Matsuo et al. [33] / RCIS website	Fully autonomous		Xilinx Virtex 5	1433 slices	8397 Mbit/s	205 MHz
Keccak(-384)	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	1117 slices	8190 Mbit/s	189 MHz
Keccak(-512)	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	1117 slices	8518 Mbit/s	189 MHz
Keccak(-512)	Homsirikamol et al. [30] / On request	Fully autonomous		Xilinx Virtex 5	1257 slices	6845 Mbit/s	285.2 MHz
Keccak(-512)	Homsirikamol et al. [30] / On request	Fully autonomous		Altera Stratix III	3979 ALUTs	7310 Mbit/s	304.6 MHz
Keccak	Akin et al. [34] / N/A	Core functionality	One Keccak-f round per cycle	Xilinx Spartan 3	2024 slices	3460 Mbit/s	81.4 MHz
Keccak	Akin et al. [34] / N/A	Core functionality	One Keccak-f round per cycle	Xilinx Virtex-II	2024 slices	5810 Mbit/s	136.6 MHz
Keccak	Akin et al. [34] / N/A	Core functionality	One Keccak-f round per cycle	Xilinx Virtex 4	2024 slices	6070 Mbit/s	142.9 MHz
Skein-256-h	Men Long [11] / N/A	Core functionality	UBI component	Xilinx Virtex 5	1001 slices	408.7 Mbit/s	114.9 MHz
Skein-256-256	Stefan Tillich [12] / On request	Fully autonomous	8 Threefish rounds unrolled	Xilinx Virtex 5	937 slices	1751 Mbit/s	68.4 MHz
Skein-256-256	Stefan Tillich [12] / On request	Fully autonomous	8 Threefish rounds unrolled	Xilinx Spartan 3	2421 slices	669 Mbit/s	26.14 MHz
Skein-256-256	Kobayashi et al. [3] / RCIS webpage	Fully autonomous		Xilinx Virtex 5	854 slices	1482 Mbit/s	115 MHz
Skein-512-256	Homsirikamol et al. [30] / On request	Fully autonomous	4 Threefish rounds unrolled	Xilinx Virtex 5	1621 slices	3178 Mbit/s	118.0 MHz
Skein-512-256	Homsirikamol et al. [30] / On request	Fully autonomous	4 Threefish rounds unrolled	Altera Stratix III	4645 ALUTs	2503 Mbit/s	92.9 MHz
Skein-256-256	Matsuo et al. [33] / RCIS website	Fully autonomous		Xilinx Virtex 5	854 slices	1402 Mbit/s	115 MHz
Skein-512-h	Men Long [11] / N/A	Core functionality	UBI component	Xilinx Virtex 5	1877 slices	817.4 Mbit/s	114.9 MHz
Skein-512-512	Stefan Tillich [12] / On request	Fully autonomous	8 Threefish rounds unrolled	Xilinx Virtex 5	1632 slices	3535 Mbit/s	69.04 MHz
Skein-512-512	Stefan Tillich [12] / On request	Fully autonomous	8 Threefish rounds unrolled	Xilinx Spartan 3	4273 slices	1365 Mbit/s	26.66 MHz
Skein-512	Baldwin et al. [31] / UCC webpage	Fully autonomous		Xilinx Virtex 5	1786 slices	1945 Mbit/s	83.65 MHz
Skein-512-512	Homsirikamol et al. [30] / On request	Fully autonomous	4 Threefish rounds unrolled	Xilinx Virtex 5	1716 slices	3209 Mbit/s	119.1 MHz
Skein-512-512	Homsirikamol et al. [30] / On request	Fully autonomous	4 Threefish rounds unrolled	Altera Stratix III	4794 ALUTs	2434 Mbit/s	90.3 MHz

5.2 Low-Area Implementations (FPGA)

Hash Function Name	Reference / HDL	Impl. Scope	Implementation Details	Technology	Size	Throughput	Clock Frequency
BLAKE-32	Beuchat et al. [13] / N/A	Fully autonomous	Rescheduled G function	Xilinx Spartan-3	124 slices	115 Mbit/s	190.0 MHz
BLAKE-32	Beuchat et al. [13] / N/A	Fully autonomous	Rescheduled G function	Xilinx Virtex-4	124 slices	216 Mbit/s	357.0 MHz
BLAKE-32	Beuchat et al. [13] / N/A	Fully autonomous	Rescheduled G function	Xilinx Virtex-5	56 slices	225 Mbit/s	372.0 MHz
BLAKE-32	Beuchat et al. [13] / N/A	Fully autonomous	Rescheduled G function	Altera Cyclone III	285 LEs	116 Mbit/s	192.0 MHz
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 1 G function unit	Xilinx Virtex-II Pro	958 slices	371 Mbit/s	59.0 MHz
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 1 G function unit	Xilinx Virtex 4	960 slices	430 Mbit/s	68.0 MHz
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 1 G function unit	Xilinx Virtex 5	390 slices	575 Mbit/s	91.0 MHz
BLAKE-64	Beuchat et al. [13] / N/A	Fully autonomous	Rescheduled G function	Xilinx Spartan-3	229 slices	138 Mbit/s	158.0 MHz
BLAKE-64	Beuchat et al. [13] / N/A	Fully autonomous	Rescheduled G function	Xilinx Virtex-4	230 slices	219 Mbit/s	250.0 MHz
BLAKE-64	Beuchat et al. [13] / N/A	Fully autonomous	Rescheduled G function	Xilinx Virtex-5	108 slices	314 Mbit/s	358.0 MHz
BLAKE-64	Beuchat et al. [13] / N/A	Fully autonomous	Rescheduled G function	Altera Cyclone III	542 LEs	123 Mbit/s	140.0 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 1 G function unit	Xilinx Virtex-II Pro	1802 slices	326 Mbit/s	36.0 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 1 G function unit	Xilinx Virtex 4	1856 slices	381 Mbit/s	42.0 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 1 G function unit	Xilinx Virtex 5	939 slices	533 Mbit/s	59.0 MHz
Grøstl-224/256	Jungk et al. [6] / N/A	Fully autonomous	64-bit datapath, P & Q permutation in parallel	Xilinx Spartan 3	2486 slices	404 Mbit/s	63.2 MHz
Grøstl-224/256	Jungk et al. [6] / N/A	Fully autonomous	64-bit datapath, P & Q permutation in parallel	Xilinx Virtex 2 Pro	2754 slices	512 Mbit/s	81.5 MHz
Grøstl-224/256	Jungk and Reith [22] / N/A	Fully autonomous	Shared P & Q permutation, S-Box based on composite field arithmetic	Xilinx Spartan 3	1276 slices	192 Mbit/s	60 MHz
Grøstl-384/512	Jungk and Reith [22] / N/A	Fully autonomous	Shared P & Q permutation, S-Box based on composite field arithmetic	Xilinx Spartan 3	2110 slices	144 Mbit/s	63 MHz
Keccak	Updated spec. (v1.2) [8] / Submission webpage	Using external memory	Small core using system memory	Altera Stratix III	855 ALUTs	96.8 Mbit/s	366 MHz
Keccak	Updated spec. (v1.2) [8] / Submission webpage	Using external memory	Small core using system memory	Altera Cyclone III	1559 LEs	47.8 Mbit/s	181 MHz
Keccak	Updated spec. (v1.2) [8] / Submission webpage	Using external memory	Small core using system memory	Xilinx Virtex 5	444 slices	70.1 Mbit/s	265 MHz
Skein-256-256	Namin and Hasan [2] / N/A	Core functionality	One round of Threefish iterated	Altera Stratix III	1385 ALUTs	573.9 Mbit/s	161.42 MHz

5.3 High-Speed Implementations (ASIC)

A comparison of implementations of all 14 round 2 candidates has been presented informally at IAIK (Graz University of Technology) on Sept. 16, 2009. The updated presentation slides can be found here.

Hash Function Name	Reference / HDL	Impl. Scope	Implementation Details	Technology	Size	Throughput	Clock Frequency
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 8 G function units	UMC 0.18 µm	58.30 kGates	5295 Mbit/s	114 MHz
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 4 G function units	UMC 0.18 µm	41.31 kGates	4153 Mbit/s	170 MHz
BLAKE-32	Namin and Hasan [2] / N/A	Core functionality	Compression function with 8 G function units and I/O registers	STM 90 nm	53 kGates	4475 Mbit/s(*)	96.15 MHz
BLAKE-32	Tillich et al. [14] / On request	Fully autonomous	Compression function with 4 G function units with CSAs	UMC 0.18 µm	45.64 kGates	3971 Mbit/s	170.64 MHz
BLAKE-32	Henzen et al. [29] / ETH webpage	Fully autonomous	Four parallel G functions modules	UMC 90 nm	47.5 kGates	9752 Mbit/s	400 MHz
BLAKE-32	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	43.52 kGates	4645 Mbit/s	200 MHz
BLAKE-32	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	37 kGates	6668 Mbit/s	286.5 MHz
BLAKE-32	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 8 G function units	UMC 0.18 µm	79 kGates	6376 Mbit/s	137 MHz
BLAKE-32	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 4 G function units	UMC 0.18 µm	48 kGates	5847 Mbit/s	240 MHz
BLAKE-32	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 8 G function units	UMC 0.13 µm	67 kGates	9365 Mbit/s	201 MHz
BLAKE-32	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 4 G function units	UMC 0.13 µm	43 kGates	8047 Mbit/s	330 MHz
BLAKE-32	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 8 G function units	UMC 90 nm	65 kGates	17498 Mbit/s	376 MHz
BLAKE-32	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 4 G function units	UMC 90 nm	38 kGates	15143 Mbit/s	621 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 8 G function units	UMC 0.18 µm	132.47 kGates	5910 Mbit/s	87 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with 4 G function units	UMC 0.18 µm	82.73 kGates	4810 Mbit/s	136 MHz
BLAKE-64	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 8 G function units	UMC 0.18 µm	147 kGates	7216 Mbit/s	106 MHz
BLAKE-64	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 4 G function units	UMC 0.18 µm	98 kGates	7192 Mbit/s	204 MHz
BLAKE-64	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 8 G function units	UMC 0.13 µm	139 kGates	10802 Mbit/s	158 MHz
BLAKE-64	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 4 G function units	UMC 0.13 µm	92 kGates	10265 Mbit/s	291 MHz
BLAKE-64	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 8 G function units	UMC 90 nm	128 kGates	20317 Mbit/s	298 MHz
BLAKE-64	Henzen et al. [40] / Submission webpage	Fully autonomous	Compression function with 4 G function units	UMC 90 nm	79 kGates	18782 Mbit/s	532 MHz
Blue Midnight Wish-256	Namin and Hasan [2] / N/A	Core functionality	Compression function with f0, f1, and f2 unrolled in sequence and I/O registers	STM 90 nm	164 kGates	26665 Mbit/s(*)	52.08 MHz
Blue Midnight Wish-256	Tillich et al. [14] / On request	Fully autonomous	Compression function with f0, f1, and f2 unrolled	UMC 0.18 µm	169.74 kGates	5358 Mbit/s	10.46 MHz
Blue Midnight Wish-256	Henzen et al. [29] / ETH webpage	Fully autonomous	single-cycle f0 and f2, f1 iteratively	UMC 90 nm	150 kGates	8486 Mbit/s	298 MHz
Blue Midnight Wish-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	198.17 kGates	12220 Mbit/s	48 MHz
Blue Midnight Wish	Akin et al. [34] / N/A	Core functionality	Compression function with f0, f1, and f2 unrolled in sequence	Synopsys 90 nm	55.9 kGates	26320 Mbit/s	52.63 MHz
Blue Midnight Wish-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	128.7 kGates	25937 Mbit/s	101.3 MHz
CubeHash16/32-h	Tillich et al. [14] / On request	Fully autonomous	Dynamically reconfigurable r and b parameters, two rounds unrolled	UMC 0.18 µm	58.87 kGates	4665 Mbit/s	145.77 MHz
CubeHash16/32-h	Bernet et al. [20] / N/A	Fully autonomous	One round per cycle	0.13 µm	34.33 kGates	9248 Mbit/s(***)	578 MHz
CubeHash16/32-h	Bernet et al. [20] / N/A	Fully autonomous	Half a round per cycle	0.13 µm	21.54 kGates	8000 Mbit/s(***)	1000 MHz
CubeHash16/32-256	Henzen et al. [29] / ETH webpage	Fully autonomous	One round per cycle, IV fixed	UMC 90 nm	42.5 kGates	10667 Mbit/s	667 MHz
CubeHash16/32-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	38.18 kGates	4624 Mbit/s	289 MHz
CubeHash16/32-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	35.5 kGates	8247 Mbit/s	515.5 MHz
ECHO-224/256	Lu et al. [5] / N/A	Fully autonomous		0.13 µm	521.1 kGates	14850 Mbit/s	87.1 MHz
ECHO-256	Tillich et al. [14] / On request	Fully autonomous	Four parallel AES rounds, 16 AES MixColumns 32-bit column multipliers	UMC 0.18 µm	141.49 kGates	2246 Mbit/s	141.84 MHz
ECHO-256	Henzen et al. [29] / ETH webpage	Fully autonomous	8 AES rounds per cycle	UMC 90 nm	260 kGates	13966 Mbit/s	291 MHz
ECHO-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	92.73 kGates	3366 Mbit/s	217 MHz
ECHO-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	101.1 kGates	5621 Mbit/s	362.3 MHz
ECHO-384/512	Lu et al. [5] / N/A	Fully autonomous		0.13 µm	516.8 kGates	7750 Mbit/s	83.3 MHz
Fugue-256	Submission doc. [15] / N/A	Fully autonomous	Four columns of SMIX transformation in parallel (SUPER4_P)	IBM 90 nm	109.85 kGates	13913 Mbit/s	869.5 MHz
Fugue-256	Tillich et al. [14] / On request	Fully autonomous	Four columns of SMIX transformation in parallel	UMC 0.18 µm	46.26 kGates	4092 Mbit/s	255.75 MHz
Fugue-256	Henzen et al. [29] / ETH webpage	Fully autonomous	S-box as LUT	UMC 90 nm	55 kGates	8815 Mbit/s	551 MHz
Fugue-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	91.09 kGates	2385 Mbit/s	149 MHz
Fugue-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	56.7 kGates	2721 Mbit/s	170.1 MHz
Grøstl-256	Tillich et al. [14] / On request	Fully autonomous	One shared permutation for P & Q, one pipeline stage	UMC 0.18 µm	58.40 kGates	6290 Mbit/s	270.27 MHz
Grøstl-256	Henzen et al. [29] / ETH webpage	Fully autonomous	P and Q permutation interleaved with one pipeline stage, S-box as LUT	UMC 90 nm	135 kGates	16254 Mbit/s	667 MHz
Grøstl-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	110.11 kGates	9606 Mbit/s	188 MHz
Grøstl-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	139.1 kGates	17297 Mbit/s	337.8 MHz
Grøstl-256	RCIS webpage [39] / RCIS webpage	Fully autonomous		STM 90 nm	120.8 kGates	16275 Mbit/s	349.7 MHz
Grøstl-384/512	Submission doc. [7] / N/A	Fully autonomous	P & Q permutation in parallel	UMC 0.18 µm	341 kGates	6225 Mbit/s	85.1 MHz
Hamsi-256	Junfeng Fan (Hamsi website) [16] / N/A	Fully autonomous		0.13 µm	22 kGates	4940 Mbit/s	1080 MHz
Hamsi-256	Tillich et al. [14] / On request	Fully autonomous	Three instances of P/Pf function unrolled	UMC 0.18 µm	58.66 kGates	5565 Mbit/s	173.91 MHz
Hamsi-256	Henzen et al. [29] / ETH webpage	Fully autonomous	Message expansions in LUTs, one round per cycle	UMC 90 nm	45 kGates	8686 Mbit/s	814 MHz
Hamsi-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	29.94 kGates	3571 Mbit/s	446 MHz
Hamsi-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	67.6 kGates	7767 Mbit/s	970.9 MHz
Hamsi-512	Junfeng Fan (Hamsi website) [16] / N/A	Fully autonomous		0.13 µm	50 kGates	3970 Mbit/s	820 MHz
JH-256	Tillich et al. [14] / On request	Fully autonomous	320 S-boxes, one round of R₈ per cycle	UMC 0.18 µm	58.83 kGates	4991 Mbit/s	380.22 MHz
JH-256	Henzen et al. [29] / ETH webpage	Fully autonomous	S-boxes as LUTs, stored constants	UMC 90 nm	80 kGates	10807 Mbit/s	760 MHz
JH-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	62.42 kGates	5128 Mbit/s	391 MHz
JH-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	54.6 kGates	10022 Mbit/s	763.4 MHz
Keccak	Updated spec. (v1.2) [8] / Submission webpage	Fully autonomous	Core (round function, state register) & IO buffer	ST 0.13 µm	48 kGates	29900 Mbit/s	526 MHz
Keccak	Submission doc. [8] / Submission webpage	Fully autonomous	Core (round function, state register) only	ST 0.13 µm	40 kGates	15000 Mbit/s	500 MHz
Keccak(-256)	Tillich et al. [14] / On request	Fully autonomous	One instance of Keccak-f round	UMC 0.18 µm	56.32 kGates	21229 Mbit/s	487.80 MHz
Keccak(-256)	Henzen et al. [29] / ETH webpage	Fully autonomous	One round per cycle	UMC 90 nm	50 kGates	43011 Mbit/s	949 MHz
Keccak(-256)	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	47.43 kGates	15457 Mbit/s	377 MHz
Keccak	Akin et al. [34] / N/A	Core functionality	One Keccak-f round per cycle	Synopsys 90 nm	10.5 kGates	19320 Mbit/s	454.5 MHz
Keccak(-256)	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	50.7 kGates	33333 Mbit/s	781.3 MHz
Keccak(-256)	RCIS webpage [39] / RCIS webpage	Fully autonomous		STM 90 nm	55.9 kGates	43986 Mbit/s	1030.9 MHz
Luffa-224/256	Knežević and Verbauwhede [17] / Author's webpage	Fully autonomous	Three permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each)	UMC 0.13 µm	30.83 kGates	31960 Mbit/s	1124 MHz
Luffa-256	Namin and Hasan [2] / N/A	Core functionality	Compression function (1 cycle latency) and I/O registers	STM 90 nm	122 kGates	25702 Mbit/s(*)	100.4 MHz
Luffa-224/256	Tillich et al. [14] / On request	Fully autonomous	Three permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each)	UMC 0.18 µm	44.97 kGates	13741 Mbit/s	483.09 MHz
Luffa-256	Henzen et al. [29] / ETH webpage	Fully autonomous	Three parallel step modules, SubCrumb as logic	UMC 90 nm	55 kGates	23256 Mbit/s	727 MHz
Luffa-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	37.94 kGates	13943 Mbit/s	490 MHz
Luffa-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	39.6 kGates	28732 Mbit/s	1010.1 MHz
Luffa-256	Satoh et al. [38] / RCIS webpage	Fully autonomous	Three permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each), two rounds unrolled	STM 90 nm	62.8 kGates	35068.5 Mbit/s	684.9 MHz
Luffa-384	Knežević and Verbauwhede [17] / Author's webpage	Fully autonomous	Four permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each)	UMC 0.13 µm	50.07 kGates	23126 Mbit/s	813 MHz
Luffa-512	Knežević and Verbauwhede [17] / Author's webpage	Fully autonomous	Five permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each)	UMC 0.13 µm	65.1 kGates	19617 Mbit/s	690 MHz
Luffa	Akin et al. [34] / N/A	Core functionality	Three step modules	Synopsys 90 nm	11.5 kGates	21370 Mbit/s	769.2 MHz
Shabal-256	Namin and Hasan [2] / N/A	Core functionality	Compression function with I/O registers (latency of 16 clock cycles)	STM 90 nm	20 kGates	4408 Mbit/s(*)	413.22 MHz
Shabal-256	Tillich et al. [14] / On request	Fully autonomous	One word rotation per cycle, 50 cycles per block	UMC 0.18 µm	54.19 kGates	3282 Mbit/s	320.51 MHz
Shabal	Bernet et al. [20] / N/A	Fully autonomous	One word rotation per cycle, 52 cycles per block	0.13 µm	41.32 kGates	6351 Mbit/s(***)	645 MHz
Shabal-256	Henzen et al. [29] / ETH webpage	Fully autonomous	30 adders, 16 subtractors	UMC 90 nm	45 kGates	6819 Mbit/s	693 MHz
Shabal-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	49.44 kGates	2945 Mbit/s	362 MHz
Shabal-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	34.6 kGates	6059 Mbit/s	591.7 MHz
SHAvite-3₂₅₆	Tillich et al. [14] / On request	Fully autonomous	Four AES rounds (two for compression, two for message expansion)	UMC 0.18 µm	57.39 kGates	3152 Mbit/s	227.79 MHz
SHAvite-3₂₅₆	Henzen et al. [29] / ETH webpage	Fully autonomous	One AES round each for message expansion and F³ round	UMC 90 nm	75 kGates	7999 Mbit/s	562 MHz
SHAvite-3₂₅₆	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	55.25 kGates	4599 Mbit/s	341 MHz
SHAvite-3₂₅₆	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	59.4 kGates	8421 Mbit/s	625 MHz
SIMD-256(**)	Tillich et al. [14] / On request	Fully autonomous	Two FFT-64 with two FFT-8 and 16 multipliers (8x8 bit) each	UMC 0.18 µm	104.17 kGates	924 Mbit/s	64.93 MHz
SIMD-256	Henzen et al. [29] / ETH webpage	Fully autonomous	Four parallel Feistel modules, message expansion based on NNT₈ and eight multipliers	UMC 90 nm	135 kGates	5177 Mbit/s	364 MHz
SIMD-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	139.55 kGates	2157 Mbit/s	194 MHz
SIMD-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	139 kGates	3171 Mbit/s	284.9 MHz
Skein-256-256	Stefan Tillich [12] / On request	Fully autonomous	8 Threefish rounds unrolled	UMC 0.18 µm	53.87 kGates	1762 Mbit/s	68.8 MHz
Skein-256-256	Namin and Hasan [2] / N/A	Core functionality	All 72 Threefish rounds unrolled	STM 90 nm	369 kGates	3126 Mbit/s(*)	12.21 MHz
Skein-256-256	Tillich et al. [14] / On request	Fully autonomous	8 Threefish rounds unrolled	UMC 0.18 µm	58.61 kGates	1882 Mbit/s	73.52 MHz
Skein-256-256	Henzen et al. [29] / ETH webpage	Fully autonomous	Four unrolled Threefish rounds	UMC 90 nm	50 kGates	3558 Mbit/s	264 MHz
Skein-256-256	Guo et al. [35] / VT webpage	Fully autonomous		UMC 0.13 µm	40.9 kGates	1941 Mbit/s	159 MHz
Skein-256-256	RCIS webpage [37] / RCIS webpage	Fully autonomous		STM 90 nm	43.1 kGates	3295 Mbit/s	270.3 MHz
Skein-512-512	Tillich et al. [14] / On request	Fully autonomous	8 Threefish rounds unrolled	UMC 0.18 µm	102.04 kGates	2502 Mbit/s	48.87 MHz
Skein-512	Walker et al. [36] / N/A]	Fully autonomous	8 Threefish rounds unrolled	Intel 32 nm	57.93 kGates	32320 Mbit/s	631.31 MHz

(*) Estimated peak throughput for the minimal delay of compression function: 1000 * (Input Size in bits) / [(Compression Function Delay in ns) * (Number of Cycles)] = Throughput in Mbit/s.
(**) Implementation of round-one variant.
(***) Estimated peak throughput: Throughput for CubeHash8/1-h implementation * 16.

5.4 Low-Area Implementations (ASIC)

Hash Function Name	Reference / HDL	Impl. Scope	Implementation Details	Technology	Size	Throughput	Clock Frequency
BLAKE-32	Tillich et al. [18] / N/A	Fully autonomous	One G function in 11 cycles	AMS 0.35 µm	25.57 kGates	15.4 Mbit/s	31.25 MHz
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with a single G function unit	UMC 0.18 µm	10.54 kGates	253 Mbit/s	40 MHz
BLAKE-32	Submission doc. [1] / Submission webpage	Core functionality	Compression function with a half G function unit	UMC 0.18 µm	9.89 kGates	127 Mbit/s	40 MHz
BLAKE-32	Henzen et al. [40] / Submission webpage	Fully autonomous	1 adder and 4-word latch array	UMC 0.18 µm	13.56 kGates	135 Mbit/s	215 MHz
BLAKE-32	Henzen et al. [40] / Submission webpage	Using external memory	1 adder and 4-word latch array	UMC 0.18 µm	8.60 kGates	62 Mbit/s	100 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with a single G function unit	UMC 0.18 µm	20.61 kGates	181 Mbit/s	20 MHz
BLAKE-64	Submission doc. [1] / Submission webpage	Core functionality	Compression function with a half G function unit	UMC 0.18 µm	19.46 kGates	91 Mbit/s	20 MHz
CubeHash16/32-h	Bernet et al. [20] / N/A	Fully autonomous	Process two 32-bit words per cycle, 64 cycles per round	0.13 µm	7.63 kGates	32 Mbit/s(****)	100 MHz
ECHO-224/256	Lu et al. [5] / N/A	Fully autonomous		0.13 µm	82.8 kGates	373 Mbit/s	66.6 MHz
Fugue-256	Submission doc. [15] / N/A	Fully autonomous	One SMIX transformation (SUPER1_L)	IBM 90 nm	59.22 kGates	2000 Mbit/s	500 MHz
Grøstl-224/256	Tillich et al. [18] / N/A	Fully autonomous	64-bit datapath, P & Q permutation shared	AMS 0.35 µm	14.62 kGates	145.9 Mbit/s	55.87 MHz
Grøstl-224/256	Grøstl website [19] / N/A	Fully autonomous	64-bit datapath, P & Q permutation shared	UMC 0.18 µm	17 kGates	645 Mbit/s	246.9 MHz
Grøstl-256	RCIS webpage [39] / RCIS webpage	Fully autonomous		STM 90 nm	34.8 kGates	2478 Mbit/s	101.6 MHz
Keccak	Updated spec. (v1.2) [8] / Submission webpage	Using external memory	Small core using system memory	ST 0.13 µm	6.5 kGates	176.4 Mbit/s(*)	666.7 MHz
Keccak	Updated spec. (v1.2) [8] / Submission webpage	Using external memory	Small core using system memory, clock freq. limited to 200 MHz	ST 0.13 µm	5 kGates	52.9 Mbit/s(**)	200 MHz
Luffa-224/256	Knežević and Verbauwhede [17] / Author's webpage	Fully autonomous	One permutation block (64 S-boxes, 4 MixWord blocks)	UMC 0.13 µm	18.26 kGates	2461 Mbit/s	250 MHz
Luffa-256	Mikami et al. [27] / N/A	Fully autonomous	One permutation block (64 S-boxes, 4 MixWord blocks)	UMC 0.13 µm	10.34 kGates	538 Mbit/s	806 MHz
Luffa-256	Satoh et al. [38] / RCIS webpage	Fully autonomous	One permutation block (64 S-boxes, 4 MixWord blocks)	STM 90 nm	14.7 kGates	3641.1 Mbit/s	355.9 MHz
Luffa-384	Knežević and Verbauwhede [17] / Author's webpage	Fully autonomous	6 S-boxes, 1 MixWord	TSMC 90 nm	27.13 kGates	1882 Mbit/s	250 MHz
Luffa-512	Knežević and Verbauwhede [17] / Author's webpage	Fully autonomous	One permutation block (64 S-boxes, 4 MixWord blocks)	UMC 0.13 µm	37.35 kGates	1524 Mbit/s	250 MHz
Shabal	Bernet et al. [20] / N/A	Fully autonomous	One adder, one subtractor, one incrementer. 165 cycles per block	0.13 µm	23.32 kGates	310 Mbit/s	100 MHz
Skein-256-256	Tillich et al. [18] / N/A	Fully autonomous	64-bit datapath	AMS 0.35 µm	12.89 kGates	19.8 Mbit/s	80 MHz
Skein-256-256	Namin and Hasan [2] / N/A	Core functionality	One round of Threefish iterated	STM 90 nm	21 kGates	1018.8 Mbit/s(***)	286.53 MHz

(*) Estimation for 64-bit memory interface: (1024 bits/permutation) * (666.7 * 10^6 cycles/s) / (3870 cycles/permutation) = 176.41 * 10^6 bits/s
(**) Estimation for 64-bit memory interface: (1024 bits/permutation) * (200 * 10^6 cycles/s) / (3870 cycles/permutation) = 52.92 * 10^6 bits/s
(***) Estimated peak throughput for the minimal delay of compression function: 1000 * (Input Size in bits) / [(Compression Function Delay in ns) * (Number of Cycles)] = Throughput in Mbit/s
(****) Estimated peak throughput: Throughput for CubeHash8/1-h implementation * 16.

6 Comparative Studies

This section summarizes the reported results of publications which examined more than one round-two candidate in a similar setup.

6.1 Blake, BMW, Luffa, Shabal, Skein

Reference	HDL	Category	Impl. Scope	Technology
Namin and Hasan [2]	N/A	High-speed FPGA	Core functionality	Altera Stratix III

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
BLAKE-32	Compression function with 8 G function units and I/O registers	5435 ALUTs	2186.2 Mbit/s	46.97 MHz
Blue Midnight Wish-256	Compression function with f0, f1, and f2 unrolled in sequence and I/O registers	12917 ALUTs	4889.6 Mbit/s	9.55 MHz
Luffa-256	Compression function (1 cycle latency) and I/O registers	16552 ALUTs	12042.2 Mbit/s	47.04 MHz
Shabal-256	Compression function with I/O registers (latency of 16 clock cycles)	1440 ALUTs	3125.6 Mbit/s	195.35 MHz
Skein-256-256	All 72 Threefish rounds unrolled (device too small)	N/A	N/A	N/A

Reference	HDL	Category	Impl. Scope	Technology
Namin and Hasan [2]	N/A	High-speed ASIC	Core functionality	STM 90 nm

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
BLAKE-32	Compression function with 8 G function units and I/O registers	53 kGates	4475 Mbit/s(*)	96.15 MHz
Blue Midnight Wish-256	Compression function with f0, f1, and f2 unrolled in sequence and I/O registers	164 kGates	26665 Mbit/s(*)	52.08 MHz
Luffa-256	Compression function (1 cycle latency) and I/O registers	122 kGates	25702 Mbit/s(*)	100.4 MHz
Shabal-256	Compression function with I/O registers (latency of 16 clock cycles)	20 kGates	4408 Mbit/s(*)	413.22 MHz
Skein-256-256	All 72 Threefish rounds unrolled	369 kGates	3126 Mbit/s(*)	12.21 MHz

(*) Estimated peak throughput for the minimal delay of compression function: 1000 * (Input Size in bits) / [(Compression Function Delay in ns) * (Number of Cycles)] = Throughput in Mbit/s.

6.2 Blake, CubeHash, ECHO, Grøstl, Hamsi, Luffa, Shabal, Skein

Reference	HDL	Category	Impl. Scope	Technology
Kobayashi et al. [3]	RCIS webpage	High-speed FPGA	Fully autonomous	Xilinx Virtex 5

Hash Function Name	Size	Throughput	Clock Frequency
BLAKE-32	1660 slices	2676 Mbit/s	115 MHz
CubeHash16/32-256	590 slices	2960 Mbit/s	185 MHz
ECHO-256	3556 slices	1614 Mbit/s	104 MHz
Grøstl-256	4057 slices	5171 Mbit/s	101 MHz
Hamsi-256	718 slices	1680 Mbit/s	210 MHz
Luffa-256	1048 slices	6343 Mbit/s	223 MHz
Shabal-256	1251 slices	1739 Mbit/s	214 MHz
Skein-256	854 slices	1482 Mbit/s	115 MHz

6.3 CubeHash, Grøstl, Shabal

Reference	HDL	Category	Impl. Scope	Technology
Baldwin et al. [4]	N/A	High-speed FPGA	Core functionality	Xilinx Spartan 3

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
CubeHash8/1-256(*)	2 compression functions unrolled	3268 slices	70 Mbit/s	37.9 MHz
Grøstl-224/256	P & Q permutation in parallel, S-box in BRAM	4827 slices	3660 Mbit/s	71.53 MHz
Grøstl-384/512	P & Q permutation parallel, S-box in LUTs	17452 slices	3180 Mbit/s	79.61 MHz
Shabal	36 adders in permutation	2223 slices	740 Mbit/s	71.48 MHz

(*) CubeHash16/32-h implemented in a similar fashion can be expected to have throughput increased by a factor of about 16.

Reference	HDL	Category	Impl. Scope	Technology
Baldwin et al. [4]	N/A	High-speed FPGA	Core functionality	Xilinx Virtex 5

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
CubeHash8/1-256(*)	1 iterated compression function	1178 slices	160 Mbit/s	166.8 MHz
Grøstl-224/256	P & Q permutation in parallel, S-box in BRAM	4516 slices	7310 Mbit/s	142.87 MHz
Grøstl-384/512	P & Q permutation parallel, S-box in LUTs	19161 slices	6090 Mbit/s	83.33 MHz
Shabal	36 adders in permutation	2768 slices	1450 Mbit/s	138.87 MHz

(*) CubeHash16/32-h implemented in a similar fashion can be expected to have throughput increased by a factor of about 16.

6.4 All 14 Round-Two Candidates

Reported results are post-synthesis. An interactive graphical comparison of various area-performance tradeoffs of this study can be found here.

Reference	HDL	Category	Impl. Scope	Technology
Tillich et al. [14]	On request	High-speed ASIC	Fully autonomous	UMC 0.18 µm

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
BLAKE-32	Compression function with 4 G function units with CSAs	45.64 kGates	3971 Mbit/s	170.64 MHz
Blue Midnight Wish-256	Compression function with f0, f1, and f2 unrolled	169.74 kGates	5358 Mbit/s	10.46 MHz
CubeHash16/32-h	Dynamically reconfigurable r and b parameters, two rounds unrolled	58.87 kGates	4665 Mbit/s	145.77 MHz
ECHO-256	Four parallel AES rounds, 16 AES MixColumns 32-bit column multipliers	141.49 kGates	2246 Mbit/s	141.84 MHz
Fugue-256	Four columns of SMIX transformation in parallel	46.26 kGates	4092 Mbit/s	255.75 MHz
Grøstl-256	One shared permutation for P & Q, one pipeline stage	58.40 kGates	6290 Mbit/s	270.27 MHz
Hamsi-256	Three instances of P/Pf function unrolled	58.66 kGates	5565 Mbit/s	173.91 MHz
JH-256	320 S-boxes, one round of R₈ per cycle	58.83 kGates	4991 Mbit/s	380.22 MHz
Keccak(-256)	One instance of Keccak-f round	56.32 kGates	21229 Mbit/s	487.80 MHz
Luffa-224/256	Three permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each)	44.97 kGates	13741 Mbit/s	483.09 MHz
Shabal-256	One word rotation per cycle, 50 cycles per block	54.19 kGates	3282 Mbit/s	320.51 MHz
SHAvite-3₂₅₆	Four AES rounds (two for compression, two for message expansion)	57.39 kGates	3152 Mbit/s	227.79 MHz
SIMD-256(*)	Two FFT-64 with two FFT-8 and 16 multipliers (8x8 bit) each	104.17 kGates	924 Mbit/s	64.93 MHz
Skein-256-256	8 Threefish rounds unrolled	58.61 kGates	1882 Mbit/s	73.52 MHz
Skein-512-512	8 Threefish rounds unrolled	102.04 kGates	2502 Mbit/s	48.87 MHz

(*) Implementation of round-one variant.

6.5 BLAKE, Grøstl, Skein

Reference	HDL	Category	Impl. Scope	Technology
Tillich et al. [18]	N/A	Low-area ASIC	Fully autonomous	AMS 0.35 µm

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
BLAKE-32	One G function in 11 cycles	25.57 kGates	15.4 Mbit/s	31.25 MHz
Grøstl-224/256	64-bit datapath, P & Q permutation shared	14.62 kGates	145.9 Mbit/s	55.87 MHz
Skein-256-256	64-bit datapath	12.89 kGates	19.8 Mbit/s	80 MHz

6.6 ECHO, Hamsi, Luffa

Reference	HDL	Category	Impl. Scope	Technology
Ramakers and Narinx [25]	Hosted by SHA-3 zoo	High-speed FPGA	Core functionality	Xilinx Virtex 5

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
ECHO-256	Straight-forward instantiation of complete compression function	15006 slices	23860 Mbit/s	139 MHz
ECHO-256	Optimized: 4 x 2 AES round instances with pipeline register in BigSubWords	12061 slices	3560 Mbit/s	187 MHz
Hamsi-256	Straight-forward instantiation of complete compression function	4664 slices	6620 Mbit/s	207 MHz
Hamsi-256	Non-linear permutation block reused	2113 slices	1970 Mbit/s	308 MHz
Luffa-256	Straight-forward instantiation of complete compression function	9611 slices	12290 Mbit/s	48.2 MHz
Luffa-256	One step block reused for 8 rounds	2303 slices	5090 Mbit/s	179 MHz

6.7 All 14 Round-Two Candidates

Reported results of this study are post-P&R performances of designs targeting high throughput.

Reference	HDL	Category	Impl. Scope	Technology
Henzen et al. [29]	ETH webpage	High-speed ASIC	Fully autonomous	UMC 90 nm

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
BLAKE-32	Four parallel G functions modules	47.5 kGates	9752 Mbit/s	400 MHz
Blue Midnight Wish-256	single-cycle f0 and f2, f1 iteratively	150 kGates	8486 Mbit/s	298 MHz
CubeHash16/32-256	One round per cycle, IV fixed	42.5 kGates	10667 Mbit/s	667 MHz
ECHO-256	8 AES rounds per cycle	260 kGates	13966 Mbit/s	291 MHz
Fugue-256	S-box as LUT	55 kGates	8815 Mbit/s	551 MHz
Grøstl-256	P and Q permutation interleaved with one pipeline stage, S-box as LUT	135 kGates	16254 Mbit/s	667 MHz
Hamsi-256	Message expansions in LUTs, one round per cycle	45 kGates	8686 Mbit/s	814 MHz
JH-256	S-boxes as LUTs, stored constants	80 kGates	10807 Mbit/s	760 MHz
Keccak(-256)	One round per cycle	50 kGates	43011 Mbit/s	949 MHz
Luffa-256	Three parallel step modules, SubCrumb as logic	55 kGates	23256 Mbit/s	727 MHz
Shabal-256	30 adders, 16 subtractors	45 kGates	6819 Mbit/s	693 MHz
SHAvite-3₂₅₆	One AES round each for message expansion and F³ round	75 kGates	7999 Mbit/s	562 MHz
SIMD-256	Four parallel Feistel modules, message expansion based on NNT₈ and eight multipliers	135 kGates	5177 Mbit/s	364 MHz
Skein-256-256	Four unrolled Threefish rounds	50 kGates	3558 Mbit/s	264 MHz

6.8 All 14 Round-Two Candidates

Designs optimized towards throughput to area ratio. The cited results are those for the Xilinx Virtex 5 and Altera Stratix III platforms (both for the 256-bit and the 512-bit version of the candidates). Results marked with N/A did not fit into the largest device of the device family. For a full listing of all ATHENa results refer to the ATHENa webpage.

Reference	HDL	Category	Impl. Scope	Technology
Homsirikamol et al. [30]	On request	High-speed FPGA	Fully autonomous	Xilinx Virtex 5

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
BLAKE-32	4 G function units per iteration	1523 slices	3143 Mbit/s	128.9 MHz
BLAKE-64	4 G function units per iteration	3064 slices	3520 Mbit/s	99.7 MHz
Blue Midnight Wish-256	Fully unrolled	4353 slices	6141 Mbit/s	12.0 MHz
Blue Midnight Wish-512	Fully unrolled	N/A	N/A	N/A
CubeHash16/32-256		684 slices	4385 Mbit/s	274.1 MHz
CubeHash16/32-512		734 slices	4315 Mbit/s	269.7 MHz
ECHO-256	3 clk cycles per round	4982 slices	11323 Mbit/s	184.3 MHz
ECHO-512	3 clk cycles per round	5044 slices	7779 Mbit/s	235.5 MHz
Fugue-256	2 clk cycles per round	708 slices	3495 Mbit/s	218.4 MHz
Fugue-512	4 clk cycles per round	979 slices	1773 Mbit/s	221.6 MHz
Grøstl-256	P & Q permutations interleaved	1597 slices	7885 Mbit/s	323.4 MHz
Grøstl-512	P & Q permutations interleaved	3138 slices	10314 Mbit/s	292.1 MHz
Hamsi-256		720 slices	3049 Mbit/s	285.9 MHz
Hamsi-512		1900 slices	1942 Mbit/s	182.1 MHz
JH-256		1018 slices	5416 Mbit/s	380.8 MHz
JH-512		1104 slices	5610 Mbit/s	394.5 MHz
Keccak(-256)		1272 slices	12817 Mbit/s	282.7 MHz
Keccak(-512)		1257 slices	6845 Mbit/s	285.2 MHz
Luffa-256		949 slices	9692 Mbit/s	340.7 MHz
Luffa-512		1960 slices	7691 Mbit/s	240.3 MHz
Shabal-256	32-bit datapath	283 slices	1719 Mbit/s	214.9 MHz
Shabal-512	32-bit datapath	283 slices	1719 Mbit/s	214.9 MHz
SHAvite-3₂₅₆	3 clk cycles per round	1076 slices	3253 Mbit/s	235.1 MHz
SHAvite-3₅₁₂	4 clk cycles per round	2090 slices	3841 Mbit/s	213.8 MHz
SIMD-256	4 SIMD steps unrolled	8922 slices	3123 Mbit/s	54.9 MHz
SIMD-512	4 SIMD steps unrolled	19639 slices	4938 Mbit/s	43.4 MHz
Skein-512-256	4 Threefish rounds unrolled	1621 slices	3178 Mbit/s	118.0 MHz
Skein-512-512	4 Threefish rounds unrolled	1716 slices	3209 Mbit/s	119.1 MHz

Reference	HDL	Category	Impl. Scope	Technology
Homsirikamol et al. [30]	N/A	High-speed FPGA	Fully autonomous	Altera Stratix III

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
BLAKE-32	4 G function units per iteration	3635 ALUTs	2901 Mbit/s	119.0 MHz
BLAKE-64	4 G function units per iteration	7086 ALUTs	3161 Mbit/s	89.5 MHz
Blue Midnight Wish-256	Fully unrolled	12619 ALUTs	6339 Mbit/s	12.4 MHz
Blue Midnight Wish-512	Fully unrolled	25192 ALUTs	9820 Mbit/s	9.6 MHz
CubeHash16/32-256		1922 ALUTs	3726 Mbit/s	232.9 MHz
CubeHash16/32-512		1930 ALUTs	3267 Mbit/s	204.2 MHz
ECHO-256	3 clk cycles per round	20723 ALUTs	14335 Mbit/s	233.3 MHz
ECHO-512	3 clk cycles per round	21187 ALUTs	8172 Mbit/s	247.4 MHz
Fugue-256	2 clk cycles per round	2397 ALUTs	3319 Mbit/s	207.4 MHz
Fugue-512	4 clk cycles per round	2783 ALUTs	1598 Mbit/s	199.8 MHz
Grøstl-256	P & Q permutations interleaved	6350 ALUTs	5380 Mbit/s	220.7 MHz
Grøstl-512	P & Q permutations interleaved	12355 ALUTs	7142 Mbit/s	202.3 MHz
Hamsi-256		2308 ALUTs	2997 Mbit/s	281.0 MHz
Hamsi-512		6401 ALUTs	2001 Mbit/s	187.6 MHz
JH-256		3525 ALUTs	5515 Mbit/s	387.8 MHz
JH-512		3709 ALUTs	5556 Mbit/s	390.6 MHz
Keccak(-256)		4213 ALUTs	12393 Mbit/s	273.4 MHz
Keccak(-512)		3979 ALUTs	7310 Mbit/s	304.6 MHz
Luffa-256		3032 ALUTs	8570 Mbit/s	301.3 MHz
Luffa-512		6891 ALUTs	8579 Mbit/s	268.1 MHz
Shabal-256	32-bit datapath	1744 ALUTs	877 Mbit/s	109.7 MHz
Shabal-512	32-bit datapath	1744 ALUTs	877 Mbit/s	109.7 MHz
SHAvite-3₂₅₆	3 clk cycles per round	3042 ALUTs	3397 Mbit/s	245.5 MHz
SHAvite-3₅₁₂	4 clk cycles per round	5619 ALUTs	4071 Mbit/s	226.6 MHz
SIMD-256	4 SIMD steps unrolled	25728 ALUTs	3123 Mbit/s	54.9 MHz
SIMD-512	4 SIMD steps unrolled	53623 ALUTs	5668 Mbit/s	49.8 MHz
Skein-512-256	4 Threefish rounds unrolled	4645 ALUTs	2503 Mbit/s	92.9 MHz
Skein-512-512	4 Threefish rounds unrolled	4794 ALUTs	2434 Mbit/s	90.3 MHz

6.9 All 14 Round-Two Candidates

Results are without wrapper for long messages.

Reference	HDL	Category	Impl. Scope	Technology
Baldwin et al. [31]	UCC webpage	High-speed FPGA	Fully autonomous	Xilinx Virtex 5

Hash Function Name	Size	Throughput	Clock Frequency
BLAKE-32	1118 slices	1169 Mbit/s	118.06 MHz
BLAKE-64	1718 slices	1299 Mbit/s	90.91 MHz
Blue Midnight Wish-256	4997 slices	457 Mbit/s	14.02 MHz
Blue Midnight Wish-512	9810 slices	287 Mbit/s	10 MHz
CubeHash8/32	695 slices	2509 Mbit/s	166.83 MHz
ECHO-256	7372 slices	5373 Mbit/s	198.93 MHz
ECHO-512	8633 slices	18133 Mbit/s	166.69 MHz
Fugue-256	1689 slices	914 Mbit/s	200.04 MHz
Fugue-384	2380 slices	640 Mbit/s	200.08 MHz
Fugue-512	2596 slices	481 Mbit/s	200.16 MHz
Grøstl-256	2391 slices	3242 Mbit/s	101.32 MHz
Grøstl-512	4845 slices	3619 Mbit/s	123.4 MHz
Hamsi-256	1518 slices	358 Mbit/s	72.41 MHz
Hamsi-512	6229 slices	79 Mbit/s	16.51 MHz
JH	1291 slices	1941 Mbit/s	250.13 MHz
Keccak(-224)	1117 slices	5915 Mbit/s	189 MHz
Keccak(-256)	1117 slices	6263 Mbit/s	189 MHz
Keccak(-384)	1117 slices	8190 Mbit/s	189 MHz
Keccak(-512)	1117 slices	8518 Mbit/s	189 MHz
Luffa-256	2221 slices	5333 Mbit/s	166.67 MHz
Luffa-384	3740 slices	5336 Mbit/s	166.75 MHz
Luffa-512	3700 slices	5336 Mbit/s	166.75 MHz
Shabal	1583 slices	1469 Mbit/s	148.04 MHz
SHAvite-3₂₅₆	3125 slices	1170 Mbit/s	109.17 MHz
SHAvite-3₅₁₂	9775 slices	931 Mbit/s	59.4 MHz
SIMD-256	22704 slices	1338 Mbit/s	107.2 MHz
SIMD-512	43729 slices	2677 Mbit/s	107.2 MHz
Skein-512	1786 slices	1945 Mbit/s	83.65 MHz

6.10 All 14 Round-Two Candidates

Results include throughputs without interface overhead.

Reference	HDL	Category	Impl. Scope	Technology
Matsuo et al. [33]	RCIS webpage	High-speed FPGA	Fully autonomous	Xilinx Virtex 5

Hash Function Name	Size	Throughput	Clock Frequency
BLAKE-32	1660 slices	2676 Mbit/s	115 MHz
Blue Midnight Wish-256	4350 slices	8704 Mbit/s	34 MHz
CubeHash16/32-256	590 slices	2960 Mbit/s	185 MHz
ECHO-256	2827 slices	2312 Mbit/s	149 MHz
Fugue-256	4013 slices	1248 Mbit/s	78 MHz
Grøstl-256	2616 slices	7885 Mbit/s	154 MHz
Hamsi-256	718 slices	1680 Mbit/s	210 MHz
JH-256	2661 slices	2639 Mbit/s	201 MHz
Keccak(-256)	1433 slices	8397 Mbit/s	205 MHz
Luffa-256	1048 slices	7424 Mbit/s	261 MHz
Shabal-256	1251 slices	2335 Mbit/s	228 MHz
SHAvite-3₂₅₆	1063 slices	3382 Mbit/s	251 MHz
SIMD-256	3987 slices	835 Mbit/s	75 MHz
Skein-256-256	854 slices	1402 Mbit/s	115 MHz

Same implementations as in Matsuo et al. [33] implemented on STM 90 nm technology.

Reference	HDL	Category	Impl. Scope	Technology
RCIS webpage [37]	RCIS webpage	High-speed ASIC	Fully autonomous	STM 90 nm

Hash Function Name	Size	Throughput	Clock Frequency
BLAKE-32	37 kGates	6668 Mbit/s	286.5 MHz
Blue Midnight Wish-256	128.7 kGates	25937 Mbit/s	101.3 MHz
CubeHash16/32-256	35.5 kGates	8247 Mbit/s	515.5 MHz
ECHO-256	101.1 kGates	5621 Mbit/s	362.3 MHz
Fugue-256	56.7 kGates	2721 Mbit/s	170.1 MHz
Grøstl-256	139.1 kGates	17297 Mbit/s	337.8 MHz
Hamsi-256	67.6 kGates	7767 Mbit/s	970.9 MHz
JH-256	54.6 kGates	10022 Mbit/s	763.4 MHz
Keccak(-256)	50.7 kGates	33333 Mbit/s	781.3 MHz
Luffa-256	39.6 kGates	28732 Mbit/s	1010.1 MHz
Shabal-256	34.6 kGates	6059 Mbit/s	591.7 MHz
SHAvite-3₂₅₆	59.4 kGates	8421 Mbit/s	625 MHz
SIMD-256	139 kGates	3171 Mbit/s	284.9 MHz
Skein-256-256	43.1 kGates	3295 Mbit/s	270.3 MHz

6.11 Blue Midnight Wish, Keccak, Luffa

Reference	HDL	Category	Impl. Scope	Technology
Akin et al. [34]	N/A	High-speed FPGA	Core functionality	Xilinx Spartan 3

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
Blue Midnight Wish	Compression function with f0, f1, and f2 unrolled in sequence	10531 slices	2110 Mbit/s	4.22 MHz
Keccak	One Keccak-f round per cycle	2024 slices	3460 Mbit/s	81.4 MHz
Luffa	Three step modules	2956 slices	1480 Mbit/s	157.3 MHz

Reference	HDL	Category	Impl. Scope	Technology
Akin et al. [34]	N/A	High-speed FPGA	Core functionality	Xilinx Virtex-II

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
Blue Midnight Wish	Compression function with f0, f1, and f2 unrolled in sequence	10432 slices	3360 Mbit/s	6.71 MHz
Keccak	One Keccak-f round per cycle	2024 slices	5810 Mbit/s	136.6 MHz
Luffa	Three step modules	2952 slices	8370 Mbit/s	301.4 MHz

Reference	HDL	Category	Impl. Scope	Technology
Akin et al. [34]	N/A	High-speed FPGA	Core functionality	Xilinx Virtex 4

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
Blue Midnight Wish	Compression function with f0, f1, and f2 unrolled in sequence	10486 slices	4510 Mbit/s	9.01 MHz
Keccak	One Keccak-f round per cycle	2024 slices	6070 Mbit/s	142.9 MHz
Luffa	Three step modules	2989 slices	8560 Mbit/s	308.2 MHz

Reference	HDL	Category	Impl. Scope	Technology
Akin et al. [34]	N/A	High-speed ASIC	Core functionality	Synopsys 90 nm

Hash Function Name	Impl. Details	Size	Throughput	Clock Frequency
Blue Midnight Wish	Compression function with f0, f1, and f2 unrolled in sequence	55.9 kGates	26320 Mbit/s	52.63 MHz
Keccak	One Keccak-f round per cycle	10.5 kGates	19320 Mbit/s	454.5 MHz
Luffa	Three step modules	11.5 kGates	21370 Mbit/s	769.2 MHz

6.12 All 14 Round-Two Candidates

Results are post-P&R and include throughputs without interface overhead.

Reference	HDL	Category	Impl. Scope	Technology
Guo et al. [35]	VT webpage	High-speed ASIC	Fully autonomous	UMC 0.13 µm

Hash Function Name	Size	Throughput	Clock Frequency
BLAKE-32	43.52 kGates	4645 Mbit/s	200 MHz
Blue Midnight Wish-256	198.17 kGates	12220 Mbit/s	48 MHz
CubeHash16/32-256	38.18 kGates	4624 Mbit/s	289 MHz
ECHO-256	92.73 kGates	3366 Mbit/s	217 MHz
Fugue-256	91.09 kGates	2385 Mbit/s	149 MHz
Grøstl-256	110.11 kGates	9606 Mbit/s	188 MHz
Hamsi-256	29.94 kGates	3571 Mbit/s	446 MHz
JH-256	62.42 kGates	5128 Mbit/s	391 MHz
Keccak(-256)	47.43 kGates	15457 Mbit/s	377 MHz
Luffa-256	37.94 kGates	13943 Mbit/s	490 MHz
Shabal-256	49.44 kGates	2945 Mbit/s	362 MHz
SHAvite-3₂₅₆	55.25 kGates	4599 Mbit/s	341 MHz
SIMD-256	139.55 kGates	2157 Mbit/s	194 MHz
Skein-256-256	40.9 kGates	1941 Mbit/s	159 MHz

7 References

[1] Jean-Philippe Aumasson, Luca Henzen, Willi Meier, and Raphael C.-W. Phan. SHA-3 proposal BLAKE (version 1.3). Available online at http://131002.net/blake/blake.pdf.

[2] A. H. Namin and M. A. Hasan. Hardware Implementation of the Compression Function for Selected SHA-3 Candidates. Available online at http://www.vlsi.uwaterloo.ca/~ahasan/hasan_report.html.

[3] Kazuyuki Kobayashi, Jun Ikegami, Shin'ichiro Matsuo, Kazuo Sakiyama, and Kazuo Ohta. Evaluation of Hardware Performance for the SHA-3 Candidates Using SASEBO-GII. IACR Eprint report 2010/010. Available online at http://eprint.iacr.org/2010/010.pdf.

[4] Brian Baldwin, Andrew Byrne, Mark Hamilton, Neil Hanley, Robert P. McEvoy, Weibo Pan, and William P. Marnane. FPGA Implementations of SHA-3 Candidates: CubeHash, Grøstl, LANE, Shabal and Spectral Hash. IACR Eprint report 2009/342. Available online at http://eprint.iacr.org/2009/342.pdf.

[5] Liang Lu, Maire O'Neil, and Earl Swartzlander. Hardware Evaluation of SHA-3 Hash Function Candidate ECHO. Presentation at the Clauce Shannon Institute Workshop on Coding and Cryptography 2009. Slides available online at http://www.ucc.ie/en/crypto/CodingandCryptographyWorkshop/TheClaudeShannonWorkshoponCodingCryptography2009/DocumentFile,75649,en.pdf.

[6] Bernhard Jungk, Steffen Reith, and Jürgen Apfelbeck. On Optimized FPGA Implementations of the SHA-3 Candidate Grøstl. IACR Eprint report 2009/206. Available online at http://eprint.iacr.org/2009/206.pdf.

[7] Praveen Gauravaram, Lars R. Knudsen, Krystian Matusievicz, Florian Mendel, Christian Rechberger, Martin Schläffer, and Søren S. Thomsen. Grøstl - a SHA-3 candidate (October 31, 2008). Available online at http://www.groestl.info/Groestl.pdf.

[8] Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles van Assche. KECCAK sponge function family main document (Version 1.2, April 23, 2009). Available online at http://keccak.noekeon.org/Keccak-main-1.2.pdf.

[9] Joachim Strömbergson. Implementation of the Keccak Hash Function in FPGA Devices. Available online at http://www.strombergson.com/files/Keccak_in_FPGAs.pdf.

[10] Romain Feron and Julien Francq. FPGA Implementation of Shabal: Our First Results (Version 2.0, February 19, 2010). Available online at http://www.shabal.com/wp-content/uploads/2010/03/FPGA-Implementation-of-Shabal-First-ResultsV2.0.pdf.

[11] Men Long. Implementing Skein Hash Function on Xilinx Virtex-5 FPGA Platform (Version 0.7, February 2, 2009). Available online at http://www.skein-hash.info/sites/default/files/skein_fpga.pdf.

[12] Stefan Tillich. Hardware Implementation of the SHA-3 Candidate Skein. IACR Eprint report 2009/159. Available online at http://eprint.iacr.org/2009/159.pdf.

[13] Jean-Luc Beuchat, Eiji Okamoto, and Teppei Yamazaki. Compact Implementations of BLAKE-32 and BLAKE-64 on FPGA. IACR Eprint report 2010/173. Available online at http://eprint.iacr.org/2010/173.pdf.

[14] Stefan Tillich, Martin Feldhofer, Mario Kirschbaum, Thomas Plos, Jörn-Marc Schmidt, and Alexander Szekely. High-Speed Hardware Implementations of BLAKE, Blue Midnight Wish, CubeHash, ECHO, Fugue, Grøstl, Hamsi, JH, Keccak, Luffa, Shabal, SHAvite-3, SIMD, and Skein. IACR Eprint report 2009/510. Available online at http://eprint.iacr.org/2009/510.pdf.

[15] Shai Halevi, William E. Hall, and Charanjit S. Jutla. The Hash Function Fugue (October 30, 2008). Available online at http://domino.research.ibm.com/comm/research_projects.nsf/pages/fugue.index.html/$FILE/NIST-submission-Oct08-fugue.pdf.

[16] Junfeng Fan. Hardware Evaluation of The Hash Function Hamsi. Available online at http://homes.esat.kuleuven.be/~okucuk/hamsi/implementations.html.

[17] Miroslav Knezevic and Ingrid Verbeiwhede. Hardware Evaluation of the Luffa Hash Family. 4th Workshop on Embedded Systems Security 2009. Available online at http://www.cosic.esat.kuleuven.be/publications/article-1282.pdf.

[18] Stefan Tillich, Martin Feldhofer, Wolfgang Issovits, Thomas Kern, Hermann Kureck, Michael Mühlberghuber, Georg Neubauer, Andreas Reiter, Armin Köfler, and Mathias Mayrhofer. Compact Hardware Implementations of the SHA-3 Candidates ARIRANG, BLAKE, Grøstl, and Skein. IACR Eprint report 2009/349. Available online at http://eprint.iacr.org/2009/349.pdf.

[19] Grøstl website. http://www.groestl.info/.

[20] Markus Bernet, Luca Henzen, Hubert Kaeslin, Norbert Felber, and Wolfgang Fichtner. Hardware Implementations of the SHA-3 Candidates Shabal and CubeHash. 52nd IEEE International Midwest Symposium on Circuits and Systems, 2009. Available online at http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5236043.

[21] Michel Kinsy and Richard Uhler. SHA-3: FPGA Implementation of ESSENCE and ECHO Hash Algorithm Candidates Using Bluespec. Available online at http://csg.csail.mit.edu/6.375/6_375_2009_www/projects/group1_report.pdf.

[22] Bernhard Jungk and Steffen Reith. On FPGA-based implementations of Grøstl. IACR Eprint report 2010/260. Available online at http://eprint.iacr.org/2010/260.pdf.

[23] Jérémie Detrey, Pierre Gaudry, and Karim Khalfallah. A Low-Area yet Performant FPGA Implementation of Shabal. IACR Eprint report 2010/292. Available online at http://eprint.iacr.org/2010/292.pdf.

[24] Jean-Luc Beuchat, Eiji Okamoto, and Teppei Yamazaki. A Compact FPGA Implementation of the SHA-3 Candidate ECHO. IACR Eprint report 2010/364. Available online at http://eprint.iacr.org/2010/364.pdf.

[25] Wim Ramakers and Hans Narinx. Implementation and evaluation of SHA-3 candidates on FPGA. Extended abstract of Master Thesis "Implementatie en Evaluatie van SHA-3-Kandidaten op FPGA" (Dutch). Extended abstract available online at http://ehash.iaik.tugraz.at/uploads/1/12/Ramakers_Narinx2010ECHO-Hamsi-Luffa_ExtendedAbstract_ENGLISH.pdf. Full thesis available online at http://ehash.iaik.tugraz.at/uploads/6/62/Ramakers_Narinx2010ECHO-Hamsi-Luffa_Thesis_DUTCH.pdf.

[26] Julien Francq and Céline Thuillet. Unfolding Method for Shabal on Virtex-5 FPGAs: Concrete Results. IACR Eprint report 2010/406. Available online at http://eprint.iacr.org/2010/406.pdf.

[27] Shugo Mikami, Nagamasa Mizushima, Setsuko Nakamura, and Dai Watanabe. A Compact Hardware Implementation of SHA-3 Candidate Luffa (version 20101105). Available online at http://www.sdl.hitachi.co.jp/crypto/luffa/ACompactHardwareImplementationOfSHA-3CandidateLuffa_20101105.pdf.

[28] Imed Mabrouk and Ryad Benadjila. ECHO webpage (hardware subpage). http://crypto.rd.francetelecom.com/ECHO/hard/.

[29] Luca Henzen, Pietro Gendotti, Patrice Guillet, Enrico Pargaetzi, Martin Zoller, and Frank K. Gürkaynak. Developing a Hardware Evaluation Method for SHA-3 Candidates. 12th International Workshop on Cryptographic Hardware and Embedded Systems (CHES), 2010. Available online at http://www.springerlink.com/content/g0115v3272156r06/.

[30] Ekawat Homsirikamol, Marcin Rogawski, and Kris Gaj. Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs. IACR Eprint report 2010/445. Available online at http://eprint.iacr.org/2010/445.pdf.

[31] Brian Baldwin, Neil Hanley, Mark Hamilton, Liang Lu, Andrew Byrne, Maire O'Neill, and William P. Marnane. FPGA Implementations of the Round Two SHA-3 Candidates. Second SHA-3 Candidate Conference, 2010. Available online at http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/papers/BALDWIN_FPGA_SHA3.pdf.

[32] Mohamed El Hadedy, Martin Margala, Danilo Gligoroski, and Svein J. Knapskog. Resource-Efficient Implementation of Blue Midnight Wish-256 Hash Function on Xilinx FPGA Platform. Second SHA-3 Candidate Conference, 2010. Available online at http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/papers/El-Hadedy_SmallSizeFPGA-BMW256.pdf.

[33] Shin'ichiro Matsuo, Miroslav Knezevic, Patrick Schaumont, Ingrid Verbauwhede, Akashi Satoh, Kazuo Sakiyama, and Kazuo Ota. How Can We Conduct "Fair and Consistent" Hardware Evaluation for SHA-3 Candidate? Second SHA-3 Candidate Conference, 2010. Available online at http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/papers/MATSUO_SHA-3_Criteria_Hardware_revised.pdf.

[34] Abdulkadir Akin, Aydin Aysu, Onur Can Ulusel, and Erkay Savas. Efficient Hardware Implementations of High Throughput SHA-3 Candidates Keccak, Luffa and Blue Midnight Wish for Single- and Multi-Message Hashing. Second SHA-3 Candidate Conference, 2010. Available online at http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/papers/SAVAS_SHA3_NIST_final.pdf.

[35] Xu Guo, Sinan Huang, Leyla Nazhandali, and Patrick Schaumont. Fair and Comprehensive Performance Evaluation of 14 Second Round SHA-3 ASIC Implementations. Second SHA-3 Candidate Conference, 2010. Available online at http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/papers/SCHAUMONT_SHA3.pdf.

[36] Jesse Walker, Farhana Sheikh, Sanu K. Mathew, and Ram Krishnamurthy. A Skein-512 Hardware Implementation. Available online at http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/papers/WALKER_skein-intel-hwd.pdf.

[37] RCIS webpage. http://staff.aist.go.jp/akashi.satoh/SASEBO/en/sha3/implement.html.

[38] Akashi Satoh, Toshihiro Katashita, Takeshi Sugawara, Naofumi Homma, and Takafumi Aoki. Hardware Implementations of Hash Function Luffa. IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), 2010. Available online at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5513102&tag=1.

[39] RCIS webpage (Other ASIC Implementations). http://staff.aist.go.jp/akashi.satoh/SASEBO/en/sha3/others.html.

[40] Luca Henzen, Jean-Philippe Aumasson, Willi Meier, and Raphael C.-W. Phan. VLSI Characterization of the Cryptographic Hash Function BLAKE. IEEE T VLSI, 2010. Available online at http://131002.net/data/papers/HAMP10.pdf.

[41] Mohamed El Hadedy, Danilo Gligoroski, and Svein J. Knapskog. Single Core Implementation of Blue Midnight Wish Hash Function on VIRTEX 5 Platform. Available online at http://people.item.ntnu.no/~danilog/Hash/BMW-SecondRound/SmallSizeFPGA-BMWOct2010.pdf.

@@ Line 259: / Line 259: @@
 |-
 | BLAKE-64  || [http://131002.net/blake/blake.pdf Submission doc.] [[#Ref001|[1]]] / [http://131002.net/blake/ Submission webpage]  || [[#Implementation_of_Core_Functionality|Core functionality]]  ||  Compression function with 1 G function unit  || Xilinx Virtex 5 || align="right"| 939 slices  || align="right"| 533 Mbit/s  || align="right"| 59.0 MHz
-|-
-| Blue Midnight Wish-256  || [http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/papers/El-Hadedy_SmallSizeFPGA-BMW256.pdf El Hadedy et al.] [[#Ref032|[32]]] / N/A  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  ||  32-bit datapath, 1 memory block  || Xilinx Virtex  || align="right"| 895 slices  || align="right"| 9 Mbit/s  || align="right"| 38 MHz
-|-
-| Blue Midnight Wish-256  || [http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/papers/El-Hadedy_SmallSizeFPGA-BMW256.pdf El Hadedy et al.] [[#Ref032|[32]]] / N/A  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  ||  32-bit datapath, 2 memory blocks  || Xilinx Virtex 5  || align="right"| 84 slices  || align="right"| 28 Mbit/s  || align="right"| 116 MHz
-|-
-| Blue Midnight Wish-256  || [http://people.item.ntnu.no/~danilog/Hash/BMW-SecondRound/SmallSizeFPGA-BMWOct2010.pdf El Hadedy et al.] [[#Ref041|[41]]] / [http://www.q2s.ntnu.no/sha3_nist_competition/start Submission webpage]  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  ||  32-bit datapath, 3 memory blocks  || Xilinx Virtex 5  || align="right"| 51 slices  || align="right"| 68.71 Mbit/s  || align="right"| 141 MHz
-|-
-| Blue Midnight Wish-512  || [http://people.item.ntnu.no/~danilog/Hash/BMW-SecondRound/SmallSizeFPGA-BMWOct2010.pdf El Hadedy et al.] [[#Ref041|[41]]] / [http://www.q2s.ntnu.no/sha3_nist_competition/start Submission webpage]  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  ||  64-bit datapath, 3 memory blocks  || Xilinx Virtex 5  || align="right"| 105 slices  || align="right"| 112.18 Mbit/s  || align="right"| 115 MHz
-|-
-| ECHO  || [http://eprint.iacr.org/2010/364.pdf Beuchat et al.] [[#Ref024|[24]]] / On request from author  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  ||  Adapted towards FPGA implementation (127 slices and 1 memory block)  || Xilinx Virtex 5 || align="right"| 127 slices  || align="right"| 72 Mbit/s  || align="right"| 352.0 MHz
-|-
-| ECHO  || Announced 19-08-2010 on hash-forum@nist.gov / On request from author  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  ||  All ECHO + all AES variants  || Xilinx Virtex 5 || align="right"| 231 slices  || align="right"| 81.7 Mbit/s (ECHO-224/256), 41.9 Mbit/s (ECHO-384/512) || align="right"| 351.0 MHz
 |-
 | Grøstl-224/256  || [http://eprint.iacr.org/2009/206.pdf Jungk et al.] [[#Ref006|[6]]] / N/A  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  || 64-bit datapath, P & Q permutation in parallel || Xilinx Spartan 3  || align="right"| 2486 slices  || align="right"| 404 Mbit/s  || align="right"| 63.2 MHz
@@ Line 281: / Line 268: @@
 |-
 | Grøstl-384/512  || [http://eprint.iacr.org/2010/260.pdf Jungk and Reith] [[#Ref022|[22]]] / N/A  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  || Shared P & Q permutation, S-Box based on composite field arithmetic  || Xilinx Spartan 3  || align="right"| 2110 slices  || align="right"| 144 Mbit/s  || align="right"| 63 MHz
 |-
 | Keccak  || [http://keccak.noekeon.org/Keccak-main-1.2.pdf Updated spec. (v1.2)] [[#Ref008|[8]]] / [http://keccak.noekeon.org/ Submission webpage]  || [[#Implementation_with_External_Memory|Using external memory]]  || Small core using system memory || Altera Stratix III  || align="right"| 855 ALUTs  || align="right"| 96.8 Mbit/s  || align="right"| 366 MHz
@@ Line 288: / Line 276: @@
 | Keccak  || [http://keccak.noekeon.org/Keccak-main-1.2.pdf Updated spec. (v1.2)] [[#Ref008|[8]]] / [http://keccak.noekeon.org/ Submission webpage]  || [[#Implementation_with_External_Memory|Using external memory]]  || Small core using system memory || Xilinx Virtex 5  || align="right"| 444 slices  || align="right"| 70.1 Mbit/s  || align="right"| 265 MHz
-|-
-| Luffa-256 || [http://www.sdl.hitachi.co.jp/crypto/luffa/ACompactHardwareImplementationOfSHA-3CandidateLuffa_20101105.pdf Mikami et al.] [[#Ref027|[27]]] / N/A || [[#Fully_Autonomous_Implementation|Fully autonomous]]  || One permutation block (64 S-boxes, 4 MixWord blocks) || Xilinx Virtex 5 || align="right"| 355 slices  || align="right"| 33 Mbit/s  || align="right"| 50 MHz
-|-
-| Shabal  || [http://ehash.iaik.tugraz.at/uploads/d/d4/FPGA_Implementation_of_Shabal_-_First_Results.pdf Feron and Francq] [[#Ref010|[10]]] / N/A  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  || 36 adders in permutation  || Xilinx Virtex 5  || align="right"| 596 slices (+ 40 DSP blocks) || align="right"| 1142 Mbit/s  || align="right"| 109 MHz
-|-
-| Shabal  || [http://eprint.iacr.org/2009/342.pdf Baldwin et al.] [[#Ref004|[4]]] / N/A  || [[#Implementation_of_Core_Functionality|Core functionality]]  || 1 adder in permutation  || Xilinx Spartan 3  || align="right"| 1933 slices  || align="right"| 540 Mbit/s  || align="right"| 89.71 MHz
-|-
-| Shabal  || [http://eprint.iacr.org/2009/342.pdf Baldwin et al.] [[#Ref004|[4]]] / N/A  || [[#Implementation_of_Core_Functionality|Core functionality]]  || 1 adder in permutation  || Xilinx Virtex 5  || align="right"| 2307 slices  || align="right"| 1330 Mbit/s  || align="right"| 222.22 MHz
-|-
-| Shabal-512  || [http://eprint.iacr.org/2010/292.pdf Detrey et al.] [[#Ref023|[23]]] / [http://hwshabal.gforge.inria.fr/ INRIA webpage (see SCM tree)]  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  || Exploiting SRL16 primitive  || Xilinx Virtex 5  || align="right"| 153 slices  || align="right"| 2051 Mbit/s  || align="right"| 256 MHz
-|-
-| Shabal-512  || [http://eprint.iacr.org/2010/292.pdf Detrey et al.] [[#Ref023|[23]]] / [http://hwshabal.gforge.inria.fr/ INRIA webpage (see SCM tree)]  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  || Exploiting SRL16 primitive  || Xilinx Spartan 3  || align="right"| 499 slices  || align="right"| 800 Mbit/s  || align="right"| 100 MHz
 |-
 | Skein-256-256  || [http://www.vlsi.uwaterloo.ca/~ahasan/hasan_report.html Namin and Hasan] [[#Ref002|[2]]] / N/A  || [[#Implementation_of_Core_Functionality|Core functionality]]  || One round of Threefish iterated  || Altera Stratix III  || align="right"| 1385 ALUTs  || align="right"| 573.9 Mbit/s  || align="right"| 161.42 MHz