Difference between revisions of "SHA-3 Hardware Implementations"

Revision as of 15:20, 15 July 2009

1 Important Information

This page summarizes key properties of reported hardware implementations of the SHA-3 candidates. This is work in progress. The implementations are categorized into FPGA and standard-cell ASIC implementations.

Note that the diversity of implementation scope, target technologies, and synthesis tools makes direct comparisions between different hardware implementation difficult. The more of these parameters agree, the more reasonable the comparison becomes.

The target technology should be as similar as possible. For FPGA implementation, it is desirable to compare implementations on the same target device (or at least on devices of the same FPGA family). For standard-cell ASIC implementation, at least the minimal gate length of the process (e.g., 0.13 µm) should agree. More ideally, the implementations use the same standard-cell library (which implies the use of the same process technology).

In order to facilitate the comparision of hardware modules with different implementation scopes, we classify them into three categories:

For suggestions regarding the structure of this site, let us know at sha3zoo-hardware@iaik.tugraz.at

1.1 Fully Autonomous Implementation

Such hardware implementations include the complete functionality of a SHA-3 candidate (or a specific version thereof). That means the input message can be loaded piecewise into the hardware module and it delivers the message digest as output. All hash calculations happen exclusively within the hardware module. If integrated in a system, the achievable throughput of a fully autonomous implementation depends on the speed of the hardware module itself and the speed of the (system dependent) data interface delivering the input message.

1.2 Implementation with External Memory

These implementations use external memory to hold intermediate values during the hashing of a message. The implemented hardware itself normally consists of the core logic functionality of the hash function, some registers for short-lived temporary values, and possible a memory controller for access to the external memory. Such implementations can load the input message either over a dedicated interface (similar to a fully autonomous implementation) or from the external memory. In order to reach the maximal throughput of the hardware module, the external memory must be sufficiently fast.

1.3 Implementation of Core Functionality

Such implementations comprise only important parts of the hash function (e.g., the compression function), which normally allows to get a first-order estimate of the performance figures of full implementations.

2 High-Speed Implementations (FPGA)

Important note: The size and functionality of slices varies between FPGA families. A direct comparision of the slice count of implementations on different FPGA families is therefore problematic.

Hash Function Name	Reference	Impl. Scope	Impl. Details	Technology	Size	Throughput	Clock Frequency
BLAKE-32	Submission document	Core functionality	Compression function with 8 G function units	Xilinx Virtex-II Pro	3091 slices	1724 Mbit/s	37.0 MHz
BLAKE-32	Submission document	Core functionality	Compression function with 8 G function units	Xilinx Virtex 4	3087 slices	2235 Mbit/s	48.0 MHz
BLAKE-32	Submission document	Core functionality	Compression function with 8 G function units	Xilinx Virtex 5	1694 slices	3103 Mbit/s	67.0 MHz
BLAKE-64	Submission document	Core functionality	Compression function with 8 G function units	Xilinx Virtex-II Pro	11122 slices	1177 Mbit/s	17.0 MHz
BLAKE-64	Submission document	Core functionality	Compression function with 8 G function units	Xilinx Virtex 4	11483 slices	1707 Mbit/s	25.0 MHz
BLAKE-64	Submission document	Core functionality	Compression function with 8 G function units	Xilinx Virtex 5	4329 slices	2389 Mbit/s	35.0 MHz
CHI-224/256	Submission document	Fully autonomous	Iterative implementation	Xilinx Virtex 2	1582 slices	3200 Mbit/s	126.0 MHz
CubeHash8/1-256	Baldwin et al.	Core functionality	2 compression functions unrolled	Xilinx Spartan 3	3268 slices	70 Mbit/s	37.9 MHz
CubeHash8/1-256	Baldwin et al.	Core functionality	1 iterated compression function	Xilinx Virtex 5	1178 slices	160 Mbit/s	166.8 MHz
Grøstl-224/256	Jungk et al.	Fully autonomous	P & Q permutation in parallel	Xilinx Spartan 3	6136 slices	4520 Mbit/s	88.3 MHz
Grøstl-224/256	Submission document	Fully autonomous	P & Q permutation in parallel	Xilinx Virtex 5	1722 slices	10276 Mbit/s	200.7 MHz
Grøstl-384/512	Submission document	Fully autonomous	P & Q permutation in parallel	Xilinx Spartan 3	20233 slices	5901 Mbit/s	80.7 MHz
Grøstl-384/512	Baldwin et al.	Core functionality	P & Q permutation interleaved, S-box in BRAM	Xilinx Spartan 3	6313 slices	2910 Mbit/s	79.61 MHz
Grøstl-384/512	Submission document	Fully autonomous	P & Q permutation in parallel	Xilinx Virtex 5	5419 slices	15395 Mbit/s	210.5 MHz
Keccak	Updated specification (v1.2)	Fully autonomous	Core (round function, state register) & IO buffer	Altera Cyclone III	5776 LEs	7500 Mbit/s	133 MHz
Keccak	Updated specification (v1.2)	Fully autonomous	Core (round function, state register) & IO buffer	Altera Stratix III	4713 ALUTs	12400 Mbit/s	218 MHz
Keccak	Joachim Strömbergson	Fully autonomous	Core (round function, state register) only	Xilinx Spartan 3A	3393 slices	4800 Mbit/s	85 MHz
Keccak	Updated specification (v1.2)	Fully autonomous	Core (round function, state register) & IO buffer	Xilinx Virtex 5	1412 slices	6900 Mbit/s	122 MHz
MD6	Submission document	Core functionality	Compression function with 16 parallel steps	Xilinx Virtex-II Pro	5313 slices	1232 Mbit/s	150.3 MHz
MD6	Submission document	Core functionality	Compression function with 32 parallel steps	Xilinx Virtex-II Pro	7529 slices	1894 Mbit/s	141.6 MHz
MD6-256	Henzen et al.	Fully autonomous	Sequential mode (L=0), compression function with 16 parallel steps	Xilinx Virtex-4	4465 slices	8440 Mbit/s	286 MHz
MD6-512	Henzen et al.	Fully autonomous	Sequential mode (L=0), compression function with 16 parallel steps	Xilinx Virtex-4	4515 slices	5254 Mbit/s	286 MHz
Skein-256	Men Long	Core functionality	UBI component	Xilinx Virtex 5	1001 slices	408.7 Mbit/s	114.9 MHz
Skein-256	Stefan Tillich	Fully autonomous	8 Threefish rounds unrolled	Xilinx Virtex 5	937 slices	1751 Mbit/s	68.4 MHz
Skein-256	Stefan Tillich	Fully autonomous	8 Threefish rounds unrolled	Xilinx Spartan 3	2421 slices	669 Mbit/s	26.14 MHz
Skein-512	Men Long	Core functionality	UBI component	Xilinx Virtex 5	1877 slices	817.4 Mbit/s	114.9 MHz
Skein-512	Stefan Tillich	Fully autonomous	8 Threefish rounds unrolled	Xilinx Virtex 5	1632 slices	3535 Mbit/s	69.04 MHz
Skein-512	Stefan Tillich	Fully autonomous	8 Threefish rounds unrolled	Xilinx Spartan 3	4273 slices	1365 Mbit/s	26.66 MHz

3 Low-Area Implementations (FPGA)

Hash Function Name	Reference	Impl. Scope	Implementation Details	Technology	Size	Throughput	Clock Frequency
BLAKE-32	Submission document	Core functionality	Compression function with 1 G function unit	Xilinx Virtex-II Pro	958 slices	371 Mbit/s	59.0 MHz
BLAKE-32	Submission document	Core functionality	Compression function with 1 G function unit	Xilinx Virtex 4	960 slices	430 Mbit/s	68.0 MHz
BLAKE-32	Submission document	Core functionality	Compression function with 1 G function unit	Xilinx Virtex 5	390 slices	575 Mbit/s	91.0 MHz
BLAKE-64	Submission document	Core functionality	Compression function with 1 G function unit	Xilinx Virtex-II Pro	1802 slices	326 Mbit/s	36.0 MHz
BLAKE-64	Submission document	Core functionality	Compression function with 1 G function unit	Xilinx Virtex 4	1856 slices	381 Mbit/s	42.0 MHz
BLAKE-64	Submission document	Core functionality	Compression function with 1 G function unit	Xilinx Virtex 5	939 slices	533 Mbit/s	59.0 MHz
Grøstl-224/256	Jungk et al.	Fully autonomous	64-bit datapath, P & Q permutation in parallel	Xilinx Spartan 3	2486 slices	404 Mbit/s	63.2 MHz
Grøstl-224/256	Jungk et al.	Fully autonomous	64-bit datapath, P & Q permutation in parallel	Xilinx Virtex 2 Pro	2754 slices	512 Mbit/s	81.5 MHz
Keccak	Updated specification (v1.2)	Using external memory	Small core using system memory	Altera Stratix III	855 ALUTs	96.8 Mbit/s	366 MHz
Keccak	Updated specification (v1.2)	Using external memory	Small core using system memory	Altera Cyclone III	1559 LEs	47.8 Mbit/s	181 MHz
Keccak	Updated specification (v1.2)	Using external memory	Small core using system memory	Xilinx Virtex 5	444 slices	70.1 Mbit/s	265 MHz

4 High-Speed Implementations (ASIC)

Hash Function Name	Reference	Impl. Scope	Implementation Details	Technology	Size	Throughput	Clock Frequency
AURORA-224/256	Submission document	Fully autonomous	One round of one MSM and one CPM in parallel with 1 cycle latency (Type-H1), table-lookup S-box	0.13 µm	35.02 kGates	10352 Mbit/s	363.9 MHz
AURORA-384/512	Submission document	Fully autonomous	One round of one MSM and two CPMs in parallel with 1 cycle latency (Type-H1), table-lookup S-box	0.13 µm	56.75 kGates	9132 Mbit/s	361.2 MHz
BLAKE-32	Submission document	Core functionality	Compression function with 8 G function units	UMC 0.18 µm	58.30 kGates	5295 Mbit/s	114 MHz
BLAKE-32	Submission document	Core functionality	Compression function with 4 G function units	UMC 0.18 µm	41.31 kGates	4153 Mbit/s	170 MHz
BLAKE-64	Submission document	Core functionality	Compression function with 8 G function units	UMC 0.18 µm	132.47 kGates	5910 Mbit/s	87 MHz
BLAKE-64	Submission document	Core functionality	Compression function with 4 G function units	UMC 0.18 µm	82.73 kGates	4810 Mbit/s	136 MHz
CHI-224/256	Submission document	Fully autonomous	Iterative implementation	0.13 µm	101.46 kGates	4800 Mbit/s	188 MHz
Fugue-256	Submission document	Fully autonomous	Four SMIX transformations parallel (SUPER4_P)	IBM 90 nm	109.85 kGates	13913 Mbit/s	869.5 MHz
Grøstl-224/256	Grøstl website	Fully autonomous	One shared permutation for P & Q, one pipeline stage	UMC 0.18 µm	58.4 kGates	6290 Mbit/s	270.2 MHz
Grøstl-384/512	Submission document	Fully autonomous	P & Q permutation in parallel	UMC 0.18 µm	341 kGates	6225 Mbit/s	85.1 MHz
Keccak	Updated specification (v1.2)	Fully autonomous	Core (round function, state register) & IO buffer	ST 0.13 µm	48 kGates	29900 Mbit/s	526 MHz
Keccak	Submission document	Fully autonomous	Core (round function, state register) only	ST 0.13 µm	40 kGates	15000 Mbit/s	500 MHz
LANE-224/256	Submission document	Fully autonomous	Six permutation blocks in parallel (two full AES engines each)	0.13 µm	243.49 kGates	14191 Mbit/s	305 MHz
LANE-384/512	Submission document	Fully autonomous	Six permutation blocks in parallel (four full AES engines each)	0.13 µm	466.19 kGates	20958 Mbit/s	286 MHz
Lesamnta-256	Submission document	Fully autonomous		90 nm	190.1 kGates	6026 Mbit/s	282.5 MHz
Lesamnta-512	Submission document	Fully autonomous		90 nm	393 kGates	9992 Mbit/s	234.2 MHz
Luffa-224/256	Supporting document	Fully autonomous	Three permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each)	0.13 µm	26.85 kGates	12642 Mbit/s	444 MHz
Luffa-384	Supporting document	Fully autonomous	Four permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each)	0.13 µm	34.99 kGates	12642 Mbit/s	444 MHz
Luffa-512	Supporting document	Fully autonomous	Five permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each)	0.13 µm	44.16 kGates	12642 Mbit/s	444 MHz
MD6	Submission document	Core functionality	Compression function with 48 parallel steps	GPDSK 90 nm	145 kGates	N/A	200 MHz
MD6	Submission document	Using external memory	Compression function with 16 parallel steps & memory control logic	GPDSK 90 nm	105 kGates	N/A	200 MHz
MD6-256	Henzen et al.	Fully autonomous	Sequential mode (L=0), compression function with 16 parallel steps	0.18 µm	69.78 kGates	16320 Mbit/s	552 MHz
MD6-512	Henzen et al.	Fully autonomous	Sequential mode (L=0), compression function with 16 parallel steps	0.18 µm	69.78 kGates	10103 Mbit/s	552 MHz
Skein-256	Stefan Tillich	Fully autonomous	8 Threefish rounds unrolled	UMC 0.18 µm	53.87 kGates	1762 Mbit/s	68.8 MHz
Skein-512	Stefan Tillich	Fully autonomous	8 Threefish rounds unrolled	UMC 0.18 µm	102.35 kGates	2501 Mbit/s	48.8 MHz

5 Low-Area Implementations (ASIC)

Hash Function Name	Reference	Impl. Scope	Implementation Details	Technology	Size	Throughput	Clock Frequency
AURORA-224/256	Submission document	Fully autonomous	One round of one MSM or one CPM with 2 cycles latency (Type-H4)	0.13 µm	11.11 kGates	2179 Mbit/s	306.4 MHz
AURORA-384/512	Submission document	Fully autonomous	One round of one MSM or one CPM with 2 cycles latency (Type-H4)	0.13 µm	14.61 kGates	1191 Mbit/s	293.1 MHz
BLAKE-32	Submission document	Core functionality	Compression function with a single G function unit	UMC 0.18 µm	10.54 kGates	253 Mbit/s	40 MHz
BLAKE-32	Submission document	Core functionality	Compression function with a half G function unit	UMC 0.18 µm	9.89 kGates	127 Mbit/s	40 MHz
BLAKE-64	Submission document	Core functionality	Compression function with a single G function unit	UMC 0.18 µm	20.61 kGates	181 Mbit/s	20 MHz
BLAKE-64	Submission document	Core functionality	Compression function with a half G function unit	UMC 0.18 µm	19.46 kGates	91 Mbit/s	20 MHz
CHI-224/256	Submission document	Fully autonomous	Iterative implementation	0.13 µm	62.99 kGates	968 Mbit/s	38 MHz
Fugue-256	Submission document	Fully autonomous	One SMIX transformation (SUPER1_L)	IBM 90 nm	59.22 kGates	2000 Mbit/s	500 MHz
Grøstl-224/256	Grøstl website	Fully autonomous	64-bit datapath, P & Q permutation shared	UMC 0.18 µm	17 kGates	645 Mbit/s	246.9 MHz
Keccak	Updated specification (v1.2)	Using external memory	Small core using system memory	ST 0.13 µm	6.5 kGates	176.4 Mbit/s(*)	666.7 MHz
Keccak	Updated specification (v1.2)	Using external memory	Small core using system memory, clock freq. limited to 200 MHz	ST 0.13 µm	5 kGates	52.9 Mbit/s(**)	200 MHz
LANE-224/256	Submission document	Fully autonomous	One permutation block (one S-box and MixColumns block)	0.13 µm	16.46 kGates	23.3 Mbit/s	100 MHz
Lesamnta-256	Submission document	Fully autonomous		90 nm	20.7 kGates	336.9 Mbit/s	169.8 MHz
Lesamnta-512	Submission document	Fully autonomous		90 nm	44.3 kGates	571.9 Mbit/s	144.1 MHz
Luffa-224/256	Supporting document	Fully autonomous	One permutation block (One S-box, one MixWord block)	0.13 µm	10.16 kGates	28.7 Mbit/s	100 MHz

(*) Estimation for 64-bit memory interface: (1024 bits/permutation) * (666.7 * 10^6 cycles/s) / (3870 cycles/permutation) = 176.41 * 10^6 bits/s
(**) Estimation for 64-bit memory interface: (1024 bits/permutation) * (200 * 10^6 cycles/s) / (3870 cycles/permutation) = 52.92 * 10^6 bits/s

6 Call for contributions

Implementers (both submitters and non-submitters): You have results that complement this site? Let us know at sha3zoo-hardware@iaik.tugraz.at

@@ Line 66: / Line 66: @@
 |-
 | Grøstl-384/512  || [http://www.groestl.info/Groestl.pdf Submission document]  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  || P & Q permutation in parallel  || Xilinx Spartan 3  || align="right"| 20233 slices  || align="right"| 5901 Mbit/s  || align="right"| 80.7 MHz
+|-
+| Grøstl-384/512  || [http://eprint.iacr.org/2009/342.pdf Baldwin et al.]  || [[#Implementation_of_Core_Functionality|Core functionality]]  || P & Q permutation interleaved, S-box in BRAM  || Xilinx Spartan 3  || align="right"| 6313 slices  || align="right"| 2910 Mbit/s  || align="right"| 79.61 MHz
 |-
 | Grøstl-384/512  || [http://www.groestl.info/Groestl.pdf Submission document]  || [[#Fully_Autonomous_Implementation|Fully autonomous]]  || P & Q permutation in parallel  || Xilinx Virtex 5  || align="right"| 5419 slices  || align="right"| 15395 Mbit/s  || align="right"| 210.5 MHz

Difference between revisions of "SHA-3 Hardware Implementations"

Revision as of 15:20, 15 July 2009

Contents

1 Important Information

1.1 Fully Autonomous Implementation

1.2 Implementation with External Memory

1.3 Implementation of Core Functionality

2 High-Speed Implementations (FPGA)

3 Low-Area Implementations (FPGA)

4 High-Speed Implementations (ASIC)

5 Low-Area Implementations (ASIC)

6 Call for contributions

Navigation menu

Views

Personal tools

Navigation

Search

Tools