Difference between revisions of "SHA-3 Hardware Implementations"
m (→Low-Area Implementations (ASIC): corrected CHI throughput) |
m (→High-Speed Implementations (FPGA): Corrected correction for CHI) |
||
Line 55: | Line 55: | ||
| BLAKE-64 || [http://131002.net/blake/blake.pdf Submission document] || [[#Implementation_of_Core_Functionality|Core functionality]] || Compression function with 8 G function units || Xilinx Virtex-II Pro || align="right"| 4329 slices || align="right"| 2389 Mbit/s || align="right"| 35.0 MHz | | BLAKE-64 || [http://131002.net/blake/blake.pdf Submission document] || [[#Implementation_of_Core_Functionality|Core functionality]] || Compression function with 8 G function units || Xilinx Virtex-II Pro || align="right"| 4329 slices || align="right"| 2389 Mbit/s || align="right"| 35.0 MHz | ||
|- | |- | ||
− | | CHI-224/256 || [http://ehash.iaik.tugraz.at/uploads/2/2c/Chi_submission.pdf Submission document] || [[#Fully_Autonomous_Implementation|Fully autonomous]] || Iterative implementation || Xilinx Virtex 2 || align="right"| 1582 slices || align="right"| | + | | CHI-224/256 || [http://ehash.iaik.tugraz.at/uploads/2/2c/Chi_submission.pdf Submission document] || [[#Fully_Autonomous_Implementation|Fully autonomous]] || Iterative implementation || Xilinx Virtex 2 || align="right"| 1582 slices || align="right"| 3200 Mbit/s || align="right"| 126.0 MHz |
|- | |- | ||
| Grøstl-224/256 || [http://www.groestl.info/Groestl.pdf Submission document] || [[#Fully_Autonomous_Implementation|Fully autonomous]] || P & Q permutation in parallel || Xilinx Spartan 3 || align="right"| 6582 slices || align="right"| 4439 Mbit/s || align="right"| 86.7 MHz | | Grøstl-224/256 || [http://www.groestl.info/Groestl.pdf Submission document] || [[#Fully_Autonomous_Implementation|Fully autonomous]] || P & Q permutation in parallel || Xilinx Spartan 3 || align="right"| 6582 slices || align="right"| 4439 Mbit/s || align="right"| 86.7 MHz |
Revision as of 14:50, 9 April 2009
Contents
1 Important Information
This page summarizes key properties of reported hardware implementations of the SHA-3 candidates. This is work in progress. The implementations are categorized into FPGA and standard-cell ASIC implementations.
Note that the diversity of implementation scope, target technologies, and synthesis tools makes direct comparisions between different hardware implementation difficult. The more of these parameters agree, the more reasonable the comparison becomes.
The target technology should be as similar as possible. For FPGA implementation, it is desirable to compare implementations on the same target device (or at least on devices of the same FPGA family). For standard-cell ASIC implementation, at least the minimal gate length of the process (e.g., 0.13 µm) should agree. More ideally, the implementations use the same standard-cell library (which implies the use of the same process technology).
In order to facilitate the comparision of hardware modules with different implementation scopes, we classify them into three categories:
For suggestions regarding the structure of this site, let us know at sha3zoo-hardware@iaik.tugraz.at
1.1 Fully Autonomous Implementation
Such hardware implementations include the complete functionality of a SHA-3 candidate (or a specific version thereof). That means the input message can be loaded piecewise into the hardware module and it delivers the message digest as output. All hash calculations happen exclusively within the hardware module. If integrated in a system, the achievable throughput of a fully autonomous implementation depends on the speed of the hardware module itself and the speed of the (system dependent) data interface delivering the input message.
1.2 Implementation with External Memory
These implementations use external memory to hold intermediate values during the hashing of a message. The implemented hardware itself normally consists of the core logic functionality of the hash function, some registers for short-lived temporary values, and possible a memory controller for access to the external memory. Such implementations can load the input message either over a dedicated interface (similar to a fully autonomous implementation) or from the external memory. In order to reach the maximal throughput of the hardware module, the external memory must be sufficiently fast.
1.3 Implementation of Core Functionality
Such implementations comprise only important parts of the hash function (e.g., the compression function), which normally allows to get a first-order estimate of the performance figures of full implementations.
2 High-Speed Implementations (FPGA)
Important note: The size and functionality of slices varies between FPGA families. A direct comparision of the slice count of implementations on different FPGA families is therefore problematic.
Hash Function Name | Reference | Impl. Scope | Impl. Details | Technology | Size | Throughput | Clock Frequency |
---|---|---|---|---|---|---|---|
BLAKE-32 | Submission document | Core functionality | Compression function with 8 G function units | Xilinx Virtex 5 | 3091 slices | 1724 Mbit/s | 37.0 MHz |
BLAKE-32 | Submission document | Core functionality | Compression function with 8 G function units | Xilinx Virtex 4 | 3087 slices | 2235 Mbit/s | 48.0 MHz |
BLAKE-32 | Submission document | Core functionality | Compression function with 8 G function units | Xilinx Virtex-II Pro | 1694 slices | 3103 Mbit/s | 67.0 MHz |
BLAKE-64 | Submission document | Core functionality | Compression function with 8 G function units | Xilinx Virtex 5 | 11122 slices | 1177 Mbit/s | 17.0 MHz |
BLAKE-64 | Submission document | Core functionality | Compression function with 8 G function units | Xilinx Virtex 4 | 11483 slices | 1707 Mbit/s | 25.0 MHz |
BLAKE-64 | Submission document | Core functionality | Compression function with 8 G function units | Xilinx Virtex-II Pro | 4329 slices | 2389 Mbit/s | 35.0 MHz |
CHI-224/256 | Submission document | Fully autonomous | Iterative implementation | Xilinx Virtex 2 | 1582 slices | 3200 Mbit/s | 126.0 MHz |
Grøstl-224/256 | Submission document | Fully autonomous | P & Q permutation in parallel | Xilinx Spartan 3 | 6582 slices | 4439 Mbit/s | 86.7 MHz |
Grøstl-224/256 | Submission document | Fully autonomous | P & Q permutation in parallel | Xilinx Virtex 5 | 1722 slices | 10276 Mbit/s | 200.7 MHz |
Grøstl-384/512 | Submission document | Fully autonomous | P & Q permutation in parallel | Xilinx Spartan 3 | 20233 slices | 5901 Mbit/s | 80.7 MHz |
Grøstl-384/512 | Submission document | Fully autonomous | P & Q permutation in parallel | Xilinx Virtex 5 | 5419 slices | 15395 Mbit/s | 210.5 MHz |
Keccak | Joachim Strömbergson | Fully autonomous | Core (round function, state register) only | Altera Cyclone III | 5842 LEs | 7000 Mbit/s | 123 MHz |
Keccak | Joachim Strömbergson | Fully autonomous | Core (round function, state register) only | Altera Stratix III | 4550 ALUTs | 10000 Mbit/s | 176 MHz |
Keccak | Joachim Strömbergson | Fully autonomous | Core (round function, state register) only | Xilinx Spartan 3A | 3393 slices | 4800 Mbit/s | 85 MHz |
Keccak | Joachim Strömbergson | Fully autonomous | Core (round function, state register) only | Xilinx Virtex 5 | 1483 slices | 6700 Mbit/s | 118 MHz |
MD6 | Submission document | Core functionality | Compression function with 16 parallel steps | Xilinx Virtex-II Pro | 5313 slices | 1232 Mbit/s | 150.3 MHz |
MD6 | Submission document | Core functionality | Compression function with 32 parallel steps | Xilinx Virtex-II Pro | 7529 slices | 1894 Mbit/s | 141.6 MHz |
Skein-256 | Men Long | Core functionality | UBI component | Xilinx Virtex 5 | 1001 slices | 408.7 Mbit/s | 114.9 MHz |
Skein-256 | Stefan Tillich | Fully autonomous | 8 Threefish rounds unrolled | Xilinx Virtex 5 | 937 slices | 1751 Mbit/s | 68.4 MHz |
Skein-256 | Stefan Tillich | Fully autonomous | 8 Threefish rounds unrolled | Xilinx Spartan 3 | 2421 slices | 669 Mbit/s | 26.14 MHz |
Skein-512 | Men Long | Core functionality | UBI component | Xilinx Virtex 5 | 1877 slices | 817.4 Mbit/s | 114.9 MHz |
Skein-512 | Stefan Tillich | Fully autonomous | 8 Threefish rounds unrolled | Xilinx Virtex 5 | 1632 slices | 3535 Mbit/s | 69.04 MHz |
Skein-512 | Stefan Tillich | Fully autonomous | 8 Threefish rounds unrolled | Xilinx Spartan 3 | 4273 slices | 1365 Mbit/s | 26.66 MHz |
3 High-Speed Implementations (ASIC)
Hash Function Name | Reference | Impl. Scope | Implementation Details | Technology | Size | Throughput | Clock Frequency |
---|---|---|---|---|---|---|---|
AURORA-224/256 | Submission document | Fully autonomous | One round of one MSM and one CPM in parallel with 1 cycle latency (Type-H1), table-lookup S-box | 0.13 µm | 35.02 kGates | 10352 Mbit/s | 363.9 MHz |
AURORA-384/512 | Submission document | Fully autonomous | One round of one MSM and two CPMs in parallel with 1 cycle latency (Type-H1), table-lookup S-box | 0.13 µm | 56.75 kGates | 9132 Mbit/s | 361.2 MHz |
BLAKE-32 | Submission document | Core functionality | Compression function with 8 G function units | UMC 0.18 µm | 58.30 kGates | 5295 Mbit/s | 114 MHz |
BLAKE-32 | Submission document | Core functionality | Compression function with 4 G function units | UMC 0.18 µm | 41.31 kGates | 4153 Mbit/s | 170 MHz |
BLAKE-64 | Submission document | Core functionality | Compression function with 8 G function units | UMC 0.18 µm | 132.47 kGates | 5910 Mbit/s | 87 MHz |
BLAKE-64 | Submission document | Core functionality | Compression function with 4 G function units | UMC 0.18 µm | 82.73 kGates | 4810 Mbit/s | 136 MHz |
CHI-224/256 | Submission document | Fully autonomous | Iterative implementation | 0.13 µm | 101.46 kGates | 600 Mbit/s | 188 MHz |
Fugue-256 | Submission document | Fully autonomous | Four SMIX transformations parallel (SUPER4_P) | IBM 90 nm | 109.85 kGates | 13913 Mbit/s | 869.5 MHz |
Grøstl-224/256 | Grøstl website | Fully autonomous | One shared permutation for P & Q, one pipeline stage | UMC 0.18 µm | 58.4 kGates | 6290 Mbit/s | 270.2 MHz |
Grøstl-384/512 | Submission document | Fully autonomous | P & Q permutation in parallel | UMC 0.18 µm | 341 kGates | 6225 Mbit/s | 85.1 MHz |
Keccak | Submission document | Fully autonomous | Core (round function, state register) & IO buffer | ST 0.13 µm | 48 kGates | 28400 Mbit/s | 500 MHz |
Keccak | Submission document | Fully autonomous | Core (round function, state register) only | ST 0.13 µm | 40 kGates | 15000 Mbit/s | 500 MHz |
LANE-224/256 | Submission document | Fully autonomous | Six permutation blocks in parallel (two full AES engines each) | 0.13 µm | 243.49 kGates | 14191 Mbit/s | 305 MHz |
LANE-384/512 | Submission document | Fully autonomous | Six permutation blocks in parallel (four full AES engines each) | 0.13 µm | 466.19 kGates | 20958 Mbit/s | 286 MHz |
Lesamnta-256 | Submission document | Fully autonomous | 90 nm | 190.1 kGates | 6026 Mbit/s | 282.5 MHz | |
Lesamnta-512 | Submission document | Fully autonomous | 90 nm | 393 kGates | 9992 Mbit/s | 234.2 MHz | |
Luffa-224/256 | Supporting document | Fully autonomous | Three permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each) | 0.13 µm | 26.85 kGates | 12642 Mbit/s | 444 MHz |
Luffa-384 | Supporting document | Fully autonomous | Four permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each) | 0.13 µm | 34.99 kGates | 12642 Mbit/s | 444 MHz |
Luffa-512 | Supporting document | Fully autonomous | Five permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each) | 0.13 µm | 44.16 kGates | 12642 Mbit/s | 444 MHz |
MD6 | Submission document | Core functionality | Compression function with 48 parallel steps | GPDSK 90 nm | 145 kGates | N/A | 200 MHz |
MD6 | Submission document | Using external memory | Compression function with 16 parallel steps & memory control logic | GPDSK 90 nm | 105 kGates | N/A | 200 MHz |
Skein-256 | Stefan Tillich | Fully autonomous | 8 Threefish rounds unrolled | UMC 0.18 µm | 53.87 kGates | 1762 Mbit/s | 68.8 MHz |
Skein-512 | Stefan Tillich | Fully autonomous | 8 Threefish rounds unrolled | UMC 0.18 µm | 102.35 kGates | 2501 Mbit/s | 48.8 MHz |
4 Low-Area Implementations (ASIC)
Hash Function Name | Reference | Impl. Scope | Implementation Details | Technology | Size | Throughput | Clock Frequency |
---|---|---|---|---|---|---|---|
AURORA-224/256 | Submission document | Fully autonomous | One round of one MSM or one CPM with 2 cycles latency (Type-H4) | 0.13 µm | 11.11 kGates | 2179 Mbit/s | 306.4 MHz |
AURORA-384/512 | Submission document | Fully autonomous | One round of one MSM or one CPM with 2 cycles latency (Type-H4) | 0.13 µm | 14.61 kGates | 1191 Mbit/s | 293.1 MHz |
BLAKE-32 | Submission document | Core functionality | Compression function with a single G function unit | UMC 0.18 µm | 10.54 kGates | 253 Mbit/s | 40 MHz |
BLAKE-32 | Submission document | Core functionality | Compression function with a half G function unit | UMC 0.18 µm | 9.89 kGates | 127 Mbit/s | 40 MHz |
BLAKE-64 | Submission document | Core functionality | Compression function with a single G function unit | UMC 0.18 µm | 20.61 kGates | 181 Mbit/s | 20 MHz |
BLAKE-64 | Submission document | Core functionality | Compression function with a half G function unit | UMC 0.18 µm | 19.46 kGates | 91 Mbit/s | 20 MHz |
CHI-224/256 | Submission document | Fully autonomous | Iterative implementation | 0.13 µm | 62.99 kGates | 121 Mbit/s | 38 MHz |
Fugue-256 | Submission document | Fully autonomous | One SMIX transformation (SUPER1_L) | IBM 90 nm | 59.22 kGates | 2000 Mbit/s | 500 MHz |
Grøstl-224/256 | Grøstl website | Fully autonomous | 64-bit datapath, P & Q permutation shared | UMC 0.18 µm | 17 kGates | 645 Mbit/s | 246.9 MHz |
Keccak | Submission document | Using external memory | Small core using system memory | ST 0.13 µm | 6 kGates | 26 Mbit/s(*) | 100 MHz |
LANE-224/256 | Submission document | Fully autonomous | One permutation block (one S-box and MixColumns block) | 0.13 µm | 16.46 kGates | 23.3 Mbit/s | 100 MHz |
Lesamnta-256 | Submission document | Fully autonomous | 90 nm | 20.7 kGates | 336.9 Mbit/s | 169.8 MHz | |
Lesamnta-512 | Submission document | Fully autonomous | 90 nm | 44.3 kGates | 571.9 Mbit/s | 144.1 MHz | |
Luffa-224/256 | Supporting document | Fully autonomous | One permutation block (One S-box, one MixWord block) | 0.13 µm | 10.16 kGates | 28.7 Mbit/s | 100 MHz |
(*) Estimation for 64-bit memory interface based on published performance figures: (1024 bits/permutation) * (100 * 10^6 cycles/s) / (3870 cycles/permutation) = 26.46 * 10^6 bits/s
5 Call for contributions
Implementers (both submitters and non-submitters): You have results that complement this site? Let us know at sha3zoo-hardware@iaik.tugraz.at