SHA-3 Hardware Implementations
Contents
1 Call for Contributions
Implementers (both submitters and non-submitters): You have results that complement this site? Let us know at sha3zoo-hardware@iaik.tugraz.at If you are making your HDL code available, please also provide us with according information.
2 Important Information
This page summarizes key properties of reported hardware implementations of those SHA-3 candidates, which are currently under consideration by NIST. This is work in progress. If you know of any implementations which should be mentioned on this page, refer to our call for contributions.
A list of hardware implementations of the round 1 candidates can be found here. Please note that the page for round 1 candidates is provided for reference and will not be updated.
The implementations are categorized into FPGA and standard-cell ASIC implementations. Note that the diversity of implementation scope, target technologies, and synthesis tools makes direct comparisions between different hardware implementation difficult. The more of these parameters agree, the more reasonable the comparison becomes.
The target technology should be as similar as possible. For FPGA implementation, it is desirable to compare implementations on the same target device (or at least on devices of the same FPGA family). For standard-cell ASIC implementation, at least the minimal gate length of the process (e.g., 0.13 µm) should agree. More ideally, the implementations use the same standard-cell library (which implies the use of the same process technology).
In order to facilitate the comparision of hardware modules with different implementation scopes, we classify them into three categories:
For suggestions regarding the structure of this site, let us know at sha3zoo-hardware@iaik.tugraz.at
2.1 Fully Autonomous Implementation
Such hardware implementations include the complete functionality of a SHA-3 candidate (or a specific version thereof). That means the input message can be loaded piecewise into the hardware module and it delivers the message digest as output. All hash calculations happen exclusively within the hardware module. If integrated in a system, the achievable throughput of a fully autonomous implementation depends on the speed of the hardware module itself and the speed of the (system dependent) data interface delivering the input message.
2.2 Implementation with External Memory
These implementations use external memory to hold intermediate values during the hashing of a message. The implemented hardware itself normally consists of the core logic functionality of the hash function, some registers for short-lived temporary values, and possible a memory controller for access to the external memory. Such implementations can load the input message either over a dedicated interface (similar to a fully autonomous implementation) or from the external memory. In order to reach the maximal throughput of the hardware module, the external memory must be sufficiently fast.
2.3 Implementation of Core Functionality
Such implementations comprise only important parts of the hash function (e.g., the compression function), which normally allows to get a first-order estimate of the performance figures of full implementations.
3 Summary of All Results
This section includes four categories of implementations (high-speed, low-area, both for FPGA and ASIC) which include known published results. If the HDL sourcecode is available, a link is provided as well.
3.1 High-Speed Implementations (FPGA)
Important note: The size and functionality of slices varies between FPGA families. A direct comparision of the slice count of implementations on different FPGA families is therefore problematic.
Hash Function Name | Reference / HDL | Impl. Scope | Impl. Details | Technology | Size | Throughput | Clock Frequency |
---|---|---|---|---|---|---|---|
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 8 G function units | Xilinx Virtex-II Pro | 3091 slices | 1724 Mbit/s | 37.0 MHz |
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 8 G function units | Xilinx Virtex 4 | 3087 slices | 2235 Mbit/s | 48.0 MHz |
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 8 G function units | Xilinx Virtex 5 | 1694 slices | 3103 Mbit/s | 67.0 MHz |
BLAKE-32 | Namin and Hasan [2] / N/A | Core functionality | Compression function with 8 G function units and I/O registers | Altera Stratix III | 5435 ALUTs | 2186.2 Mbit/s | 46.97 MHz |
BLAKE-32 | Kobayashi et al. [3] / RCIS webpage | Fully autonomous | Xilinx Virtex 5 | 1660 slices | 2676 Mbit/s | 115 MHz | |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 8 G function units | Xilinx Virtex-II Pro | 11122 slices | 1177 Mbit/s | 17.0 MHz |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 8 G function units | Xilinx Virtex 4 | 11483 slices | 1707 Mbit/s | 25.0 MHz |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 8 G function units | Xilinx Virtex 5 | 4329 slices | 2389 Mbit/s | 35.0 MHz |
Blue Midnight Wish-256 | Namin and Hasan [2] / N/A | Core functionality | Compression function with f0, f1, and f2 unrolled in sequence and I/O registers | Altera Stratix III | 12917 ALUTs | 4889.6 Mbit/s | 9.55 MHz |
CubeHash8/1-256(***) | Baldwin et al. [4] / N/A | Core functionality | 2 compression functions unrolled | Xilinx Spartan 3 | 3268 slices | 70 Mbit/s | 37.9 MHz |
CubeHash8/1-256(***) | Baldwin et al. [4] / N/A | Core functionality | 1 iterated compression function | Xilinx Virtex 5 | 1178 slices | 160 Mbit/s | 166.8 MHz |
CubeHash16/32-256 | Kobayashi et al. [3] / RCIS webpage | Fully autonomous | Xilinx Virtex 5 | 590 slices | 2960 Mbit/s | 185 MHz | |
ECHO-224/256 | Lu et al. [5] / N/A | Fully autonomous | Xilinx Virtex 5 | 9333 slices | 14860 Mbit/s | 87.1 MHz | |
ECHO-224/256 | Kinsy and Uhler [21] / N/A | Fully autonomous | 273 cycles per block | Altera Cyclone II | 39091 LEs | 397 Mbit/s(*) | 70.6 MHz |
ECHO-256 | Kobayashi et al. [3] / RCIS webpage | Fully autonomous | Xilinx Virtex 5 | 3556 slices | 1614 Mbit/s | 104 MHz | |
ECHO-384/512 | Lu et al. [5] / N/A | Fully autonomous | Xilinx Virtex 5 | 9097 slices | 7810 Mbit/s | 83.9 MHz | |
ECHO-384/512 | Kinsy and Uhler [21] / N/A | Fully autonomous | 341 cycles per block | Altera Cyclone II | 39091 LEs | 212 Mbit/s(**) | 70.6 MHz |
Grøstl-224/256 | Jungk et al. [6] / N/A | Fully autonomous | P & Q permutation in parallel | Xilinx Spartan 3 | 6136 slices | 4520 Mbit/s | 88.3 MHz |
Grøstl-224/256 | Submission doc. [7] / N/A | Fully autonomous | P & Q permutation in parallel | Xilinx Virtex 5 | 1722 slices | 10276 Mbit/s | 200.7 MHz |
Grøstl-224/256 | Baldwin et al. [4] / N/A | Core functionality | P & Q permutation in parallel, S-box in BRAM | Xilinx Spartan 3 | 4827 slices | 3660 Mbit/s | 71.53 MHz |
Grøstl-224/256 | Baldwin et al. [4] / N/A | Core functionality | P & Q permutation in parallel, S-box in BRAM | Xilinx Virtex 5 | 4516 slices | 7310 Mbit/s | 142.87 MHz |
Grøstl-256 | Kobayashi et al. [3] / RCIS webpage | Fully autonomous | Xilinx Virtex 5 | 4057 slices | 5171 Mbit/s | 101 MHz | |
Grøstl-384/512 | Submission doc. [7] / N/A | Fully autonomous | P & Q permutation in parallel | Xilinx Spartan 3 | 20233 slices | 5901 Mbit/s | 80.7 MHz |
Grøstl-384/512 | Baldwin et al. [4] / N/A | Core functionality | P & Q permutation parallel, S-box in LUTs | Xilinx Spartan 3 | 17452 slices | 3180 Mbit/s | 79.61 MHz |
Grøstl-384/512 | Baldwin et al. [4] / N/A | Core functionality | P & Q permutation parallel, S-box in LUTs | Xilinx Virtex 5 | 19161 slices | 6090 Mbit/s | 83.33 MHz |
Grøstl-384/512 | Submission doc. [7] / N/A | Fully autonomous | P & Q permutation in parallel | Xilinx Virtex 5 | 5419 slices | 15395 Mbit/s | 210.5 MHz |
Grøstl-384/512 | Jungk and Reith [22] / N/A | Fully autonomous | Shared P & Q permutation | Xilinx Spartan 3 | 8308 slices | 3474 Mbit/s | 95 MHz |
Hamsi-256 | Kobayashi et al. [3] / RCIS webpage | Fully autonomous | Xilinx Virtex 5 | 718 slices | 1680 Mbit/s | 210 MHz | |
Keccak | Updated spec. (v1.2) [8] / Submission webpage | Fully autonomous | Core (round function, state register) & IO buffer | Altera Cyclone III | 5776 LEs | 7500 Mbit/s | 133 MHz |
Keccak | Updated spec. (v1.2) [8] / Submission webpage | Fully autonomous | Core (round function, state register) & IO buffer | Altera Stratix III | 4713 ALUTs | 12400 Mbit/s | 218 MHz |
Keccak | J. Strömbergson [9] / Submission webpage | Fully autonomous | Core (round function, state register) only | Xilinx Spartan 3A | 3393 slices | 4800 Mbit/s | 85 MHz |
Keccak | Updated spec. (v1.2) [8] / Submission webpage | Fully autonomous | Core (round function, state register) & IO buffer | Xilinx Virtex 5 | 1412 slices | 6900 Mbit/s | 122 MHz |
Luffa-256 | Namin and Hasan [2] / N/A | Core functionality | Compression function (1 cycle latency) and I/O registers | Altera Stratix III | 16552 ALUTs | 12042.2 Mbit/s | 47.04 MHz |
Luffa-256 | Kobayashi et al. [3] / RCIS webpage | Fully autonomous | Xilinx Virtex 5 | 1048 slices | 6343 Mbit/s | 223 MHz | |
Shabal | Feron and Francq [10] / N/A | Fully autonomous | 36 adders in permutation | Xilinx Virtex 5 | 1171 slices | 2588 Mbit/s | 126 MHz |
Shabal | Baldwin et al. [4] / N/A | Core functionality | 36 adders in permutation | Xilinx Spartan 3 | 2223 slices | 740 Mbit/s | 71.48 MHz |
Shabal | Baldwin et al. [4] / N/A | Core functionality | 36 adders in permutation | Xilinx Virtex 5 | 2768 slices | 1450 Mbit/s | 138.87 MHz |
Shabal-256 | Namin and Hasan [2] / N/A | Core functionality | Compression function with I/O registers (latency of 16 clock cycles) | Altera Stratix III | 1440 ALUTs | 3125.6 Mbit/s | 195.35 MHz |
Shabal-256 | Kobayashi et al. [3] / RCIS webpage | Fully autonomous | Xilinx Virtex 5 | 1251 slices | 1739 Mbit/s | 214 MHz | |
Shabal-512 | Detrey et al. [23] / INRIA webpage (see SCM tree) | Fully autonomous | Exploiting SRL16 primitive | Xilinx Virtex 5 | 153 slices | 2051 Mbit/s | 256 MHz |
Shabal-512 | Detrey et al. [23] / INRIA webpage (see SCM tree) | Fully autonomous | Exploiting SRL16 primitive | Xilinx Spartan 3 | 499 slices | 800 Mbit/s | 100 MHz |
Skein-256-h | Men Long [11] / N/A | Core functionality | UBI component | Xilinx Virtex 5 | 1001 slices | 408.7 Mbit/s | 114.9 MHz |
Skein-256-256 | Stefan Tillich [12] / On request | Fully autonomous | 8 Threefish rounds unrolled | Xilinx Virtex 5 | 937 slices | 1751 Mbit/s | 68.4 MHz |
Skein-256-256 | Stefan Tillich [12] / On request | Fully autonomous | 8 Threefish rounds unrolled | Xilinx Spartan 3 | 2421 slices | 669 Mbit/s | 26.14 MHz |
Skein-256-256 | Kobayashi et al. [3] / RCIS webpage | Fully autonomous | Xilinx Virtex 5 | 854 slices | 1482 Mbit/s | 115 MHz | |
Skein-512-h | Men Long [11] / N/A | Core functionality | UBI component | Xilinx Virtex 5 | 1877 slices | 817.4 Mbit/s | 114.9 MHz |
Skein-512-512 | Stefan Tillich [12] / On request | Fully autonomous | 8 Threefish rounds unrolled | Xilinx Virtex 5 | 1632 slices | 3535 Mbit/s | 69.04 MHz |
Skein-512-512 | Stefan Tillich [12] / On request | Fully autonomous | 8 Threefish rounds unrolled | Xilinx Spartan 3 | 4273 slices | 1365 Mbit/s | 26.66 MHz |
(*) Estimated peak throughput ignoring I/O bottleneck resulting from specific interface: (1536 bits/block) * (70.6 * 10^6 cycles/s) / (273 cycles/block) = 397.22 * 10^6 bits/s.
(**) Estimated peak throughput ignoring I/O bottleneck resulting from specific interface: (1024 bits/block) * (70.6 * 10^6 cycles/s) / (341 cycles/block) = 212.01 * 10^6 bits/s.
(***) CubeHash16/32-h implemented in a similar fashion can be expected to have throughput increased by a factor of about 16.
3.2 Low-Area Implementations (FPGA)
Hash Function Name | Reference / HDL | Impl. Scope | Implementation Details | Technology | Size | Throughput | Clock Frequency |
---|---|---|---|---|---|---|---|
BLAKE-32 | Beuchat et al. [13] / N/A | Fully autonomous | Rescheduled G function | Xilinx Spartan-3 | 124 slices | 115 Mbit/s | 190.0 MHz |
BLAKE-32 | Beuchat et al. [13] / N/A | Fully autonomous | Rescheduled G function | Xilinx Virtex-4 | 124 slices | 216 Mbit/s | 357.0 MHz |
BLAKE-32 | Beuchat et al. [13] / N/A | Fully autonomous | Rescheduled G function | Xilinx Virtex-5 | 56 slices | 225 Mbit/s | 372.0 MHz |
BLAKE-32 | Beuchat et al. [13] / N/A | Fully autonomous | Rescheduled G function | Altera Cyclone III | 285 LEs | 116 Mbit/s | 192.0 MHz |
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 1 G function unit | Xilinx Virtex-II Pro | 958 slices | 371 Mbit/s | 59.0 MHz |
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 1 G function unit | Xilinx Virtex 4 | 960 slices | 430 Mbit/s | 68.0 MHz |
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 1 G function unit | Xilinx Virtex 5 | 390 slices | 575 Mbit/s | 91.0 MHz |
BLAKE-64 | Beuchat et al. [13] / N/A | Fully autonomous | Rescheduled G function | Xilinx Spartan-3 | 229 slices | 138 Mbit/s | 158.0 MHz |
BLAKE-64 | Beuchat et al. [13] / N/A | Fully autonomous | Rescheduled G function | Xilinx Virtex-4 | 230 slices | 219 Mbit/s | 250.0 MHz |
BLAKE-64 | Beuchat et al. [13] / N/A | Fully autonomous | Rescheduled G function | Xilinx Virtex-5 | 108 slices | 314 Mbit/s | 358.0 MHz |
BLAKE-64 | Beuchat et al. [13] / N/A | Fully autonomous | Rescheduled G function | Altera Cyclone III | 542 LEs | 123 Mbit/s | 140.0 MHz |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 1 G function unit | Xilinx Virtex-II Pro | 1802 slices | 326 Mbit/s | 36.0 MHz |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 1 G function unit | Xilinx Virtex 4 | 1856 slices | 381 Mbit/s | 42.0 MHz |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 1 G function unit | Xilinx Virtex 5 | 939 slices | 533 Mbit/s | 59.0 MHz |
ECHO | Beuchat et al. [24] / N/A | Fully autonomous | Adapted towards FPGA implementation (127 slices and 1 memory block) | Xilinx Virtex 5 | 127 slices | 72 Mbit/s | 352.0 MHz |
Grøstl-224/256 | Jungk et al. [6] / N/A | Fully autonomous | 64-bit datapath, P & Q permutation in parallel | Xilinx Spartan 3 | 2486 slices | 404 Mbit/s | 63.2 MHz |
Grøstl-224/256 | Jungk et al. [6] / N/A | Fully autonomous | 64-bit datapath, P & Q permutation in parallel | Xilinx Virtex 2 Pro | 2754 slices | 512 Mbit/s | 81.5 MHz |
Grøstl-224/256 | Jungk and Reith [22] / N/A | Fully autonomous | Shared P & Q permutation, S-Box based on composite field arithmetic | Xilinx Spartan 3 | 1276 slices | 192 Mbit/s | 60 MHz |
Grøstl-384/512 | Jungk and Reith [22] / N/A | Fully autonomous | Shared P & Q permutation, S-Box based on composite field arithmetic | Xilinx Spartan 3 | 2110 slices | 144 Mbit/s | 63 MHz |
Keccak | Updated spec. (v1.2) [8] / Submission webpage | Using external memory | Small core using system memory | Altera Stratix III | 855 ALUTs | 96.8 Mbit/s | 366 MHz |
Keccak | Updated spec. (v1.2) [8] / Submission webpage | Using external memory | Small core using system memory | Altera Cyclone III | 1559 LEs | 47.8 Mbit/s | 181 MHz |
Keccak | Updated spec. (v1.2) [8] / Submission webpage | Using external memory | Small core using system memory | Xilinx Virtex 5 | 444 slices | 70.1 Mbit/s | 265 MHz |
Shabal | Feron and Francq [10] / N/A | Fully autonomous | 36 adders in permutation | Xilinx Virtex 5 | 596 slices (+ 40 DSP blocks) | 1142 Mbit/s | 109 MHz |
Shabal | Baldwin et al. [4] / N/A | Core functionality | 1 adder in permutation | Xilinx Spartan 3 | 1933 slices | 540 Mbit/s | 89.71 MHz |
Shabal | Baldwin et al. [4] / N/A | Core functionality | 1 adder in permutation | Xilinx Virtex 5 | 2307 slices | 1330 Mbit/s | 222.22 MHz |
Shabal-512 | Detrey et al. [23] / INRIA webpage (see SCM tree) | Fully autonomous | Exploiting SRL16 primitive | Xilinx Virtex 5 | 153 slices | 2051 Mbit/s | 256 MHz |
Shabal-512 | Detrey et al. [23] / INRIA webpage (see SCM tree) | Fully autonomous | Exploiting SRL16 primitive | Xilinx Spartan 3 | 499 slices | 800 Mbit/s | 100 MHz |
Skein-256-256 | Namin and Hasan [2] / N/A | Core functionality | One round of Threefish iterated | Altera Stratix III | 1385 ALUTs | 573.9 Mbit/s | 161.42 MHz |
3.3 High-Speed Implementations (ASIC)
A comparison of implementations of all 14 round 2 candidates has been presented informally at IAIK (Graz University of Technology) on Sept. 16, 2009. The updated presentation slides can be found here.
Hash Function Name | Reference / HDL | Impl. Scope | Implementation Details | Technology | Size | Throughput | Clock Frequency |
---|---|---|---|---|---|---|---|
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 8 G function units | UMC 0.18 µm | 58.30 kGates | 5295 Mbit/s | 114 MHz |
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 4 G function units | UMC 0.18 µm | 41.31 kGates | 4153 Mbit/s | 170 MHz |
BLAKE-32 | Namin and Hasan [2] / N/A | Core functionality | Compression function with 8 G function units and I/O registers | STM 90 nm | 53 kGates | 4475 Mbit/s(*) | 96.15 MHz |
BLAKE-32 | Tillich et al. [14] / On request | Fully autonomous | Compression function with 4 G function units with CSAs | UMC 0.18 µm | 45.64 kGates | 3971 Mbit/s | 170.64 MHz |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 8 G function units | UMC 0.18 µm | 132.47 kGates | 5910 Mbit/s | 87 MHz |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with 4 G function units | UMC 0.18 µm | 82.73 kGates | 4810 Mbit/s | 136 MHz |
Blue Midnight Wish-256 | Namin and Hasan [2] / N/A | Core functionality | Compression function with f0, f1, and f2 unrolled in sequence and I/O registers | STM 90 nm | 164 kGates | 26665 Mbit/s(*) | 52.08 MHz |
Blue Midnight Wish-256 | Tillich et al. [14] / On request | Fully autonomous | Compression function with f0, f1, and f2 unrolled | UMC 0.18 µm | 169.74 kGates | 5358 Mbit/s | 10.46 MHz |
CubeHash16/32-h | Tillich et al. [14] / On request | Fully autonomous | Dynamically reconfigurable r and b parameters, two rounds unrolled | UMC 0.18 µm | 58.87 kGates | 4665 Mbit/s | 145.77 MHz |
CubeHash16/32-h | Bernet et al. [20] / N/A | Fully autonomous | One round per cycle | 0.13 µm | 34.33 kGates | 9248 Mbit/s(***) | 578 MHz |
CubeHash16/32-h | Bernet et al. [20] / N/A | Fully autonomous | Half a round per cycle | 0.13 µm | 21.54 kGates | 8000 Mbit/s(***) | 1000 MHz |
ECHO-224/256 | Lu et al. [5] / N/A | Fully autonomous | 0.13 µm | 521.1 kGates | 14850 Mbit/s | 87.1 MHz | |
ECHO-256 | Tillich et al. [14] / On request | Fully autonomous | Four parallel AES rounds, 16 AES MixColumns 32-bit column multipliers | UMC 0.18 µm | 141.49 kGates | 2246 Mbit/s | 141.84 MHz |
ECHO-384/512 | Lu et al. [5] / N/A | Fully autonomous | 0.13 µm | 516.8 kGates | 7750 Mbit/s | 83.3 MHz | |
Fugue-256 | Submission doc. [15] / N/A | Fully autonomous | Four columns of SMIX transformation in parallel (SUPER4_P) | IBM 90 nm | 109.85 kGates | 13913 Mbit/s | 869.5 MHz |
Fugue-256 | Tillich et al. [14] / On request | Fully autonomous | Four columns of SMIX transformation in parallel | UMC 0.18 µm | 46.26 kGates | 4092 Mbit/s | 255.75 MHz |
Grøstl-256 | Tillich et al. [14] / On request | Fully autonomous | One shared permutation for P & Q, one pipeline stage | UMC 0.18 µm | 58.40 kGates | 6290 Mbit/s | 270.27 MHz |
Grøstl-384/512 | Submission doc. [7] / N/A | Fully autonomous | P & Q permutation in parallel | UMC 0.18 µm | 341 kGates | 6225 Mbit/s | 85.1 MHz |
Hamsi-256 | Junfeng Fan (Hamsi website) [16] / N/A | Fully autonomous | 0.13 µm | 22 kGates | 4940 Mbit/s | 1080 MHz | |
Hamsi-256 | Tillich et al. [14] / On request | Fully autonomous | Three instances of P/Pf function unrolled | UMC 0.18 µm | 58.66 kGates | 5565 Mbit/s | 173.91 MHz |
Hamsi-512 | Junfeng Fan (Hamsi website) [16] / N/A | Fully autonomous | 0.13 µm | 50 kGates | 3970 Mbit/s | 820 MHz | |
JH-256 | Tillich et al. [14] / On request | Fully autonomous | 320 S-boxes, one round of R8 per cycle | UMC 0.18 µm | 58.83 kGates | 4991 Mbit/s | 380.22 MHz |
Keccak | Updated spec. (v1.2) [8] / Submission webpage | Fully autonomous | Core (round function, state register) & IO buffer | ST 0.13 µm | 48 kGates | 29900 Mbit/s | 526 MHz |
Keccak | Submission doc. [8] / Submission webpage | Fully autonomous | Core (round function, state register) only | ST 0.13 µm | 40 kGates | 15000 Mbit/s | 500 MHz |
Keccak(-256) | Tillich et al. [14] / On request | Fully autonomous | One instance of Keccak-f round | UMC 0.18 µm | 56.32 kGates | 21229 Mbit/s | 487.80 MHz |
Luffa-224/256 | Knežević and Verbauwhede [17] / Author's webpage | Fully autonomous | Three permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each) | UMC 0.13 µm | 30.83 kGates | 31960 Mbit/s | 1124 MHz |
Luffa-256 | Namin and Hasan [2] / N/A | Core functionality | Compression function (1 cycle latency) and I/O registers | STM 90 nm | 122 kGates | 25702 Mbit/s(*) | 100.4 MHz |
Luffa-224/256 | Tillich et al. [14] / On request | Fully autonomous | Three permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each) | UMC 0.18 µm | 44.97 kGates | 13741 Mbit/s | 483.09 MHz |
Luffa-384 | Knežević and Verbauwhede [17] / Author's webpage | Fully autonomous | Four permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each) | UMC 0.13 µm | 50.07 kGates | 23126 Mbit/s | 813 MHz |
Luffa-512 | Knežević and Verbauwhede [17] / Author's webpage | Fully autonomous | Five permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each) | UMC 0.13 µm | 65.1 kGates | 19617 Mbit/s | 690 MHz |
Shabal-256 | Namin and Hasan [2] / N/A | Core functionality | Compression function with I/O registers (latency of 16 clock cycles) | STM 90 nm | 20 kGates | 4408 Mbit/s(*) | 413.22 MHz |
Shabal-256 | Tillich et al. [14] / On request | Fully autonomous | One word rotation per cycle, 50 cycles per block | UMC 0.18 µm | 54.19 kGates | 3282 Mbit/s | 320.51 MHz |
Shabal | Bernet et al. [20] / N/A | Fully autonomous | One word rotation per cycle, 52 cycles per block | 0.13 µm | 41.32 kGates | 6351 Mbit/s(***) | 645 MHz |
SHAvite-3256 | Tillich et al. [14] / On request | Fully autonomous | Four AES rounds (two for compression, two for message expansion) | UMC 0.18 µm | 57.39 kGates | 3152 Mbit/s | 227.79 MHz |
SIMD-256(**) | Tillich et al. [14] / On request | Fully autonomous | Two FFT-64 with two FFT-8 and 16 multipliers (8x8 bit) each | UMC 0.18 µm | 104.17 kGates | 924 Mbit/s | 64.93 MHz |
Skein-256-256 | Stefan Tillich [12] / On request | Fully autonomous | 8 Threefish rounds unrolled | UMC 0.18 µm | 53.87 kGates | 1762 Mbit/s | 68.8 MHz |
Skein-256-256 | Namin and Hasan [2] / N/A | Core functionality | All 72 Threefish rounds unrolled | STM 90 nm | 369 kGates | 3126 Mbit/s(*) | 12.21 MHz |
Skein-256-256 | Tillich et al. [14] / On request | Fully autonomous | 8 Threefish rounds unrolled | UMC 0.18 µm | 58.61 kGates | 1882 Mbit/s | 73.52 MHz |
Skein-512-512 | Tillich et al. [14] / On request | Fully autonomous | 8 Threefish rounds unrolled | UMC 0.18 µm | 102.04 kGates | 2502 Mbit/s | 48.87 MHz |
(*) Estimated peak throughput for the minimal delay of compression function: 1000 * (Input Size in bits) / [(Compression Function Delay in ns) * (Number of Cycles)] = Throughput in Mbit/s.
(**) Implementation of round-one variant.
(***) Estimated peak throughput: Throughput for CubeHash8/1-h implementation * 16.
3.4 Low-Area Implementations (ASIC)
Hash Function Name | Reference / HDL | Impl. Scope | Implementation Details | Technology | Size | Throughput | Clock Frequency |
---|---|---|---|---|---|---|---|
BLAKE-32 | Tillich et al. [18] / N/A | Fully autonomous | One G function in 11 cycles | AMS 0.35 µm | 25.57 kGates | 15.4 Mbit/s | 31.25 MHz |
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with a single G function unit | UMC 0.18 µm | 10.54 kGates | 253 Mbit/s | 40 MHz |
BLAKE-32 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with a half G function unit | UMC 0.18 µm | 9.89 kGates | 127 Mbit/s | 40 MHz |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with a single G function unit | UMC 0.18 µm | 20.61 kGates | 181 Mbit/s | 20 MHz |
BLAKE-64 | Submission doc. [1] / Submission webpage | Core functionality | Compression function with a half G function unit | UMC 0.18 µm | 19.46 kGates | 91 Mbit/s | 20 MHz |
CubeHash16/32-h | Bernet et al. [20] / N/A | Fully autonomous | Process two 32-bit words per cycle, 64 cycles per round | 0.13 µm | 7.63 kGates | 32 Mbit/s(****) | 100 MHz |
ECHO-224/256 | Lu et al. [5] / N/A | Fully autonomous | 0.13 µm | 82.8 kGates | 373 Mbit/s | 66.6 MHz | |
Fugue-256 | Submission doc. [15] / N/A | Fully autonomous | One SMIX transformation (SUPER1_L) | IBM 90 nm | 59.22 kGates | 2000 Mbit/s | 500 MHz |
Grøstl-224/256 | Tillich et al. [18] / N/A | Fully autonomous | 64-bit datapath, P & Q permutation shared | AMS 0.35 µm | 14.62 kGates | 145.9 Mbit/s | 55.87 MHz |
Grøstl-224/256 | Grøstl website [19] / N/A | Fully autonomous | 64-bit datapath, P & Q permutation shared | UMC 0.18 µm | 17 kGates | 645 Mbit/s | 246.9 MHz |
Keccak | Updated spec. (v1.2) [8] / Submission webpage | Using external memory | Small core using system memory | ST 0.13 µm | 6.5 kGates | 176.4 Mbit/s(*) | 666.7 MHz |
Keccak | Updated spec. (v1.2) [8] / Submission webpage | Using external memory | Small core using system memory, clock freq. limited to 200 MHz | ST 0.13 µm | 5 kGates | 52.9 Mbit/s(**) | 200 MHz |
Luffa-224/256 | Knežević and Verbauwhede [17] / Author's webpage | Fully autonomous | One permutation block (64 S-boxes, 4 MixWord blocks) | UMC 0.13 µm | 18.26 kGates | 2461 Mbit/s | 250 MHz |
Luffa-384 | Knežević and Verbauwhede [17] / Author's webpage | Fully autonomous | One permutation block (64 S-boxes, 4 MixWord blocks) | UMC 0.13 µm | 27.13 kGates | 1882 Mbit/s | 250 MHz |
Luffa-512 | Knežević and Verbauwhede [17] / Author's webpage | Fully autonomous | One permutation block (64 S-boxes, 4 MixWord blocks) | UMC 0.13 µm | 37.35 kGates | 1524 Mbit/s | 250 MHz |
Shabal | Bernet et al. [20] / N/A | Fully autonomous | One adder, one subtractor, one incrementer. 165 cycles per block | 0.13 µm | 23.32 kGates | 310 Mbit/s | 100 MHz |
Skein-256-256 | Tillich et al. [18] / N/A | Fully autonomous | 64-bit datapath | AMS 0.35 µm | 12.89 kGates | 19.8 Mbit/s | 80 MHz |
Skein-256-256 | Namin and Hasan [2] / N/A | Core functionality | One round of Threefish iterated | STM 90 nm | 21 kGates | 1018.8 Mbit/s(***) | 286.53 MHz |
(*) Estimation for 64-bit memory interface: (1024 bits/permutation) * (666.7 * 10^6 cycles/s) / (3870 cycles/permutation) = 176.41 * 10^6 bits/s
(**) Estimation for 64-bit memory interface: (1024 bits/permutation) * (200 * 10^6 cycles/s) / (3870 cycles/permutation) = 52.92 * 10^6 bits/s
(***) Estimated peak throughput for the minimal delay of compression function: 1000 * (Input Size in bits) / [(Compression Function Delay in ns) * (Number of Cycles)] = Throughput in Mbit/s
(****) Estimated peak throughput: Throughput for CubeHash8/1-h implementation * 16.
4 Comparative Studies
This section summarizes the reported results of publications which examined more than one round-two candidate in a similar setup.
4.1 Blake, BMW, Luffa, Shabal, Skein
Reference | HDL | Category | Impl. Scope | Technology |
---|---|---|---|---|
Namin and Hasan [2] | N/A | High-speed FPGA | Core functionality | Altera Stratix III |
Hash Function Name | Impl. Details | Size | Throughput | Clock Frequency |
---|---|---|---|---|
BLAKE-32 | Compression function with 8 G function units and I/O registers | 5435 ALUTs | 2186.2 Mbit/s | 46.97 MHz |
Blue Midnight Wish-256 | Compression function with f0, f1, and f2 unrolled in sequence and I/O registers | 12917 ALUTs | 4889.6 Mbit/s | 9.55 MHz |
Luffa-256 | Compression function (1 cycle latency) and I/O registers | 16552 ALUTs | 12042.2 Mbit/s | 47.04 MHz |
Shabal-256 | Compression function with I/O registers (latency of 16 clock cycles) | 1440 ALUTs | 3125.6 Mbit/s | 195.35 MHz |
Skein-256-256 | All 72 Threefish rounds unrolled (device too small) | N/A | N/A | N/A |
Reference | HDL | Category | Impl. Scope | Technology |
---|---|---|---|---|
Namin and Hasan [2] | N/A | High-speed ASIC | Core functionality | STM 90 nm |
Hash Function Name | Impl. Details | Size | Throughput | Clock Frequency |
---|---|---|---|---|
BLAKE-32 | Compression function with 8 G function units and I/O registers | 53 kGates | 4475 Mbit/s(*) | 96.15 MHz |
Blue Midnight Wish-256 | Compression function with f0, f1, and f2 unrolled in sequence and I/O registers | 164 kGates | 26665 Mbit/s(*) | 52.08 MHz |
Luffa-256 | Compression function (1 cycle latency) and I/O registers | 122 kGates | 25702 Mbit/s(*) | 100.4 MHz |
Shabal-256 | Compression function with I/O registers (latency of 16 clock cycles) | 20 kGates | 4408 Mbit/s(*) | 413.22 MHz |
Skein-256-256 | All 72 Threefish rounds unrolled | 369 kGates | 3126 Mbit/s(*) | 12.21 MHz |
(*) Estimated peak throughput for the minimal delay of compression function: 1000 * (Input Size in bits) / [(Compression Function Delay in ns) * (Number of Cycles)] = Throughput in Mbit/s.
4.2 Blake, CubeHash, ECHO, Grøstl, Hamsi, Luffa, Shabal, Skein
Reference | HDL | Category | Impl. Scope | Technology |
---|---|---|---|---|
Kobayashi et al. [3] | RCIS webpage | High-speed FPGA | Fully autonomous | Xilinx Virtex 5 |
Hash Function Name | Impl. Details | Size | Throughput | Clock Frequency |
---|---|---|---|---|
BLAKE-32 | 1660 slices | 2676 Mbit/s | 115 MHz | |
CubeHash16/32-256 | 590 slices | 2960 Mbit/s | 185 MHz | |
ECHO-256 | 3556 slices | 1614 Mbit/s | 104 MHz | |
Grøstl-256 | 4057 slices | 5171 Mbit/s | 101 MHz | |
Hamsi-256 | 718 slices | 1680 Mbit/s | 210 MHz | |
Luffa-256 | 1048 slices | 6343 Mbit/s | 223 MHz | |
Shabal-256 | 1251 slices | 1739 Mbit/s | 214 MHz | |
Skein-256 | 854 slices | 1482 Mbit/s | 115 MHz |
4.3 CubeHash, Grøstl, Shabal
Reference | HDL | Category | Impl. Scope | Technology |
---|---|---|---|---|
Baldwin et al. [4] | N/A | High-speed FPGA | Core functionality | Xilinx Spartan 3 |
Hash Function Name | Impl. Details | Size | Throughput | Clock Frequency |
---|---|---|---|---|
CubeHash8/1-256(*) | 2 compression functions unrolled | 3268 slices | 70 Mbit/s | 37.9 MHz |
Grøstl-224/256 | P & Q permutation in parallel, S-box in BRAM | 4827 slices | 3660 Mbit/s | 71.53 MHz |
Grøstl-384/512 | P & Q permutation parallel, S-box in LUTs | 17452 slices | 3180 Mbit/s | 79.61 MHz |
Shabal | 36 adders in permutation | 2223 slices | 740 Mbit/s | 71.48 MHz |
(*) CubeHash16/32-h implemented in a similar fashion can be expected to have throughput increased by a factor of about 16.
Reference | HDL | Category | Impl. Scope | Technology |
---|---|---|---|---|
Baldwin et al. [4] | N/A | High-speed FPGA | Core functionality | Xilinx Virtex 5 |
Hash Function Name | Impl. Details | Size | Throughput | Clock Frequency |
---|---|---|---|---|
CubeHash8/1-256(*) | 1 iterated compression function | 1178 slices | 160 Mbit/s | 166.8 MHz |
Grøstl-224/256 | P & Q permutation in parallel, S-box in BRAM | 4516 slices | 7310 Mbit/s | 142.87 MHz |
Grøstl-384/512 | P & Q permutation parallel, S-box in LUTs | 19161 slices | 6090 Mbit/s | 83.33 MHz |
Shabal | 36 adders in permutation | 2768 slices | 1450 Mbit/s | 138.87 MHz |
(*) CubeHash16/32-h implemented in a similar fashion can be expected to have throughput increased by a factor of about 16.
4.4 All 14 Round-Two Candidates
An interactive graphical comparison of various area-performance tradeoffs of this study can be found here.
Reference | HDL | Category | Impl. Scope | Technology |
---|---|---|---|---|
Tillich et al. [14] | On request | High-speed ASIC | Fully autonomous | UMC 0.18 µm |
Hash Function Name | Impl. Details | Size | Throughput | Clock Frequency |
---|---|---|---|---|
BLAKE-32 | Compression function with 4 G function units with CSAs | 45.64 kGates | 3971 Mbit/s | 170.64 MHz |
Blue Midnight Wish-256 | Compression function with f0, f1, and f2 unrolled | 169.74 kGates | 5358 Mbit/s | 10.46 MHz |
CubeHash16/32-h | Dynamically reconfigurable r and b parameters, two rounds unrolled | 58.87 kGates | 4665 Mbit/s | 145.77 MHz |
ECHO-256 | Four parallel AES rounds, 16 AES MixColumns 32-bit column multipliers | 141.49 kGates | 2246 Mbit/s | 141.84 MHz |
Fugue-256 | Four columns of SMIX transformation in parallel | 46.26 kGates | 4092 Mbit/s | 255.75 MHz |
Grøstl-256 | One shared permutation for P & Q, one pipeline stage | 58.40 kGates | 6290 Mbit/s | 270.27 MHz |
Hamsi-256 | Three instances of P/Pf function unrolled | 58.66 kGates | 5565 Mbit/s | 173.91 MHz |
JH-256 | 320 S-boxes, one round of R8 per cycle | 58.83 kGates | 4991 Mbit/s | 380.22 MHz |
Keccak(-256) | One instance of Keccak-f round | 56.32 kGates | 21229 Mbit/s | 487.80 MHz |
Luffa-224/256 | Three permutation blocks in parallel (64 S-boxes, 4 MixWord blocks each) | 44.97 kGates | 13741 Mbit/s | 483.09 MHz |
Shabal-256 | One word rotation per cycle, 50 cycles per block | 54.19 kGates | 3282 Mbit/s | 320.51 MHz |
SHAvite-3256 | Four AES rounds (two for compression, two for message expansion) | 57.39 kGates | 3152 Mbit/s | 227.79 MHz |
SIMD-256(*) | Two FFT-64 with two FFT-8 and 16 multipliers (8x8 bit) each | 104.17 kGates | 924 Mbit/s | 64.93 MHz |
Skein-256-256 | 8 Threefish rounds unrolled | 58.61 kGates | 1882 Mbit/s | 73.52 MHz |
Skein-512-512 | 8 Threefish rounds unrolled | 102.04 kGates | 2502 Mbit/s | 48.87 MHz |
(*) Implementation of round-one variant.
4.5 BLAKE, Grøstl, Skein
Reference | HDL | Category | Impl. Scope | Technology |
---|---|---|---|---|
Tillich et al. [18] | N/A | Low-area ASIC | Fully autonomous | AMS 0.35 µm |
Hash Function Name | Impl. Details | Size | Throughput | Clock Frequency |
---|---|---|---|---|
BLAKE-32 | One G function in 11 cycles | 25.57 kGates | 15.4 Mbit/s | 31.25 MHz |
Grøstl-224/256 | 64-bit datapath, P & Q permutation shared | 14.62 kGates | 145.9 Mbit/s | 55.87 MHz |
Skein-256-256 | 64-bit datapath | 12.89 kGates | 19.8 Mbit/s | 80 MHz |
5 References
[1] Jean-Philippe Aumasson, Luca Henzen, Willi Meier, and Raphael C.-W. Phan. SHA-3 proposal BLAKE (version 1.3). Available online at http://131002.net/blake/blake.pdf.
[2] A. H. Namin and M. A. Hasan. Hardware Implementation of the Compression Function for Selected SHA-3 Candidates. Available online at http://www.vlsi.uwaterloo.ca/~ahasan/hasan_report.html.
[3] Kazuyuki Kobayashi, Jun Ikegami, Shin'ichiro Matsuo, Kazuo Sakiyama, and Kazuo Ohta. Evaluation of Hardware Performance for the SHA-3 Candidates Using SASEBO-GII. IACR Eprint report 2010/010. Available online at http://eprint.iacr.org/2010/010.pdf.
[4] Brian Baldwin, Andrew Byrne, Mark Hamilton, Neil Hanley, Robert P. McEvoy, Weibo Pan, and William P. Marnane. FPGA Implementations of SHA-3 Candidates: CubeHash, Grøstl, LANE, Shabal and Spectral Hash. IACR Eprint report 2009/342. Available online at http://eprint.iacr.org/2009/342.pdf.
[5] Liang Lu, Maire O'Neil, and Earl Swartzlander. Hardware Evaluation of SHA-3 Hash Function Candidate ECHO. Presentation at the Clauce Shannon Institute Workshop on Coding and Cryptography 2009. Slides available online at http://www.ucc.ie/en/crypto/CodingandCryptographyWorkshop/TheClaudeShannonWorkshoponCodingCryptography2009/DocumentFile,75649,en.pdf.
[6] Bernhard Jungk, Steffen Reith, and Jürgen Apfelbeck. On Optimized FPGA Implementations of the SHA-3 Candidate Grøstl. IACR Eprint report 2009/206. Available online at http://eprint.iacr.org/2009/206.pdf.
[7] Praveen Gauravaram, Lars R. Knudsen, Krystian Matusievicz, Florian Mendel, Christian Rechberger, Martin Schläffer, and Søren S. Thomsen. Grøstl - a SHA-3 candidate (October 31, 2008). Available online at http://www.groestl.info/Groestl.pdf.
[8] Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles van Assche. KECCAK sponge function family main document (Version 1.2, April 23, 2009). Available online at http://keccak.noekeon.org/Keccak-main-1.2.pdf.
[9] Joachim Strömbergson. Implementation of the Keccak Hash Function in FPGA Devices. Available online at http://www.strombergson.com/files/Keccak_in_FPGAs.pdf.
[10] Romain Feron and Julien Francq. FPGA Implementation of Shabal: Our First Results (Version 2.0, February 19, 2010). Available online at http://www.shabal.com/wp-content/uploads/2010/03/FPGA-Implementation-of-Shabal-First-ResultsV2.0.pdf.
[11] Men Long. Implementing Skein Hash Function on Xilinx Virtex-5 FPGA Platform (Version 0.7, February 2, 2009). Available online at http://www.skein-hash.info/sites/default/files/skein_fpga.pdf.
[12] Stefan Tillich. Hardware Implementation of the SHA-3 Candidate Skein. IACR Eprint report 2009/159. Available online at http://eprint.iacr.org/2009/159.pdf.
[13] Jean-Luc Beuchat, Eiji Okamoto, and Teppei Yamazaki. Compact Implementations of BLAKE-32 and BLAKE-64 on FPGA. IACR Eprint report 2010/173. Available online at http://eprint.iacr.org/2010/173.pdf.
[14] Stefan Tillich, Martin Feldhofer, Mario Kirschbaum, Thomas Plos, Jörn-Marc Schmidt, and Alexander Szekely. High-Speed Hardware Implementations of BLAKE, Blue Midnight Wish, CubeHash, ECHO, Fugue, Grøstl, Hamsi, JH, Keccak, Luffa, Shabal, SHAvite-3, SIMD, and Skein. IACR Eprint report 2009/510. Available online at http://eprint.iacr.org/2009/510.pdf.
[15] Shai Halevi, William E. Hall, and Charanjit S. Jutla. The Hash Function Fugue (October 30, 2008). Available online at http://domino.research.ibm.com/comm/research_projects.nsf/pages/fugue.index.html/$FILE/NIST-submission-Oct08-fugue.pdf.
[16] Junfeng Fan. Hardware Evaluation of The Hash Function Hamsi. Available online at http://homes.esat.kuleuven.be/~okucuk/hamsi/implementations.html.
[17] Miroslav Knezevic and Ingrid Verbeiwhede. Hardware Evaluation of the Luffa Hash Family. 4th Workshop on Embedded Systems Security 2009. Available online at http://www.cosic.esat.kuleuven.be/publications/article-1282.pdf.
[18] Stefan Tillich, Martin Feldhofer, Wolfgang Issovits, Thomas Kern, Hermann Kureck, Michael Mühlberghuber, Georg Neubauer, Andreas Reiter, Armin Köfler, and Mathias Mayrhofer. Compact Hardware Implementations of the SHA-3 Candidates ARIRANG, BLAKE, Grøstl, and Skein. IACR Eprint report 2009/349. Available online at http://eprint.iacr.org/2009/349.pdf.
[19] Grøstl website. http://www.groestl.info/.
[20] Markus Bernet, Luca Henzen, Hubert Kaeslin, Norbert Felber, and Wolfgang Fichtner. Hardware Implementations of the SHA-3 Candidates Shabal and CubeHash. 52nd IEEE International Midwest Symposium on Circuits and Systems, 2009. Available online at http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5236043.
[21] Michel Kinsy and Richard Uhler. SHA-3: FPGA Implementation of ESSENCE and ECHO Hash Algorithm Candidates Using Bluespec. Available online at http://csg.csail.mit.edu/6.375/6_375_2009_www/projects/group1_report.pdf.
[22] Bernhard Jungk and Steffen Reith. On FPGA-based implementations of Grøstl. IACR Eprint report 2010/260. Available online at http://eprint.iacr.org/2010/260.pdf.
[23] Jérémie Detrey, Pierre Gaudry, and Karim Khalfallah. A Low-Area yet Performant FPGA Implementation of Shabal. IACR Eprint report 2010/292. Available online at http://eprint.iacr.org/2010/292.pdf.
[24] Jean-Luc Beuchat, Eiji Okamoto, and Teppei Yamazaki. . IACR Eprint report 2010/364. Available online at http://eprint.iacr.org/2010/364.pdf.