Quick Overview of the Halo2 GPU Acceleration Solution

SuperScalar, a pioneering company in the realm of accelerated verifiable computation, offers a comprehensive range of products encompassing GPU, FPGA, and ASIC solutions. This article delves into the GPU acceleration solution that SuperScalar has not only been diligently developing but also successfully deployed in Zero-Knowledge Proof (ZKP) projects, including Taiko.

Background

Currently, many leading Zero-Knowledge Proof (ZKP) projects are widely adopting the Halo2 algorithm, such as PSE, Taiko, Scroll, and others. These projects typically employ the CPU version for proof generation. However, due to the substantial hardware resources required for handling large-scale circuits, generating ZKP proofs using CPU can be time-consuming, ranging from seconds to several minutes, or even longer. Hence, there is an urgent need to explore an acceleration solution.

GPU are known to excel at parallel computing.Algorithms such as Multi-Scalar Multiplication (MSM) and Number Theoretic Transform (NTT) in the Halo2 algorithm are time-consuming and involve large-scale parallel computation, making use of GPU ideal.From a technical perspective, GPU offer comprehensive development tools, including programming languages like CUDA, which contribute to shorter development cycles.From a commercialization perspective, GPU are easy to access, have a strong supply chain and a mature ecosystem.Therefore, utilizing GPU acceleration is seen as the ideal direction for ZKP acceleration, and GPU hold immense potential in the field of ZKP acceleration.

Some projects and teams have already started utilizing GPU to accelerate ZKP such as ALEO, as well as some more professional competitions, such as ZPrize. With the aid of GPU acceleration, it is expected to promote the further development and wide application of ZKP, providing more potential possibilities in areas such as privacy protection, secure encryption and blockchain.

Motivation to accelerate Halo2 with GPU

– The CPU version of Halo2 proof generation is very slow

Currently, the CPU version of the Halo2 algorithm requires a significant amount of time for proof generation, resulting in a relatively slow proof generation speed, which may take several minutes or even longer.

– MSM, NTT and Evaluation calculations take a high proportion of time

During the generation of Halo2 proof,Calculations such as MSM and NTT account for a considerable proportion of the total time required.The specific distribution of time may vary depending on the circuit, with some circuits exhibiting a particularly high proportion of time allocated to Evaluation.By optimizing calculations such as MSM, NTT, and Evaluation, it is possible to substantially reduce the time required for proof generation.

– To Integrate Hardware Acceleration into the Halo2 Framework

In order to take full advantage of the computing power of GPU,we need to integrate hardware acceleration into the Halo2 framework. This requires a unified design of module management and memory management to take into account the overall computing characteristics of Halo2.In this way, you can speed up proof generation and improve overall performance.

Solution Overview

Architecture

The GPU framework includes an internal GPU manager module, which comprises Rust code for modules like MSM, NTT, and others.The primary responsibility of this module is to provide external interfaces while internally being capable of invoking CUDA code and performing certain management operations.This module interfaces with CUDA code through GPU FFI (Foreign Function Interface).In the context of the Halo2 algorithm, the most fundamental calculations involve finite field computations, whereas modules like MSM, NTT, and Evaluation are built upon these finite field calculations.

Algorithm optimization

– In finite field calculations, we optimize the computation process using Montgomery modular multiplication and implement acceleration through assembly language.

– For MSM module, we employ the Pippenger algorithm for grouping to reduce computation overhead, enhance parallelism, and utilize point addition operations with lower computational complexity. Additionally, we incorporate excellent algorithmic optimizations from the industry, such as those observed in ZPrize.

– In the NTT module, we utilize the Fast Fourier Transform (FFT) algorithm.

– Field calculations encompass various computations based on finite fields, including batch inversion.

Data Transfer

Precomputation: For specific circuits, the base points for MSM and the omegas for FFT are often fixed. These data can be transmitted to the GPU memory during GPU initialization, eliminating the need to transmit these bases every time MSM and FFT calculations are performed. This approach helps save on additional data transfer overhead.

Reduce Redundant Data Transfers: Through combined computations, such as performing FFT followed immediately by MSM calculations, the result of FFT can be used directly as input for MSM. This avoids the necessity of transferring data from GPU to CPU and back to GPU, reducing the complexity and cost associated with data transfer.

Extensible compatibility

By configuring the curve, we can achieve compatibility with bn264 and other different curves. This means that our system can flexibly adapt to various curve parameter configurations to meet diverse needs and application scenarios.

Benchmark

Configuration:

8-core AMD Ryzen 7 7800X3D 8-Core Processor

64GB RAM

NVIDIA RTX 3090 GPU with 24GB of VRAM

MSM: