

GPUs have emerged as a powerful tool for accelerating general-purpose applications. To make FPGAs more competitive in raw performance compared to high-performant CPU and GPU platforms, it is important to increase external memory bandwidth, minimize data movement between a host and a device, and reduce OpenCL runtime overhead on an FPGA. On the other hand, we described performance bottlenecks on the FPGA.

When DRAM memory bandwidth is comparable on the three computing platforms, the FPGA can achieve the highest kernel performance for large workload.
#Cudalaunch dominates runtime pro#
We evaluated the performance more » of the reduction kernel on an Intel (R) Xeon (R) CPU and an Intel (R) IrisT Pro integrated GPU, and an FPGA card that features an Intel (R) Arria (R) 10 FPGA. We parallelized the reduction with atomic functions, and optimized it with vectorized memory accesses, tree reduction, and compute-unit duplication. Then we derived a reduction pattern from a representative application of population count. Based on the results, we select the most efficient implementation. Motivated by the productivity of high-level synthesis and the importance of population count, in this paper we investigated the OpenCL implementations of population count algorithms, and evaluated their performance and resource utilizations on an FPGA. Commodity processors have dedicated instructions for achieving high-performance population count. Population count is a primitive used in many applications.
