Abstract:
Convolutional Neural Networks (CNNs) have been known for their high performance and are widely applied in many computer vision tasks including object detection and image classification. Large-scale CNNs are computationally intensive and require high-computing resources. This issue has been addressed by deploying them on Graphical Processing Units (GPUs). However, GPU-based implementation results in significant power consumption, limiting their application in embedded systems. In this context, FPGA-based implementations offer a reasonable solution with tolerable power consumption, but large-scale CNNs require substantial memory and computation requirements, making it challenging to deploy on resource-limited edge devices. This research work addresses the challenge of deploying CNNs on resource-restricted embedded systems by proposing a memory-efficient parallel architecture (MPA). The goal is to utilize CNN’s inherent parallelism while minimizing memory and power consumption. The MPA approach involves a software/hardware co-design, including a low-memory network compression framework at the software level which removes redundant parameters to reduce model size. In addition, it categorizes layers into No-Pruning (NP) and Pruning (P) layers. Additionally, weight quantization is applied to each category of layers to compress the model further by reducing bits of weight parameters into low-bit width. Depending upon the distribution of the weights, NP-layers undergo Optimized Quantization (OQ), however, P-layers are subjected to Incremental quantization (INQ). We propose OQ algorithm for NP-layers to perform quantization using optimal quantization levels (Q-levels) obtained from the Optimizer. High compression of 11x, 5x, 8.5x and 7.5x has been achieved for LeNet-5, VGG-16. AlexNet and ResNet-50 respectively, by using the proposed compression framework, resulting in significant reduction in memory utilization with negligible accuracy loss. To further minimize the resource utilization of the system, MPA presents a parallel architecture for the convolution (CONV) layers of the model at the hardware level. The compressed model obtained in the previous step is mapped onto target hardware i.e., FPGA by applying the proposed parallel hardware architecture. In parallel architecture, multiple 1D-processing elements (PEs) are connected in parallel as 2D-PE to achieve data-level and computation-level parallelism. Each 2D-PE executes the convolution operation of the CONV layer by performing multiple MAC operations involved in the convolution process in parallel, which further reduces the computational cost of the system. In the structure of each 1D-PE, we further achieve computation-level parallelism by connecting multiple registers, multipliers, adders, and multiplexers in a systolic-array manner. However, for P-layers, multipliers are replaced with barrel shifters to further reduce the system’s resource utilization. The model we have achieved after applying both software and hardware levels optimization is examined based on its resource optimization, which includes the area in terms of number of slice registers, LUTs, DSPs and flipflops. It is found that using barrel-shifter-based PE consumes almost half of the resources as consumed by multiplier-based PE, hence resulting in a prominent decrease in overall resource utilization. Consequently, this makes the proposed system a reasonable solution to be deployed on embedded systems with constrained hardware resources.