Projects:2017s1-122 On-Chip Learning
Dr Braden Phillips, George Stamatescu
Xiaoyang Dai , Tao Zeng
This project explores efficient methods to implement a digit classification neural network on a chip. Various approaches are evaluated. Following the score of this study, comprehensive comparison and evaluation are made among a group of available weight quantisation methods, power-of-two and normalised power-of-two quantisation are determined as two first-rate quantisation approaches. Further, three network optimisation strategies including neuron replacement, alternative activation functions and matrix multiplication optimisation are raised and tested in this thesis to make the target neural network more hardware friendly and simplify the hardware implementation. The present results are essential components of the proposed outcomes and indicate the future steps of this study.
Deep learning neural networks is becoming popular and being widely researched in a large range of areas because of its magnificent performance, strong adaptability and bright commercial prospect. However, as a result of deep learning neural networks’ strict requirements to hardware platforms and energy support, so far, the applications of neural networks in the areas which have restrictions in hardware resource and energy consumption such as the Internet of things are limited. Consequently, the efficient and commercial implementations of neural networks are in high demand. In this thesis, Field-Programmable Gate Array (FPGA) based implementations of a digit classification deep neural network is researched with focusing on the necessary weight quantisation strategies and the network optimisations for FPGA.
Though the neural networks with deep learning algorithm have been widely applied, there are several limitations with the applications. Firstly, the training process of neural network is computational extensive. Generally, the practical neural networks with complicated structures are trained by GPUs and Central Processing Units (CPUs), because a great deal of computing is involved in the training processes and the GPU and CPU which have a large number of computing cores is the best option. Moreover, the trained neural networks still accomplish their prediction process on personal computers, laptops and smart phones, though the computing capabilities of the devices far beyond the requirements of prediction process. The phenomenon causes serious waste of hardware resources and energy.
The listed limitations introduce a part of motivations of this project. An innovative method to implement neural networks is expected to be researched in this project, realizing neural networks by FPGA. FPGA is a potential technique which could be utilized to overcome the mentioned limitations previously. The first reason is that FPGA circuits covers a wide range of power. It means that low power FPGA circuits can be allocated to support simple neural networks and powerful FPGA circuits can deal with complex works. Secondly, it is possible to design FPGA circuits for specific neural networks according to the detail performance requirements, which contributes to efficient and effective hardware and energy resources usage. Another motivation is that the implementation of on-chip deep learning is a new and hot research area which introduces an enormous market which has great potential. Google has released its own neural network chip called the Tensor Processing Unit (TPU). The TPUs are used to assemble a TPU cluster and provide a cloud neural network computing service. In addition, a startup company called DeePhi is using FPGAs to build general neural network platform for individual developers. The company has been recognized and invested by two investment companies called GSR Ventures and Banyan Capital. Those commercial cases demonstrate that on-chip neural network implementation is competitive and its investment values and marketing potential are approved. As a result, the business values and considerable market motivate the project.
According to the motivations, a FPGA based implementation of digit classification neural network is expected to be achieved in this study. Trained digit classification neural networks are able to classify handwritten digit figures.
The target neural network is a fully connected neural network with 30 hidden neurons. The reasons why we chose the neural network are as follows. Firstly, the basic neural network shares similar fundamental principles and theories with the neural networks with more complex structure, which means that the results based on our target neural network can be generalised to the on-chip implementations of other neural networks. In addition, the selected digit classification neural network takes a reasonable time to train and test. With about 50 training epochs and appropriate training rate, the neural network normally finishes training process within half an hour on a laptop. Compared with selecting complex neural networks which usually cost several hours to train, the use of the simple digit classification neural network accelerates this research.
The FPGA based implementations of neural networks can be accomplished in two ways: training neural networks with FPGA; implementing trained neural networks with FPGA. This study considers the FPGA based implementation of a trained neural network. Because this way utilizes the advantages of available technologies and platforms. Compared with FPGA devices, CPUs and GPUs have advantages in training neural networks because of their outstanding computing ability. Meanwhile, FPGA can give incredibly high performance for problems of the right form e.g. those where you take take advantage of concurrency and when there is regularity in the processing patterns, which fits the characteristics of the prediction process of neural networks. As a result, FPGA is an appropriate and competitive solution to the hardware implementation of neural networks.
Two main issues are involved in the on-chip implementation of the digit classification neural network, namely network optimisations and FPGA architecture design. The network optimisations focus on making necessary changes to the original neural network to meet the requirements of FPGA based implementation and improve the performance of implementations. The design of FPGA architectures is the process of practical implementation of our target neural network which involves a large deal of hardware programming work.
So far, two kind of measurements including weight reformations and the optimizations of network structure have been researched, tested and evaluated. The two measurements are essential to the on-chip implementation of neural networks, because the utilization of weight reformations is able to reduce the storage requirements of neural network and compared with original networks, the neural networks with optimized structure have better feasibility to be implemented with FPGA. In this section, four optional weight reformation approaches are analyzed and the most satisfactory one is selected. In addition, available network optimization methods called neuron replacement, matrix multiplication reformation, and alternative activation functions are released, the methods are going to be adopted in the implementations.
Among the mentioned quantisation methods, binary is the one which is able to minimize the bit length of weights, because the bit length is reduced by 32 times to 1 bit which is the minimum bit length. Benefiting from quantisation, deep neural networks which has a large number of hidden neurons and hidden layers can be implemented with FPGA, which is one of the project purposed outcomes. Mohammad Rastegari, Vicente Ordonez, Joseph Redmon and Ali Farhadi’s research shows an available quantisation method. They announced that binary weights can be estimated be filtering float weights with a binary filter which generates binary weight matrix according to the signs of float weights and a scale factor which is the average of absolute float weight values.
According to Fengfu Li, Bo Zhang and Bin Liu’s research, it is noticeable that a large proportion of weights scatter around origin, which suggests a possible approach to enhance the performance of binary networks. It is obvious that binary weights are in two values -1 and 1, if another alternative value 0 is added in the weight pool and introduces ternary weights, the new weights may be able to represent the original weights and contribute to an improvement in network prediction accuracy. In ternary weight neural networks, each weight requires 2 bits to store. Thus, it achieves up to 16 times compression rate compared with 32-bit float weight neural networks, though lower than the rate of binary rate, it is excellent enough to be applied in the implementation of neural network with FPGA.
Similar to the mentioned binary methods, the minimisation of Euclidian distance is the foundation of ternary methods. Zhang, Li and Liu raised an available ternary approach in their study.
Moreover, another quantisation approach with more distinct available weights is expected to offer the possibility of representing the original weights better and carry out higher prediction accuracy.
To test the listed quantisation methods, several comprehensive comparisons were made among them with Python, including the weight distributions, prediction accuracy and storage costs. The focused three factors are essential for the on-chip implementations of neural networks, because according to the difference between Gaussian distribution whose similarity with trained float weights has been proved and quantised weights’ distribution, it is possible to predict the performance of quantised networks and make preliminary evaluations. In addition, the prediction accuracy and storage cost are the direct evaluation standards, which means that the quantised networks with higher prediction accuracy and less storage cost have better performance and are more suitable for the implementation of neural networks with FPGA.
Referring to the comparison of quantised weight distributions, it is easy to recognise that power-of-two and normalised power-of-two weights represent the original distribution in the highest level among four kinds of quantised weight forms, which proved by the Gaussian-similar shapes of distribution bars. Moreover, the distribution of ternary weights takes the second place with three bars and the binary weights show a uniform distribution with two available value, which is distinguishable different from Gaussian distribution. In this project, it is expected to prove that the validity and performance of quantisation approaches by analysing the distributions of processed weights and determine the positive relationship between the performance and Gaussian similarity. Consequently, power-of-two and normalised power-of-two weights are assumed to have the best performance following by the ternary network and binary network.
The speculation is initially confirmed by the gaps among the networks with different quantised weights. To be specific, high-level prediction accuracy which are approximately similar to the accuracy shown by the original network are able to be found within the networks processed with power-of-two and normalised power-of-two quantisation techniques. In contrast, the binary network and ternary network predicted the handwritten digits with a much lower accuracy. In addition, it is noticeable that the prediction correction rates of ternary network and binary network fluctuate dramatically during the training period with the rates of others keep stable, which suggests that binary and ternary operations may distort the original weights.
The assumption about prediction accuracy is proved with statistical data. The original, power-of-two and normalised power-of-two networks have nearly equal highest prediction accuracy rates and average accuracy rates (around 95%). While ternary and binary networks’ rates are much lower (less than 85%) and the binary network performed worst with 53.92% average prediction correct rate and 64.77% peak rate, which shows an enormous gap between the outcomes shown in relevant peer studies. The occurrence is likely to be a result of lacking quantised training phases which are unexpected in this project because of hardware resources saving. The data proves that the quantisation methods with more available weights are likely to cause less decline in prediction accuracy and there is a positive relationship between the quantised networks’ performance and the weight distribution’s similarities to the Gaussian distribution.
In addition, the variation appearing in the correlation coefficient to original neural network’s prediction accuracy is worth researching. Because the prediction accuracy is the major evaluation object of neural networks, it is reasonable to consider the coefficient as an index of the association between processed weights and original weights. Similar to the phenomenon observed in prediction accuracy, power-of-two and normalised power-of-two weight networks perform distinctly strong correlation with original network with the coefficients outnumbering 0.9. Meanwhile, binary and ternary networks show an extremely weak correlation because of the numerical small vales. The evidence validates that binary and ternary reformations are likely to distort the original network seriously, which is unacceptable in the on-chip implementation of neural networks.
Although four processed neural networks demonstrate tremendously various prediction performance, slight differences are determined in the theoretical storage cost. Among the four kinds of weights, binary weights demand smallest memory space for each weight which is 1 bit and the two kinds of power-of-two weights require 3 bits for each which is the largest with 2 bits are needed by each ternary weight. Even for the power-of-two and normalized power-of-two weights, the storage requirement is more than 10 times less than float weights’, which is satisfactory enough for on-chip implementations and meet the relevant needs in this project.
Synthesising the previous discussions and evaluations, it is reasonable to give a high priority to the power-of-two and normalised power-of-two quantisation in further tests and assessment because of their slight effects on prediction accuracy and adequate storage cost. However, the rest two quantisation means may not be sufficient options. In addition, power-of-two weights and normalised power-of-two weights also have potential in simplify the neural network’s mathematical operations and reduce its computational requirements on hardware.
We have shown a range of techniques and evaluations used in the hardware implementation of a digit classification neural network. To implement neural networks with hardware, it is feasible to optimize the neural network according to the statistics features of the network. It is proved by the quantization process in this thesis. Instead of listing and evaluating a number of available quantization approaches equally, the quantization phase begins with the analysis of the neural network’s weight distribution, by comparing the similarity of original weight distribution and quantized weight distribution, the performance of considered quantization methods were predicted. Because of the prediction, the evaluation of quantization methods is able to be done with a clear predisposition which simplifies this phase. Summarizing the results, a group of valuable conclusions are obtained. Among the four candidates of weight reformation means, power-of-two and normalized power-of-two quantization approaches demonstrate better overall performance compared with binary and ternary quantization methods, which gives a high priority to the two approaches in further research. The neural network hardware design relates to the weight quantization inseparably; it is necessary to evaluate the neural weight quantization with hardware aspect. For instance, with power-of-two and normalized power-of-two weights, the neural networks are able to use shift operations in hardware matrix computing architecture, which consume much less energy, time and area than multiplication operations for 32 bits, binary and ternary weights. As for the optimization of neural network structure, the feasibility of hidden neuron replacement is satisfactory in this case with a risk in adaptability. In addition, the performance of three potential alternative activation functions to the sigmoid function is evaluated preliminarily and the assessment proves that it is valuable to test and compare them in future. Moreover, according to the analysis about the reformation of matrix multiplication, it is substantiated that the listed four quantization methods not only can reduce the storage cost of weights but also have a great effect on the simplification of matrix multiplication which contributes to the economizing of hardware resources.
The four versions hardware implementations represent hardware structure design strategies. In terms of the neural networks with a large number of neurons, circular structure is able to save hardware resource such as registers, LE, electric circuit scale and wiring resource, it allows implementations to achieve as low power consumption as possible. By contrast, parallel structures boost implementations’ speed dramatically, with higher hardware resource consumption, larger hardware scale and higher energy requirements as prices. A balance is able to be achieved with applying both structures in hardware implementations appropriately, which means using parallel hardware modules in the layers with low workload and selecting circulations in heavy modules to reduce the resource consumption.
 N. Hemsoth (2016), FPGA BASED DEEP LEARNING ACCELERATORS TAKE ON ASICS [Online]. Available: https://www.nextplatform.com/2016/08/23/fpga-based-deep-learning-accelerators-take-asics/
 A. Dubious, “It’s in the Numbers Baby,” Transaction on Everything 9, no.4, pp.42-749, 1984.
 M. Shagger, Reflexive Perspectives on Post-Modern Verbosity, 3rd ed. Addison-Hall, 1968.
 A. Turing, “COMPUTING MACHINERY AND INTELLIGENCE,” MIND. 59.236, pp. 433-460, Oct.1950.
 C. Robert, “A brief history of machine learning,” Slide share, Jun.2016.
 M. A. Nielsen, “Using neural nets to recognize handwritten digits,” in Neural Networks and Deep Learning. Determination Press, 2015.
 T. Yaniv, Y. Ming, R. Marc’Aurelion, W. Lior, “DeepFace: Closing the Gap to Human-Level Performance in Face Verification,” in Conference on Computer Vision and Pattern Recognition, Jun. 2014.
 G. E. Hinton, “Learning multiple layers of representation,” Trends in Cognitive Science, 11, pp. 428-434, 2007.
 A. Krizhevsky, “ImageNet Classification with Deep Convolution Neural Networks,” Advances in Neural Information Processing Systems 25, 2012.
 Y. Benign, “Learning Deep Architectures for AI[J],” in Foundations & Trends in Machine Learning, 2(1), pp. 1-127, 2009.
 W. Remigiusz, “Synthesis of compositional microprogram control units for programmable devices,” University of Zielona Góra, Zielona Góra, pp. 153, 2009.
 S. Higginbotham (2016), GOOGLE TAKES UNCONVENTIONAL ROUTE WITH HOMEGROWN MACHINE LEARNING CHIPS [Online]. Available: https://www.nextplatorm.com/2016/05/19/google-takes-unconventional-routehomegr own-machine-learning-chips/
 DeePHi. (2017). Investors [Online]. Available: http://www.deephi.com/index.html#contact
 J. Park, W. Sung, “FPGA BASED IMPLEMENTATION OF DEEP NEURAL NETWORKS USING ON-CHIP MEMORY ONLY,” Department of Electrical and Computer Engineering, Seoul National University, Korea, Aug. 2016.
 M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,” Allen Institute for AI, University of Washington, Aug. 2016.
 F. Li et al., “Ternary weight networks,” 30th Conference om Neural Information Processing System, Barcelona, Spain, 2016.
 M. Panicker and C. Babu, “Efficient FPGA Implementation of Sigmoid and Bipolar Sigmoid Activation Functions for Multilayer Perceptrons,” IOSR Journal of Engineering, vol.2, no. 6, pp. 1352-1356, June, 2012.
 Girau, B, “FPNA: CONCEPTS AND PROPERTIES,” FPGA Implementations of Neural Networks, 2006, pp.63-102.
 C. Z. Tang and H. K. Kwan, “Multiplayer Feedforward Neural Networks with Single Power-of-two Weights,” IEEE TRANSACTIONS ON SIGNAL PROCESSING, August, 1993, pp.2724-2727.