# A Spiking Neural Network Decoder for Implantable Brain Machine Interfaces and its Sparsity-aware Deployment on RISC-V Microcontrollers

Jiawei Liao, Graduate Student Member, IEEE, Oscar Toomey, Member, IEEE, Xiaying Wang, Member, IEEE, Lars Widmer, Cynthia A. Chestek, Senior Member, IEEE, Luca Benini, Fellow, IEEE, Taekwang Jang, Senior Member, IEEE

*Abstract*—Implantable Brain-machine interfaces (BMIs) are promising for motor rehabilitation and mobility augmentation, and they demand accurate and energy-efficient algorithms. In this paper, we propose a novel spiking neural network (SNN) decoder for regression tasks for implantable BMIs. The SNN is trained with enhanced spatio-temporal backpropagation to fully leverage its capability to handle temporal problems. The proposed SNN decoder outperforms the state-of-the-art Kalman filter and artificial neural network (ANN) decoders in offline finger velocity decoding tasks.

The decoder is deployed on a RISC-V-based hardware platform and optimized to exploit sparsity. The proposed implementation has an average power consumption of 0.50mW in a dutycycled mode. When conducting continuous inference without duty-cycling, it achieves an energy efficiency of  $1.88\mu$ J per inference, which is 5.5X less than the baseline ANN. Additionally, the average decoding latency is 0.12ms for each inference, which is 5.7X faster than the ANN implementation.

Index Terms—Spiking neural network, neural decoder, implantable, brain-machine interface, sparsity, regression

## I. INTRODUCTION

**B** RAIN-MACHINE interfaces (BMIs) bridge neurons in the brain with external electronic devices. They have become increasingly popular in both academia and industry as a promising approach to assist individuals with paralysis and amputations in regaining or enhancing their motor functions. An early study in 2004 conducted a human clinical trial to showcase prosthetic devices with an implanted BMI [1]. A BMI proposed in [2] was able to decode handwriting movement intentions at a speed close to typical smartphone texting. In a more recent study, researchers demonstrated a BMI system that bridges the brain and spinal cord to treat a patient with spinal cord injury. By recording and processing the brain's neural signals to stimulate neurons in the spine, this new BMI system has restored the person's ability to walk [3].

A BMI system typically consists of three main components: data acquisition, data processing, and stimulation or actuation. Both non-invasive and invasive methods are viable choices for acquiring data from the brain. Although non-invasive methods, such as EEG, reduce risks for patients, invasive methods such as iEEG or ECoG usually outperform the non-invasive methods in terms of precision thanks to their higher signal-tonoise ratio and finer temporal-spatial resolution [4]. Despite the advantages, implanted BMIs are often tethered using wires for the transfer of power and the large amount of data necessary for accurate decoding [1], [2], resulting in a higher risk of scar tissue formation and infection. One promising approach to reduce the damage caused by the wires is neural dust [5], featuring wireless power and data transfer to and from the miniaturized and distributed devices [6]. However, this wireless approach faces stringent requirement, arising from two factors: 1) limited battery life and transferable power, and 2) the risk of tissue damage caused by the heat generated by the implant. Among the components of a BMI system, the data transceiver is usually the most power-hungry element, given its role in handling vast amount of data acquired from highchannel recording circuits. To mitigate these challenges, edge computing emerges as a viable solution for BMI systems. By processing data directly within the sensor system, it reduces data transfer power. This indicates the growing importance of developing energy-efficient neuronal decoding algorithms within the BMI field.

1

Spiking neural networks (SNNs), inspired by biological neurons, are promising for energy-efficient inference. Neurons in SNNs employ single-bit spikes to transfer information between layers, unlike neurons in conventional artificial neural networks (ANNs) that generally use multiple bits. This property allows for efficient hardware implementation. Moreover, SNNs possess inherent temporal memory, allowing them to extract temporal features and making them well-suited for time-series regression tasks such as motor function decoding. Another beneficial property of SNNs lies in the firing behavior of neurons, which occurs only occasionally and asynchronously. This leads to data sparsity that can be harnessed to achieve more energy-efficient computation. Thanks to their energyefficient properties, SNNs have been extensively explored for classification tasks including image classification [7], [8] and gesture recognition [9], [10], yet more studies are necessary to assess their usage in time-series regression tasks.

Although SNNs show promise for energy-efficient implementation, several challenges remain before this potential can be fully realized. Firstly, SNNs have a more complex data flow

Jiawei Liao, Oscar Toomey, Xiaying Wang, Lars Widmer, Luca Benini and Taekwang Jang are with the Department of Information Technology and Electrical Engineering, ETH Zürich, 8092 Zürich, Switzerland (Email: liaoj@iis.ee.ethz.ch). Xiaying Wang is also with the Swiss University of Traditional Chinese Medicine, 5330 Bad Zurzach, Switzerland.

Cynthia A. Chestek is with the Department of Biomedical Engineering and Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 USA.



Fig. 1. Application scenario and data flow of the proposed design. The neural signal is obtained from the motor cortex region of a brain before the SBP feature is extracted. During training, the input data is prepared with a sliding window containing 10 time steps with 9 steps overlapping. Then, noise is injected before the data is normalized. In test mode, all the data is streamed into an SNN neural decoder deployed on an MCU that predicts the finger velocity.

than conventional ANNs. In contrast to typical fully connected or convolution networks, SNNs require additional handling of temporal memory. Secondly, despite the natural temporal data sparsity of SNNs, effectively exploiting the sparsity on resource-constrained hardware remains a significant challenge. The sparsity lies in feature maps varying from one inference to another, leading to irregular memory access patterns. Thirdly, many SNN implementations require multiple inference cycles to form the temporal dimension for a single input data due to the rate encoding, increasing energy consumption, and processing time [11]. Therefore, despite the benefits of single-bit operation and sparsity of SNNs, their practical application has been limited to neuromorphic hardware specifically designed for this purpose with inherent limitations in flexibility and adaptability [12]. On the other hand, recent general-purpose edge computing platforms have proven to be energy efficient and suitable for BMI applications [13], [14]. Exploring SNN deployment on these platforms is beneficial since they are more versatile, affordable, and widely accessible. They are optimal for algorithmic and implementation explorations before the final deployment on neural implants, as suggested by [15].

In this work, we propose a neural decoding algorithm and deployment methodology for a low-complexity SNN, trained with an enhanced backpropagation method, aimed at the offline open-loop finger velocity prediction for implantable BMIs, as depicted in Fig. 1. Previous research in [16] presents an SNN-based finger velocity decoder, reporting decoding accuracy and computational complexity. The achieved decoding coefficients were 0.745 and 0.582 for the two given datasets. Building on [16], this work introduces new regularization techniques, yielding improved accuracies of 0.782 and 0.624. Furthermore, this paper delves into the implementation details, such as the model, training method, and the entire hardware deployment flow for inference, including the quantization and optimization for the deployment on the energy-efficient hardware platform.

The main contributions of this paper are summarised below:

• We propose an open source<sup>1</sup> and low-complexity SNN for

continuous-time finger velocity decoding for low-power implantable BMIs.

- We demonstrate an SNN backpropagation training strategy enhanced with neuron reset-by-subtraction, trainable decay factors, and noise injection to achieve a higher accuracy than the state-of-the-art KF [17] and ANN decoders [18].
- The proposed design is quantized with quantization-aware training and deployed to a RISC-V-based GAP9 micro-controller (MCU) for inference, resulting in a memory footprint saving of around 4X. The deployment is optimized to leverage SNN sparsity to enhance performance and efficiency.
- The implementation achieves an energy per inference of  $1.88 \,\mu$ J for continuous inference, 5.7X less than the baseline ANN and 2.1X less than the baseline SNN implementation that does not exploit sparsity. The average inference latency is only 0.12 ms, 5.5X and 1.8X faster than the conventional ANN and SNN baseline, respectively.

To the best of our knowledge, our work is the first sparsityaware implementation of an SNN on a general-purpose edge computing platform for neural implants.

# II. RELATED WORKS

# A. Neural decoding algorithms

Neural decoding algorithms translate brain activities into signals that can be used to operate actuators based on the user's intention. Linear decoders, such as linear regression, linear discriminant analysis, and variants of Kalman filters have been developed to perform arm and hand control [17], [25]–[28]. Linear decoders require little computation and can be implemented in highly energy-efficient hardware. However, linear decoders can only achieve moderate accuracy. Nonlinear decoders, including recurrent neural networks [29], [30] and feed-forward neural networks [31], [32] have received recent attention. While neural networks are powerful, they come at the cost of high computational complexity, implying high energy consumption in hardware implementations [33]. Therefore, it is essential to develop decoding algorithms that

<sup>&</sup>lt;sup>1</sup>https://github.com/liaoRichard/SNN-for-Finger-Velocity-iBMI (To be updated on acceptance)

TABLE I Comparison table

|               | This Work                | BioCAS24 [19]          | TBioCAS22 [20]                                     | TBioCAS16 [21]      | ISSCC24 [22]       | Frontiers16 [23] | ISCAS23 [24] | IEEE Sens.J24 [13] |
|---------------|--------------------------|------------------------|----------------------------------------------------|---------------------|--------------------|------------------|--------------|--------------------|
| Туре          | MCU                      | FPGA                   | ASIC                                               | ASIC                | ASIC               | ASIC             | MCU          | MCU                |
| Task          | BMI Finger               | BMI Finger             | BMI Finger                                         | BMI Finger          | BMI Hand           | BMI Object       | Digit        | BMI                |
|               | Movement                 | Movement               | Movement                                           | Movement            | Writing            | Control          | Recognition  | Motor Imagery      |
| Feature       | SBP                      | MUA                    | SBP                                                | Spike rate          | Spike rate         | Spike rate       | Spike rate   | EEG                |
| Algorithm     | SNN                      | SNN                    | SSKF                                               | ELM                 | DNC + LDA          | SNN              | SNN          | CNN                |
| Output Type   | Regression               | Regression             | Regression                                         | 12 classes          | 31 classes         | 4 clases         | 10 clases    | 2 classes          |
| Accuracy      | 0.782 corr<br>0.627 corr | 0.84 corr <sup>c</sup> | 0.601 corr <sup>d</sup><br>0.459 corr <sup>d</sup> | 99.3% <sup>c</sup>  | 90.8% <sup>c</sup> | 50% - 70%°       | 97.2% °      | 82.51%             |
| Power         | 0.50 mW                  | 56.4 mW                | 0.588 mW <sup>a</sup>                              | 0.4 uW <sup>b</sup> | 0.22 mW            | 4 mW             | 99.5 mW      | 10.17 mW           |
| Latency       | 0.12 ms                  | 0.3 ms                 | < 1 ms                                             | -                   | -                  | < 100 ms         | 0.04 ms      | 2.95 ms            |
| Energy/infer. | 1.88 uJ                  | 56.4 uJ                | -                                                  | -                   | -                  | -                | 4.9 uJ       | 30uJ               |
|               |                          |                        |                                                    |                     |                    |                  |              |                    |

<sup>a</sup> Feature extraction power included <sup>b</sup> MCU power not included <sup>d</sup> Reproduced results, assuming SSKF has same accuracy as KF

<sup>c</sup> Not same tasks or datasets, cannot be used for direct comparison

can achieve high accuracy while consuming low energy. To this end, SNNs emerge as promising candidates. Our proposed SNN achieves better accuracy compared with the state-of-theart KF and ANN neural decoders in offline neural decoding tasks reported in table III, which will be discussed in detail in section V.

# B. SNN training and deployment

There are three main methods for SNN training:

1) ANN-to-SNN conversion requires an ANN trained before converting it to an SNN. Typically, it relies on rate coding and tries to mimic the computation of the ANN by setting the parameters of the SNN to correlate the spiking neurons' firing rates with the neurons' activation of the original ANN [34]. In the same vein, [35] introduced an SNN decoder converted from a Kalman filter decoder. This approach has shown accuracy comparable to its ANN or KF counterparts but typically requires high spike counts due to the conversion process and rate approximation, resulting in high energy consumption and latency.

2) Unsupervised learning, which relies on local learning rules [36] such as spike-timing-dependent-plasticity (STDP). This is a biologically plausible approach, but without global supervision, it has not yet achieved state-of-the-art performance in terms of accuracy and energy efficiency.

3) SNN backpropagation such as spatio-temporal-backpropogation (STBP) [7], [8]. This approach establishes an error backpropagation path for gradient descent training by applying a surrogate function in the backward flow to approximate the derivative of the spike activity. It utilizes the temporal feature of the input and has demonstrated good performance in classification tasks. In this work, we construct the SNN using the SNN backpropagation method.

After training, SNNs are deployed. The key performance indicator for deployment is energy efficiency. Due to the irregular memory access pattern and more complex data flow, deploying SNNs efficiently can be challenging. FPGA accelerators [37], and analog [38] or digital [39]–[41] ASIC accelerators have been designed targeting efficient deployment of SNNs. There are only a few studies discussing SNN deployment on MCUs. An SNN is deployed on a low-cost MCU for digit classification task [24], which, while demonstrating feasibility, does so without detailed optimization or deployment strategies. This system reports a relatively high power consumption of  $99.5 \,\mathrm{mW}$  and an energy efficiency of  $4.9 \,\mu\mathrm{J}$  per inference. In contrast, our paper offers an indepth optimization of SNN deployment on MCU, with special emphasis on exploiting sparsity.

## C. BMI hardware systems

Table I provides a comparative analysis of our approach with previous works in the field of BMI hardware deployment and implementation.

The works in [13], [21]–[23] focus on implementing BMIs for classification tasks. Specifically, [13] introduces a CNN deployed on a RISC-V-based MCU for motor imagery tasks, consuming tens of uJ per inference and showing the feasibility of applying modern MCUs to BMI applications. [21] discusses a mixed-signal ASIC for classifying finger movements using an extreme learning machine (ELM) algorithm, which consumes only  $0.4 \mu$ W. However, it should be noted that this measurement excludes the power of the MCU and peripherals. Moreover, the adaptability of ELM for BMI regression tasks requires further exploration.

[19], [20] introduce BMI hardware for regression tasks. [20] describes an ASIC that supports feature extraction and decoding, maintaining a low power consumption of  $588 \mu W$  and a processing latency under 1 ms for closed-loop BMI tasks. However, the KF-based decoder in this system yields relatively lower accuracy than neural network-based decoders.

In a recent work by Leone, et al. [19], an FPGA-based SNN decoder was introduced for finger movement regression tasks, with a latency of only 0.3 ms. However, this design has a power consumption of 56.4 mW and energy per inference of  $56.4 \mu$ J which are significantly higher when compared to other neural decoders.

Compared to the prior works, our design demonstrates an SNN-based decoder and its sparsity-aware deployment on a flexible MCU with competitive decoding accuracy, latency of  $0.12 \,\mu s$ , and energy efficiency of  $1.88 \,\mu J/inference$  in BMI regression tasks.

#### **III.** Algorithm

# A. Input feature

The proposed network adopts the spiking band power (SBP) introduced in [42] as the input feature. SBP is a neural feature

defined as an absolute average of a 300-1000 Hz bandpass-filtered signal. The advantage of the spiking band power feature lies in its lower data rate than the direct action potential acquisition, implying lower power for communication, while its value is still determined mainly by single neuron spikes. SBP has shown good performance for motor prediction tasks when embedded in various algorithms [17], [18]. SBP feature extractors have been implemented in both digital [20] and analog [6] circuits consuming very low power. In this work, we propose a network receiving SBP features from 96 channels.

## B. Neuron Model

The proposed network uses a leaky-integrate-and-fire (LIF) neuron as its neuron model, which is one of the most commonly applied models in computational neuroscience. It is inspired by the biological neuron but implemented in a simplified form. As demonstrated in [43], the LIF neuron model shows a good compromise between cognitive capabilities and computational complexity, thus making it a suitable candidate for embedded platforms with limited resources. The action potentials are simplified to events and neural dynamics, which are governed by two equations describing the membrane potential evolution and the spike generation mechanism [44]. The membrane potential is defined by the differential equation

$$\tau \frac{du}{dt} = -u + i,\tag{1}$$

where u is the membrane potential, i is the input current, and  $\tau$  is the time constant for the membrane potential decay.

To make the model more suitable for efficient computation, we can solve the equation (1) with the forward Euler method, which gives us the discrete-time equation

$$u(t+dt) = (1 - \frac{dt}{\tau})u(t) + \frac{dt}{\tau}i.$$
(2)

In the scenario of neural networks, the input i is the weighted sum of the output from a previous layer and can be represented by

$$i^{l}(t) = W^{l}s^{l-1}(t) + B^{l},$$
(3)

where W is the weight matrix, B is the bias vector, and l represents the layer number.

The neuron fires its output in accordance with its current membrane potential. This behavior is governed by the second half of the neural dynamics model, the spike generation mechanism, represented by a simple Heaviside step function

$$s^{l}(t) = \begin{cases} 1 & u^{l}(t) - V_{th} \ge 0\\ 0 & u^{l}(t) - V_{th} < 0, \end{cases}$$
(4)

where s represents the single-bit output of a neuron, and Vth is the threshold voltage.

The membrane potential will reset to a new value after the neuron fires. Typically, when the membrane potential exceeds a threshold, the neuron fires and resets its membrane potential to zero, which means all information regarding recent input activity is lost after firing. However, we adjusted the reset scheme to subtract the threshold voltage  $(V_{th})$  from the membrane potential when firing, as inspired by [45]. This



Fig. 2. LIF neuron dynamics. Membrane potential u keeps the internal state of a neuron. If the membrane potential is higher than a threshold voltage, then the neuron will fire and reset the membrane potential. Otherwise, the neuron keeps its membrane potential for the next step and does not fire.

means that the information from very high activities is not completely lost. The data above the threshold can influence the neuron's decision at the next firing. This adaptation poses only a little hardware overhead, requiring only one subtraction operation per neuron per inference.

The equation (2) can be further simplified by using decay factors  $\lambda = 1 - \frac{dt}{\tau}$  and merging the factor  $\frac{dt}{\tau}$  into the weights of the input. After incorporating the spikes and resetting, the membrane potential can be modeled by

$$u^{l}(t) = \lambda(u^{l}(t-1) - s^{l-1}(t-1)V_{th}) + i^{l}(t).$$
(5)

The behavior of our neural model follows (3), (4), and (5), and it is illustrated in Fig. 2. The membrane potential u keeps the internal state of a neuron. The membrane potential and the decaying factor  $\lambda$  allow the neuron to partly retain its former information and inherently capture temporal dynamics.

The decay factor  $\lambda$  determines the length of history that a neuron can remember. Larger  $\lambda$  leads to slower membrane potential decay, while small values lead to fast decay. Different  $\lambda$  values indicate the different lengths of history that a neuron can remember. Typically, a single  $\lambda$  is employed across the whole network. However, in our proposed network, each neuron has its own individually trainable decay factor  $\lambda$ , allowing for varying sensitivities to historical data among neurons. In this way, we allow the network to capture more complex temporal dynamics. Fig. 3 shows the distribution of the decay factor that spans the entire range between 0 and 1 after training. Introducing trainable decay factors increases the complexity of training. However, in the inference process, its overhead of storing the decay factor is negligible since the number of neurons is far fewer than the number of weights.

In contrast to the normal ANN implementation where the information between layers is carried by multi-bit real values, the information between spiking layers is encoded in 1-bit spikes. Therefore, one of the advantages of using an SNN instead of a conventional ANN is that the input current *i* can be calculated as cheaper additions of weights instead of Multiply-Accumulate (MAC) operations due to the 1-bit input spikes. All hidden neurons fire only occasionally when the membrane potentials exceed the threshold, introducing a high degree of sparsity in the intermediate features. The reduced computational complexity and the high degree of sparsity are



Fig. 3. Decay factor  $\lambda$  distribution before and after training. Tau is randomly initialized around 0.6. After training,  $\lambda$  spread in the range between 1 and 0



Fig. 4. Topology sweep. The scope of this exploration includes the number of spiking layers and the number of neurons in each layer except for the output layer.

the two main sources of energy efficiency of the proposed SNN compared to the conventional ANNs.

#### C. Proposed Network Architecture

In this work, we propose a fully-connected SNN for velocity regression. A conventional ANN is proposed in [18] for the same task, where the input layer is a temporal convolution layer for feature extraction from multiple time steps, followed by three fully connected layers. However, the large number of neurons in the flattened output of the temporal convolutional layers leads to a high number of parameters and computations in their second layer.

Instead of processing multiple time steps of inputs using a temporal convolution layer, we only pass inputs from a single time step in each inference. Then, we rely on the implicit recurrent nature of the neurons for temporal processing. As a result our model only uses 4 fully connected spiking layers. This choice was made to exploit an intrinsic property of SNNs, i.e., the membrane potential of each neuron acts as a memory element and stores temporal information from previous inputs.

The input to the first layer of our proposed network are multi-bit SBP features from 96 channels. Hence, the input



Fig. 5. Proposed SNN unfolded for training (a) and in inference (b)

current is calculated as  $i(t) = \sum_{n=0}^{M-1} wSBP_n(t)$ , rather than the  $i(t) = \sum_{i=0}^{M-1} w^l s^{l-1}(t)$  used in the other layers. Hence, for the first layer, it is not possible to replace MAC operations with additions. The second and third layers are spiking layers formed by the LIF neurons described in section III-B. The output layer is composed of two leaky-integrate neurons predicting the velocity of two fingers. In contrast to some works converting the spikes back to the real value, we directly use the membrane potential as the output. Therefore, the neurons in the last layer are modeled by  $u(t) = \lambda u(t-1) + i(t)$  without spike and reset mechanisms.

The number of hidden layers and neurons in each layer is determined based on the parameter sweep shown in Fig. 4. We choose a configuration consisting of 4 layers, each comprising 256 neurons except for the output layer, after making a trade off between the configuration size and the performance improvement.

Between successive layers, batch normalization is implemented to improve convergence. Training an SNN with a backpropagation approach requires unfolding the network, equivalent to expanding the network to more layers. This severely exacerbates the gradient vanishing or explosion problems. Unlike the conventional ANN, the SNN has an extra temporal dimension, and neurons firing too frequently or scarcely can degrade performance [8]. Therefore, the scale of the threshold voltage and the input needs to be balanced. We employ the threshold-based batch normalization introduced in [8] to address these issues. The threshold-dependent batch normalization can be described by

$$y = \frac{V_{th}(x - E[x])}{\sqrt{Var[x] + \epsilon}} * \gamma + \beta.$$
(6)

In training, E[x] and Var[x] are calculated over the mini batches. In inference, we use the mean and standard deviation calculated in the training dataset. Unlike the standard batch normalization function, the  $V_{th}$  is also taken into consideration. To account for the additional temporal domain, the mean and variance are calculated for each channel not only for all the spatial dimensions but also for the temporal dimension. To reduce computation complexity, we merge batch normalization with the pre-synaptic calculation described in (3).

When we export parameters for hardware deployment, batch normalization is fused into weight and bias to eliminate



Fig. 6. (Left) Time step sweep. The number of time steps for the network to unfold in the training phase. The time step of 10 is chosen as the accuracy does not increase with more time steps. (Right) Sweep strength of the noise. The noise is white Gaussian noise, whose standard deviation is set to a ratio relative to the mean standard deviation of signals from all channels. Best performance is achieved at 0.9.

overhead in inference as in

$$W_{fused} = \frac{V_{th}W}{\sqrt{Var[x] + \epsilon}} * \gamma \tag{7}$$

and

$$B_{fused} = \frac{V_{th}(B-\mu)}{\sqrt{Var[x] + \epsilon}} * \gamma + \beta.$$
(8)

## D. Training

Spiking functions for neurons in SNNs are not directly differentiable. We implement a surrogate function to allow the gradient to propagate back through the neurons, as suggested by [7], [46]. In this work, we use a square surrogate function defined by

$$\frac{\partial a}{\partial u} = \begin{cases} 1 & \text{if} |u(t) - V_{th}| < 0.5\\ 0 & \text{else.} \end{cases}$$
(9)

In the forward path, the spike function is governed by (4), while, in the backpropagation phase, the gradient is calculated by the surrogate function. We choose this surrogate function for its simplicity and good performance shown in other works [7].

The training process is developed on top of the publicly available PyTorch implementation of the STBP backpropagation training method introduced in [7]. We modify it to support the proposed architecture incuding the reset-by-subtract scheme, trainable decay factor, quantization, hyperparameter sweeps, regularization as well as non-spiking output layers.

Fig. 5 shows the unfolded network during the training process. The STBP allows backpropagation through both temporal and spatial dimensions. The network is unfolded for ten time steps during the training process. This is equivalent to having a much deeper network, similar to the training process of RNNs. The time sequence length for training should be long enough to capture the relevant temporal dependencies; however, extending the number of time steps overly may lead to longer training duration and worsen the issue of vanishing or exploding gradients. We observed that the network does not need to unfold more than ten frames as it does not improve the accuracy, shown in Fig. 6. Additionally, we discard the first two frames in the loss calculation because the network

has not converged to a stable prediction, yet. The loss is then back-propagated through spatial and temporal dimensions in this unfolded network.

We generate the training dataset by using a sliding window with a length of 10 time steps and an overlap of 9 time steps. These training samples are then shuffled during training. Note that during the inference process, the network is not unfolded, and all neurons maintain and update their internal state across multiple inferences. One time step is used once per inference. This operating scenario mimics the situation where a real-time prediction task runs on data that are streamed continuously to the network.

We apply the AdamW optimizer [47] with a learning rate and a weight decay of  $2 \times 10^{-3}$  and  $1 \times 10^{-2}$ , respectively. Learning rate decay is configured to decay by 1 decade every 20 epochs for better convergence. The batch size used during training is 128. The membrane threshold  $V_{th}$  is set to 0.4 which facilitates an optimal firing rate, thereby contributing to improved accuracy. These hyperparameters are determined by grid search.

Similarly to [32], we use the time-integrated mean square error as the loss function during training, while the Pearson correlation coefficient and mean square error are used as the metric to evaluate the performance of the SNN, which are commonly adopted for neural decoding algorithm comparisons [32], [42].

# E. Regularization

Overfitting is a common problem for neural networks that have numerous parameters and are trained with limited data. Our model also suffers from overfitting as observed in Fig. 7. We address this issue by three means. First, the weight decay function from the AdamW optimizer is applied. Second, dropout is implemented between each layer. It is performed for only spatial dimensions. At each time step, a new dropout mask is generated randomly. The dropout probability must be carefully calibrated. While low dropout probability provides insufficient regularization, excessively high probability can lead to the neglect of too many neurons during training, causing significant information loss. After a grid search, a dropout probability of 0.2 is chosen.

Third, white Gaussian noise is inserted before normalizing the input data to enhance the regularization effect. The inserted noise has a zero mean, and the standard deviation is set to 0.9 times the average standard deviation of the input features across all channels. While the noise helps to mitigate overfitting by introducing variability and uncertainty during training, too strong noise obscures the input feature and reduces accuracy. The standard deviation of 0.9 is determined as a result of a parameter sweep shown in Fig. 6. The noise added to input data from all the channels is calculated based on the same criteria mentioned above.

Noise injection not only adds randomness to the input data but also serves as a data augmentation method. As introduced in section III-D, each training sample is prepared by using a sliding window of 10 time steps with 9 steps overlapped with the previous sample. Noise is added after these 10-step-long



Fig. 7. The correlation coefficients over epochs for the training set (dash line) and the validation set (solid line). The gap between the training line and the corresponding validation line shows the effectiveness of the regularisation methods. The baseline is the case when no regularisation method is applied. Weight decay does not help. Dropout and noise injection significantly alleviate the overfitting and improve accuracy.

samples are made. This means that the same original data in different sliding windows will be slightly different due to the added noise, and this effectively augments the dataset to 10 times larger.

The impact after adding each regularisation scheme is shown in Fig. 7. Weight decay does not appear to have a noticeable effect. Dropout, on the other hand, reduces the disparity between the training and validation curves. The best accuracy is attained after applying noise injection, which shows the effectiveness of this approach in mitigating the overfitting issue.

# IV. DEPLOYMENT

# A. Quantization

Quantization is an effective method for improving hardware efficiency. Decreasing the bit precision of the data and weights reduces the memory footprint and enables the usage of lighter integer computations instead of expensive floatingpoint computations. Quantization unleashes the full capability of the target MCU equipped with 8-bit SIMD instructions.

Before performing quantization, the batch normalization parameters are first fused into weights and biases. Thanks to the 1-bit information carrier in an SNN, no scaling is required between layers. We implement uniform quantization with a symmetrical range. This quantization is performed on a layer-by-layer basis, with distinct scaling factors computed for each individual layer. The scaling factor is primarily determined by the weights. The scaling factors of weight, bias, and threshold voltage are calculated by the bit-width and the maximum values of the weights in that layer as  $scale = 2^{(bitwidth-1)-1}/W_{max}$ .

The quantized weight, bias, and threshold voltage are generated by rounding the scaled values to the closest integer as



Fig. 8. GreenWaves Technologies GAP9 Microprocessor block diagram.

 $quant = round(value \cdot scale)$ . The membrane potential uses more bits than the weights do to provide numerical stability.

The decay factor ranges from 0 to 1 and is quantized differently. The scaling factor of  $\lambda$  is only determined by the chosen bit-width. Unlike the weight and bias, which are applied once per inference for various inputs, the decay factor is repeatedly multiplied by the membrane potential in each forward pass, so the scaling factor for the quantized decay factor may impair numerical stability. To address this issue, we re-scale the multiplication result back through right-shift and truncation after performing multiplication with the scaled  $\lambda$ , formulated as  $u^{l}(t) = (\lambda 2^{bw}(u^{l}(t-1) - s^{l-1}(t-1)V_{th}) >> bw + i^{l}(t)$ .

To compensate for the accuracy loss caused by quantization, quantization-aware training (QAT) is adopted. After regular training epochs, QAT is applied for 20 epochs. During QAT, a straight through estimator is used [48]. In the forward path, the simulated quantization is applied to mimic the effects of quantization, while in the back-propagation, the error is back-propagated with full precision, as the gradients for the piece-wise flat operator for quantization are almost zero everywhere [48].

The neural decoder shows almost no loss even when the weights are quantized to 4 bits and decay factors are quantized to 3 bits. The model is exported after quantization and used for hardware deployment.

# B. Hardware Platform

The network is deployed on a GAP9 hardware platform, a low-power 32-bit microcontroller (MCU), consisting of 10 RISC-V ISA cores as depicted in Fig. 8. A 'Fabric Controller' core runs on startup and manages IO-related tasks, such as interfacing with peripherals and configuring operating voltages and frequencies. The platform is equipped with a compute cluster that allows task parallelization, consisting of a 'master' core, eight worker cores, and an accelerator (NE). Each core on GAP9 supports an extended instruction set with support for zero-overhead hardware loops and various Single-Instruction-Multiple-Data (SIMD) operations for 8- and 16-bit data types.

The MCU uses a hierarchical memory architecture consisting entirely of SRAM, accessed through interleaved memory banks. The memory regions of relevance to this work are (1) 1.5 MB located near the Fabric Controller with a longer access time to the cluster known as L2, (2) 128 kB located in the cluster with one cycle of access time known as L1.

Considering the memory hierarchy, care must be taken to ensure that data is in a low-access cost memory region when it is needed for execution. To achieve this, a dedicated direct memory access (DMA) unit is equipped and can be used to move data between L2 and L1 at high bandwidth without interrupting execution. A single transfer command consists of source and destination addresses and the number of bytes to move.

# C. Data Movement & Execution

Eight cores in the compute cluster are used for most of the execution. Between the execution of each layer, the master core is responsible for allocating and deallocating buffers, and configuring the tasks to be executed on each worker core. The model parameters and the membrane potentials are ordinarily stored in the L2 memory region. The weights for each layer are stored in a 2D array. The binary input spike vector determines which columns are used during the weight summation operation along each row of the weight matrix. This is illustrated in Fig. 9 where the weights are denoted by  $W_{ij}$ , and *i* and *j* represent the neurons in the current and previous layers, respectively. To decrease memory access times, the parameters and inputs are transferred to L1 for execution.

However, it is not feasible to transfer the entire model from the L2 memory region to L1 due to the model size. Additionally, transferring the entire model is not optimal due to the latency and energy overhead. Therefore, we propose different data moving and execution strategies for different layers according to their characteristics.

1) Layer 1: The input to the first layer is an 8-bit normalized SBP feature vector. The data movement and execution scheme for layer 1 is similar to double buffering, where the data transfer runs concurrently with the model execution. Firstly, the input vector for the model is transferred to L1, followed by the membrane potentials of all four layers. Then, the parameters for the first layer are split into two blocks of 128 rows, where execution can begin as soon as the first block has been transferred. Once the first block has finished executing and the parameters for the second block have arrived in L1, the second block can be executed. This hides the latency associated with transferring the second half of layer 1's parameters, as demonstrated by the program profiler trace in Fig. 10.

The blocks of parameters were configured to contain all parameters needed to execute a portion of the layer in a single contiguous region of memory. This allows all parameters for layer 1 to be moved into L1 with only two transfer commands, minimizing the cost of queuing a transfer. Each block contains 128 rows of the weight matrix W and the 128 corresponding bias and decay factors. The threshold voltages are read once from L2 at the start of each layer's execution by the cluster master core. The task is split row-wise as there are no data dependencies between rows. The computation is then distributed and parallelized on the 8 cluster cores, with each core processing 16 rows of each block in layer 1.

The worker cores perform the weight matrix multiplication step using a dot product SIMD operation available in each core on the GAP9 platform. This instruction computes the dot product between two 4-element 8-bit vectors, allowing four values from each row of the weight matrix to be processed per cycle per core. Once the result for a row has been calculated, the reset and decay factors are applied using 2-element 16-bit vector instructions.

2) Layer 2, 3 & 4: In contrast to the first layer, each subsequent layer receives a binary vector as input. This is an important characteristic since it ensures that only the columns in the weight matrix corresponding to neurons fired in the preceding layer contribute to the update of the membrane potential. Thus, we propose the Sparse Copy implementation, taking advantage of sparsity by only transferring the weights required for execution from L2 to L1 (as illustrated in Fig. 9). This reduces energy consumption and execution time by decreasing the amount of data transferred. Instead of transferring data after a layer's execution is finished, the transfer is queued during layer execution to give the DMA more time to complete transfers. Once the membrane potential for a neuron is updated, the spike condition is checked, and the transfer of the corresponding weight column in the next layer is initiated. This allows the DMA peripheral to process many of the transfers while the cluster is still executing. This behavior is depicted in Fig. 10. Two large buffers are used in L1: one for reading during execution, the other for transferring new data. Due to the limited size of L1, the size of these buffers is less than the total size of the layer 2 or layer 3 weight matrices. However, in practice, the number of neuron spikes is never large enough to result in a buffer overflow, thanks to the sparsity. Once the transfer is queued, the destination address for the next transfer in the L1 weight buffer is incremented by 256 Bytes, the size of one column in the weight matrix. Besides saving the weights transfer, the sparse copy strategy also eases the write-back of the results. Saving the binary spikes is no longer required because the necessary DMA commands to obtain the relevant weights for the next layer are issued immediately after comparing the membrane potential to the threshold voltage.

The transfers were initiated using a DMA mode that allows cores to initiate transfers in chronological order, forcing subsequent transfers to wait. Because the transfer size is very small, i.e., 256 bytes, the time spent waiting is usually short. When multiple cores experience a spike in a short window, this latency accumulates and may affect execution time. This issue is further exacerbated for layers with a higher spike rate. An advantage of the proposed implementation is the capability of distributing the combined overhead of transfer initialization across all 8 cores rather than just the master core, as well as masking the transfer time behind the execution.

As with layer 1, layers 2 and 3 are executed on all 8 cluster cores. Each core processes 32 of the 256 neurons in each layer. The program needs to keep track of the location to write the next column of the weight matrix, so there is a risk that two cores might initiate DMA transfers to the same memory address, in which case, one overwrites the other. This problem is solved using a hardware semaphore, which supports



Fig. 9. Sparse Copy - The weight matrix calculation and initiation of transfers for the next layer are depicted in this diagram for a reduced weight matrix size. The columns required for computation are transferred from L2 to L1 using the DMA peripheral. The membrane potentials are then updated across multiple cores (two are shown here) using 4-element and 2-element SIMD. Once the spike conditions of the neurons in the layer are known, the process may repeat for the next layer.

atomic integer increment and decrement. It is initialized to 0 before each layer executes. Whenever a core wants to transfer a column of weights, it loads and increments the semaphore value to determine the destination address in the L1 weight buffer. It can then safely queue a DMA transfer to this address from the column's memory address in L2. It is not necessary for the weight matrix in L1 to maintain the original column order as in the weight matrix in L2, the elements transferred to L1 only need to be summed in a row-wise manner.

As with Layer 1, SIMD operations are used to perform the weight matrix calculation. A 4-element 8-bit addition instruction allows four rows of each column in the weight matrix to be added per cycle, as shown in Fig. 9. The quantization configuration used during training guarantees that the 8-bit signed addition will not over- or underflow.

Finally, because the fourth layer consists of only two rows, it is executed on just the master core. Additionally, the layer 4 weights are padded to 16 bits to enable the use of 2-element 16-bit addition SIMD operations.

TABLE II MODEL PARAMETERS AND ASSOCIATED QUANTISATION.

| Parameter | Layers     | Bit Precision | Total size for all layers |
|-----------|------------|---------------|---------------------------|
| W         | 1, 2, 3, 4 | 8 bits        | 156,160 Bytes             |
| $\lambda$ | 1, 2, 3, 4 | 16 bits       | 1,540 Bytes               |
| B         | 1, 2, 3, 4 | 16 bits       | 1,540 Bytes               |
| $V_{th}$  | 1, 2, 3    | 16 bits       | 6 Bytes                   |
| u         | 1, 2, 3, 4 | 16 bits       | 1,540 Bytes               |
| Total     |            |               | 160,786 Bytes             |

## A. Environment

The model is constructed in Python with the Pytorch framework. The training and hyperparameter search are facilitated with Raytune framework to enable parallel training on multiple GPUs. The code for deployment to MCU was written in C and then compiled and deployed with GAP9 SDK Release v5.11.0.

The algorithm is trained and evaluated on two datasets recorded from non-human primates while they were performing two-degree-of-freedom finger tasks. The datasets contain positions and velocities of two fingers, as well as the SBP feature from 96 channels. Dataset A, also used in [32], contains 817s data. Dataset B is an open-source dataset, used also in [17], containing 610s data<sup>2</sup>. The SBP feature in the dataset is sampled at 2 kHz. The SBP is first time-averaged in non-overlapping time bins before being fed into the proposed decoder. Time bin sizes are chosen to be  $50 \,\mathrm{ms}$  and  $32 \,\mathrm{ms}$ for dataset A and B, respectively, as introduced in [17], [32]. SBP features are standardized by removing means and scaling to a standard deviation of 1 before being fed into the SNN. Predicted velocity is also standardized with statistics from the training set. Standardized velocity is recovered to the original scale after inference. Hyperparameter optimization for the training parameters, unfolding time steps, and noise standard deviation is only conducted on dataset A.

The first 80% of the data are used for training, and the remaining 20% are for validation. The non-quantized model is trained for 60 epochs. The quantized SNN is trained with full precision for 30 epochs before QAT is enabled for another 20 epochs. An inference is performed for every time frame to generate two-finger velocities in a streaming fashion.

## B. Memory footprint

Although the weights can be quantized to 4 bits, they are padded to 8 bits for the hardware deployment to take advantage of the 8-bit SIMD operations of the MCU. The bit precision for different parameters and their memory footprint are presented in Table II. Layer 4 is a non-spiking layer, so no threshold voltage is required. In total, the model requires 159,246 Bytes of storage for parameters or 160,786 Bytes, including the neuron membrane potentials.

## C. Accuracy

The velocity predicted by the proposed SNN, and the true velocity of two fingers are plotted in Fig. 11. The accuracy is presented as the correlation coefficients between the predicted velocity and the recorded velocity in the validation set. The reported correlation coefficients are the mean values for two finger velocities. Table III summarises the mean correlation and root mean square error (RMSE) over eight runs. To compare our results with previous work, we replicate the KF predictor [17] and the ANN predictor [32] using the same parameters as the original papers. Our proposed work and the quantized model reach the highest correlation coefficients of 0.783 and 0.624 for datasets A and B, respectively.

<sup>2</sup>Dataset is available for download: https://deepblue.lib.umich.edu/data/ concern/data\_sets/0g354f51t



Fig. 10. Profile trace for the sparse copy implementation running at 150 MHz depicting how DMA transfers are initiated for columns in subsequent layers. After initialization, the inputs and membrane potentials are transferred to LI. Then the first half of layer 1's parameters are transferred before execution begins. Once the first block has finished executing and the DMA is finished transferring the second block, execution continues to the second half of layer 1. The DMA queue operations for sparse copy are shown in red. Many queue operations are omitted for the sake of clarity.



Fig. 11. Predicted velocity versus true velocity for two fingers.

## D. Computation complexity analysis

The SNN's potential to achieve energy-efficient computation stems from the sparsity of the spikes. The spike rate for neurons and the average spike count in different layers are evaluated for the whole validation set of dataset A to quantify the sparsity in our network. The spike rate is calculated as the ratio of a neuron to fire over the total number of inferences. Fig. 12 shows the histogram for the spike rate distribution on the three spiking layers. The average spike rates for the three layers are 19%, 19%, and 9%, respectively. Out of the 770 neurons, only 120 of them fire in each inference step on average. Ideally, this high degree of sparsity significantly reduces the necessary number of memory accesses and computation operations. The required operations are summarised in Table IV.

In conventional ANNs, the weighted sum computation for each input to a neuron requires a MAC operation. Whereas, in the proposed SNN, as the spike status is either 1 or 0, this process has been replaced by the add operations, which require much less power [49]. It is important to note that for

TABLE III ACCURACY IN TERMS OF CORRELATION COEFFICIENTS AND RMSE.

| Decoder       | Corr A             | RMSE A            | Corr B            | RMSE B             |
|---------------|--------------------|-------------------|-------------------|--------------------|
| KF [17]       | $0.601^{\dagger}$  | $0.026^{\dagger}$ | $0.459^{\dagger}$ | $0.016^{\dagger}$  |
| ANN [18]      | 0.732 <sup>†</sup> | $0.032^{\dagger}$ | 0.593†            | 0.017 <sup>†</sup> |
| Proposed SNN  | 0.783              | 0.022             | 0.624             | 0.014              |
| Quantised SNN | 0.782              | 0.021             | 0.627             | 0.014              |

<sup>†</sup>Reproduced results.

a fair comparison between the proposed SNN and ANN, the SNN needs to update its membrane potentials, as described in (5), once per inference, which amounts to an additional MAC operation per each neuron. In this complexity analysis comparison, it is assumed that three memory loads and one store are required for each MAC operation, while for each addition, two loads and one store are required.

As a reference point, we also include deployment of the same proposed SNN labeled as SNN baseline in our comparison, which does not exploit sparsity in any manner. Each layer in the SNN baseline implementation uses the double buffering approach described in section IV-C1.

Thanks to the high degree of sparsity, the proposed SNN requires up to one order of magnitude fewer operations and memory accesses than the ANN and the SNN baseline implementation.

## E. Cycle Count Comparison

The preceding analysis examines the potential savings in SNN operations under ideal circumstances. However, it is important to note that there may be additional overhead in actual implementation. For example, addressing the irregular access pattern resulting from sparsity may require longer processing times. Therefore, we implement the decoder described in section IV targeting deployment on GAP9.

First, we assess the number of cycles required for the inference to understand the latency of the decoder using the GAP9 SDK. We compare the implementations in our work to a baseline ANN [18] that is trained for the same task and deployed on the same hardware platform using the *nntool*, which is an automatic neural network deployment



Fig. 12. (Top) Spike rate distribution for three spiking layers. Each bar indicates the number of neurons exhibiting a certain spike rate. (Bottom) The number of cycles for the processing of a layer for different numbers of input spikes. Dots represent measured cycles for SNN sparse copy implementation. The number of cycles required for the SNN baseline implementation without using sparsity is illustrated by the orange lines

TABLE IV Performance and computation complexity analysis.

|                 | Theoretical complexity analysis |     |     |               | Measurement             |                            |            |
|-----------------|---------------------------------|-----|-----|---------------|-------------------------|----------------------------|------------|
|                 | # Parameters                    | MAC | ADD | Memory access | Avg. Power <sup>a</sup> | Energy / Inf. <sup>b</sup> | Time/ Inf. |
| Decoder         | (K)                             | (K) | (K) | (K)           | ( <b>mW</b> )           | ( <b>u</b> J)              | (ms)       |
| ANN [18]        | 526                             | 529 | -   | 2116          | 0.60                    | 10.69                      | 0.66       |
| SNN baseline    | 158                             | 25  | 132 | 496           | 0.57                    | 3.90                       | 0.22       |
| SNN sparse copy | 158                             | 25  | 26  | 178           | 0.50                    | 1.88                       | 0.12       |

<sup>a</sup> Average power measured when the MCU is waked up every 50 ms, and after the inference, the MCU is put into sleep mode.

<sup>b</sup> Energy/inference measured when the MCU performs continuous inference without sleep to show the best achievable energy efficiency.

tool provided by the SDK, with the accelerator enabled. The corresponding cycle counts measured for each layer are shown in Fig. 13. The cycle counts are averaged over 500 inferences from the validation dataset and are measured by finding the time taken from the end of execution of the previous layer to the end of execution of the current layer. The time for copying the SBP inputs and membrane potentials to L1 is not included in the individual layer cycle counts. The ANN takes many more cycles in the first layer due to the larger fully connected structure after the temporal convolution. All the implementations have the same sizes for the layer 2,3, and 4. For the SNN implementations, the input to layer 2, 3, and 4 are binary spikes. Therefore, the sparse copy implementation can leverage the sparsity and achieve the lowest cycle counts. The overhead of queuing DMA transfers in the sparse copy implementation for the weights of the subsequent layer is demonstrated in the increased execution time of Layer 1 of the sparse copy method over the SNN baseline implementation. Nevertheless, the gains made in subsequent layers justify this cost.

To further evaluate how well the sparsity is utilized, we

also measure the number of cycles required for each layer when there are different numbers of spikes. The number of spikes is data-dependent. The distribution for spike counts observed while performing inference on the validation dataset is shown in Fig. 12 for the sparse copy implementation. The cycle count for the sparse copy exhibits a strong linear relationship with the spike count, which is always significantly lower than that of the SNN baseline implementation in the measurement range. However, it is expected that at a higher spike rate, i.e., lower sparsity, the slope will change as the DMA peripheral becomes saturated and the bottleneck shifts away from execution towards data transfer.

# F. Power, energy, and latency measurement

The measurements for power, energy, and latency are done on the physical GAP9 module. The chip runs at 150 MHz, 0.65 V, for all measurements in order to maximize energy efficiency. The current consumption of the GAP9 chip is measured using a DC power analyzer, from which the power usage is then calculated.



Fig. 13. Number of cycles used per layer for the ANN and SNN implementations on GAP9. The reported SNNs include an SNN baseline implementation without sparsity optimization, and an SNN effectively utilizes the sparsity noted as an SNN sparse copy. Note that the SNN does not have a temporal convolution layer, and the first fully connected layer of the ANN is 16 times larger than that of the SNN.

The measurement results for the SNN baseline, sparse copy, and ANN implementations, taken over 3000 inferences, are shown in Table IV. The average energy consumption for the proposed implementation when running the inference continuously is  $1.88 \,\mu$ J per inference and the average latency is  $0.12 \,\mathrm{ms}$ , while the SNN baseline implementation takes  $3.90 \,\mu$ J per inference and  $0.22 \,\mathrm{ms}$  latency. The baseline ANN takes a much higher energy of  $10.69 \,\mu$ J per inference and  $0.66 \,\mathrm{ms}$ .

For dataset A, which is utilized in the power measurements, the sampling interval is 50 ms. To save power and energy, the chip is configured to a 'light sleep' mode between inferences. In this mode, the cluster and fabric controller are powered down, reducing consumption to  $\approx 395 \,\mathrm{uW}$ . However, L2 is the only retentive memory region during sleep, requiring the membrane potentials for each layer to be copied to L2 between each inference. A GPIO pad was configured as a 'wake-up source' from sleep and triggered from an external waveform generator every 50 ms. The power trace for four inferences is shown in Fig. 14. When the time spent asleep is included, the average power usage is just  $0.50 \,\mathrm{mW}$ . Due to the long sleep time in this setup, the power is dominated by the sleeping power.

## G. Discussion

The proposed implementation shows significantly improved performance, particularly in accuracy, energy efficiency, and latency for continuous measurement. This can be attributed to the following factors: 1) The sparsity is efficiently utilized. The proposed implementation effectively reduces the number of memory accesses and executions. 2) The proposed decoder exhibits a high level of sparsity. Utilizing sparsity in MCU comes with the overhead of handling irregular patterns. This includes the need to determine which parameters to retrieve based on the output of each individual neuron. An excess



Fig. 14. Power trace captured over four inferences of the SNN sparse copy implementation, running at 150 MHz, 0.65 V; including sleep between each inference. A zoomed-in capture of a single inference is shown in the top right.

of spikes could lead to congestion and degrade performance. However, the high sparsity in our network prevents us from these issues. 3) The MCU features a multi-core cluster with a DMA. The DMA allows us to hide the latency of fetching the irregularly positioned parameters by transferring data while executing. However, if more spikes occur, the DMA peripheral can be saturated, and the bottleneck may shift from execution to data transfer. Therefore, it is essential to analyze the spike rate for each layer to develop effective strategies for data execution and transfer for the best efficiency. 4) The proposed SNN has a much higher number of connections compared to neurons. Despite the overhead of calculating each neuron's internal status, the overhead is negligible in our fully connected structure. In total, the SNN contains 156160 connections while having only 770 neurons. Therefore, the number of operations is dominated by the operations related to the connections between neurons instead of the neurons' internal states.

Our work demonstrates the advantage of using SNNs compared to conventional methods in an application where low energy and power consumption is a critical requirement. Our optimized implementation and in-depth complexity analyses provide important insight for future work in automatizing SNN deployment on ultra-low-power hardware platforms.

# VI. CONCLUSION

We present a new SNN neural decoder to predict finger velocities along with its deployment method for implantable BMI and successfully demonstrate its capability in solving a realworld regression problem. The proposed SNN is trained with STBP backpropagation enhanced by trainable decay factor, reset-by-subtract, and noise injection techniques to improve accuracy while keeping computation complexity low. The model is also fully quantized and deployed with optimization to efficiently utilize the sparsity on a general-purpose GAP9 MCU platform.

Compared to previous works, our SNN achieves the best correlation coefficient of 0.782 and 0.624 for the offline inference, while showing significantly less computation complexity, indicating potential in achieving energy-efficient hardware implementation. The deployed SNN on GAP9 achieves an average latency of  $0.12 \,\mathrm{ms}$  for each inference, which is 5.7X and 2.1X less than the baseline ANN and SNN baseline implementation without exploiting sparsity, respectively. The average power consumption, by duty cycling the MCU with sleep mode, is  $0.50 \,\mathrm{mW}$ . The energy per inference assuming continuous inference without sleep is 1.88 µJ, which is 1.8X less than the SNN baseline implementation and 5.5X less than ANN. To the best of our knowledge, at the time of submission, this is the first sparsity-aware on-board demonstration showing that the SNN can be competitive in terms of latency and power consumption, even with general-purpose hardware platforms.

#### ACKNOWLEDGMENTS

We would like to thank Dr. Alfio Di Mauro for his advice on SNN training, Dr. Samuel R. Nason-Tomaszewski and Joseph Costello for their help on the datasets, and Cyrill Künzi for his contribution to the integration of the Raytune framework.

#### REFERENCES

- [1] L. R. Hochberg, M. D. Serruya, G. M. Friehs, J. A. Mukand, M. Saleh, A. H. Caplan, A. Branner, D. Chen, R. D. Penn, and J. P. Donoghue, "Neuronal ensemble control of prosthetic devices by a human with tetraplegia," vol. 442, no. 7099, pp. 164–171, number: 7099 Publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/nature04970
- [2] F. R. Willett, D. T. Avansino, L. R. Hochberg, J. M. Henderson, and K. V. Shenoy, "High-performance brain-to-text communication via handwriting," vol. 593, no. 7858, pp. 249–254, number: 7858 Publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41586-021-03506-2
- [3] H. Lorach, A. Galvez, V. Spagnolo, F. Martel, S. Karakas, N. Intering, M. Vat, O. Faivre, C. Harte, S. Komi, J. Ravier, T. Collin, L. Coquoz, I. Sakr, E. Baaklini, S. D. Hernandez-Charpak, G. Dumont, R. Buschman, N. Buse, T. Denison, I. Van Nes, L. Asboth, A. Watrin, L. Struber, F. Sauter-Starace, L. Langar, V. Auboiroux, S. Carda, S. Chabardes, T. Aksenova, R. Demesmaeker, G. Charvet, J. Bloch, and G. Courtine, "Walking naturally after spinal cord injury using a brain–spine interface," vol. 618, no. 7963, pp. 126–133. [Online]. Available: https://www.nature.com/articles/s41586-023-06094-5
- [4] A. B. Rapeaux and T. G. Constandinou, "Implantable brain machine interfaces: first-in-human studies, technology challenges and trends," vol. 72, pp. 102–111. [Online]. Available: https: //linkinghub.elsevier.com/retrieve/pii/S095816692100183X
- [5] D. Seo, R. M. Neely, K. Shen, U. Singhal, E. Alon, J. M. Rabaey, J. M. Carmena, and M. M. Maharbiz, "Wireless recording in the peripheral nervous system with ultrasonic neural dust," vol. 91, no. 3, pp. 529–539. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0896627316303440
- [6] J. Lim, E. Moon, M. Barrow, S. R. Nason, P. R. Patel, P. G. Patil, S. Oh, I. Lee, H.-S. Kim, D. Sylvester, D. Blaauw, C. A. Chestek, J. Phillips, and T. Jang, "26.9 a 0.19×0.17mm2 wireless neural recording IC for motor prediction with near-infrared-based power and data telemetry," in 2020 IEEE International Solid- State Circuits Conference - (ISSCC), pp. 416–418, ISSN: 2376-8606.
- [7] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, "Spatiotemporal backpropagation for training high-performance spiking neural networks," vol. 0, publisher: Frontiers. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fnins.2018.00331/full
- [8] H. Zheng, Y. Wu, L. Deng, Y. Hu, and G. Li, "Going deeper with directly-trained larger spiking neural networks." [Online]. Available: http://arxiv.org/abs/2011.05280

- [9] E. Donati, M. Payvand, N. Risi, R. Krause, and G. Indiveri, "Discrimination of EMG signals using a neuromorphic implementation of a spiking neural network," vol. 13, no. 5, pp. 795–803, conference Name: IEEE Transactions on Biomedical Circuits and Systems. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8747378
- [10] F. Baracat, A. Mazzoni, S. Micera, G. Indiveri, and E. Donati, "Neuromorphic event-based processing of transradial intraneural recording for online gesture recognition." [Online]. Available: https://www.techrxiv.org/users/722218/articles/707510-neuromorphicevent-based-processing-of-transradial-intraneural-recording-for-onlinegesture-recognition
- [11] S. Narayanan, K. Taht, R. Balasubramonian, E. Giacomin, and P.-E. Gaillardon, "SpinalFlow: An architecture and dataflow tailored for spiking neural networks," in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, pp. 349–362. [Online]. Available: https://ieeexplore.ieee.org/document/9138926/
- [12] K. Malcolm and J. Casco-Rodriguez, "A comprehensive review of spiking neural networks: Interpretation, optimization, efficiency, and best practices." [Online]. Available: http://arxiv.org/abs/2303.10780
- [13] X. Wang, M. Hersche, M. Magno, and L. Benini, "MI-BMInet: An efficient convolutional neural network for motor imagery brain-machine interfaces with EEG channel selection," vol. 24, no. 6, pp. 8835–8847. [Online]. Available: https://ieeexplore.ieee.org/document/10409134/
- [14] X. Wang, L. Cavigelli, T. Schneider, and L. Benini, "Sub-100 µW multispectral riemannian classification for EEG-based brain-machine interfaces," vol. 15, no. 6, pp. 1149–1160. [Online]. Available: https://ieeexplore.ieee.org/document/9658220/
- [15] X. Liu and A. G. Richardson, "Edge deep learning for neural implants: a case study of seizure detection and prediction," vol. 18, no. 4, p. 046034. [Online]. Available: https://iopscience.iop.org/article/10.1088/ 1741-2552/abf473
- [16] J. Liao, L. Widmer, X. Wang, A. Di Mauro, S. R. Nason-Tomaszewski, C. A. Chestek, L. Benini, and T. Jang, "An energy-efficient spiking neural network for finger velocity decoding for implantable brainmachine interface," in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 134–137.
- [17] S. R. Nason, M. J. Mender, A. K. Vaskov, M. S. Willsey, P. G. Patil, and C. A. Chestek, "Real-time linear prediction of simultaneous and independent movements of two finger groups using an intracortical brain-machine interface." [Online]. Available: http://biorxiv.org/lookup/doi/10.1101/2020.10.27.357228
- [18] M. S. Willsey, S. R. Nason-Tomaszewski, S. R. Ensel, H. Temmar, M. J. Mender, J. T. Costello, P. G. Patil, and C. A. Chestek, "Real-time brain-machine interface in non-human primates achieves high-velocity prosthetic finger movements using a shallow feedforward neural network decoder," vol. 13, no. 1, p. 6899, number: 1 Publisher: Nature Publishing Group. [Online]. Available: https: //www.nature.com/articles/s41467-022-34452-w
- [19] G. Leone, L. Martis, L. Raffo, and P. Meloni, "Spiking neural networks for integrated reach-to-grasp decoding on FPGAs," in 2023 IEEE Biomedical Circuits and Systems Conference (BioCAS). IEEE, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/10389037/
- [20] H. An, S. R. Nason-Tomaszewski, J. Lim, K. Kwon, M. S. Willsey, P. G. Patil, H.-S. Kim, D. Sylvester, C. A. Chestek, and D. Blaauw, "A power-efficient brain-machine interface system with a sub-mw feature extraction and decoding ASIC demonstrated in nonhuman primates," vol. 16, no. 3, pp. 395–408, conference Name: IEEE Transactions on Biomedical Circuits and Systems.
- [21] Y. Chen, E. Yao, and A. Basu, "A 128-channel extreme learning machine-based neural decoder for brain machine interfaces," vol. 10, no. 3, pp. 679–692. [Online]. Available: http://ieeexplore.ieee.org/ document/7348721/
- [22] M. A. Shaeri, U. Shin, A. Yadav, R. Caramellino, G. Rainer, and M. Shoaran, "33.3 MiBMI: A 192/512-channel 2.46mm<sup>2</sup> miniaturized brain-machine interface chipset enabling 31-class brainto-text conversion through distinctive neural codes," in 2024 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, pp. 546–548. [Online]. Available: https://ieeexplore.ieee.org/document/ 10454533/
- [23] F. Boi, T. Moraitis, V. De Feo, F. Diotalevi, C. Bartolozzi, G. Indiveri, and A. Vato, "A bidirectional brain-machine interface featuring a neuromorphic hardware decoder," vol. 10, publisher: Frontiers. [Online]. Available: https://www.frontiersin.org/journals/neuroscience/ articles/10.3389/fnins.2016.00563/full
- [24] C. Boretti, L. Prono, C. Frenkel, G. Indiveri, F. Pareschi, M. Mangia, R. Rovatti, and G. Setti, "Event-based classification with recurrent spiking neural networks on low-end micro-controller units," in 2023

- [25] J. L. Collinger, B. Wodlinger, J. E. Downey, W. Wang, E. C. Tyler-Kabara, D. J. Weber, A. J. McMorland, M. Velliste, M. L. Boninger, and A. B. Schwartz, "7 degree-of-freedom neuroprosthetic control by an individual with tetraplegia," vol. 381, no. 9866, pp. 557–564. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3641862/
- [26] G. Hotson, D. P. McMullen, M. S. Fifer, M. S. Johannes, K. D. Katyal, M. P. Para, R. Armiger, W. S. Anderson, N. V. Thakor, B. A. Wester, and N. E. Crone, "Individual finger control of the modular prosthetic limb using high-density electrocorticography in a human subject," vol. 13, no. 2, p. 026017. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4875758/
- [27] A. B. Ajiboye, F. R. Willett, D. R. Young, W. D. Memberg, B. A. Murphy, J. P. Miller, B. L. Walter, J. A. Sweet, H. A. Hoyen, M. W. Keith, P. H. Peckham, J. D. Simeral, J. P. Donoghue, L. R. Hochberg, and R. F. Kirsch, "Restoration of reaching and grasping in a person with tetraplegia through brain-controlled muscle stimulation: a proof-of-concept demonstration," vol. 389, no. 10081, pp. 1821–1830. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5516547/
- [28] L. R. Hochberg, D. Bacher, B. Jarosiewicz, N. Y. Masse, J. D. Simeral, J. Vogel, S. Haddadin, J. Liu, S. S. Cash, P. van der Smagt, and J. P. Donoghue, "Reach and grasp by people with tetraplegia using a neurally controlled robotic arm," vol. 485, no. 7398, pp. 372–375. [Online]. Available: http://www.nature.com/articles/nature11076
- [29] D. Sussillo, P. Nuyujukian, J. M. Fan, J. C. Kao, S. D. Stavisky, S. Ryu, and K. Shenoy, "A recurrent neural network for closed-loop intracortical brain-machine interface decoders," vol. 9, no. 2, p. 026027. [Online]. Available: https://iopscience.iop.org/article/10.1088/ 1741-2560/9/2/026027
- [30] T. Hosman, M. Vilela, D. Milstein, J. N. Kelemen, D. M. Brandman, L. R. Hochberg, and J. D. Simeral, "BCI decoder performance comparison of an LSTM recurrent neural network and a kalman filter in retrospective simulation," in 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 1066–1071, ISSN: 1948-3554.
- [31] J. I. Glaser, A. S. Benjamin, R. H. Chowdhury, M. G. Perich, L. E. Miller, and K. P. Kording, "Machine learning for neural decoding," vol. 7, no. 4, pp. ENEURO.0506–19.2020. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7470933/
- [32] M. S. Willsey, S. R. Nason, S. R. Ensel, H. Temmar, M. J. Mender, J. T. Costello, P. G. Patil, and C. A. Chestek, "Real-time brain-machine interface achieves high-velocity prosthetic finger movements using a biologically-inspired neural network decoder," [Online]. Available: http://biorxiv.org/lookup/doi/10.1101/2021.08.29.456981
- [33] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, "Efficient processing of deep neural networks: A tutorial and survey," vol. 105, no. 12, pp. 2295–2329, conference Name: Proceedings of the IEEE.
- [34] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer, and S.-C. Liu, "Conversion of continuous-valued deep networks to efficient eventdriven networks for image classification," vol. 11, p. 682. [Online]. Available: https://www.frontiersin.org/article/10.3389/fnins.2017.00682
- [35] J. Dethier, V. Gilja, P. Nuyujukian, S. A. Elassaad, K. V. Shenoy, and K. Boahen, "Spiking neural network decoder for brain-machine interfaces," p. 10.1109/NER.2011.5910570. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3864805/
- [36] T. Masquelier and S. J. Thorpe, "Unsupervised learning of visual features through spike timing dependent plasticity," vol. 3, no. 2, pp. 1–11, publisher: Public Library of Science. [Online]. Available: https://doi.org/10.1371/journal.pcbi.0030031
- [37] D. Neil and S.-C. Liu, "Minitaur, an event-driven FPGA-based spiking network accelerator," vol. 22, no. 12, pp. 2621–2628, conference Name: IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
- [38] S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri, "A scalable multicore architecture with heterogeneous memory structures for dynamic neuromorphic asynchronous processors (DYNAPs)," vol. 12, no. 1, pp. 106–122, conference Name: IEEE Transactions on Biomedical Circuits and Systems.
- [39] C. Frenkel, J.-D. Legat, and D. Bol, "MorphIC: A 65-nm 738k-synapse/mm\$<sup>2</sup>\$ quad-core binary-weight digital neuromorphic processor with stochastic spike-driven online learning," vol. 13, no. 5, pp. 999–1010. [Online]. Available: http://arxiv.org/abs/1904.08513
- [40] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam, B. Taba, M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, and D. S. Modha, "TrueNorth: Design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip," vol. 34, no. 10, pp.

1537–1557, conference Name: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

- [41] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C.-K. Lin, A. Lines, R. Liu, D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, Y.-H. Weng, A. Wild, Y. Yang, and H. Wang, "Loihi: A neuromorphic manycore processor with on-chip learning," vol. 38, no. 1, pp. 82–99, conference Name: IEEE Micro.
- [42] S. R. Nason, A. K. Vaskov, M. S. Willsey, E. J. Welle, H. An, P. P. Vu, A. J. Bullard, C. S. Nu, J. C. Kao, K. V. Shenoy, T. Jang, H.-S. Kim, D. Blaauw, P. G. Patil, and C. A. Chestek, "A low-power band of neuronal spiking activity dominated by local single units improves the performance of brain-machine interfaces," vol. 4, no. 10, pp. 973–983. [Online]. Available: http://www.nature.com/articles/s41551-020-0591-0
- [43] E. Izhikevich, "Which model to use for cortical spiking neurons?" vol. 15, no. 5, pp. 1063–1070, conference Name: IEEE Transactions on Neural Networks.
- [44] W. Gerstner, W. M. Kistler, R. Naud, and L. Paninski, Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition. Cambridge University Press. [Online]. Available: https://www.cambridge.org/core/books/neuronal-dynamics/ 75375090046733765596191E23B2959D
- [45] P.-Y. Tan, C.-W. Wu, and J.-M. Lu, "An improved STBP for training high-accuracy and low-spike-count spiking neural networks," in 2021 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 575–580, ISSN: 1558-1101.
- [46] S. B. Shrestha and G. Orchard, "SLAYER: Spike layer error reassignment in time," in Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. [Online]. Available: https://papers.nips.cc/paper/2018/hash/ 82f2b308c3b01637c607ce05f52a2fed-Abstract.html
- [47] I. Loshchilov and F. Hutter, "Decoupled weight decay regularization." [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7
- [48] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, "A survey of quantization methods for efficient neural network inference." [Online]. Available: http://arxiv.org/abs/2103.13630
- [49] M. Horowitz, "1.1 computing's energy problem (and what we can do about it)," in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14, ISSN: 2376-8606.