Developing methods towards better deep brain stimulation treatments.
We used the activity of a few neurons in the brains of rats to predict the timing of audio beeps they were hearing. Hopefully, the models and insights we developed for this problem may eventually be useful on the path towards improving treatments for Parkinson’s disease and other neurological disorders.
Deep brain stimulation is being used to help alleviate the symptoms of severe Parkinson’s disease, dystonia, and epilepsy (with recent promising results in treating severe depression). It involves inserting electrodes into specific parts of the brain, as shown in figure 1, and then later tuning their electrical pulses to alleviate symptoms. Once the pulse pattern is set, it is usually left pulsing continuously.
Recently, there has been growing interest in using deep brain stimulation electrodes to measure the activity of the surrounding neurons in the patient’s brain. If the severity of symptoms could be decoded from neural activity, this might be used to automatically adapt the pattern of pulses in response, as shown in figure 2. This idea is known as closed-loop or adaptive deep brain stimulation.
Being able to implement adaptive deep brain stimulation without implanting extra sensors would require reliable monitoring of the patient’s symptoms from just a few recording channels of local field potentials in the brain. This is a difficult problem, akin to looking at a few transistors in a computer and trying to guess whether malware is running. We thought it sounded like a task suited to machine learning (ML).
There have been many attempts to use ML to decode local neural signals, a few actually using data from people with deep brain stimulation implants. Still, most decoding of neural activity uses traditional approaches, despite ML being shown to be superior in many cases. We partnered with the Queensland Brain Institute to start applying ML methods to some of their pre-existing data, as a first-step towards exploring techniques best suited to adaptive deep brain stimulation.
The dataset consisted of sessions of neural activity in rats exposed to various-pitched beeps over a few minutes. An example session is shown in figure 3: the beeps are shown in red, and the firing of a neuron is shown as a white vertical line. There were five rats in this dataset, with 20–25 sessions per rat, each session containing the activity of 1–50 neurons. The recordings were taken using an electrode similar to the deep brain stimulation electrodes.
Our focus was to decode the timing of beeps from neural activity, as a first step towards decoding symptom severity from neural activity. Although the beeps come from rat basal ganglia rather than human hippocampi, and likely induce a quicker and different neural response than Parkinson’s symptoms, both represent a type of inference from the activity of a few neurons.
The main challenges of this data were its heterogeneity and its tiny size. The whole dataset was just 15 megabytes over 114 sessions. On top of this, there were large variations between each session, with 20 different beep pitches, different numbers and types of neurons being recorded in each rat, and different rats, which were being conditioned in some sessions to respond to some beep patterns and not others.
These problems are unlikely to go away with more realistic clinical data. It is not feasible to obtain mountains of human deep brain stimulation data, and every patient will have different neurons recorded and exhibit different responses. As such, any techniques we develop to deal with the small size and the highly heterogenous rat data should also be applicable to human data.
Firstly, I tried unsupervised learning, with Toeplitz Inverse Covariance-based Clustering (TICC). This is an unsupervised clustering technique specifically developed for multi-channel time-series data. It involves learning a correlation network for each cluster, containing the characteristic correlations between the data channels across adjacent times for that cluster. The correlation networks for each cluster are slid across time, and each time point is put into the cluster of the strongest-responding correlation network, as shown in figure 4.
I applied TICC to each session independently, because of the heterogeneity of the data across rats and sessions. TICC performed well in the few sessions where neurons were clearly responding to beeps, as in figure 5. Even in these sessions, the neurons usually only clearly respond to one pitch of beeps and not the other. The unsupervised clusters also matched up with some beeps in one or two sessions where there was no obvious neural response, as shown in figure 6. In about 85–90% of the sessions though, the clusters did not line up with the beeps. If the beeps were causing any consistent pattern of neural activity, it was probably insignificant compared to whatever else was going through the rat’s mind. I figured that the model needed to know what to look for, thus requiring supervised learning.
To attempt supervised learning, I had to confront the problems of heterogeneity and the small dataset. Because of these problems, directly applying supervised learning to whole sessions produced poor results. To overcome them, I decided to look only at small time-windows or “snippets” and predict whether a beep was occurring. The price for overcoming heterogeneity and the size of the dataset in this way is that each neuron’s responses are effectively being treated as independent, identical, and localized in just a 40 second snippet. Figure 7 shows some of these snippets. Either they coincide with the start of a beep, or they are cut from a time away from any beeps. Classifying the snippets by eye seems almost an impossible task: it is unlikely that a human would be able to perform much better than chance. If there are some subtle patterns though, we hoped that ML should be able to pick up on them.
I started by replicating some methods of Neural Activity Classification with Machine Learning Models Trained on Interspike Interval Series Data (Lazarevich et al., 2018). These involved creating lists of time-intervals between the firing times of a neuron, known as inter-spike intervals (ISIs), and using the tsfresh statistics package to automatically generate hundreds of relevant statistics on the training set, to be used as features for machine learning models. The statistics generated by tsfresh include things like Fourier components, wavelet transform coefficients, and autocorrelations. I won’t explain the models used to learn on these statistics here, but they fall into two main categories: global approaches and bag-of-pattern approaches, as shown in figure 8.
Overall, I found that the bag-of-pattern based approaches performed poorly, at no more than 52% accuracy when trained and tested on balanced datasets of different randomly sampled snippets, while the global approaches such as XGBoost (a gradient boosting library) performed much better at up to 61% accuracy. This confirms the general trend found by Lazarevich et al., where bag-of-pattern based approaches struggled to achieve more than 58% accuracy at distinguishing between awake and asleep neural activity, while XGBoost also performed the best for them, with over 70% accuracy. The lower accuracies of our results compared to theirs are indicative of the difficulty of our problem, and the challenges of our dataset.
In order to improve on these results, I decided to try feeding the data to the models in a more direct way. To do this, I turned each snippet of firing times into a histogram. I found the best results with 400 histogram bins, one every 0.1 seconds. I again tried many ML models learning on these histograms, including dense neural networks (NNs), 1D convolutional NNs, locally connected NNs, residual NNs, long-short-term-memory NNs, and XGBoost. Again I found the best results with XGBoost, which achieved 72% accuracy on the randomly sampled snippet histograms, or 74% if given extra information about which rat and neuron the snippet came from. That’s a large jump from the 61% accuracy of the ISI-statistics based approaches, much better than humans would be able to achieve, and is even on par with the wake/ sleep classification accuracies of Lazarevich et al. (2018). Considering their statistics-based methods performed so poorly on our data, it would be interesting to see how our histogram-based methods perform on theirs.
Of course, classifying randomly sampled snippets of neural activity is a long way from extracting beep times across a whole session. Firstly the model should be trained on snippets from earlier sessions and tested on later ones, rather than trained and tested on snippets from random times. However in this experiment, the rats were only trained (conditioned) in later sessions to actually respond to the beeps. Training our model on earlier sessions means that it is only seeing unconditioned rats, and then being asked to make predictions even as the rats’ brains change in response to what they learn. When we trained the model only on unconditioned rats like this, the accuracy on snippets dropped from 74% to 68%, but this is still a quite impressive result given the dataset shift.
Predicting beep times
Directly trying to run the unconditioned-trained snippet histogram XGBoost model across full sessions to predict when beeps occur produces poor results (see figure 10). Every half a second, the model is fed a histogram of the following 40 seconds of activity in one neural channel. These predictions are then averaged across neural channels and averaged across time with a Gaussian window. If the model’s average predicted probability of a tone is ever larger than some threshold (e.g. 52%), then a tone is predicted to have occurred. Perhaps the reason for the poor performance in these sliding predictions is because the model was only trained on beeps starting exactly when the snippets start, so it was not robust when slid across the whole session.
The lack of robustness was illustrated by looking at which times within each snippet were most important to the model, using shapely additive explanations. On average, the neural activity in the first tenth of a second of each snippet had a 10 times greater impact on the model’s decision than any other time, as shown in figure 11. The model had discovered that it didn’t need to look too hard at deeper patterns of neural activity, because the first tenth of a second told it a lot of what it needed to know about whether a beep had just occurred. This narrow focus is exactly what let the model down when it was asked to make sliding predictions over a whole session, because there are many opportunities for neural activity to look similar in that tiny 0.1 second period at the start of the sliding prediction window, leading to many false predictions.
In order to make the model more robust when slid across whole sessions, I offset each training snippet across a 40-sample distribution from 2 seconds earlier to 2 seconds later. This resulted in lower accuracy on snippets, but much better performance on predicting beep times across whole sessions, as shown in figure 12. In some sessions, our model’s predictions are even perfect when TICC could not generate useful clusters, and no obvious pattern of response is visible to the human eye, as shown in figure 13. On the other hand, there seem to be some sessions where the neurons really aren’t responding to one or more pitches of beeps, and so the model struggles.
To quantify our model, we ran it across the whole testing set of sessions, with a Gaussian time-averaging window and a beep-prediction threshold chosen to maximize the true positives and minimize the false positives on the first half of each session, with the metrics then evaluated on the second half of each session. The model picked up an average of 47% of the beeps each session, and if the model predicted a beep, there was an average 41% chance that a beep really occurred. Of course, in some sessions this is 100% and 100% as in figure 13, and in others it is closer to 0% and 0%, again illustrating just how varied the responses of different neurons can be to the different beeps in different rats at different times.
Although the final results are certainly less than clinical-grade, we have demonstrated many accurate predictions of beep times, even when no response is obvious to a human or to other ML models. My key findings are:
- Toeplitz Inverse Covariance Clustering is quite a powerful unsupervised method for neural data. In future work, it should be applied to a dataset where it is allowed to learn across sessions.
- Developed as a means of overcoming the heterogeneity and tiny size of our data, our approach of using histograms of snippets outperformed the statistics-based approaches of other published neural decoding methods on our data.
- A model’s accuracy can be improved by feeding it extra information on top of the brain signals.
- Gradient-boosting models like XGBoost consistently outperformed all other models with this small data. Custom NNs may still win with large data.
- Analyzing feature importances with shapely additive explanations was very useful to see what models are looking at and how to make them more robust.
- Feeding a model with manipulated data can make it more generalizable and robust, which is often more important than test accuracy.
These insights should remain applicable as we move closer towards deep brain stimulation data, which will share similar issues. Our methods of overcoming the small size and heterogeneity of data should be especially useful, such as making snippets and manipulating them so as to break the model’s tunnel vision. Adding extra information on top of brain signals may also improve the results of adaptive deep brain stimulation.
Finally, in order to improve our work and bridge the gap back from detecting beeps to implementing adaptive deep brain stimulation, we should:
- Move towards more realistic data, such as raw local field potentials recorded from brains with extended symptom-like activity.
- Investigate a wider range of techniques, such as transfer learning and reinforcement learning, which may be useful for real-time adaptive deep brain stimulation.
Above all, we hope that our research will eventually help to improve people’s lives.
We would like to thank Pankaj Sah’s lab and particularly Dr Alan Woodruff and Dr Francois Windels at the Queensland Brain Institute for supplying the data and discussing the findings. We would also like to thank Maciej Trzaskowski, and everyone at Max Kelsen for their support and supervision during this project.