Machine learning validation of EEG+tACS artefact removal

Objective. Electroencephalography (EEG) recorded during transcranial alternating current simulation (tACS) is highly desirable in order to investigate brain dynamics during stimulation, but is corrupted by large amplitude stimulation artefacts. Artefact removal algorithms have been presented previously, but with substantial debates on their performance, utility, and the presence of any residual artefacts. This paper investigates whether machine learning can be used to validate artefact removal algorithms. The postulation is that residual artefacts in the EEG after cleaning would be independent of the experiment performed, making it impossible to differentiate between different parts of an EEG+tACS experiment, or between different behavioural tasks performed. Approach. Ten participates undertook two tasks (nBack and backwards digital recall) during simultaneous EEG+tACS, exercising different aspects of working memory. Stimulations during no task and sham conditions were also performed. A previously reported tACS artefact removal algorithm from our group was used to clean the EEG and a linear discriminant analysis was trained on the cleaned EEG to differentiate different parts of the experiment. Main results. Baseline, baseline during tACS, working memory task without tACS, and working memory task with tACS data segments could be differentiated with accuracies ranging from 65%–94%, far exceeding chance levels. EEG from the nBack and backwards digital recall tasks could be separated during stimulation, with an accuracy exceeding 72%. If residual tACS artefacts remained after the EEG cleaning these did not dominate the classification process. Significance. This helps in building confidence that true EEG information is present after artefact removal. Our methodology presents a new approach to validating tACS artefact removal approaches.


Introduction
Recent years have seen the development of many noninvasive brain stimulation technologies. In particular, transcranial electrical stimulation (tES) is a relatively new technique with many applications as a therapeutic and investigative tool [1]. It operates by injecting small amounts of current into the scalp via rubber electrodes that are enclosed in saline soaked sponges. Transcranial alternating current simulation (tACS), where the applied stimulation has a sinusoidal waveform, is of particular interest as it can potentially be used in a closed loop manner, where the frequency and phase of the stimulation are guided based upon concurrently measured brain information [2,3]. The effects of tES are highly variable between different subjects and studies [4,5], and it is thought that allowing brain state dependent stimulation paradigms may be a key step towards reducing this variability and understanding its causes [6].
Electroencephalography (EEG) is a widely used brain monitoring technique, with a high temporal resolution which makes it highly suitable for use in closed loop applications. However, tACS guiding via EEG is difficult due to the stimulation artefacts that tACS produces in simultaneous EEG monitoring. These are illustrated in figure 1 where it can be seen that the high amplitude, non-linear, tACS artefacts obscure the low amplitude EEG data.
In recent years there have been several investigations into the removal of this artefact in the EEG by applying post-collection signal processing to the raw EEG data [7][8][9][10][11][12][13][14][15][16][17]. There have also been investigations into the properties of the tACS artefacts [18][19][20][21][22], for example how they spread across the head [19] and how they mix with other bio-signals such as ECG and breathing patterns [18]. A recent review of methodological aspects of EEG+tACS is given in [23]. However, to date these techniques have received only modest use in the research literature, with most studies restricted to only analysing EEG data before and after the tACS stimulation was applied [7]. There remains much debate as to how good tACS artefact cleaning methods are, need to be, and the significance of any post-cleaning residual artefacts in the EEG trace.
Proving that artefacts have been successfully removed is an intrinsically challenging problem as, in on-person tests, there is no baseline available against which the cleaned EEG can be directly compared. In our previous EEG+tACS artefact removal work [7] we introduced the use of a head phantom where prerecorded EEG was played in to a phantom which then had tACS applied to it, giving a known baseline to compare against for artefact removal validation. This was combined with a number of on-person tests in order to build up the overall confidence that the cleaned EEG was not dominated by residual artefacts (although some residual artefacts, particularly for high stimulation frequencies, undoubtedly remained).
In this paper we investigate an alternative validation approach for checking EEG+tACS artefact removal, based upon machine learning. In our approach we split the data from an EEG+tACS data collection experiment into different classes: EEG when no tACS was used, and cleaned EEG collected during tACS with artefact removal applied; collected both while the subject was performing a working memory task and while the subject was sitting stationary with no task performed. The underlying assumption is that residual artefacts present after tACS artefact removal would be independent of the task being performed. If residual artefacts dominated the cleaned EEG traces then it would not be possible to use machine learning to discriminate between the different parts of the experiment. We use a previously reported artefact removal approach, superposition of moving averages (SMA) from [7,12] with a new experiment designed to investigate this validation approach. We believe that Ongoing artefact is not a pure sinusoid at the simulation frequency, but has an approximately 100 µV ripple present. Images reproduced from [7] under its CC BY 4.0 license. it can be used to add further confidence in the performance of an artefact removal technique.

Working memory tasks
Subjects were asked to perform two tasks, one after another, exercising similar but different aspects of working memory. Firstly a Visual nBack which tested visual working memory span and executive function. Subjects were presented with a sequence of visual stimuli and asked to recall if the current stimuli matched the nth stimuli before the current one, as illustrated in figure 2. If the current stimuli matched the stimuli two iterations ago this would be a two-back trial, three iterations ago a three-back trial, and so on. The images used were diagrams of sporting activities, shown in figure 2, easily identifiable by participants. A keyboard press was used to indicate whether the stimuli matched or not, and the next image would be presented once a button was pressed.
Secondly, participants did a Backward digit recall testing verbal memory span and executive function. A sequence of single digit numbers was presented to the subject, who was then asked to recall them in reverse order. Numbers were displayed on a screen one at a time and each for 0.8 s, followed by a 0.2 s blank screen. The difficulty of the task was varied by changing the length of the number to recall. After all digits were displayed the subject was asked to type them in on a keyboard, in reverse order of display. Subjects were asked to do as many repetitions of this as possible during the trial block (2 min).
Both these tasks utilise the central executive of the multi-modal working memory model but use different forms of working memory capacity. Both tasks were implemented as customised Matlab scripts, synchronised with our EEG data collection.

Experimental procedures 2.2.1. Practice session
Before the experimental session with EEG and tACS, subjects were invited to a practice session during which they practised the two working memory tasks at different difficulty levels. The purpose was to familiarise subjects with the tasks and also to establish a baseline difficulty at which each individual's performance peaked (n for the nBack task and the number of digits for the digit recall task). This minimised the influence any learning effects that repetition may have in the experimental session. In the experimental session the difficulty was set for each task and participant as the n or number of digits where their performance was 50% in the practice session.

Experimental sessions
After the practice session, each subject attended two experimental sessions. Each session was approximately 1 h long, including the time to set up and test the EEG and tACS equipment. The protocols for the two sessions were identical with the exception that tACS was applied in one session and the other session was a sham where no tACS was applied (but the tACS electrodes were still placed to ensure the participants were uninformed). Each experimental session was separated from the other by a period of at least 24 h and a maximum of 1 week.
The experimental session was sub-divided into six blocks of EEG+tACS, with each block lasting for 4 min. We repeated each working memory task 3 times giving the total of 6 experimental blocks to be performed. All experiments were conducted between 2-5 pm on a weekday to minimise baseline EEG variances and subjects were allowed to take breaks in between blocks to prevent fatigue.
Within one block the parts of the experiment were further sub-divided into the following three stages: 1. Pre: A one minute recording of background EEG data before the working memory task was performed. No tACS was applied in this period. 2. During: A two minute period during which one of the working memory tasks was performed. EEG data was recorded, and tACS applied (or applied as sham). 3. Post: A one minute recording of background EEG data after the working memory task was performed. No tACS was applied in this period.
This division of the experiment is illustrated in figure 3.
No breaks were present between the Pre, During and Post stages. To prevent fatigue, after each block there was a break of at least 5 min, after which the participant was asked when they were ready for the next block.
In-line with our previous tACS artefact removal work [7], deliberately short 2 min stimulations were used with the intention of avoiding behavioural effects. This is intended to allow the artefact removal problem to be investigated without the confounding factor of potential long term changes in the brain state being present. Here we do not investigate the behavioural result of the impact of tACS on the task performance. Rather the aim is to investigate the artefact removal validation procedure.
All subjects were also asked to do a control session which consisted of a 5 min tACS stimulation, with EEG also collected, where subjects were asked to relax, keep their eyes open and minimise movements. This was intended to mirror the Pre and Post sections of each block, but now with tACS applied. This provided a baseline for the stimulation condition where no working memory task was being performed.

Participants
For this proof of concept work there were a total of 10 subjects, five male and five female, aged 20-35 years. The study was deliberately not powered to investigate the behavioural result of the impact of tACS on the task performance, as this was not the study aim. Instead ten subjects was selected as sufficient to investigate the artefact removal validation procedure, and to be inline with other early stage EEG+tACS artefact removal works (review given in [7] and in the discussion section here).
All procedures used in this study were reviewed and approved by the University of Manchester Research Ethics Committee, application number 14263. Procedures performed involving human participants were in accordance with the ethical standards of the institutional and written informed consent was obtained from all individual participants included in the study.

EEG and tACS setup
An 8 channel EEG montage was used for this study with the ground and reference channel placed at Cz. The 8 electrode positions were determined by using the ten-twenty system: Fp1, F3, F4, C3, C4, P4, O1, O2. All electrode impedances were below 10 kΩ and the data was sampled at 500 Hz. EEG was recorded using an Enobio EEG recorder (Neuroelectrics, Spain). Our EEG+tACS artefact removal procedures are channel count independent, and using a low number of electrodes is important for allowing for quicker experimental set ups which may be critical for clinical applications and when working with subjects who are vulnerable [7].
tACS stimulation was delivered using the DC-Stimulator PLUS (Neuroconn, Germany). The tACS electrodes were placed at Fp2 and P3, selected to match montages that target working memory effects in previous tACS research [24,25]. Stimulation was performed at 1 mA peak-to-peak at 5 Hz for the duration of the working memory tasks (2 min) with a ramp up and down time of 1 s each. Again, the choice of 1 mA and 5 Hz stimulation was to mimic previously used choices in experiments looking at the effects of parietal stimulation in working memory [24,25].

Data analysis 2.5.1. Pre-processing
All recorded data was analyzed offline in Matlab (Mathworks, USA). Prior to analysis the EEG data was filtered using 3rd order low pass and high pass Butterworth filters with cutoffs at 45 and 0.5 Hz respectively.
If tACS stimulation was applied, the tACS artefact was removed using the superposition of moving averages (SMA) approach, described in and unmodified from [7]. This gave either raw EEG data (from baseline and sham periods) or cleaned raw EEG data (from Top: the nBack working memory task task. Stimuli A-G were presented in a randomised sequence and the subject asked to identify whether the current stimuli was the same as the one from n iterations before. Bottom: shapes used for the nBack task depicting sporting activities. (Images taken from www.clker.com under a creative commons CC0 public domain license.) tACS periods) for use with a machine learning classifier.

Feature matrix generation
For inputting to the machine learning classifier the EEG signal for each channel was split into nonoverlapping windows of 5 s. Small window sizes were used to increase the number of epochs available for classification, with 5 s selected to ensure at least 10 cycles of the lowest frequency of interest (4 Hz theta) was present in the window. This duration also ensured that potential non-linear heart beat and breathing artefacts as discussed in [18] would be present in each analysis window.
The EEG data in each 5 s window was downsampled to 256 Hz and decomposed using the discrete wavelet transform and db1 wavelet basis function. This gave time-frequency estimates of the signal approximately matching the widely used cognitive frequency bands: theta (4-8 Hz), alpha (8)(9)(10)(11)(12)(13)(14)(15)(16), beta (16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32) and gamma . In each band the power and Shannon entropy were calculated, and these features used as the inputs to the machine learning. In total 64 features were present for each time window of data: eight channels, with two measures from each of four frequency bands. These features were selected as bandpowers are widely used across many EEG applications, and entropy is a measure of order in the brain which would be affected by tACS. For example, some works (e.g. [26]) have suggested that closed loop tACS could be adjusted in terms of entropy slope rather than only amplitude and frequency as commonly considered.

Classification
We created four different machine learning models, fitted and tested with different variants of the feature data available, in order to examine the separability performance in different cases. All models used a linear discriminant analysis (LDA) classifier based on Fisher's discriminant analysis [27], and the associated built-in Matlab function. This classifier assumes that the data in each class is based on different Gaussian distributions with the same covariance and aims to express each dependant variable (the classes) as a linear combination of the features. It was extended for multiclass models by [28]. In all cases 65% of the data was used to train the classifier and the remaining 35% used for testing the performance of the classifier, with a permutation test (described below) used to crossvalidate this performance.
The first model split the data into four classes: from data collected before and after each working memory task. The task class (T) from when the subject was performing the task without tACS stimulation present (in the nBack task in model 1, and backwards digit recall task in model 2). The task + stimulation class (T + S) from EEG data recorded while the subject was performing the task in the stimulation experimental session, and the Baseline + Stimulation (B + S) data came from the control session where no task was performed. Each task was actually repeated three times giving six blocks in the stimulation and sham sessions. This data split is illustrated in figure 3. The procedure was performed separately for each person, with a different LDA trained and tested for each subject. The second model also split the data into four classes, identical to that above, but this time with the task data from the backwards digital recall task. These first two models are intended to demonstrate the possibility of separating data from during the task from the baseline cases, regardless of whether stimulation was present or not.
The third model split the data into six classes: 1. Baseline: The Pre and Post sections of each block, where no task and no tACS was present. 2. Baseline + Stimulation: The control session, where no task was present, but tACS was. 3. nBack task: Data from during the nBack working memory task when no tACS was applied (sham). 4. nBack task + Stimulation: Data from during the nBack working memory task when tACS was applied. 5. Backwards digit recall task: Data from during the backwards digit recall working memory task when no tACS was applied (sham). 6. Backwards digit recall task + Stimulation: Data from during the backwards digit recall working memory task when tACS was applied.
In this approach data from both tasks was used to build a single model such that the classifier could be used to differentiate between the two different measures of working memory span (nBack and backwards digit recall) during tACS stimulation. The three models discussed so far were developed and applied individually to data from each subject. For each analysis, ten different subject dependent classifiers were made-one for each participant using their training and test sections of data. However, in practical use this would necessitate a training session each time a new subject was recruited in order to generate a new model implementation for that person. Our fourth model was a subject independent classifier to investigate the performance of data separation when the same classifier was used for all subjects. This subject independent classifier pooled the training data from all participants to train one classifier. This one classifier was then used on the test data from each partici-pant. It had the benefit of using more data to develop the classifier, which may lead to better performance, at the cost of increased variability in the input data which may lead to poorer performance as the classifier is not optimised for each separate person.
To create the subject independent model, data from the feature matrices of all subjects was normalised by dividing each feature observation with the absolute maximum of said feature for each subject. These normalised features were then concatenated to create a single matrix of normalised features from all subjects. Subsequently, 50% of this data was selected at random to train a classifier which was then used to classify normalised data for all subjects separately. This cross-validation was repeated 1000 times.

Classification validity
In all of the models, since the data sample sizes are not equal (there are different numbers of epochs for each class), the chance level of randomly guessing the class is not one divided by the number of classes present. The actual chance level was calculated by permutation, generating 10 000 random classifier outputs, splitting these to match the proportion of each class of data present, and assessing the chance classification accuracy obtained. These chance levels of performance are reported alongside our results for the actual model classification.
In addition, an assumption for supervised learning using classifiers is that the data is independent and identically distributed (IID) [29]. This indicates that the order of the data is not relevant and thus all epochs are from the same arbitrary distribution [30]. In practice this means that the order in which the data (each epoch) is sorted for input to the classifier should not effect the performance of the classification. This is not automatically satisfied for time series data such as the EEG which evolves over time, where samples collected close together in time may be more similar than those collected further apart in time [31].
To test against this assumption, our data split using 65% of the data to train the classifier and the remaining 35% to test the performance of the classifier, was repeated 1000 times using random selections of data windows to form the training and test data sets. This process meant that the feature data matrix was randomised, mixing the time order of the collected features. The results of this test are included here as the error bars present in the classification results graphs. Any change in the performance due to time ordering effects would be captured in these error bars. Figure 4 summaries the fraction of epochs classified correctly by the first three models (four class nBack, four class digital recall, and six class), and the chance performance level for each. In all cases, the accuracy of identifying epochs far exceeds that of a chance classifier. For both four class models (i.e. regardless of which working memory task was performed) the accuracy was over 80% showing that it was possible to differentiate between cleaned EEG collected from the different parts of the experiment using our LDA classifier. This is shown in more detail in figure 5 which shows the per-class accuraries within these overall performance results. This is shown for both the nBack and backwards digital recall classification models. The recognition accuracy varies depending on the class, with the worst accuracy in recognising the baseline case, and the best in recognising epochs from the baseline + stimulation case. With these levels of accuracy it was clearly possible to separate cleaned EEG epochs collected during tACS from those where no tACS was performed.

Four class classification accuracy
This could be a potential indicator that residual artefacts dominate the cleaned EEG data. If present, these artefacts may give a distinguishable feature which would make it easy to separate baseline from baseline + stim, and task from task + stim. However, this explanation is not supported by the high classification accuracy of both the task + stim and baseline + stim cases. While the accuracy was not perfect, these cases are not being interchanged, suggesting that if residual artefacts were present and dominating the classification decision, these residual artefacts must be different depending on whether or not a working memory task was being performed.

Six class classification accuracy
The above is further supported by the six class classification results. The overall average, across all subjects and classes, is shown in figure 4. At 75.1% this accuracy far exceeds that from randomly assigning class labels. The breakdown of per-class accuracy is then shown in figure 6. Importantly, the classifier was able to separate the nBack and backwards digit recall tasks, with and without stimulation. During tACS stimulation, epochs from the nBack task were recognised with an 80% accuracy, while epochs from the backwards digital recall task with a 72% accuracy. If residual artefacts remained present after the data cleaning, and these dominated the classification process, then these residual artefacts would also have to vary with which task was being performed. Our methodology cannot rule this out, but the tasks were carefully selected to be similar, yet different, to minimise this possibility. Overall the classification accuracies add considerable weight that EEG data, different between tasks and between task/no  task conditions, is indeed being collected after artefact removal.

Subject independent classifier
As a final validation step, table 1 gives the performance of the subject independent classifier evaluated on each subject in turn. The performance averaged across all subjects was 67.7%, lower than the subject dependent classifier (75.1% from figure 4) but still above chance performance. This reduced performance comes at the benefit of not needing to generate a new set of classifier training data each time a new subject is recruited, which will be important as we move towards closed loop tACS systems. Even with this reduction in performance, the accuracy is still over three times the chance level of 18.2%.

Discussion
The verification of on-person tACS artefact removal is intrinsically very challenging as there is no ground truth reference available in order to determine whether any differences in the recovered EEG signal are due to residual or higher order artefacts still being present, or whether they are due to tACS having a modulation effect on the brain and so altering the EEG which is collected, or whether it is because recordings with different stimulation settings cannot be performed simultaneously and so there are simply slightly different EEGs present at different times. As a result there is much debate over how good the artefact removal process must be in order to start being of practical use.
In our previous work [7] we used a multi-step validation procedure to evaluate the artefact removal performance: performing head phantom tests where the EEG signal to recover was known; inspecting individual on-person tests in the time and frequency domains; comparing the descriptive statistics of raw and cleaned EEG data; and demonstrating the recording of evoked potentials during tACS. Here we propose a new validation step, using a classification based approach to discriminate different parts of an experiment from EEG data after tACS artefact removal. We were successfully able to separate two different tasks that measure different forms of working memory capacity (nBack and backwards digit recall) during tACS from background EEG.
Our experimental protocol was carefully designed with this objective in mind. A relatively low number of subjects was used as the aim was not to report a behavioural or similar effect of tACS, and the tACS stimulation duration was kept deliberately short in order to avoid inducing changes in the brain. A low number of EEG channels was used with channel count independent artefact removal, and a relatively unsophisticated machine learning classifier (an LDA) was used to avoid overfitting, or over-analysis, of the data. Our tasks were selected to involve similar, but different aspects of working memory, previously shown to be enhanced by tACS stimulation [25,[32][33][34].
Our results suggest that if residual artefacts remained after the data cleaning, and these dominated the classification process, then these residual artefacts would also have to vary with which task (if any) was being performed. We are not aware of either a technological or physiological basis for this. To us the more natural conclusion is that either no residual artefacts remained present after the data cleaning, or that those that did did not dominate the classification process. This is not direct evidence that the cleaned data is good enough for use in any arbitrary situation, but does add  weight to the overall picture as to the significance of any remaining residual artefacts. Table 2 summaries the performance validation procedures used in EEG+tACS artefact removal works that we are aware of to date. To our knowledge 10 different methods have been reported, with time and frequency domain visual inspection of on person data being by far the most common validation technique. The use of simulated/synthetic EEG+tACS data, modelling the tACS interference as an additive noise source is the second most common. Most of these models include some second order effects such as timing jitter, however the simple additive tACS model is very debatable [16,17].
The majority of works use more than one validation approach in order to help build confidence in the artefact removal procedure. We propose our classification approach described here as a new potential validation approach which incorporates an ability to assess the impact of any residual artefacts which may be left in the cleaned EEG data. The current study deliberately used a small number of subjects and a short stimulation time to avoid the stimulation inducing changes in brain state. Future work will focus on longer duration tACS protocols, investigating whether a change in the classification performance is seen for data selected from the start, compared to the end, of stimulation.

Conclusions
Machine learning can be used to investigate the performance of tACS artefact removal from simultaneously collected EEG signals. A linear discriminant analysis classifier could separate baseline, baseline during tACS (after artefact removal), working memory task without tACS, and working memory task with tACS (after artefact removal) data segments with accuracies ranging from 65%-94%. EEG from the nBack and backwards digital recall tasks could be separated during stimulation, with an accuracy exceeding 72%. This suggests a limited impact of any residual artefacts which may be present after the data cleaning process. If residual tACS artefacts remained after the EEG cleaning these did not dominate the classification process. This new validation method could be used in addition to existing validation processes to build greater confidence in the performance of tACS artefact removal algorithms.