Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Microsoft Research · Intermediate ·💰 FinTech & AI for Finance Professionals ·5y ago

Skills: LLM Foundations90%Tool Use & Function Calling80%Prompt Craft70%Advanced Prompting60%Agent Foundations50%

Key Takeaways

The video demonstrates the use of EEG headphones for decoding music attention in a user-friendly auditory brain-computer interface, utilizing techniques such as steady state visual evoked potential and modulated modulation frequency, and achieving decoding accuracy above 70% with a convolutional neural network.

Full Transcript

[Music] okay all right thank you lumitra hello and good morning everyone it's my great pleasure to introduce winco and today uh he's a fourth year phd student at clinic in mellon and he's a returning intern with our team the audience acoustics research group working on brain computer interfaces and with that wake up please take it away thank you janice for the introduction uh thank you everyone for coming but before i start uh please feel free to interrupt me anytime if you have a question uh during the talk so today uh i'm very excited to share with all of you what i did in the past three months uh so i will start with some backgrounds like uh in a title there is a term called eeg headphone so what is eeg actually eeg stands for electro encephalography so it is a recording of electrical potential along the scale so a common way of arguing that that information is by placing multiple electrodes they are usually integrated into a hat and we can put on this head and we can measure the potential change along the scalp over time so these electrodes are picking up some neural activities when a neuron fires it generates a certain waveform called action potential however just bear in mind that eeg is not picking up a single neuron fairing it's impossible to to capture that with the electrodes outside your brain uh is that it's picking up synchrony of a of a local population of neurons so suppose there's a synchronous behavior within a local population then that's what eeg is picking up from the brain so decades of eg study shows that eeg carries information about a person's perception and emotion and it can also be modulated by a person's cognitive states like attention so it offers a space for neural engineers to build an interface to communicate with an external device directly using our brain and that's the whole uh basic basic idea behind a brain computer interface or bci so by definition bci is a communication or control system that allows real-time interaction between the human brain and the external devices it has been used in many applications like you can see in the picture uh on exhaustive technology when the the patient or the user can put on eeg cap and then just by using his mind he can control this wheelchair to navigate or in some pci has also been applied in some stress monitoring applications for example the interface can real-timely monitor a person's stress level and interactively give some some components accordingly so a very simple way to implement this uh this bci system is to label is try to decode different cognitive states first and then label these states with a specific system output for example if we can decode the person's attention to left then we can label that with the left click and similarly we can label the attention to write as a right click and then if we can combine these functions uh with eye tracking for example then we may control a mouse without even moving our fingers or hand so this type of binary output like being left or right even though they sound very simple but they may actually have great values in certain applications and therefore this kind of binary output is what we focus on for this project and so a very popular way of designing a pci a paradigm is by using external stimuli and the whole idea is you can present multiple objects for a chosen sensor modality and ask the person to pay attention to one of these objects in vision so as you can see in the picture people can present for example four or even more circles in the screen and these circles may flash at different rates and we you are paying attention to one of them the rate at which the attended target flashes will show up as a stronger component in the eeg signal and we call this the type of response a steady state visual evoked potential or ssvep similarly in addition um people also play sound through to different ear or other channels and these uh sound are usually pure tones and they are modulated by a different uh modulation frequency so when you are paying attention to one of them the mod the attended modulated modulation frequency may have a stronger components so even though these uh neurosignatures are well studied and for example ssvp have been reported to be very robust across sessions and across subjects one downside of using these paradigms is that these stimuli can actually cause fatigue very easily just imagine that if you stare at a flashing object constantly flashing or listening to a modulated uh pure tone for more than like 10 minutes for example it won't be a very pleasant experience so with that in mind we try to evaluate the user-friendliness of the current eeg based vcr system and see if we can find some room to improve and it turns out there's a lot of room for us to work on uh here i'm showing you a very typical eeg based study or eg study uh people it usually so the eg cap usually has multiple channels and that number of channels can can go from maybe 24 all the way up to 128 or even more so if you choose and also if you choose a wired eeg system it means there's always a big bundle of cables behind you when you're using it and these cables are connected to an amplifier as you can see here which is usually very bulky and uh and heavy in the same time so um therefore it means sometimes it's very over it can be very overwhelming for the user to to carry it around if they want to use this pci system outside and it also sounds uh it it also looks very obtrusive when you're putting it on also many of these uh eeg systems that use gel based electrodes so we put gel between the electrode and the scalp to make sure there's good uh connectivity between these two so it means every time you before you use it you have to put gel in those uh each of these holes and it may take some time to to to make it work and this gel can also be very sticky and messy so after you use it you also need to wash it off so that creates some extra overhead in time and effort before and after each time you use the reset system and also as i said before uh some stimuli like the constantly flashing objects uh can cause fatigue easily uh and some paradigms like uh if you use motor imagery it may require a very lengthy training session before a subject can be able to to use them at all so this list can go on and on so all the factors together adding together makes pci a less mature technology i would say for the consumer market and uh this motivates us to think about if we can to explore the feasibility of building user-friendly and also functional auditory bci system so this is a very uh general question or an open open-ended question so we narrow it down to three specific aims the first one is uh we want to use a more compact form factor to collect eg data uh we want to walk away from a messy like uh eeg like gel based aeg system of course and we want to use more pleasant stimuli because we don't want to annoy our subjects and these two points usually mean that uh we need to compromise on certain aspects for example we might end up using fewer channels than a traditional eeg system or the snr may not be good because we walk away from gel based sensors so the third aim is to see if we can maintain a very high decoding accuracy and efficiency of the system when we achieve the first two points so regarding these uh form factor uh as i said we definitely want to walk away from this kind of slide setup so a smart idea would be embed this eeg system into some existing products uh that people wear daily for example there's a there's a smart design that combines uh eg system or eg electrodes into a hat so people just put it on and walk away with it or if we target some vr ar users we can try to combine this eeg system with a vr ar headset since people need to put them on anyway it offers a very convenient space for us to collect each data as well another even very uh even smarter but also very challenging idea is to collect each data from earphones or ear canals so this is a work that uh was done at msr last year by another intern so he built this uh these earphones with lectures at the ear tips so he collected signals from this device and tried to use them to decode the person's stress level and the results were pretty good so this year since we want to focus on the auditory domain so we decided we decided to use this new product in the market called smartphone with an app so the idea is to combine this eeg system into a set of headphones so they have three sensors on the top and four sensors on each side uh they're using saline solution based uh sensors meaning you have to soak some uh sponges overnight and then put them into those sockets every time you use it it's not perfect but uh compared to gel based system uh it's still very handy and it requires very less time to to prepare this system and also compared to dry sensors this offers more connectivity between the sensor and the scalp so it's kind of a trade-off so um and also this is a perfect match for our project because we need to play sound anyway and by integrating a headphone into uh or say eeg into a headphone it makes perfect sense so um last year when i was intern at msr we tried to use auditory uh we also tried to build the auditory bci system and i used a sequence of tones and i try to make it more pleasant by creating somewhat like a melody for the users so this year we want to just go one step further in that direction so we want to use real music because people can enjoy listening to music for a very long time without being annoyed so if we can find the right uh kind of music then we can probably decode their attention so we tried to do some literature review in the first few weeks of my internship uh there wasn't very there aren't very much studies on this domain but there was one study uh they were using polyphonic music meaning they have multiple instruments in their in their stimuli and they can ask people to pay attention to a specific instrument when they are playing the sound mixture and they found some uh interesting their signatures when the person is paying attention to a specific string so we think maybe we can um borrow this idea to use polyphonic music as the simile and ask people to pay attention um yeah so in the next session i will share with you what we designed for our experiment so we created uh stimuli using some excerpts so the one for so we use which was three instruments a vibraphone piano and harmonica and the one for vibraphone is uh is from i'm yours that's how i'm yours the one for piano is from wherever you go you will go and the one for harmonica is from forever young but i want to play the piano one just uh to give you a taste but i'm not sure whether the audience is working so let me try do you hear anything yeah cool yeah okay so cool so uh that's for the piano and if we combine all three together it creates a mixture [Music] so the reason why we chose these three songs is because they're using the same chord sequence and if we if we combine to combine them together because they are using following the same chord progression uh the mixture creates a very uh harmonious sound so people can really enjoy it in the meantime can choose to pay attention to a particular instrument for the task and uh um so these are the stimuli that the subjects will hear more often during this during this uh the experiment uh occasionally during the the task they will hear a modified version of these and we call them uh oddballs so uh there are four bars in each excerpt so to create an oddball we uh modify the melody by tuning the pitch down up or down for the second bar or the last bar so here's an example also for piano so compared to the standard piano stimuli we only modified one bar of it so they sound exactly the same elsewhere so the reason why the reason why we want to uh create or create some outlaw stimuli is because we want to create a task for the subjects to to focus uh which is a sort of incentive for them to focus more um so we also recognize that it might be very challenging for people to uh pay attention to a particular instrument if they don't have a very uh if they don't have any like musical musical training so we first we specialize the sound in three directions with microphone on the left piano on the right the harmonica in the middle so with that specialization it it becomes a very easy task for even non-musicians like myself to do the task and then uh so the whole experiment is child is trial based so in the very beginning of a trial we presented a visual cue being arrows pointing to different directions uh this is a way to in to direct the person's attention to a particular direction and then we play the excerpt twice so the first reputation is always a standard for this for the attendance stream and the second reputation can be either a standard or a not standard with the outball either at the second or at the last bar and the task is to identify whether these two are the same or not and they need to answer with their mouse and we will give a feedback to tell how they perform after each trial so with this sort of a trial based design we have three conditions and all the trials are randomized for each block uh we have 28 trials for attention to left and 28 trials for attention to right we also have 14 trials for attention to middle so attention to middle to center is not excuse me it it's not what we focus on because as i said we want to focus more on outputting a binary output from the vcr system but this is a very good uh reference for us to do some sentence track i i will show some results from this uh later in the in the in the results session um we don't have a lot of trials per condition because we want to squeeze everything into 30 minutes which is good for subjects during this pandemics situation and as you can imagine this pandemic makes data collection extremely challenging especially when when you are working remotely like i'm doing now so um luckily we we have two identical units of this smartphone which means we can conduct the same experiment at two different places so we came up with the plan that i recruit a few subjects in the boston area where i live and the misha helped me a lot in getting a lot of subjects from the seattle area so together we have nine subjects in total which is a really a trending challenging task for this pedantic season but this number is already very good for like a typical bci system uh sorry for for typical bci research so um since we have two experimental sites so um and two experimenters so i wrote an app in matlab to to show the subjects all the instructions so that uh we want to normalize the amount of information or how much you know about the the task before they do they actually do it so this app has a like the function for set up the experiment and also a training session where the subject can play around with the stimuli and the task as many times as they want just to get familiar with what's going to happen in the actual experiment and then of course the app also has a function for the actual experiment so everything is all in one like place so that it makes everything easier for both the subjects and the experimenters so now i will share with you uh how we process our data and how we analyze it so we collected data from eg data from this 11 channel uh smartphone and we did minimal pre-processing before the analysis so this includes a bandpass filter with two to eight hertz passband and also uh and manually remove some uh epochs that are obviously bad for example this one in the in the figure so these bad epochs are usually due to say for example package loss in the bluetooth connection or some strong motion artifacts like if the subject is yawning if the tactic is so boring or this sneeze for example so it's very easy to pick them out from the the pool and the now the reason why i chose a very narrow band for the bandpass filter is because i want to apply this uh auditory attention decoding or aad everything for my data so this method was developed based on the fact that um so neuroscience found find that when we are listening to a sound like a running speech in this case uh the low frequency components like two to the eighth hertz in the eeg may resemble the envelope of the sound wave so if we and also if we are listening to to two streams of sound for example and attend to one of them then the attended stream may have a stronger representation in the eeg signal so if we can find a way to reconstruct the stimulus envelope from the low frequency component of the eeg signal then or we can we can use the reconstruction to correlate with the the other available stimuli envelopes and that will give us so and the one you're attending to may give you a higher correlation and that's exactly what's uh the whole idea behind this uh aad method so in in in this aad algorithm we focus on two uh two signals that's one is the stimuli feature a stimulative envelope or s the other one is the multi-channel eeg data or r so here we are treating the the brain as a as a linear system which by the way is not true but uh is a very handy assumption and then we can using this uh s and r signals we can estimate um the system response by using either either of the two as the input and with the other one at the output so for bci we focus more on using the sort of sort of the backward modeling or stimulus reconstruction where we use the r as the input and use the stimulus envelope as the output to see if we can reconstruct it and suppose we can find this decoder using some method and then we can use it to apply on a new test eg data to find the reconstruction and then correlate that envelope with the available the with the original true envelopes we have if you are attending to one of them the correlation with one of them should be higher and here are some equations for how this one works uh numerically so this r as i said is the eeg data and we are trying to find this decoder g which is a function of both space uh which represents n uh sorry which is denoted as n here uh so n stands for the number of channels and there's also a function of time which is the tau uh as tau here so tau is a like a delay uh variable uh that that is mean to to model the delay between the stimulus onset and uh the time it shows up in eeg so for uh for for speech decoding usually this value is uh the best delay value is usually 200 milliseconds um so to do that we can give we can give a set of values for tao and then the algorithm can can we will delay the eeg signal by different amounts and then see which one uh which delay value gives a good reconstruction so this s-hat here is the reconstruction uh and the cost function to to find the the optimal g is by minimizing the mean square error between the reconstruction and the original stimuli envelope and the solution for g in this case has the auto correlation between the responses or the eeg signal plus the regularization term we use the inverse of that and multiply with the cross correlation between the response and the stimuli envelope so i applied this aad method on my data i pulled uh all the eeg signals we have for the three conditions and then uh and this forms the the signal rft and also i use the corresponding envelopes for each trial and put them together as the s of t and then i use the equation to to to estimate what is our decoder so after we get the decoder with a new trial uh for the testing uh so we input these eight seconds of each data into the recorder sorry into the decoder decoder and that will give us a reconstruction and then we correlate that reconstruction with either the envelope of the vibraphone or the the envelope of the piano and uh so and that will give us two correlations and we can use that one as the feature we have and then train and test on a simple svm classifier so that's a whole decoding pipeline for this line of analysis so now i will show you some results before we uh before i show you some of the decoding accuracies i want to share with you the decoder weights to see what they look like uh here i'm collapsed so as i said this g is a function of both time and the space so here i'm showing you uh the average the the g by collapsing the time axis so i'm taking the average across all the delays we have just by showing you the weight for each uh channel location so here we are looking at this smartphone with with three sensors on the top and uh the yellow color means uh it has a higher weight and the blue color means it has a lower rate weight but it seems like the three sensors on the top carry most of the weight um and it it's the the decoder weight is the surrogate of saying uh how much this decoder i mean this channel is contributing to the to the decoding so i think in this case it means the three channels on the top contributes more to decoding and we can compare these uh our results to uh do a previous study on speech decoding uh where they also applied this aad method and they found uh that there's a similar pattern we found there's a similar pattern in the decoder weights for the three electrodes at the top uh c3 and c4 are closer to the two clusters with higher values and the uh the cz is in the middle which is further further away from those two clusters so we should and this pattern is exactly what we saw here in our decoder weight so this correspondence is a good sanity check for both the hardware because it's a new product and also for the algorithm because i want to make sure that i'm implementing this aad method uh properly and with that decoder we try to reconstruct the stimulus envelope so here uh i'm showing you the results from one subject average acro across all the trials so the blue trace here uh is the original envelope and the red one is the reconstruction or say the predicted uh envelope so uh in each section sorry in each condition or attention to left attention to center attention to right we observe that in certain windows like for example in in this one uh some certain time windows uh there's a high correspondence between the original the blue trace and the the right trace uh so the prediction [Music] so so it means that during those windows this uh predict this aad method can can reconstruct this uh original stimuli pretty well but in some other windows like uh for example if you look at here these uh retreats and the bluetooth because in opposite ways and uh the reason why this is happening it could be because that this decoder is not getting enough data because we only have like uh uh around the 30 seconds sorry 30 minutes of data so it may need more data to to to get a better decoder but it can also be the reason that the person is not paying full attention or very good attention during those windows so then uh so it means and because we are seeing more uh correspondence in some time windows than the others it means this attention attention effort from each subject can be a function of time so they may choose to pay more attention in some in some windows than the others um and this might be so and this might be determined by like uh but by factors like for example when a note is being played in a particular stream and or say how each subject scheduled their attention during the whole experiment to to to do the task and we should keep this uh point in mind uh for interpreting some results i will show later so here i'm showing uh the correlation uh between between the reconstruction and the envelopes of each instrument so the result is average across all the trials and each color here represents a single subject so the pattern that we are expecting is that when you are paying attention to vibraphone for example the correlation with the the vibraphone envelope should be higher and when you are paying attention to piano decoration for piano is higher and since we are not uh since in these two uh conditions we are not paying attention to harmonica the the correlation with the harmonica envelope should always be close to zero and this is exactly what we are seeing here which is a good a good sign and uh so so we can use these features like the coordination with these two uh envelopes as the feature for classification and here are the results so the average of all the decoding accuracy is around 64 which is shortly above the chance level and the order for because usually bci systems bvci studies have a very small number of trials so in order for it to go above significantly above chance we have to exceed this uh this dash line which is around 58 percent so so the average result is surely above that significance level as well um and also we are also observing some uh individual differences with the highest accuracy being around 70 percent and the lowest one even below the significance level uh so the in general i mean this is a good result because now we know even with even though we are using smartphone and music which is a very user friendly but much less optimal than uh traditional eeg setups we can still decode the auditory tension uh to a fair amount so so that's good but the question is can we improve it and um one way to improve it is just remember where when i showed you the reconstruction results i mentioned that attention might be a function of time in many subjects so during the some windows when you are not paying much attention the correlation between the reconstruction and the original envelope might be low during that window so keeping that information in our data pool may actually reduce the snr which in a way will reduce the the overall decoding accuracy so we we may want to find a way to remove those those windows when the person is not paying much attention and then classify on the remaining data so here here's what i did so first i want to with a new eeg data i want to uh divide it into smaller segments just to capture those small windows of attention and here i'm using two seconds of data for each segment with 80 overlapping and that gives us 18 segments roughly and then we can input each of the segments into this attention decoder that we got before and it is going to reconstruct with the envelope for that two seconds and then we can call it that two seconds of that of envelope with the two seconds of envelope for each of the instruments and that's going to give us a correlation coefficient for each segment so in total we will get since we have 18 segments we will have a 2 by 18 matrix as the uh as the sd feature for each trial so now how can we decide which ones to keep and which one to to throw away so let's go back to this correlation uh to to this correlation result so we are expecting this to to observe a gradient between these two variables or saying uh when you are paying attention the one that you are you are attending to should have a higher correlation than the other so when you are not paying much attention and uh so this sort of a gradient may disappear and it may end up as a very small value in difference if you if you just subtract one from the other so that's what i think maybe we can look into that feature like the difference between the two to decide whether i want to throw away certain segments or i want to keep them so here i'm showing you this kind of difference feature but i take the absolute value because uh for a new trial we don't know which one to subtract to be subtracted from which one so just to be fair and to be unbiased so i i'm just taking the the absolute value between these two uh variables and uh my assumption is uh the more the the greater the the value is uh the more attention you are paying to or the more information there is in that segment so here i'm showing you the absolute difference this absolute difference from one single subject and the x-axis is the segment number from 1 all the way to 18 and the y-axis is the trial numbers from one all the way to whatever we have so the blue segments are what we i think should be removed from um for the reasons that i i just mentioned and the more yellow a segment is the more that we should keep keeping for further analysis so for this subject the yellow the yellows are scattered around it doesn't form a specific pattern whereas if i show you the results uh this same same matrix for another subject we can see uh these yellow segments they align pretty much as certain segments it seems like this this person is using a very consistent way of focusing throughout the whole experiment and another way to visualize this is to sort each row by the absolute value and uh now it's very clear that the subject at the bottom actually has higher values in this matrix than the other one so now about that which one of the two subjects has a higher decoding accuracy in the if we use the whole trial as decoding uh feature so it's very obvious that this subject five for example has a very high degree of accuracy because uh these uh because he has more yellow yellow segments in this matrix um so i think so uh we want to remove uh this blue segments uh on the on the right part of this matrix and then we need to find the criteria to to to do that the easy way would be just set a threshold value it's a fixed threshold value like 0.4 in this case uh then for this subject we are keeping roughly half of the data which is good but if we do that the same for the other one for some trials there even there's no survival uh like segments so it's not ideal uh therefore we want to use a distribution based threshold which we we start from this uh matrix and then we find the distribution of all the matrices of all the elements in this matrix and then when we found the medium that as a cut off and just a throwaway in this way we can throw away half of the segments from each subject so which makes more sense than just a fixed value and so with that feature selection we can reduce this 2 by 18 matrix to a 2x1 uh to a 2x1 vector by doing this by doing this so as i said uh so since we don't want to bias the distribution using the testing data so we first divide the whole data set into training data set and the testing and with the training data set we can calculate that absolute difference feature and then form the distribution find the median of that and use it as a threshold and then we can calculate the mean of the survived segments for each trial which means we are taking the average of each row here with the one that that is not masked by blue so so we are still working with two by one uh dimensionality which is good for for this uh type of analysis because we don't have a lot of data and the result uh is shown here uh you can see this feature selection gives a lot of gains in decoding accuracy especially for these few subjects and the average now is above 70 which is a a good number for for this uh like overall eg setup because we are trying to be more user-friendly and uh so one thing i want to so here i want to make a note about uh the so-called bci literacy because uh in in bci studies sometimes researchers may find that the signal from certain subjects cannot be well decoded and so but in the same time this the same method may apply better very well on the other subjects so some researchers may claim that these subjects are not bci literate because their signal cannot be decoded i'm not totally against this idea because each person really they each person has a unique uh anatomical structure in their in their brain so it's really possible that for some people uh their courtesies are floating in the way that uh some test relevant features cannot be captured so it's truly possible but there's also a possibility that sometimes there's just too much noise in the data and we need to remove them to see the real effect for example in this in this case for subject four it jumped from lower than i mean even below the chance level all the way to a pretty decent decoding accuracy so that's what i want to make in here um so recently i mean uh last week we thought about we we reviewed the whole like analysis pipeline and we found maybe there we can use some ways to automate this uh future selection process because uh we are setting uh like hard threshold as 50 or the median uh it's a subjective or it it may actually may not be the optimal uh threshold so we want to automate this and the easier way the very obvious choice for this is by using a convolutional neural network and uh so here we built a very simple neural net with two combinational layers in uh one in each uh like direction in the matrix and with some dropouts and to to make sure there's no overfitting and uh so i don't want to spend a lot of time on this because the result is not great um the yellow bar the yellow bars are results from the cnn so it's in some cases it's working like in the subject four and five is uh pretty close to the red bar but in some other cases it's not so i think it's because uh we really don't have enough time to to tweak the parameters or to to revise the architecture so this is just something we think that could work and it's not really optimal so i think in the future our team may surely spend more time on this after my internship so i want to compare my results with other previous studies because uh like we are using here now we are using a very user friendly setup i want to know how our decoding pipeline performs so before we jump to the decoding accuracy i want to mention that compared to these two studies we are using a far less a number of uh channels and it's not even covering covering the whole brain and we are using music is which is more more pleasant but it carries less uh like uh contextual meanings in uh as in speech and we are using a relatively shorter uh like data for decoding uh however if we compare other studies using also using this linear auditory attention decoding algorithm uh our result is actually not not not bad so the decoding accuracy is 70 which is uh kind of medium but if you take into account how much data we use uh it's the equivalent uh like efficiency of the system or the information transfer rate which is quantified as a bit in a minute it's actually way better than the other two and also one thing that that's worth mentioning here is that in this in this particular study the authors propose an end-to-end uh dnn architecture for classification but this is totally uh different from using aed to extract some high level features first and then do classification on those high level features um so in their study they achieved pretty good results or a good gain compared to using this iad method so this is definitely something we want to try in the in the future given the huge boost and also the the expertise of our team but i still want to make uh to give some merits to the aad approach because aad is inspired by neuroscience studies and where the scientists they really observe this neurosignature from many many human subjects however in in neural net i mean neural network is purely like data driven so it is trained at a data we already have which usually contains a like i mean for for preset study this sample size is usually a small number it's usually in a few tens of samples or in some good cases a few hundred so if we train the architecture which is purely data driven uh like in a typical dna architecture it involves like a large number of of uh variables to fit like usually in hundreds of thousands as in this paper so that means this is a hugely underdetermined problem not to mention that the signal to noise ratio for eg is usually very low so uh therefore i think these dnn based approaches may sometimes give a solution that might be tailored for the data set but may not be easily generalizable to other data sets um so in the future there are a lot of things we can still work on from this study we learned that uh we can improve a lot in our study design so we can some subjects reflect that uh the the task is it's not as it's very not it's not very challenging so because the first reputation is always standard after after a few a few a few trials they realized oh it's always the same for the for the first reputation so they start to pay less attention when they hear it in the first run and also we can have better outball design because some people also reflect that uh the outball in the unattended stream may also grab may actively grab your attention during the task and that way may reduce the the attention effort you are paying to the to the target so maybe in the future we can use some other other designs like uh adding missing notes uh which grabs no attention but it can still be an outlaw and also we can have better stimuli design or experimental design so here one drawback of using this is the vibraphone always starts first so for a lot of people their default attention mode is to the left and then if the task is about attention to the right it needs to shift their attention after a few seconds so that's again that's something we learned from this study and we definitely want to improve in the future and uh also we want to work more on the form factor so where we can have even more user-friendly hardwares like in-ear electrodes as i showed you before and despite all the other negatives i i said about the dna approach we want to try this end-to-end classification because in that domain we may suddenly have a lot of techniques to play with like data augmentation which might help us to reduce the chance of overfitting and still giving us generalizable generalizable uh networks so we already applied uh a cnn approach uh to the to the study we had last year uh and achieved good results so we may try to use the same architecture here for this year's study so what do we learn from this project we found that we can acquire information from about cognitive states from very compact eeg devices and this device doesn't have to cover the full brain as many other eeg systems do and also we can use polyphonic music as a very pleasant stimuli and we can direct the user's attention to specific instrument instruments uh by by using specialization and also we can decode a music attention uh using a stimulus reconstruction method uh and achieve a good decoding accuracy and with that i want to thank my team i want to thank my mentor honest to for being a great mentor for two years for me uh it's really nice talking to you every time um and also i want to thank evan for giving me this opportunity you know and special thanks to dimitra for helping me get a lot of good data and it's really valuable for this project and i want to thank my colleagues in the audio and acoustic research group and also the uh the msr bci team it's really great to to to hear the updates every week and share with what i found and get some feedbacks they are very valuable to me i want to thank my advisor for giving me for being always uh supported and also special thanks to my wife as well um she's now on call here today but she can hear it maybe next door um no no no no none of this can happen without her so yeah it's really grateful to to to have this chance to do this project with all of you and thank you very much for your attention um yeah now i can take any questions you have thank you so much winko um i believe hannah's his call was dropped so i can i can act oh as a moderator that was a wonderful presentation and i must say as as someone who um who recorded some of this data as in i was actually a participant in the study um i wanted to mention that something that struck me as as interesting you you said that the the top three electrodes show the most promising signatures and what i've seen is that the top three electrodes were actually the ones with the poorest connection which basically is very encouraging to know um because the top ones were the ones that had the most contact with hair and we happen to have quite a few females in the study that happen to have lots of hair so this is very encouraging basically keep working on that optimal design the hardware design can certainly change the results to be even much better yeah so so for for impedance issue yes i also observed the same thing on my subjects because of the hair the impedance for the top three sensors are always higher than the others because the others are directly contacting the skin but uh those noises are usually like in high frequency so because in my data i in my study i use the two to eight low frequency components so i think that doesn't matter really much to the to the analysis and also here uh i want to show you so we did this uh beep test before each actual uh experiment just to to play bibs through the two uh like other channels and this is a way to find uh some like typical like auditory evoked potential from the from the eg signal it turns out the three at the top actually gives us very reliable erp or event related potential from those speed tests it's even better than the other like eight channels so i think signal quality i mean impedance wise they might be low i mean they might be worse than the others but maybe signal wise they are even better so that that's what i think that may contribute to the uh to the decoder weight results there's a question in the comments wink oh if you want to go i had or i can read it so even comments right at the end is there a potential of being able to read attention from pre-existing songs or at least to be able to detect if a given song has the requirements for being used um so pretty pre-existing uh so it stands to on on on the car i i'm i'm back yeah uh sorry sorry my laptop crashed i don't know what happened um i think that i'm guessing the question is can you use songs that the subject is already familiar with or decide beforehand whether they are i see yeah so um i think i think because when we started this uh study we kind of hoped that the subject already knew those songs and so that it may increase the familiarity of the song and the people may identify the oddball very easily uh so that that's actually a hope and i think of course we can use some samsung you already knew before and but the thing is uh it's not an easy task to do in uh to to pay attention to a particular element like like instruments for example in the song without any manipulation to the stimuli so i mean it especially for not not musicians so that's why i think uh in our in our study we specialize it to make it to be a much easier job uh so if we can do the same for other like pre-existing songs then maybe yes we can still do it and also i guess yeah sorry yeah yeah i guess i guess i guess if you think of uh today's work right seamlessly so they're played at the same time but they follow the same they they uh follow the same rhythm so maybe potentially if you spatialize those two you could pay attention to one or the other say if you were using electronic music yeah and also if you want to apply this aad method actually if you look at how this decoder is is calculated it's uh it's not discriminating any particular sound you're just putting everything together to train this decoder so i think in that way that we don't have very specific requirements for what song can be used but uh maybe not very loud or noisy i don't know it's just my guess there was one one question oh sorry oh can you hear me yeah yeah there was one there was one question in the messages about the census and i said dimitra already replied but uh yeah they're just they're just little uh this the sensors inside the headphones are just little sponges similar to ear plugs like foamy foamy ear plugs that are soaked in in a salty water basically so no gel involved yeah the white parts are that they are just sponges and uh you have to squeeze them to put them into the socket but uh you know to use them you have to soak them before the experiment for overnight and in insulin solution so yeah but they're kind of pandemic friendly because they're disposable oh yeah so there's that advantage yeah i just hi sorry this is work i just didn't know if they're really wet do they leave your hair wet your head wet or is it just slightly moist uh it's uh so in the beginning when you first put it on it may be wet a little bit there i mean and also if you soaked it for a long time and you didn't squeeze a lot there might be some water drops from your head but uh after a while it uh it's just a moist around the sensor region so i guess it's still still okay yeah i've been thinking about other kinds of sensors like uh near infrared um you know other kinds of things that might or passively or i guess actively light up an area so i don't have those been used or do you think that might function in this case um so the the thing with apnears is like is is that uh it's capturing like a blood oxygen change over time so this oxygen change uh is a is a slow plot process like usually in in order of a few seconds like uh so which means i mean when you play a sound and when you're waiting for that blood signal block oxygen level to go up to the peak it may take like five to six seconds so in that case it may not be able to capture very quick dynamics as we are playing here like we are playing sounds then you may not capture those but uh it might be a good good tool to capture other things like a stress level for example there might be more some more reliable signatures for those type of uh applications cool thanks i guess i guess uh just a follow-up comment i think i think given that right now the trial length is eight seconds right winkle um in our case so so if if you could reliably detect attention shift with f nearest you would kind of be in the same ballpark right in terms of in terms of sorry in terms of um lag so right now with with the method with the method with aad you basically need at least four or five seconds right now of data to determine an attention shift yeah well i mean uh the the duration of the i mean the length of the the trial is is one aspect the other one is the spatial temporal resolution so it's not just being delayed but it's also like smeared in time so you're all you're only seeing very slow dynamics in in both signal or in in app near signal so uh i think we cannot apply the aad method in here uh uh for for f near sensors but there might be other algorithms we can use uh right yeah i i guess i guess the i guess my my point is uh or my question to you is do do you know what the um are you familiar with any studies that use afner for attention decoding it might be interesting to know what their decoding accuracy is just say left right left versus right well i can't think of any at the top of my head but uh since our lab is working on special attention uh to sound and we also have a postdoc in doing f nears in on that so maybe there there is already some okay the papers i have seen they typically average f near signals for 60 seconds before to get a reliable reading yeah that's similar to what uh fmi is doing like you have a very long session of experiment and then you collapse the time kind of time domain signal to to get information about this space so yeah i guess i guess it may be like the it's a little bit stretched if we only use 10 seconds yeah that case would be low um i i have a question lester or others i think there are let's see a couple of there was a comment uh in the messages about um breaking down a single song into different tracks and then treat those as different stimuli um i think that that that might that might work i guess i guess the similarity between the different uh segments might be a problem here focusing on one segment versus another one all the instruments are the same the sound is the same but they're just different places in time that might be a little tricky but there was one question about how did you collect potential or correct stimuli data to calculate the mean squared error oh so this uh correct wait the mean square error for the decoder i assume um [Music] so if you mean uh if you're asking for the mean square error about the true stimuli then because we already know know the envelopes of destiny we know what we are playing to the subject so we use them as a ground truth and uh yeah so and then we try to reconstruct using this method and we can compare them i'm not sure whether that answers okay okay thank you uh so so uh rick you had another question yeah i was i'm i can speak it faster than i'm typing i'm only about halfway through with typing so yeah so the this is an interesting result you know you you're showing that there's a spatial sensitivity to these sensors my intuition and i'm just an amateur at this but my intuition would have told me maybe the ear pieces would have been more sensitive if you will to picking up these signals but but your results seem counterintuitive so what is it about you know either the brain function or where things are ending up in the head for this music attention that makes the top pieces more sensitive than in the ears so yeah that's a very good question because in the beginning i also expect to see more from the sides than on the top uh especially if you look at this uh topography these clusters actually around the auditory cortex which is on the side uh but i so one thing i can think of is uh is when we are doing some some uh beat test like uh you're you're trying to give you the person some beeps from two sides uh the most effect we see actually is on the top it's because for most people these cortices are folded in a way that is pointing to the middle so if you sum them up actually uh even though i mean those quarters are actually around this area but those those dipoles are like batteries they're pointing to the to the to the more like the center so that's something i can think of for now to explain this and of course the other thing is if you look at our beep test sanity track the three sensors on the top always have very good snr whereas the oth

Original Description

Listening to music is a joyful way to relax. Many people enjoy playing music in the background in their daily life, even while working. This makes music an excellent choice as a stimulus for building a user-friendly auditory brain-computer interface. In this study, we designed a new brain-computer interface system using music as the stimuli, and recorded brain signals from Smartfones, an EEG recording device integrated into a pair of headphones. We spatialized musical instruments, and asked the participants to pay attention to either the vibraphone on the left or the piano on the right. Then, we used a stimulus reconstruction method to decode attention from EEG signals. Results show that the proposed system can achieve good decoding accuracy on top of its superior user-friendliness compared to EEG caps. These results suggest the viability of using music in designing future auditory brain-computer interfaces. See more at https://www.microsoft.com/en-us/research/video/decoding-music-attention-from-eeg-headphones-a-user-friendly-auditory-brain-computer-interface/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 46 of 60

← Previous Next →

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Microsoft Research

Frontiers in Machine Learning: Climate Impact of Machine Learning

Frontiers in Machine Learning: Climate Impact of Machine Learning

Microsoft Research

Frontiers in Machine Learning: Security and Machine Learning

Frontiers in Machine Learning: Security and Machine Learning

Microsoft Research

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Microsoft Research

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Microsoft Research

Remote Work and Well-Being

Remote Work and Well-Being

Microsoft Research

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Microsoft Research

Towards a Practical Virtual Office for Mobile Knowledge Workers

Towards a Practical Virtual Office for Mobile Knowledge Workers

Microsoft Research

Impact of COVID-19 crisis on the future of work in India

Impact of COVID-19 crisis on the future of work in India

Microsoft Research

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Microsoft Research

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

Microsoft Research

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Microsoft Research

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Microsoft Research

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Research

Directions in ML: Algorithmic foundations of neural architecture search

Directions in ML: Algorithmic foundations of neural architecture search

Microsoft Research

MineRL Competition 2020

MineRL Competition 2020

Microsoft Research

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Microsoft Research

From Paper to Product

From Paper to Product

Microsoft Research

SkinnerDB: Regret Bounded Query Evaluation using RL

SkinnerDB: Regret Bounded Query Evaluation using RL

Microsoft Research

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

Microsoft Research

Programming with Proofs for High-assurance Software

Programming with Proofs for High-assurance Software

Microsoft Research

Platform for Situated Intelligence Overview

Platform for Situated Intelligence Overview

Microsoft Research

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Microsoft Research

Galactic Bell Star Music Demo

Galactic Bell Star Music Demo

Microsoft Research

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Microsoft Research

Welcome to Microsoft Expressive Pixels (1 of 9)

Welcome to Microsoft Expressive Pixels (1 of 9)

Microsoft Research

Getting Started with Microsoft Expressive Pixels (2 of 9)

Getting Started with Microsoft Expressive Pixels (2 of 9)

Microsoft Research

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Microsoft Research

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Microsoft Research

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Microsoft Research

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Microsoft Research

Using Layers in Microsoft Expressive Pixels (7 of 9)

Using Layers in Microsoft Expressive Pixels (7 of 9)

Microsoft Research

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

Microsoft Research

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Microsoft Research

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Microsoft Research

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Microsoft Research

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Microsoft Research

Novel Image Captioning

Novel Image Captioning

Microsoft Research

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Microsoft Research

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Microsoft Research

How does holographic storage work?

How does holographic storage work?

Microsoft Research

The physics of hologram formation in iron doped lithium niobate

The physics of hologram formation in iron doped lithium niobate

Microsoft Research

Introduction to coax: A Modular RL Package

Introduction to coax: A Modular RL Package

Microsoft Research

Directions in ML: "Neural architecture search: Coming of age"

Directions in ML: "Neural architecture search: Coming of age"

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research

Clinical Research with FHIR

Clinical Research with FHIR

Microsoft Research

Soundscape Street Preview

Soundscape Street Preview

Microsoft Research

Tilt-Responsive Techniques for Digital Drawing Boards

Tilt-Responsive Techniques for Digital Drawing Boards

Microsoft Research

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

Microsoft Research

Haptic PIVOT: On-Demand Handhelds in VR

Haptic PIVOT: On-Demand Handhelds in VR

Microsoft Research

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

Microsoft Research

This video demonstrates the use of EEG headphones for decoding music attention in a user-friendly auditory brain-computer interface, utilizing techniques such as steady state visual evoked potential and modulated modulation frequency, and achieving decoding accuracy above 70% with a convolutional neural network. The video covers the basics of brain-computer interfaces, EEG data analysis, and the implementation of a BCI system using EEG data.

Key Takeaways

Present visual cue to direct attention
Play excerpt twice with standard and modified version
Ask subjects to identify whether two excerpts are the same or not
Give feedback to subjects after each trial
Apply bandpass filter for pre-processing
Use AAD method to reconstruct stimulus envelope
Train and test on simple SVM classifier
Divide EEG data into smaller segments
Input each segment into the attention decoder
Reconstruct the envelope for each instrument

💡 The use of EEG headphones and convolutional neural networks can achieve high decoding accuracy for music attention in a user-friendly auditory brain-computer interface.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

The Most Important Part of FinTech Is the One You’ll Never See

The underlying systems of FinTech are crucial to its success, despite being invisible to users, and understanding them is key to innovation

I was tired of every budgeting app wanting my bank login. So I built one that never asks.

Learn how to build a budgeting app that doesn't require bank login credentials, promoting user privacy and security

Medium · Startup

M-Pesa enters stablecoin pilot with Visa and Onafriq in DRC

M-Pesa, Visa, and Onafriq launch a stablecoin payment pilot in the DRC, exploring new fintech opportunities

Techpoint Africa

Banking Rails Modernization: The Future of Finance Essay

Learn how banking rails modernization can increase efficiency in financial institutions using modern technologies

Warren Buffet's 55 Lakh% Gain

Finance With Sharan