Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Microsoft Research · Intermediate ·⚡ Algorithms & Data Structures ·5y ago

Skills: Research Methods90%Reading ML Papers80%AI Alignment Basics70%AI Ethics & Policy60%

Key Takeaways

This video by Microsoft Research demonstrates a forest sound scene simulation and bird localization using distributed microphone arrays, focusing on acoustic modeling, sound source localization, and robustness evaluation.

Full Transcript

[Music] okay while we count down the last seconds i can maybe introduce shogun so uh shoken kaneko was with us this summer virtually for mod internship he's from the university of maryland and he worked on a very exciting ai for earth internship project on bird sounds and with that show can please take it away okay so thank you very much everyone for coming to my um final internship presentation and thank you hundreds for the introduction so let's get started um just briefly uh about the background so there's a technique uh which is audio based wildlife monitoring um this is an important technique used for purpose of for conservation of animal species and ecosystems that the biologists use these kind of techniques and microsoft has an initiative in the ai for earth project which is concerned about these issues of conservation and sustainability and we wanted to contribute to this field um using this audio based wildlife monitoring techniques by our uh by means of our expertise in acoustics and signal processing so um the particular problem that we're interested is bird localization in forests using distributed microphone aries and our collaborators were suggesting kittses in the university of pittsburgh his research group is planning to deploy this kind of distributed microphone array in the forest which has about 20 by 20 microphones or 400 or about 400 microphones which would cover in the largest case of one kilometer by one kilometer forest so the challenges in these uh kind of localization problem is following so first of all we we are interested in these forests so the acoustics of the forest can be highly reverberate which may harm the performance of the localization algorithm and it could be also very noisy because of other animals in the forest or wind noise and rain these kind of natural noise sources and then we also have measurement errors which which are often maybe ignored but we have to live with them so we could have we will have synchronization error because the um signals of these uh distributed array distributed channels will be synchronized using the gps clock and there is a certain amount of synchronization error and also we have to deal with the positional ambiguity of the microphones so um it is not possible to measure the exact position to know the exact position positions of the microphones so there will be also some error in there so the proposed approach is to use computer simulations to conduct computer simulation of this audio based bird localization in a revolver into noisy forest and study the performance and the challenges that may be present in existing source localization methods so we combine forest acoustic simulation and birth sound and background noise data set preparation and simulating the measurement errors that i just mentioned and we apply to this virtual force sound scene recordings this sounds localization techniques and look how it would perform i would like to briefly mention related works on this topic of bird audio localization and the previous existing works can be roughly categorized into two classes uh one group uses densely localized compact microphone arrays and another group of research use those sparse distributed arrays that we are going to use and depending on the the form of the microphone array um the the algorithms that i use will be different and recently there have been some work who com that combine these two approaches so um for each microphone node in the sparse grid a dense localized grid is used so in that way um the localization performance could be improved but in our work we uh um we will focus on the sparse uh grid which has mono microphone nodes um because we are interested in covering a large area of the forest and using multi-channel nodes would improve the cost significantly so for the time we we will focus on this simple sparse arrays so a possible architecture of such a bird monitoring system could look like this so we have forest and a distributed microphone grid and some birds living in this forest and there will be a bird activity detector um which uh detects the activation of the bird sound and that will trigger the localizer and the bird species classifier so the biologists are interested in the information these um informations of when and where which kind of bird is active um to compute more higher level information such as the territory of birds and the number of the birds are living in the forest and that kind of stuff so these informations are the basic low level informations and in our uh early uh stage review of related uh literature we have found that there's considerable amount of work on this bird activity detection and for species classification in the recent years mostly using deep neural networks and also even the biologists are training the deep neural networks by themselves nowadays and we thought that we could potentially contributes um deliver more interesting contribution by focusing on this localization part so um the scope of the project is uh starts with the forced acoustic simulation and then we will discuss about the dataset development of the simulated for sound scenes and as a third part we'll discuss the uh robustness of the sound localization algorithms in this virtual forest sound scene okay so let's start with the first acoustic simulation so the problem of forest acoustics um has a long history and at le at least date back to the 1940s by uh to the work by iring and foley and the aim of these early works was to compute the macroscopic bulk parameter of a forest such as an effective wave number and the theory of these early works is really solid it's uh very nice it rigorously takes into account multiple scattering that will happen that happens in a forest caused by the tree trunks and but one uh drawback is that to to compress all the microscopic information about the scattering events in a forest um the theory has to rely on statistical averaging and that will sacrifice basically the microscopic information about the scattering events and so i thought that using this single uh bulk format meter or forest to synthesize forest impulse responses would be a bit too too difficult i mean it it it doesn't have the potent the capacity of doing that uh for our arbitrary sound source and receiver positions so i didn't rely on this work for our simulations and then there's another line of research in the field of sound effects or digital reverberation synthesis so there have been two papers one from 2008 and one from 2017 and both of these works are focusing on modeling the multiple scattering happening in outdoor environments and that's very nice but uh the cost they had to pay is the very high computation cost which um scales in a cubic uh fashion so in n cube where n is the number of trees so their algorithm um it was nice to take into account what we were scattering but they are unfortunately limited to very very small forest so about 25 or 50 trees or so so unfortunately we couldn't also we also couldn't rely on these line of works so we had to develop our own acoustic simulation algorithm um and our approach was to take into account very simple a simple physical model but doing the computation of the large scale and so the physical effects that we take into account is the scattering by the tree trunks um but the difference with previous methods is that we do it uh in a first order approximation so we ignore multiple scattering in our model this sounds uh may sound like a oversimplification of the problem however it turns out that the resulting impulse responses are quite reasonable so um yeah that's uh the story here and we take also take a look on air dissipation and this turned out also um to be a very important factor and so the property of our algorithm rhythm is that it scales linearly with respect to the number of trees so because of this um scalability uh the simulation of a forest of one kilometer by one kilometer with 100 000 trees can be easily performed so here the forest impulse response model of our algorithm is described so basically it has three components the direct signal and the ground reflection component and the tree scattering component the direct signal is composed by the distance attenuation which contains a delay time delay in the amplitude attenuation and the error dissipation component the ground reflection component has in addition to that the ground reflection coefficient and most the most interesting part is the tree scattering and that has the distance attenuation from the source to the trees and there's three scattering filter which is dependent on the scattering angle and then we have the distance attenuation of from the tree to the receiver here the the um delay factor is uh sorry the the this the amplitude attenuation factor is included in the scatter ring filter here so there is only the delay component described here um then also we have the air dissipation which is dependent on the path from source to tree and to receiver um okay so next the the tree scattering filter is described like this and this may look a bit complicated but it's basically a classical result of theoretical acoustics you can find it in very old theoretical acoustics papers but here i introduce two um parameters that we can potentially use to tune the uh out the result namely this exponent alpha where um rho is here the distance from the tree to the receiver so in theory if you consider scattered cylindrical wave this alpha should be 0.5 but in reality the trees are not infinitely long cylinders like they are in theory so i wanted to make this flexible and also this beta is a parameter that controls the amplitude of the scattered scene and next the air dissipation model is the nice thing about this is it is standardized so we just use this summarize model and implement it as a finite impulse response filter okay so um i would like to present some demos about of these uh forest burst responses um so i have here five um different variations the first one is the real forest in pacific ones recorded in a um national park in finland this nice beautiful forest here and then we have the um for a synthesized uh forest revolution synthesis algorithm from 2008 and from 2017 and to of our algorithm with two different number of crease three densities so um okay let's start with the real one so i first play back the impulse responses and then i uh playback uh the impulse responses which are convolved with the dry birch sound okay so this is a real forest impulse response and this is a 2008 algorithm 2017 algorithm and hours with hundred thousand trees and hours with 500 000 trees again the real one shocking yeah we cannot hear it oh we can but or i can at least but it's it's i think you have to play the samples for people to appreciate okay yeah yeah i will do um okay so do you hear this sound yes okay yeah sorry maybe the the impressive the head of the first response wasn't played back i don't know maybe it's because of powerpoint but so anyway um so okay let's let's listen to the birds so this is a dry bird song this is the real forest a 2008 algorithm a 2017 algorithm hours with hundred thousand trees oops sorry hours with five hundred thousand trees and again the real one okay so i think it demonstrates that our algorithm at least works it doesn't produce um some any any useless impulse response and it's actually quite reasonable and uh one comment on the performance so the algorithm by spread it all from 2008 um has a matlab open source implementation and we computed uh impulse response or forest of 50 trees and competing up to second order scattering and that took about 31 minutes to synthesize one a single porous impulse response and in contrast to that our algorithm was able to compute a single forest impulse response for forests of 100 000 trees in just 1.3 seconds so it's it's significantly faster and efficient oh sorry okay so next uh we evaluated these forest impulse responses using a quantitative measure which is the echo density um i don't go into the detail of this uh metric but um it basically measures the the number of the density of the echo spikes in a sliding window fashion so it gives you a temporal profile of the echo density and some typical room impulse responses have these kind of density profiles so they grow and saturate at a constant value and here are the echo densities for some of these force impulse responses so the red one is the 2008 algorithm 2017 are going this is yellow one and ours with 500 000 trees is this green one and the real recordings from the kali national park is the blue one so um what we can see here first is that um real forest impulse responses tend to have this flat um profile so from so from the very beginning of the impulse response it has almost constant echo density and we observe this in in basically all of the quality recordings there are um um 10 or maybe 20 or so provided and publicly available and and that almost constant profile is not observed in the previous algorithms but can be observed in our algorithm with many many trees so if we reduce the trees in our algorithm we lose this constant profile and we have this more similar pattern like this so initially it has a smaller value but um so by by adding enough trees we get this realistic profile here okay so and uh one last component of the forest simulation is the integration of um the sound source directivity so birds uh we're interested in birds and birds uh will have some kind of directivity of the radiation and we approximate this by using a theoretical model here we approximate the bird's head with a sphere and assume here a point source mounted on the sphere surface and this is also a well-known problem in theoretical acoustics and the result can be easily computed okay so yeah this was the last part on the forest impossible yeah what is the diameter of the sphere uh okay so um here we just use a constant value um i think it was um 2.5 centimeters or uh or so around around that or maybe five years of the head yeah of the birth of our average birth head thank you yeah okay so next i'll briefly talk about the sunshine dataset creation so um we have used two famous bird audio databases which is the macaulay master set in the zeno kanto database so this macaulay master set is a subset of a much larger data set which is known as the macaulay library data set and also the xenocattle data set itself is a massive dataset it has more than a few hundred thousand bird sound clips but we use just a subset of that so we have extracted um using a general purpose voice activity detector clean bird clips roughly a thousand from both of the data sets and 2000 noise clips from the macaulay set and 40 000 noise clips from the xenocontrol dataset so this number roughly tells you how noisy this xenocontour data set is and actually what we have what we heard from the um our collaborating biologists that is that mccauley master said is one of the cleanest and xenocon is one of the messiest so it's nice to have this contrast here i just play back here some examples this is a clean bird clip and from the zeno contour set and an example noise clip and for each of the exact noise clips we also generate um stationary noise clips that have the same spectral shape okay and here's another noise clip from the xenocontour dataset and the stationary version of that um so these noise clips are extracted from these field recordings of bird sound so it is um they kind of mimic the reality and we implement implemented a script uh that randomly generates four sound seats using the database of bird clips and noise clips and the forest impulse responses that we simulate and that saves the audio clip and metadata which contains all the labels and ground truth information about the sound scene and here's an example sound scene which is six seconds long and has 10 bird chirps and the same sound scene recorded at the different position in the virtual forest is this oops um sorry what's happened oh okay um i play back again the first one so i guess you you have recognized the the contrast um depending on the microphone position okay and as one last component of the sound scene simulations we integrate um simulated wind noise so there's a nice paper in 2000 from 2018 which describes a method to simulate wind noise and here i just use playback here an example okay so it sounds quite reasonable and we use this as an um additional noise source in our experiments okay so the last part is about the sound source localization experiments so i briefly um introduce the basics of the sound source localization techniques that we are using here which are called time delay estimation methods so the idea is simple if you have two microphones and a sound source and record the signal you would receive the sounds the sound wave at different timings depending on the distance of the sound source and the microphones and so the idea is to compute the cross correlation function between these two microphone signals and that will give you a spike in the at the time which is um corresponding to the time difference of arrival and so identifying that spike means um that you can identify the hyperboloid uh of the the of of that specific time delay of arrival and which which serves as a hint of the sound source position so um if you only have two microphones you can only get this single hyperbolic but if you add more microphones you get multiple hypervaloids and the point that intersects that that all the hyperbolic intersects is basically the position of sound source and um to do this robustly um usually we compute a function which is called a spatial likelihood function so for each of the possible points we compute the likelihood of the sound source to be there and we aggregate that for all the microphone pairs that are present in the array and these kind of methods are called generalized cross correlation methods and the generalized means here that we apply here a spectral weighting function in the computation of the cross correlation so before going back so the computation is usually done in the spectral domain and before going back to the time domain we apply a spectral filter here okay so um in our experiments we compared two of these gcc algorithms the one is gcc effect and the second one is gcc roth there are more uh many more variants of these gcc algorithms and but here we focus on the very simple uh ones um because um the more uh sophisticated functions um require either a priori knowledge of the signal in noise spectrum or estimation online estimation of those information and it would complicate the signal processing pipeline a little bit and here we wanted to focus on the basic algorithm also because this gcc fat algorithm is also very widely used also by the bioacoustics people so we focus on these algorithms here and to accelerate the experiments at larger scale um we we run the experiments on the gpu so in the largest case in our experiments we consider a microphone grid which covers a forest of 400 meters by 400 meters and we have a spatial resolution of 20 centimeter and that means we have um yeah 2000 by 2000 possible sound source positions and that computation can be get can be a little bit expensive but can be very efficiently parallelized on the gpu so we do this optimization here um oh by the way this is uh one example of the spatial likelihood function so you you usually see these traces of the hyperboloids that i just mentioned and the point where all the hyperboloids intersect gives you the bright spot which is likely to be the sound source position um okay so um before going to the experiments so uh here um i would like to explain the some of the assumptions that we make here so we assume that we have we have this kind of system we particularly assume that we have a robust bird activity detector as a preprocessor um so which robustly reports the microphone that is closest to the bird and we use the nine microphones the three by three grid of microphones that is containing the bird in our experiments so basically we do this this assumption is made in order to reduce the computation cost in our largest case as i mentioned we have a 400 meter by 400 meter grid and but the forest can be much larger and to to reduce the the search space in the localization algorithm we made this assumption here um so now i'd like to explain about the first experiment so the first experiment is exhaustive study of all the variables that we have um so we have nine variables here the one the first one is source position for this source position we use hundred randomly selected positions within the microphone grid and basically we use these 100 positions to average out average the results and for the microphone spacing we have four choices from 25 meter to 200 meter oh by the way microphone spacing is this distant d here for the microphone grid type we have five choices so in the very ideal case we have a completely regular cartesian grid like this image shows at the recording stage and also at the localization stage but in reality we don't have access to the exact microphone positions so here we assume some microphone position error and use a perturbed grid at the localization stage and but also in reality we um it's also not possible to to place a microphone at this this like these kind of exact courtesy and grid positions because if you have a river in the forest or if you have rocks in the forest you can replace the microphones there um so in reality it would be a perturbed grid at mic at the recording stage and for the localization stage we yeah we have here three choices either the regular cartesian grid or the um the exact person grid or another operator upgrade so this last one is the most realistic case and um for the tree density we have either no trees um or 10 000 degrees per square kilometer or 100 000 trees per square millimeter uh for directivity we have uh the on off switch uh so on would mean the the directional um bird uh directivity and off would be the omnidirectional case and for the noise types we have 10 different choices so either clean so no additive noise or we use the shaped stationary noise that i have played back before or the extracted noise clips or the wind noise and for each of this noise category we have three different levels from 40 dba to 60 dba so for synchronization error we have um also the on off switch and in the case of on we apply a random synchronization error to each of the microphone channels independently sampled from the uniform distribution from minus one millisecond to plus one millisecond um for the gcc weighting function we have the fat and the roth function and for the data sets we have the two macholines in our counter data sets so if we multiply these numbers we get 960 000 combinations which is close to 1 million so it was very essential to to run the experiments on the gpu especially in the large large grid cases okay so before describing the results i would like to explain um what kind of metrics evaluation metrics we use and for that here the the histogram of the localization errors is plotted so what we usually observe as a result is we have many many many experiments where the localization error is very small and but then we have this long long tail of almost random localization error result so this is natural to categorize this into success and fail and basically we use this success rate the probability of success as the evaluation metric in the next couple of slides and for the threshold here we use 5 meters so if the localization error is smaller than five meters then we treat it as success uh this is um it's this matches uh with this histogram that we get here but it also is based on the insights from the biologists so what our collaborators told us is that they're interested in computing the territories of the birds and for that purpose of 5 to 10 meters would be sufficient for the localization accuracy so we also incorporate that knowledge here um so as i mentioned we have this nine-dimensional explanatory variable space and um so visualizing the result of this nine dimensional space is a bit tricky so we in order to compress the results uh we have looked at the standard deviation of the success variable so success variable means is the binary probabilistic variable which is uh which also one means success and zero means fail and if you look at the success standard deviation for each of the variables we get this so we find that the microphone spacing and the noise type and the gcc algorithm choice has a significant impact here and the other variables are relatively small and to give you some numbers um we observe that the presence of source directivity drops the average success rate by 9.3 percent um the application of synchronization error drops the result of by 1.8 and the the presence of microsoft microphone position error drops the result in average by 7.8 and um the gcc fat algorithm can outperform the gc roth by 20 in our success rate which is quite significant so to reduce the variable space um we first select the realistic scenario here so we take the results for the directivity on synchronization error is on and microphone grid is the most realistic one and we state the gcc fat algorithm and the number of trees us had a relatively small variance here so we average out the results for the cases which are realistic so which had trees excluding the case without trees and also the source positions and data sets are these dimensions are used to average the results and so we end up with a two-dimensional variable space now which is uh trivial to visualize um which is summarized here so in the x-axis we have the noise type and y-axis we have the microphone spacing and what we see here is that the general trend is that if we increase the microphone spacing we lose localization accuracy or the success rate of the localization error success rate of the localization and also if we uh increase the noise level we lose the success rate so basically that that's that's a reasonable result um but it it implies um some things here so the results tell us that the snr signature noise ratio is uh important and the spectral weighting function the gcc algorithm has a significant impact on the results so um this led us to two ideas to improve the performance of the localization algorithm i mean the of this salsa localization um so the first idea is to apply a problem-specific sprinkler rating which is just simply um a bond limiting in our case so actually so far we haven't utilized the fact that um the the things that we are trying to localize are birds and so here we incorporate a very simple domain knowledge that vocalization is mostly contained in the range spectral range of one kilohertz to eight kilohertz and apply that uh band limiting here in the gcc algorithm and the second idea is to up increase to improve the snr so we apply noise reduction here as preprocessing and in particular we use the mmsc sdsa noise suppressor here which is applied independently on each of the micron channels and that gives us these results so in the left top we have the baseline result in the left right top we have the result with the band limiting applied which improves the suc average success rate by nine percent and if we add noise reduction only we get this uh seven point eight percent average improvement and if we apply both of these uh techniques we boost the success rate by 14.4 percent and this is uh um yeah may sound obvious but the result is actually very encouraging because if we have a look at the moderate uh noise uh levels and at the for example the microphone spacing of 100 meters uh initially we had only about 50 percent success rate here but um after these improvements we have uh 70 to 75 percent success rate um so that means um so initially this hundred meter spacing um seemed to be useless kind of but it it it actually turned out that it can be uh useful and and practical for applications and uh yep shocking if i may just just to give people a point of reference i went to for those of you who live in the seattle area i went to uh volunteer park which is a pretty um pretty popular park here in seattle and i recorded ambient noise in the evening with people walking around with airplanes flying overhead and it it was around 50 dba so it's the it would correspond to roughly the moderate noise level that shogun is showing here with uh w50 s50 and c50 yeah yeah exactly so thank you very much hannes and okay so and this is now another um visualization of the results um so in the x axis now we have the mean signal to noise ratio so the average is taken over the mic or the microphone channels that i used and i don't know so one thing that i should mention is that this snr is not nothing estimated but the exact snr because we have we are generating our sound scenes by ourselves we have access to the exact brown tools values for these numbers so in the y-axis is again the success rate and this blue chain line is the baseline and the yellow chain line is the result with bond limiting applied and the solid lines are with the bond limiting and noise reduction of light and so and here um i tried several different noise reduction algorithms but those that difference was relatively small but what we again see here is that for the same um average snr we in this especially in this intermediate region we get a boost of about 20 percent of success rate or if you look at uh horizontally for the same success rate we can have about 10 db more noise so this is as encouraging result but it also um provides suggestions for future improvements so so far we have only incorporated minimal information or minimal domain knowledge about birds um but i would suggest to do that more so to to integrate a machine learning based spirit sound extractor or suppressor can be utilized to do the separation or estimation of the signal in noise more accurately and and there are gcc algorithms that can utilize this knowledge and that would improve the result of localization algorithm and also we can utilize the domain knowledge about birds vocalization frequency spectrum spectra by utilizing the output of the bird species classifier here so if we know the species or the type of localization of the birds the biologists have have the knowledge about this uh species dependent burst vocalization spectrum so that can be used as an additional waiting function here for the localization localization um with that i would conclude my talk so we have developed efficient forest acoustic simulation algorithm a data set for bird audio clips in background noise clips and a for a sunscreen generator and using those we have studied the robustness and the performance of a widely used sound source localization algorithm in a simulated forest which contains uh many kinds of errors and noises and we have found that um the noise type and microphone spacing has a significant impact on this localization accuracy and with that we would suggest to um integrate the more domain knowledge and real world statistics about the bird vocalization which would be very useful to improve the localization performance even more and also i would mention that the sound scene generator that we have developed can be effectively used to to train these kind of machine learning models so that's i think one promising component of possible future works okay so with that i would like to end my talk so i would like to you would like to thank microsoft's research ai for earth project for sponsoring this research um we would like to thank professor justin kitces at the university of pittsburgh for inputs from the biologist's perspective and i personally would like to thank hannes gapper for his amazing mentorship it was really fun working with you i really enjoyed the project and i would like to thank ivan tashif for having me and i would like to thank everyone in the our acoustics group for the very warm welcome um especially dimitra uh for organizing all the fun events and the reading groups and i would like to thank all other interns in the aeg for sharing their very inspiring works every week so thank you very much for having me at microsoft research this summer and if you have any questions i would like to answer thank you very much are any questions so just a general question this is evan tashiff you have the localization algorithm and you have this training on the simulation do you did you do any work on the birth classification and reducing the signal to noise ratio um no we haven't we haven't integrated any machine learning approach uh in this project so we yeah that was out of scope okay thank you if i yeah if i may add to this basically our our thinking here was that um there are domain experts that are actively working on the monorail case so they they are training classifiers and maybe even detectors and it might also be somewhat application specific so depending on what types of birds or bird species they're interested in they might train different classifiers so we felt that uh the best way to contribute to their work is by sticking with the signal processing aspects and then plugging in whatever pre-trained models they might the biologists might already have i was more interesting in a very little aspect of getting one signal uh the same amount in this vast microphone array and what kind of beamforming and enhancement you can do before to feed them the mono classifier but okay next projects are coming so nice project i'm especially interested in a bit quality how you implemented those things so because i know all these models and a lot of work so why they're all libraries for those things available especially these spherical models and uh cylindrical models are implemented on yourself and that's yours you mentioned you run it on the gpu yeah basically i implemented the most those things by myself the scattering and radiation filters are implemented by myself and the localization algorithm um by myself um i used uh open source implementation of this as a code density measure metric computation and yeah uh yeah also used open source library for the voice activity detection here and yeah i don't know maybe maybe other ones but i yeah yeah cool so it means you implemented all those spherical models and stuff for yourself yeah nice that will be of good use uh oh sorry no somebody else yeah um can you hear me yeah i hear you okay this is uh that uh so i have a uh some question uh before you do the classification of the birth uh which you say this house um did you like try how to like identify if it's this sounds coming from the bird um yeah so that's that's um i guess you're talking about the bird activity detection here so um which classifies if the sound is a bird or not and there are existing models that do that but we don't use it in our experiments so you took the data set and trained your neural model do this identification is that i should say no no we we don't do any any of that identification so um in in our experiments we only generate the sound things which which which are birds so we know we assume that we know it's a bird sound that is there and we uh we try to localize that i see okay so on yeah so just to chime in briefly i um when we started out i i did so this is hannah's speaking by the way uh i did briefly try to train a simple machine learning based bird activity detector or bad um it the bad uh the issue there is that um you don't have a very finely labeled data set to train with so what you get is basically clips that say either bird present or not present and i think the clips were 10 seconds long and what we would have been interested in is something more fine-grained than that so i tried to train a model that would output continuous bird activity detection estimates but we decided that rather than relying on my hastily implemented suboptimal birth activity detection let's take that um problem out of the equation and just assume that the bird activity detector works as intended in practice what might happen is even if you have a bad or no third activity detector if you use a very simple metric and just say whenever there's a loud sound at any microphone uh use that as a query run the localization potentially run noise suppression and or beam forming and then feed that to a bird classifier and if the bird classifier says this is a bird or this is a particular bird then you go back and mark that location as an incident so if your localization is if your localization algorithm is efficient enough you might be able to afford to just run a lot of false positives through your localization algorithm rather than have a very accurate birth activity detector that that would weed out a lot of those false positives that was our thinking but we'll have that actually a good point um but that let's let leads to me my second question like for um i understand this um this work is based on simulation but i guess in the future it won't do this in the field fair experiments how do you like get ground truths for those birth locations do you have any negation on that that's a good question um so what the biologists have done so far is using loudspeakers as virtual um birds superbs for emulation yeah for emulation so yeah it's it's quite tricky to get the real ground truth positions from living birds right yeah i have no idea but i guess one option might be if you equip a few birds with a gps sensor and a microphone you might get a ground truth location you might be able to use something like that but this is still this is gps is like a far worse performance than acoustic realization right right yes you might want to put a um like a cage somewhere in the room and use bicone system to to do like camera based localization the same thing yeah i think the closest is people using cloud speakers in the forest and that has been done i see and also in the simulation i see you take the directionalities of the birds into accounts um but did you also take into the microphone directionality into account because um i know you probably assume the macrame is all mini-directional but there's no only brushing microphone as well so probably you want to alter that yeah good point um the answer is no we haven't take it into account um we could we could have done it but um well yeah i thought um we already took the bird directivity into account so um we could see some kind of effect how that would affect the result but you're right and in reality there would be also some directivity of the microphones right because if the microphone is starting out you might um this might involve a mechanical steering the directional direction of the microphone um either better or worse performance of your system i guess so my last question is about the time synchronization you in the beginning of your presentation i remember you mentioned your investigates uh time error introduced by maybe gps um and you briefly mentioned that in your results this on and off note to turn off on your symbolization um how that exactly um affect or impact your results yeah um so yeah i just briefly mentioned that in in other i briefly mentioned the average effect here uh it's not sorry where it was um yeah so actually it had a quite little drop here in average success rate only about two percent which is relatively small and we we use uh um synchronization error as i mentioned from um minus one millisecond to one millisecond so uniformly sampled from that range but you know how much synchronizing error from gps that could be yeah um so what we what our collaborators told us is in that condition they achieve about five microseconds so um i think our condition is much more um strict than it might be in reality but um in in some literature people um say that using this gps clock-based synchronization the synchronization error could be one or two milliseconds so i think we are the value that we assume is a relatively um it's a worst case kind of i see because i just did a quick search i found on google says the gps realized like um tens of nanoseconds of synchronizing accuracy that's like obviously an overkill of your system i guess okay yeah that's my question thanks uh thank you nice presentation thank you and also you can your algorithm tell the difference because you said you didn't place microphones by the rivers and things like that can it tell the difference between like this constant steady sound like the sea like the seashore and bird calls um sorry i um i'm not sure if i understood the uh question correctly but you're talking about the sound source signal yeah um because you were saying you didn't place microphones by rivers to record the bird sounds it's kind of limiting the data set like you couldn't for instance i took from that anyway that you couldn't place these by the seashore and actually get a lot of seabird noise and those markings um so sorry i think the question i think the question is shocking that uh you you mentioned briefly at one point that a microphone cannot be placed on a river right um i think so what shocking meant is that because of um you assume in the localization you assume that the microphones are relatively evenly spaced or regularly spaced and if there is some geological feature that might prevent you from being able to do that it's not a limitation of the method or of the data set it's just it just means that uh the position of the microphone might be limited by geological obstacles but uh the the constant the noise of of an ocean or a river should not in in theory um uh prevent this this uh this method or the localization from working okay all right thank you yeah sorry um my loudspeaker was a bit my earphone was a bit too small sorry i could never see it any more questions otherwise i have a one generic one may i ask once this is rico yeah hi hannes hi shulkin this is this is very cool work um back one on one's question on this on the synchronization i wonder if you or other researchers did some try some active compensation let's say you put four noise-like sources in the field and then try to calibrate would that make sense um yeah maybe um i we don't we haven't done that kind of stuff and i don't know if the there's related work uh on that in the field of this bird activity detection but um yeah maybe potentially yes but i i'm not sure i don't i'm not specialist of this ghps based synchronization so sorry i don't have honestly that's fine okay thanks i don't see any other race sorry if rico i didn't realize i forgot about that feature oh dimitra's raising her hand go ahead dimitra no it's a great feature yeah i forgot about it so shocking this is this is great work and a lot of work so if if you look at your those beautiful success rate clubs that you gave us so in practice you can use that as a rubric metric to to decide given the forest of the situation or the question you're trying to ask as a biologist you can use this as a metric to decide what parameters to use how to place your microphones and what the what's the wiggle room for everything right yeah so um but of course before applying this to reality um it should be confirmed by real um measurements in real force and and study how how that correlates but in principle yes so the only thing you're missing is real impulse responses um yeah real forests and real birds because the the audio you used were real birds and noise was also real from the forests yeah all right basically yeah i should maybe add here the the idea originally was to use real recordings but because of kovit 19 the the field recordings had to stop so the laptop was that was gearing up to make those measurements they had to basically postpone their um their collection and then we realized that actually doing doing a proper uh simulation as as well as we can including simulating the reverb simulating the noise the the synchronization error bert uh directivity and all that might actually be useful even if you have the chance to go back to the forest and take measurements because you can probe a much bigger a much bigger space both spatially and in terms of the variables with this approach than you could by if you limit yourself to one particular forest with one type of birds and one uh one category of noise examples you will find there so it seems like there is there is actually value in in in doing this in doing this simulation beyond just beyond just the results you're seeing here okay if there are if there are no more questions i think uh thank you again shokun it was it was great having you with us remotely i see yes go ahead hi um so when i go hiking with my wife we often notice these birds i think they're called warblers and they have a very unique um song where they they appear to be throwing their voices and i'm wondering if there's any understanding as to um how they appear to do this or if you've looked at it or if anyone has looked into this um or your work could be used to look into this yeah so um we haven't looked at particular birth species but um i guess that that our collaborators the biologists know very much about those stuff so yeah we haven't done much on that specific okay thank you okay if there if there are no other questions then um let's thank shotgun again and thank you nice

Original Description

Audio-based wildlife monitoring is an important method for studying animal habitations and for the conservation of animal species and ecosystems. In this work, we have developed a highly efficient and scalable forest acoustics simulation algorithm, a dataset of bird audio clips and background noise clips extracted from two publicly available field recording databases, and a synthetic forest wildlife sound scene generator for distributed microphone array recording setups. We used the synthetic forest sound scenes to study the robustness of commonly used sound source localization algorithms in a wildlife monitoring setup in various reverberation, noise, and measurement error conditions. In our simulated bird localization experiments, we observed that the microphone spacing, signal-to-noise ratio, and the choice of the spectral weighting function in the localization algorithm have significant impact on localization accuracy, while the effect of synchronization error and microphone position misalignment was modest. We also observed that problem-specific spectral weighting in the localization algorithm and noise suppression pre-processing significantly improve the localization accuracy. These results are expected to help design practical wildlife monitoring systems and suggest promising directions for further improvements. See more at https://www.microsoft.com/en-us/research/video/forest-sound-scene-simulation-and-bird-localization-with-distributed-microphone-arrays/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 45 of 60

← Previous Next →

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Microsoft Research

Frontiers in Machine Learning: Climate Impact of Machine Learning

Frontiers in Machine Learning: Climate Impact of Machine Learning

Microsoft Research

Frontiers in Machine Learning: Security and Machine Learning

Frontiers in Machine Learning: Security and Machine Learning

Microsoft Research

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Microsoft Research

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Microsoft Research

Remote Work and Well-Being

Remote Work and Well-Being

Microsoft Research

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Microsoft Research

Towards a Practical Virtual Office for Mobile Knowledge Workers

Towards a Practical Virtual Office for Mobile Knowledge Workers

Microsoft Research

Impact of COVID-19 crisis on the future of work in India

Impact of COVID-19 crisis on the future of work in India

Microsoft Research

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Microsoft Research

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

Microsoft Research

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Microsoft Research

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Microsoft Research

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Research

Directions in ML: Algorithmic foundations of neural architecture search

Directions in ML: Algorithmic foundations of neural architecture search

Microsoft Research

MineRL Competition 2020

MineRL Competition 2020

Microsoft Research

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Microsoft Research

From Paper to Product

From Paper to Product

Microsoft Research

SkinnerDB: Regret Bounded Query Evaluation using RL

SkinnerDB: Regret Bounded Query Evaluation using RL

Microsoft Research

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

Microsoft Research

Programming with Proofs for High-assurance Software

Programming with Proofs for High-assurance Software

Microsoft Research

Platform for Situated Intelligence Overview

Platform for Situated Intelligence Overview

Microsoft Research

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Microsoft Research

Galactic Bell Star Music Demo

Galactic Bell Star Music Demo

Microsoft Research

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Microsoft Research

Welcome to Microsoft Expressive Pixels (1 of 9)

Welcome to Microsoft Expressive Pixels (1 of 9)

Microsoft Research

Getting Started with Microsoft Expressive Pixels (2 of 9)

Getting Started with Microsoft Expressive Pixels (2 of 9)

Microsoft Research

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Microsoft Research

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Microsoft Research

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Microsoft Research

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Microsoft Research

Using Layers in Microsoft Expressive Pixels (7 of 9)

Using Layers in Microsoft Expressive Pixels (7 of 9)

Microsoft Research

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

Microsoft Research

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Microsoft Research

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Microsoft Research

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Microsoft Research

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Microsoft Research

Novel Image Captioning

Novel Image Captioning

Microsoft Research

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Microsoft Research

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Microsoft Research

How does holographic storage work?

How does holographic storage work?

Microsoft Research

The physics of hologram formation in iron doped lithium niobate

The physics of hologram formation in iron doped lithium niobate

Microsoft Research

Introduction to coax: A Modular RL Package

Introduction to coax: A Modular RL Package

Microsoft Research

Directions in ML: "Neural architecture search: Coming of age"

Directions in ML: "Neural architecture search: Coming of age"

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research

Clinical Research with FHIR

Clinical Research with FHIR

Microsoft Research

Soundscape Street Preview

Soundscape Street Preview

Microsoft Research

Tilt-Responsive Techniques for Digital Drawing Boards

Tilt-Responsive Techniques for Digital Drawing Boards

Microsoft Research

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

Microsoft Research

Haptic PIVOT: On-Demand Handhelds in VR

Haptic PIVOT: On-Demand Handhelds in VR

Microsoft Research

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

Microsoft Research

This video teaches how to simulate forest sound scenes and localize bird sounds using distributed microphone arrays, with a focus on acoustic modeling and sound source localization. It demonstrates the use of gcc algorithms, spectral weighting functions, and robust bird activity detectors to improve localization accuracy.

Key Takeaways

Extract clean bird clips and noise clips using a general-purpose voice activity detector
Generate stationary noise clips with the same spectral shape as the noise clips
Implement a script to randomly generate sound scenes using bird clips and noise clips, and forest impulse responses
Integrate simulated wind noise into the sound scene simulations
Use time delay estimation methods for sound source localization

💡 The use of simulation can probe a bigger space than physical measurements and provide valuable insights beyond just results, making it a crucial tool for audio-based wildlife monitoring and conservation efforts.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

Does Water Swirl the Other Way in the Southern Hemisphere?

Does Water Swirl the Other Way in the Southern Hemisphere?

Undergraduate Research Forum 2026

Undergraduate Research Forum 2026

Related AI Lessons

Bloom Filters, Explained Properly

Learn how Bloom filters work and their benefits, including tiny memory and blazing speed, in exchange for potential false positives.

Dev.to · Daksh Gargas

Prefix Sums: The Preprocessing Trick That Makes Range Queries Instant

Learn how prefix sums enable instant range queries in arrays, boosting performance in various applications

Medium · Programming

I Thought I Was Ready for the Interview — Then One Simple Math Question Destroyed Me

A simple math question can destroy a developer's interview, highlighting the importance of being prepared for unexpected questions

Medium · Programming

Week 2(Day 10): LeetCode Two Pointers(slow & fast): Remove Duplicates from Sorted Array (Brute…

Learn to remove duplicates from a sorted array using the two pointers technique, improving from brute force to optimized solutions

Medium · Python

Stump Grinder Carbide Wheel Grinds Hardwood To Chips

Innoforge Studio