Fireside Chat with Shadab Khan - AI in Healthcare and Data Science Career Tips

Imaad Mohamed Khan · Beginner ·📐 ML Fundamentals ·4y ago

Skills: Data Literacy70%ML Maths Basics60%

Key Takeaways

Shadab Khan discusses his experiences in building AI systems in healthcare, analyzing structured and unstructured datasets, and shares career tips for data and AI scientists.

Full Transcript

hello and welcome to yet another mantissa data science webinar my name is imad muhammad khan and i organize manchester designs meetups and webinars i will give you a little bit of an introduction on what mantissa has been doing so far and then i will introduce the speaker and after that we will have a file set chat with the speaker and then uh hopefully towards the end we will have some time for questions and answers today uh we have with us khan thank you so much for taking the time to do this session with us i'll just give a brief introduction of what shadab has done and maybe then i will allow him to say a few words so shadab leads a team of applied scientists and engineers at g42 to solve problems in healthcare ai his team develops solutions for clinical care healthcare operations and healthcare finance by analyzing structured and unstructured data sets ranging from electronic health records genomics medical imaging and claims among others before joining g42 healthcare shadow was a researcher at the inception institute of ai in uae where he focused on machine learning from limited data shadow obtained his phd from dartmouth dartmouth college in biomedical engineering and did a research fellowship at harvard medical school and boston children's hospital in radiology i think shadow is a great candidate for a fireside chat because he has such varied experiences and i'm really excited for the session if you would like to introduce yourself in a few words sure thank you so much thank you so much for inviting me here and uh really happy to hear about how organically mantissa data science has grown as a community um really exciting to see that you know regardless of the locations and the community resources such as jeremy harvard's famous fast dot ai courses are benefiting so much i've personally used it myself and found it very very useful to get started with in this in this area um so as as you already mentioned i lead a team of applied scientists here at g42 healthcare in abu dhabi um g42 is a company which which is fairly new to the space we exist we're in existence for the past three years um based based here in abu dhabi and you know looking at several verticals of interests and of course healthcare among them where where i currently work cool okay if you do have more to add uh no i'm good for that okay okay so uh if that is the case then i'll also brief the audience on question and answers there is a place for q and a on the right side of your screen perhaps where you can go and drop your questions we can either take them during the session if we have time i mean during the session i mean while we are having this five-star chat or towards the end of the session we will of course pick some of them up and talk uh with uh shut up on that so yeah if you have any questions please feel free to drop them in the q a section or even in the chat section i will be taking a look at them during the session as well okay so with that said let's start with the first part of our chat which is the ai in healthcare and which is what i was advertising everywhere because this is uh an interesting sub-domain of cai or rather the area in which ai is being applied and your background makes it even more interesting you have an engineering background but your phd and the subsequent work is in the healthcare domain so how did this happen and what inspired you to work in healthcare um i think you're muted has to be the most committed mistaken i started off i got interested in healthcare as a domain i took up a research assistant position at institute for systems and robotics in lisbon portugal where i spent the better part of six months as part of my my bachelor's thesis work to work on a problem called automatic kero typing where the idea is you you have these chromosomes which are you know twisted and disoriented and sometimes overlapping on top of each other and you have to essentially disentangle them and arrange them in a neat order so you know so the so the ana so the pathologist can analyze them for any signatures of disease or abnormality so to speak and that was broadly speaking a medical image analysis problem um and that was that was what really sparked my interest in not just image analysis but specifically medical image analysis because working in healthcare it's hard not to feel the importance of the problems you're working on you know as cliche as it might sound it really is a very satisfying domain to be working on and particularly for those reasons i started looking at phd positions and in medical image analysis or related areas luckily i got an admit from from dartmouth to work with ryan halter on developing a medical imaging device which was uh which was again um focused on analyzing signatures of bioimpedance as a biomarker um to to identify whether there is a signature of cancer in the imaging domain or not and that was really interesting because you you often get to work with images uh that that are coming from a medical imaging scanner um but this is really a rather you know at least to me it sounded like a very unique opportunity to work on building the medical imaging device itself and that was very exciting so i actually got involved as an electrical engineer in fact my phd was focused on medical imaging device development more than the analysis of images themselves and uh you know so from the from uh from there on i was uh towards the end of my phd i explored the you know the the interest area that got me to dartmouth in the first place the image analysis side and i i thought i wanted to explore that a bit more so i you know started looking for post-doc positions and uh ended up in aligolipu's lab at boston children's in harvard um where i focused on computational analysis of diffusion weighted imaging um for uh acquired from the fetuses who were live fetuses you know from from the pregnant woman so although i've been an engineer um first and you know use these methods from electrical engineering and computer science in my in my line of work so far the application domain has so far been primarily healthcare and yeah so that's how being an engineer i got into the area and i really really enjoy working in healthcare cool sounds really cool i think one related question that i would have to that is uh like how did you i mean of course you said you've come from that engineering background and all your your primary skill set is more more towards engineering but if you're working uh on an analysis project in a particular domain then oftentimes if you lack the understanding of the domain you're likely to make inferences that are not perhaps uh but that perhaps a domain expert would not make right so right how do you how do you how did you overcome those say i mean so to speak shortcomings or did you really try to beef up on the domain side as well that's a very good question so you know it's uh and i think uh what i did sort of changed as i grew in my career early on um i really used to try to learn even the domain ideas myself but very soon you hit a wall and you realize you know medicine is an expertise by itself and you can't just learn it on coursera or wherever it is right so um if you ask a computer scientist he will say it's just a it's just a dictionary right so uh yeah yeah not all the answers are found on webmd once so um so you know one of the advantages of being at dartmouth was that it's a really small uh community and we we work very closely with our medical school collaborators and at some point you realize you know it's just easier to ask them the questions than try to figure everything out on your own and um so what i what i realized was it was still very useful and in fact necessary to pick up a working vocabulary so you're able to communicate effectively with your collaborators but you definitely don't need to learn everything by yourself you can't you can't really write so i think having the right vocabulary um you know in it enables you to be an effective communicator within the team and then that allows you to understand the the the problem space that you're addressing a bit better than than you would if you did not have that working vocabulary uh so that's what i started focusing on when i when i read research papers or when i watched lectures you know these you see a recurring theme of keywords that come up often so under you know just looking those words up uh and trying to memorize what their meanings is uh meanings are and so that that was helpful and this is what i continued on later during my career so when i moved to boston and even during my time at inception ng42 um you know for any new domain area that i try to address i do try to you know learn the basics but then stop there and then hopefully find collaborators who are masters of their game and you know work with them to give you one example uh you know among the resources that i found really useful are the introductory biology course from from mit on coursera and oftentimes the review papers that uh that would be helpful for you when you um you know when you bring in when you come into a clinical problem as an engineer um so these are some of the some of the resources that i've used great okay speaking of memorizing from coming from memorizing so take us through a memorable project that has been etched in your memory for a long period of time what what is something uh that makes that project memorable and what were the problems you face and how so i think i really thoroughly enjoyed my project work during my during my postdoc and it was a very hard problem the the idea is that you have these uh so you um you know in neuroscience a lot of the a lot of the neuroscientists often rely on what is called an atlas of uh of a brain which is essentially like a map of the brain right so um it turns out that atlases of fetuses are not very widely available for a lot of reasons um the the image atlases are often computed from these uh from these images of the of the subjects that are required so on one hand you have these uh you know human connectome project where you have seven tesla scans of human brain which describe the anatomy in exquisite detail uh available to make your life easy for computing these atlases right and with these high resolution essentially maps you can start to study the anatomy right and and that tells you a lot uh about uh about what the reference is essentially it's it's it's trying it's it's it's in some ways you know trying to take image of thousands of people and then trying to figure out what does the average face look like so you can study you can focus on the differences rather than the similarities and oftentimes that's a simplified example but it turns out in in study of human brains it's often useful to know what the differences are to assess if that change in structure has any impact on the function right and that then dictates uh you know various abilities that you might possess as a human and so on so anyway so it turns out that uh atlases of fetal brains are really hard to acquire because fetuses you know they move a lot and the the mri scanners make a loud noise so so when a pregnant uh woman you know if she's lying on the bed because of the noise and just the agitation you know the space is space is different and so um fetuses move a lot right and these mri images take in the order of uh you know a few minutes to tens of minutes to to be acquired particularly these diffusion weighted scans that that we were interested in and um so that that was that was the problem that i focused on it was really challenging again because uh the the motion makes the problem very hard you know there's a there's a lot there are a lot of artifacts and the way these uh 3d mri images are acquired they're acquired slice by slice right so it's sort of like you're trying to slice a watermelon right at equal at equal spacing from above and imagine that instead of the watermelon being steadies i'm i'm you know let's say you're holding the watermelon and i'm slicing it but instead of holding it steady you're just moving it all the time right and so um if i were to try to put together the watermelon by you know stacking the slices one over the other i'll probably not end up with the same shape as i had because of the motion right so um that's what made it hard and we so you know working with the team there you know we came up with algorithms and methods to compensate for that motion and try to reconstruct the anatomical details and so that was so that that ended up being the world's first reference for uh the diffusion weighted atlas of the fetal brain and diffusion imaging is really interesting because it allows you to study the the white matter tracts of the brain uh you know these these tracks are like wires or threads that run run through the brain and connect different areas and any uh deformities or you know any changes from the reference tracks um can can then be evaluated for any impact on the function right and so to be able to even investigate that hypothesis whether or not someone's abnormal tracts have any impact on the function you need that reference first and so um you know developing these motion compensated imaging methods we ended up being the the world's first team to produce an atlas of uh fetal brain and second and third trimester of pregnancy and as part of the solution we actually employed both you know the classical model based approaches for registration along with some you know newer deep learning based approaches uh for uh segmentation of the fetal brain to stabilize that registration problem a little bit and so it ended up being a very satisfying project you know with lots of different complexities and you really had to throw um you know a lot a lot a lot of the computational power at it from both classical and the newer algorithms to to be able to come up with uh with the solution that was satisfying cool sounds really really like a lot of work and finally a good solution that you could be satisfied with yes indeed the atlas is publicly available you know if if anyone is interested i think you can share the link with me and i can maybe add it here if you have it now you can put it in the chat or maybe later i can add it on the youtube video as well uh so building the atlas itself was a challenge you said right because of the continuous motion of the fetus and then you're not able to say reconstruct back to what it originally supposed to be because you're not able to capture the uh the overall image in its most accurate form so to speak right and and that is essentially a problem of building the right data set if you if you look at it right you don't have that data set and this is a problem i have seen uh in the industry especially with healthcare because and it happens because especially with diseases and stuff right you don't have uh data sets easily available and even if they are it's likely to be a more imbalanced version of the population because it's only only a subset of population is infected with a certain disease so my next question is on that line so building ml solutions in healthcare is often challenging because availability of say labeled data is limited so what have you have you have you had these problems and if you had had them what are some of the ways in which you've been able to tackle them right i think anyone who has worked in healthcare ml has faced these challenges youth you're very right iman so in fact this particular problem is very close to my heart uh you know um it's something that i've been focusing on for the past three years uh during my time at inception and continue to focus on it at g42 because excuse me the problem of low data in in low labor data in healthcare is is not going to go away anytime soon right so i've looked at specifically three different ways of addressing this first i've looked at you know the the problem of actually generating more label data more easily so one of the projects that i did with uh with an internet inception was to look at ways to assist an annotator essentially a radiologist in in in more quickly being able to annotate you know the segmentation um more quickly be able to annotate the objects in an image whether it's for segmentation or classification problem we we developed some methods that instead of instead of a radiologist having to you know annotate at a pixel level uh where where an object lies they could uh they could just uh you know put in four clicks um at what we call the extreme points and uh from these four clicks we're then able to produce uh you know a full segmentation of the object that was intended and we we've tried the method on you know different uh ct and mri images for different organs and it works fairly well so that was um that was one of the projects we did our contribution in that particular line of work was to come up with these principles that that allow us to encode what we call a confidence map so we take the four extreme points and then we try to ss okay is it it if if a user uh you know has given me four extreme points for an object then it is quite likely then that the line connecting the extreme points uh lies on the object more so than the points that are away from these lines connecting the extreme points and then we came up with some mathematical models to encode the distances from these lines um as a confidence map and then we use this uh you know confidence map as a prior um to to produce the final full segmentation so that was the way in which we try to accelerate the annotation process itself because if you only have a limited time to annotate instead of you know producing limited annotations on a few images you you produce these ai assisted annotations on you know lots of different images so that that was one way in another project we actually looked at something called active learning which is which is actually a very interesting and very uh very well studied area within within the machine learning in active learning the idea is you have a pool of labeled samples and a pool of unlabeled samples and you try to learn a model um from the labeled samples and then you use this model somehow to try to understand which of the samples from the unlabeled set need to be annotated to uh to come up with a model that's uh that's you know very that's uh better then then you then the model you would learn if you were to randomly draw you know some samples from the from the unlabeled set i'll i'll use an analogy to describe this not sure if it's going to be useful but let's say that you know you were you were a trainee chef right and the master chef charges you a fee for each dish that they teach you right and you were on a budget trying to minimize uh what you spend uh while trying to maximize how many new dishes you're able to make right so let's say that uh you know you you're trying to learn thai cuisine and one of the dishes you you tried well you learned from the master is red curry with tofu all right let's say the another one one other dish that you try to learn from the master is green curry with chicken right let's say for each of these dishes you've already paid 100 per dish okay now would it be useful for you to learn green curry with tofu so green curry with tofu is one thing that you learned red curry with chicken is another thing you learned so would it be useful for you to learn red curry with tofu i hope i'm keeping track here but the idea is this right so if you if you know how to prepare the proteins in in these two different dishes one with green curry flavor and one with red curry flavor you can probably extrapolate that knowledge right and prepare both green curry with tofu and chicken and red curry with tofu and chicken so these redundancies in your data right uh or rather how much are your data elements are useful for learning a better model that's what we try to assess with active learning and so you know having having a model that you learn from uh the limited initial set you try to come up with heuristics to draw you know let's call them a more discriminative set of samples from the unlabeled set to then retrain your initial initial model and the idea is if i'm going to pay a radiologist let's say you know 200 per ct scan for annotating lung cancer modules in in that ct scan then i better be sending them ct images that that are more conducive for for the learning algorithm that i'm trying to use to train the model um right and so um so that that's what we tried to assess using active learning um and we we explored multiple different uh you know heuristics to draw samples from the unlabeled set uh you know you can think of one simple heuristic uh as an uncertainty based model so the idea is if your classifier produces a score between zero and one you try to uh you try to draw the samples for which your model is most confused right so if you have your model producing a result of 0.5 um or close to the threshold that that you optimize using an rock or something else you you might want to send that for uh you know annotation but on the other hand um you know you you could be looking at the latent space structure of your of your data to try to identify which samples are furthest away from the label set that you know in your latent space and use distance based heuristics who to draw a set and so we explored a few different approaches uh you know in that in that line of work and we have a paper and preparation uh to be submitted to a journal and the draft is already out on archive um so that was you know active learning to to see okay if we have to get annotated which which images make more sense um and finally the you know the other approaches that we're looking at is semi-supervised and self-supervised learning models so you know in in computer vision world there has been this rather impressive progress made in the past two three years where we're now seeing image net top 1 accuracy in the order of 90 percent um compared to you know the 80 or so which was which was uh i think the top one accuracy maybe a few years ago and so all of this progress has been a lot of this progress has been unlocked with these algorithms called semi supervised or self-suffice learning methods i'll briefly describe them so in semi-supervised learning algorithms the idea is that you know you have a label set you train a model on it and then you produce what we call pseudo labels on the unlabeled set and then you retrain your initial model combining the the label set and these pseudo labels that your model produces and you repeat this a few time you know so this is um and there is this framework which is called the teachers student training approach where you know the initial model that produces a pseudo label is considered a teacher and the model that learns from both the label set as well as the pseudo labels is called the student so by doing this knowledge distillation between teacher and student back and forth you uh you know try to come up with a model that uh that is actually better than a model that you would have learned if you were only learning from the labeled set right in this learning paradigm you have not uh you have not increased your annotation budget by by any means right you still do not have any labels on the unlabeled set of images or samples but you still end up with a model that performs much better than the model you would have if you were only learning from the label set right and similar approaches have been tried in a learning paradigm called self-supervised approaches where you learn uh you know you learn a proxy task so for example you know on on all of your available images with and without labels you could rotate them by a certain amount right let's say 90 degree and then try to predict that rotation as a proxy task and then you repeat this learning process right many times over with many different samples and so you end up with what we call a foundation model right and then you fine-tune this foundation model on your actual task for example classification on you know a limited label data and so it turns out that you know this this learning paradigm can also work fairly well particularly if you have you know an unlabeled set that that is very large as compared to your initial label set in some studies from google and facebook we've seen um that they've used an unlabeled set in the order of 300 million or even billion of images um as compared to the image net which is around one million images right so um so come so we're looking at these semi-supervised and self-supervised learning approaches um particularly in our studies with semi-supervised learning approaches on chest x-ray analysis we we've seen some extremely promising results and uh we're in the process of you know preparing a manuscript so i i'll uh i'll hope to share a draft with you as soon as it's on out on archive um but uh but the results are really promising to to the point where we're not really seeing any difference between training with hundred percent labeled data versus training with you know ten percent labeled data in these uh with what we call these strongly regularized semi supervised learning regimes so um so these are the three ways in which you know me and my team have looked at uh addressing limited data problem first one was how do we accelerate the annotation process second one was if we were to seek annotations which images and finally uh if we just could not seek additional annotations you know how do we how do we make use of unlabeled data sets to improve the performance of our models um but this is you know far from solved problem and we will we we have you know lots to accomplish still all right thank you so much for sharing these uh methods with us i and i know you like you said this is something that you are like concentrating on so you have a lot more context uh in in recent years on this right so i hope uh the self-supervised method is out soon and we get to see what you are already seeing uh sure so yeah i mean one one of the things related to this is when i was in my previous company uh we were actually trying to do model evaluation i'm actually talking with respect to uh the second point you mentioned where you are trying to look at samples which are more uh say not representative already in the label data set right so you want to find the more discriminative ones so this is something we were doing in model evaluation as well right so when our model wasn't performing as well we were trying to look at those cases where it was very uh deviating from from like the usual failure modes so maybe when your predicted probabilities were even lesser than say point two or point three and then trying to figure out what is how is that data state even structured why is it so different from the other decisions that the model is being involved with so yeah i think that technique of finding the most discriminant sample could be useful in a lot of different contexts that way right and yeah one one of the things that could happen because of say say the teacher student model could be that if your teacher is initially biased your bias could creep in into your entire system going forward right and then your entire label data set could have issues that you didn't imagine that it would have so sometimes even there's unintentional bias that creeps into data sets if they're not carefully built so have you seen such instances of bias in your data sets and if yes how have you tackled it if not what do you what do you think we can do to ensure that we build more representative data sets of course one tactic you mentioned is looking at discriminative samples but is there anything else that you can tell right so that's a very it's a very important question imag bias is something that you know a lot of people who work in healthcare machine learning are really trying to to address um in my own experience i've seen i've seen bias creep in in ways that are really it's it's really uh befuddling to to to say the least um among the projects that i can talk about uh you know we we've analyzed the data sets from this large-scale healthcare uh survey that was con that is in fact being conducted in the united states uh since the start of 1998 until now in 2020 and it will continue in the future as well where a lot of participants are being asked to share their their you know um there is a questionnaire on the on the health care indicators for example you know do you have diabetes or not um you know what's your what's your height weight you know what's your race and ethnicity and so on among other questions in the data sets you know there are for us you know some of the participants there are lab measurements available as well um and there are other questions for example if you've had you know hepatitis a in the past if you've been infected with hepatitis c in the past or not um and you know and the list goes on so uh there are questions on diet there are questions on you know lifestyle uh how many times you know do you eat vegetables and fruits uh in a week and so on but we were very interested in in analyzing these data sets to uh for a number of reasons particularly we were looking for indicators of you know future chronic disease uh incidence diabetes cardiovascular disease and so on um you know one of the so at some point during the analysis of this uh this data the um you know a manuscript appeared which talked about how it was possible for these authors who trained a machine learning model on chest x-ray to um predict the rays of the patient looking purely at the x-ray and nothing else right and that's a problem right i mean it's it's an image and an x-ray at that you know with no additional information and and the authors were able to predict the race of the patient by and you know it was a fairly well performing model and they to the author's credit they tried a lot to you know impair the model from coming to its decisions by changing the resolution you know decreasing the amount of details if i recall correctly they even went down as far as to reduce the resolution to 16 cross 16. i remember reading twitter the europe was still like 65 0.65 right and so um so that was very interesting and so we thought okay can we detect the race if we remove the markers of you know rays and so on from from the enhanced dataset and you know oddly enough the answer was yes so looking looking nothing and but the health data coming from these uh large scale survey participants we were in fact able to identify you know the race and the auroch there was uh like 86 percent or or above i mean we're still working on the study so you know our preprint isn't out yet um but but you know so this this goes on to tell you that there are many ways in which um you know you may be leaking information um to your model that that you're not really aware of right and so what can you do about it first i think as model developers we have to be aware of the many ways in which you know these biases can creep in um all right and so once you know whenever we analyze the problem one of the first things we do is we we do try to you know explore the distribution of data along many dimensions such as a gender ethnicity race and so on to see if there's any you know lack of representation in the data with respect to a particular dimension and you know there are ways in which you can uh compensate for it somewhat i mean you know you you know i think a lot of us are very well familiar with class imbalance uh dealing with class imbalance in these data sets and so you can try to address it to some extent lastly you know once once a model is trained instead of looking at you know an aggregate overall metric such as accuracy we actually try to look at uh you know the performance metrics along these multiple dimensions through which bias can unintentionally creep in and uh yeah so this is this is an area you know we are we are continuing to focus on you know learning as we go along to to ensure that we're reducing the bias in the models that we train i think one use case i mean one news that had come out was i think google image search was showing uh yeah certain results that it shouldn't be showing because of the bias in their training data set but yeah i mean it's it's an ongoing process i suppose as as you discover more and more dimensions along which you could have had buyers and then you try to fix those particularly in healthcare it's important because you know on one hand you have during award winners like joffrey hinton claiming that you know radiologists should be worried for their jobs on the other hand you have ground realities of that that make healthcare an extremely hard problem to work on and fair disclaimer you know i mean i'm a big fan of joffrey hinton but um it goes on to show you that you know healthcare is uh is not as easy that you know a lot of people actually think and so the least you can do do is be humble about the models you train you know be aware of the potential shortcomings and uh be prepared to you know address the shortcomings absolutely and i think one question on that related note right so when when you're trying to improve the performance of your model uh what are some of the data pre-processing techniques that you've seen uh that have proven to work for you uh maybe you can share a few sure so you know i think uh andrew wang recently had a talk where he described from moving to model centric to data centric ai right and so i think a lot of people are now beginning to not a lot of people i think most people i would say have have by now realize that um this uh you know developing machine learning models is so much about the data itself and so um you know in uh among the few things we do is first we you know just manually and uh you know visually look at the data sets right whether they be from imaging from structured data sets or text or whatever it is we you know sit down actually look at the data set try to understand what are the different dimensions in the data set you know so getting getting familiar with the data set is extremely important because you you sometimes you know come across things that uh that uh that are a short i mean let's say a problem with a data set that you wouldn't really find otherwise if you were not looking i think jeremy howard to his credit in fast dot ai course actually presents an example where i think was it the cars data set where he finds examples where the model uh does not perform well which uh which were actually not even cars i'm sorry it was probably cats and dogs data set where i don't remember yeah either way so uh there are these uh you know so what what he described was this uh data set actually claimed to have cats and dogs but there were some images which were neither right uh where the model actually uh did not perform well so you know so it's uh manual exploration is really important so you can find these edge cases and take care of them um among other things that we do is we look for ways to um you know make train our model and you know to to make it more robust towards the many forms of variations that can occur in a data set in a medical imaging data set you know for example applying augmentation techniques such as rotation translation share zoom and zoom out you know changing the resolution a bit among others and uh yeah and among other things that we do is actually you know this is this is also a suggestion that andre carpet talks about on his on his blog is that actually look at the data as it you know goes through the different uh input stages right up until the point you know where it's uh taken into the model and believe it or not it has actually helped us many times you know in in in our real world applications where we've when you start with you know reasonable assumptions for example okay i'll introduce xyz gaussian blur right and and then you end up finding that okay the object that i'm trying to detect is probably lost by this extent of blur so i should probably tone it down a bit right um and uh you know so these are among the techniques that we that we've used um in general you know we we look at regularizing the model training chain all throughout so you have the data part the model part the metric part and so you we use several regularization techniques on the data model and metric to ensure that we have a model that is well regularized and that you know does not learn these shortcuts all right thanks for sharing those with us and i think we have time for one last question and then we can open uh the forum for audience questions if there are any sure uh so i think this is going to be a bit more generic sort of a question uh on because there are a lot of people out there who are looking to get in the field and they've been like uh spoiled for resources so to speak so they they always have these questions on what resources to focus on and all of that so i would maybe just put it uh as simple as uh one line so what are the skill sets required to succeed and grow uh in say the data science machine learning ai industry and i mean if you have a few resources that you would like to share you can go ahead sure um i think uh i think one has to be comfortable with programming so you should at least be you know comfortable uh writing codes in python you know that's that's sort of become the language of choice uh for for most ml practitioners um you should also be able to understand the code written by others and this is where if you have an intermediate to somewhat advanced level of exposure to python you can you can really thrive because you know as you said ima there are so many resources there are so many code bases available on github where you can which you can you know take a look at and learn from it does require you to understand you know good python programming practices um knowing how to write a class and you know some of the python decorators what are decorators in python and so so knowing intermediate to advanced level of python is extremely helpful um there are many python uh programming courses available on coursera where one could start but it is one of those things where you know you learn with practice so i can't recall a resource of the top of my head because i learned python many years ago but um you know any new in the fee any new person in the field has to make sure you know they're spending time in writing code um and not only reading about it um you also have to understand basics of the the math involved i think having high school level of exposure to topics from calculus linear algebra and probability would be a would be a good start but there are a number of resources right that that one could refresh uh used to refresh their knowledge um i think the introduction introduction to statistical learning book by um daniela witton tip sharani and others is is a great resource that book actually uses r to write their examples but you know you can find others and uh by no means an introductory book um and no one should be disappointed if they find it difficult to read but an absolute favorite is uh is uh murphy's book on machine learning i was looking up because it's on my shelf right here and so um you know it's it's very comprehensive it talks about uh probabilistic approaches in machine learning and and covers many varieties of topics that would be of interest to someone who's you know familiar with the basics of machine learning okay so what is your one line statement for someone who wants to just enter the field so um start by doing it i think jeremy howard's approach should work well for a lot of people who are new to the area and are probably intimidated by the amount of work that needs to be done to you know to get to the point where you can start producing result but and definitely don't you know don't let people uh scare you into believing you that you can't really do it if you don't be a master in math first or a master and advance python first i think you can learn a lot as you go along you and you know it's a marathon not a sprint so you just you just have to have the dedication and stay focused all right yeah sounds good i know it's very cliche but uh you know it's it's uh it's what i really think no it is what it is i mean at the end of the day it's about putting in the hours i suppose and then uh and then keep being at it all the time and then you will see that i mean it's again like i said it's cliched but it is what it is okay i think with that i come to an end of uh to the list of the questions that i had but we can also take a look at the q a section and uh see if there's a few questions i can already see so i've brought one question on stage uh i think dr raleigh i don't know who doctor valley is but dr valli asks the question does your company incorporate use of federated learning in healthcare we are familiar with federated learning approaches we haven't um we haven't had a use for it just yet because a lot of the data sets that we work with come from our local collaborators where we're able to provide them with a secure uae-based cloud computing facility to be able to work with their data sets nevertheless you know um we are familiar with federated learning approaches on how to how to implement it and how to execute it um so we'd be able to do it if there was was a need okay i hope that is answered and on that point anyone interested in federated learning should really check out a recent paper coming from nvidia and boston medical group cluster it's on how they implemented federated learning to learn from you know more than 20 participating sites spread all across the globe and the results are obviously very very interesting to see that you know with the combined data sets from these multiple sites the the performance of the model was actually better than what it would be if it was learned from you know site-specific data uh one of my dear friends and zanul aberdeen was a co-author on that paper and he has a very good write up on linkedin about another study which i encourage everyone to check out okay let's get to the next question any ml use cases used for tabular data example patient level analysis or hcp level analysis i think um assuming scp is healthcare provider professional healthcare professional okay so yes i mean there are there are many you know the the analysis of healthcare survey data from nhanes that i talked about that's you know we used a fair number of tabular data analysis approaches such as random forest gradient boosting machines and so on um you know in our projects on analysis of healthcare claims uh healthcare finance datasets healthcare operations datasets we do explore uh you know the tab we do so address these problems as tabular data problems and we look at you know both classical and modern machine learning algorithms to to address them uh it's really hard to beat grad gradient boosting machines is if you tune it well is is what i've found but there are uh you know approaches such as tab transformers and others which have looked at applying um deep learning tabular data sets that i encourage everyone to check out i think one related question to that is when you use say gradient boosting machines or random forests do you also try and look at their future importances uh to figure out and and what what do you do in general to figure out future importances when you use say ensemble we definitely do use feature importance methods you know one thing to keep in mind is that these are all i mean a lot of the approaches that we have used in particular are all you know retrospective post hoc analysis techniques such as looking at the shaft values and it can it can give you a pretty good idea of what the model is using um to to make its decision among you know among the variety of features which ones are more important but there are several limitations um as well that one has to be aware of of uh when and when not to trust what you see um as an important feature the short answer is yes we do use you know um sharp lime and other approaches to to see what's what the model is thinking are important features no thanks for bringing the limitations point because i made a video on that just last week on youtube so go check that out if you want absolutely i will no i'm not to you but in general to everyone no no it's always good to learn new things there was there was a paper uh on general pitfalls on uh using model agnostic machine learning interpretable machine learning technique which is again sharp lime and all of these techniques uh christophe molnar who has written the interpretable machine learning book is like the lead author of this paper so what i essentially did was try to summarize what he has done in the paper so yeah i have that video on youtube i mean i'm just bringing it up because it came up no absolutely i'd love to check it out because i really like his book uh you know we use it um we've read chapters from it we you know we recommend it to all of our you know team members new interns and so on so i'd love to check out your video ideally i mean i mean maybe for you you would directly read the paper it's i mean you don't need to check my summary of the paper but okay so that's all right sure so how am okay the next question is from case one uh how ml ordeal techniques are leveraged in computational neuroscience any use cases in brain sure in computational neuroscience you could be using the mlodl techniques to analyze uh you know the connections between different parts of the brain so as an example it turns out you can use the you can represent our brains connectome uh as a graph all right and so once you have uh once you have converted essentially this this you know diffusion tracts that describe how what is the strength of connections and i'm simplifying things here a bit so in a neuroscientist don't take offense but you know once you've once you've quantified what is the strength of these connections between different uh areas of the brain you can formulate that as a graph and once you have a graph essentially representing the connections of an individual brain you can then you know go on and apply any number of graph machine learning or deep learning techniques to uh for for variety of reasons right so one one approach again could be to try to identify which are the which are which which is let's say a model graph representing uh you know what ideally the connections within the brain should look like and then what would with with reference to this model graph what would uh what would constitute constitute an abnormality right and what would be the impact of it among other applications um uh you could be using it to study the spike train that is that is generated from the neurons and that also relates to the brain computer interfaces right so if anyone who has seen neural links uh demonstration of how they use these signals acquired from apex brain to um you know for uh for these creating these brain machine uh interfaces um a lot of the interpretation there uh i presume is using some sort of machine learning or deep learning though i don't think they've made publicly they publicly disclosed what algorithms were used ok we have time for one last question if there is any question please drop them in the q a we just have a minute so if you drop it now it will be taken up or else maybe i think shabab can take it later you can also contact him perhaps on linkedin that's okay okay then i think there's no question that i can see so thank you so much for this chat with us it was really great having you uh on i'm sure yeah it was really great having you today for for this session uh if you have any few few last words that you would like to share this is the time i'd just like to you know say again that you know you know when you enter the data science or machine learning field a lot of things are very intimidating right and uh words like eigenvector and eigen analysis you know uh tormented me for a very long time to be honest um it's it's important not to lose sight of the bigger picture and often you know you as long as you are determined and you keep coming back to you know the concepts that you learn and you're open-minded about the the possibilities and limitations of the algorithm i think i think you can go really far and long in data science as a professional okay all right then thank you so much everyone for joining i hope it was a good use of your time as well and i will see you next month with yet another speaker or a webinar or fireside chat whatever it is we i will see you next month until then have a good time and yeah stay safe wherever you are thank you so much bye all right thank you it was a pleasure talking to you

Original Description

Shadab leads a team of applied scientists and engineers at G42 to solve problems in healthcare AI. His team develops solutions for clinical care, healthcare operations, and healthcare finance by analyzing structured and unstructured datasets ranging from electronic health records, genomics, medical imaging, and claims, among others. Before joining G42 Healthcare, Shadab was a researcher at the Inception Institute of AI in UAE, where he focused on machine learning from limited data. Shadab obtained his Ph.D. from Dartmouth College in Biomedical Engineering and did a research fellowship at Harvard Medical School and Boston Children's Hospital in Radiology. In this fireside chat, we'll get to know more about his experiences and learnings while building AI systems in Healthcare. We first understand his motivation to work at the intersection of AI and Healthcare and then understand some of the problems he faced while building AI systems and how he overcame them. Towards the end, we will talk to him on some advice he would like to share with upcoming Data and AI scientists.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Imaad Mohamed Khan · Imaad Mohamed Khan · 30 of 34

← Previous Next →

Does AI know Fashion? - Mitali Sodhi - Mantissa Data Science Meetups

Does AI know Fashion? - Mitali Sodhi - Mantissa Data Science Meetups

Imaad Mohamed Khan

Mantissa Data Science Webinar - 1 with Santhosh Shetty

Mantissa Data Science Webinar - 1 with Santhosh Shetty

Imaad Mohamed Khan

Recommender Systems - Imaad Mohamed Khan - Mantissa Data Science Meetups

Recommender Systems - Imaad Mohamed Khan - Mantissa Data Science Meetups

Imaad Mohamed Khan

Data Science is more than just Data Scientist - Different Roles in the field of Data Science

Data Science is more than just Data Scientist - Different Roles in the field of Data Science

Imaad Mohamed Khan

What topics to prepare for Data Science Interviews in 2020?

What topics to prepare for Data Science Interviews in 2020?

Imaad Mohamed Khan

Programming as a human activity

Programming as a human activity

Imaad Mohamed Khan

What are the languages or tools used by Data Scientists in their work?

What are the languages or tools used by Data Scientists in their work?

Imaad Mohamed Khan

Linear Regression From Scratch - Part 1

Linear Regression From Scratch - Part 1

Imaad Mohamed Khan

Linear Regression From Scratch - Part 2

Linear Regression From Scratch - Part 2

Imaad Mohamed Khan

Linear Regression From Scratch - Part 3

Linear Regression From Scratch - Part 3

Imaad Mohamed Khan

Journey into Data Science - Fireside chat with Adarsha and Karthikeyan

Journey into Data Science - Fireside chat with Adarsha and Karthikeyan

Imaad Mohamed Khan

Off the ground - Python in 5 Steps

Off the ground - Python in 5 Steps

Imaad Mohamed Khan

How LinkedIn uses Data Science to build your feed - LinkedIn Feed Algorithm Explained

How LinkedIn uses Data Science to build your feed - LinkedIn Feed Algorithm Explained

Imaad Mohamed Khan

Fireside chat with Eric Weber - Learnings in Data Science

Fireside chat with Eric Weber - Learnings in Data Science

Imaad Mohamed Khan

Part 2 - How LinkedIn uses Data Science to build your feed | LinkedIn Feed Algorithm Explained

Part 2 - How LinkedIn uses Data Science to build your feed | LinkedIn Feed Algorithm Explained

Imaad Mohamed Khan

Using Streamlit's Share Feature to easily deploy (and share) videos using Github

Using Streamlit's Share Feature to easily deploy (and share) videos using Github

Imaad Mohamed Khan

Airbnb Experiences Ranking Algorithm Explained - Part I

Airbnb Experiences Ranking Algorithm Explained - Part I

Imaad Mohamed Khan

Airbnb Experiences Ranking Algorithm Explained - Part II

Airbnb Experiences Ranking Algorithm Explained - Part II

Imaad Mohamed Khan

Airbnb Experiences Ranking Algorithm Explained - Part III

Airbnb Experiences Ranking Algorithm Explained - Part III

Imaad Mohamed Khan

Big Data, Hadoop and Machine Learning Explained using Dams

Big Data, Hadoop and Machine Learning Explained using Dams

Imaad Mohamed Khan

Fireside Chat with Hiromu Hota - Transitioning from Research to Industry

Fireside Chat with Hiromu Hota - Transitioning from Research to Industry

Imaad Mohamed Khan

Introduction to Anomaly Detection and One Class Classification

Introduction to Anomaly Detection and One Class Classification

Imaad Mohamed Khan

Reading and manipulating Google Sheets (GSheets) using Python libraries

Reading and manipulating Google Sheets (GSheets) using Python libraries

Imaad Mohamed Khan

Writing to Google Sheets (GSheets) using Python libraries

Writing to Google Sheets (GSheets) using Python libraries

Imaad Mohamed Khan

Fireside Chat with Mirza Rahim Baig - Business Problem Solving and Data Science Career Tips

Fireside Chat with Mirza Rahim Baig - Business Problem Solving and Data Science Career Tips

Imaad Mohamed Khan

Six types of Data Analysis you will do as a Data Scientist

Six types of Data Analysis you will do as a Data Scientist

Imaad Mohamed Khan

Automatic Speech Recognition (ASR) with Facebook AI's wav2vec 2.0 model using Huggingface

Automatic Speech Recognition (ASR) with Facebook AI's wav2vec 2.0 model using Huggingface

Imaad Mohamed Khan

9 Anti-patterns to avoid MLOps mistakes

9 Anti-patterns to avoid MLOps mistakes

Imaad Mohamed Khan

8 pitfalls to avoid while using Machine Learning Interpretation Techniques (SHAP, PDP, LIME, PFI)

8 pitfalls to avoid while using Machine Learning Interpretation Techniques (SHAP, PDP, LIME, PFI)

Imaad Mohamed Khan

Fireside Chat with Shadab Khan - AI in Healthcare and Data Science Career Tips

Fireside Chat with Shadab Khan - AI in Healthcare and Data Science Career Tips

Imaad Mohamed Khan

Features and Feature Engineering in Machine Learning - An Introduction

Features and Feature Engineering in Machine Learning - An Introduction

Imaad Mohamed Khan

Building your own AI text generation tool with aitextgen using GPT-2/GPT-3

Building your own AI text generation tool with aitextgen using GPT-2/GPT-3

Imaad Mohamed Khan

Organising Data Science projects using CRISP-DM

Organising Data Science projects using CRISP-DM

Imaad Mohamed Khan

Introduction to Prompt Engineering

Introduction to Prompt Engineering

Imaad Mohamed Khan

In this fireside chat, Shadab Khan shares his experiences in building AI systems in healthcare and provides career tips for data and AI scientists. He discusses the importance of analyzing structured and unstructured datasets and overcoming challenges in machine learning.

Key Takeaways

Understand the intersection of AI and healthcare
Analyze structured and unstructured datasets
Develop solutions for clinical care, healthcare operations, and healthcare finance
Overcome challenges in machine learning from limited data
Pursue a career in data science and AI

💡 Building AI systems in healthcare requires analyzing diverse datasets and overcoming challenges in machine learning from limited data.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Data Literacy

View skill →

Analyzing Billing Data with BigQuery

PySpark in Action: Hands-On Data Processing

PySpark in Action: Hands-On Data Processing

Analyze and Visualize Data Using Splunk Statistics

Analyze and Visualize Data Using Splunk Statistics

Apply SCD2 to Build Dynamic Data Models

Automate Financial Insights with AI Tools & Dashboards

Automate Financial Insights with AI Tools & Dashboards

Automate Excel Data with Power Query and Lookups

Automate Excel Data with Power Query and Lookups

Related Reads

Build a Simple Calculator

Learn to build a simple calculator using Python and apply basic programming concepts to a real-world project

Medium · Python

Building ML APIs That Don’t Fail During Startup

Learn how to build ML APIs that don't fail during startup by using a production-ready pattern for loading ML models without serving requests too early

Medium · Python

Your Model’s Numbers Just Changed. Git Never Noticed.

Learn how to track changes in your model's data using Data Version Control (DVC) to ensure reproducibility and accuracy

Medium · Machine Learning

Your Model’s Numbers Just Changed. Git Never Noticed.

Learn how to track changes in your machine learning model's data with Data Version Control, a crucial step in ensuring reproducibility and collaboration

Medium · DevOps

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB