Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Key Takeaways
The video discusses the application of machine learning to COVID-19 research, focusing on the use of transformers, BERT embeddings, and semantic search to analyze a dataset of 45,000 scholarly articles, with tools such as Allen AI Research Challenge, CoronaY Slack, BERT, and uMap.
Full Transcript
hi everyone I hope everyone's doing well in quarantine or social distancing right now not going to stir crazy okay so um praia love you are already familiar at this point with the kind of Allen NLP well the Allen AI research challenge on coronavirus but anyways if you are and I'll just very briefly say they basically released this large data set around 45,000 scholarly articles and the idea is to kind of look and see which ones are the most useful for answering specific questions about the virus so you can read this later if you haven't already seen it I won't go into too much more details but one of the really cool things I think that came out of this kind of immediately unlike a lot of other Kegel challenges was there was immediately kind of a cooperative effort instead of a really competitive effort from the get-go so this group kind of corn called Corona Y formed and now we have over 500 members I believe on our slack and and a lot of people are exposed to be working in teams in a very organized way to try to address some of the real problem problems and the progress we can kind of make on this issue so I found that very interesting if you're interested in joining the group we're always looking for new people home and I can send you over the slack later specifically what I'm working on and for some of you this may be kind of review though is is looking at kind of forming good sentence embeddings or good general-purpose embedding so it to do semantic search on this corpus one of my kind of interests for a while has been using transformers to kind of do effective representation learning some of that has actually been even on the time series side but returning more to the NLP side for the moment I really wanted to see if I could find some good useful representations and one of the really challenging things I think about this task and in a sense is that we have we have no real evaluation metrics all evaluations qualitative and really have to rely kind of on in a minute moreover I guess many of us don't even really know what would be good results so we really have to rely on experts to evaluate what what the results are and if they make sense but I just wanted to kind of develop using kind of these embeddings and my knowledge of clustering a good way to quickly cluster things and make it so that the experts could see if you know those things kind of makes sense so so yeah I gotta wrote out this notebook very quickly um the the top are probably are all used to just downloading installing my main question they're just kind of with this notebook are wrong betting's useful you know how can we construct a an efficient semantic search using these embeddings because there's always a trade-off between doing like a full semantic search and the memory required which I found all too well when this started repeating lis crashing due to lack of RAM and then the other big thing is like as I said what does the embedding space look like if we display the embedding space and have experts look at it what can they tell us whether it makes sense and specifically since LaVon and did ask me to say this earlier um I will say it right now there's a lot of like machine learning going around with people not really understanding the problem space and not always understanding how it impacts impact stuff in a clinical sense or in a medical sense I know I don't know that by myself so I always want to try to rely on medical experts to try to evaluate my results and look at those and I think you should follow all good machine learning best practices but then also in addition really try to collaborate and we know form these cross team collaborations because we can't solve it on our own as machine learning experts we need that expert advice so without further ado I'll just quickly run through some of this um so as I say Wantage at this point I just kinda want to see how these vanilla Seibert embeddings performs so I just essentially loaded the model I did a very naive embedding method I basically took that across all word and betting's just later I'll show you how I refine this a bit and then you know I did some basically basic cosine similarity scores some of these seem to actually give meaning kind of here we do see like a high correlation for instance between compliance and MERS coronavirus and a random word you know there's still a high correlation so obviously that's not great moving through kind of just did some of her helper functions and then why I really wanted to do is I said plot in the embedding space so I actually used u Mac which I find really useful I kind of like it's one of my go-to dimensionality reduction techniques so with that I kind of just plotted the article title embeddings just using this naive method and I was kind of nice to see at least in my own on expert opinion having just said that that like you know certain things do seem to form like distinct kind of patterns on the cluster like here we can see like health capacity management I know that's kind of going off screen but um that's pretty much the only sing the only sing uh in the kind of this area then if we look at something like the top of the cluster then you see there's similar kinds of article titles grouped together in this part though obviously we'd want to get like an epidemiologist or a biochemist to actually thoroughly evaluate if these make a lot of sense I'm so moving down through my day couple more clusters then I did it kind of a semantic search on the various titles so one of the problems as I said from the get-go these are 768 dimensional embeddings that are returned by the bert model so they take up a lot of space so it's just not practical to really do a full search of the corpus and because i was limited only embedding 200 articles i think some of the results weren't that great to begin with because I can only essentially embed it in 200 were chunks due to the or 200 200 article chunks due to that of the titles so so yeah that was definitely a limitation I did tribe as a possible kind of unsupervised evaluation metric specifically I thought you know if we have two kind of different queries like one is current overt virus person-to-person transmission mechanics and the other one is corona virus infection infection origin and transmission from animals these these are actually two fairly different questions from kind of you know a research standpoint so why am normal search engine might return those have those returns similar results ideally you'd want them to return very different results so what I did is I took those two queries then I embedded the ten resort turn results and you can see that these aren't very good because like ideally we'd still see like distinct I guess distinct areas and the kind of embedding space where the different search results should return just qualitatively and they're kind of mixed there's kind of even some overlapping ones but again this is kind of just on the the partial kind of corpus and not the full the full one just about 200 articles in the search so a little bit's kind of understandable later on I kind went to embedding abstracts which of course full abstracts which was even more RAM intensive unfortunately what I did I did finally combine it with what was called the b25 diem25 index which is kind of a more kind of vanilla search algorithm similar to tf-idf with a few slight variations and one of the things I found is that when I combine that on the search abstracts with that and have it return a little list of twenty results on the four full like forty five thousand articles and then reweighed those results with semantic search I did actually get more distinct clusters so for instance here's like coronavirus human to bat transmission Cagle can cut off some of the edge there but and here's COPD nineteen person-to-person transmission and all these though these aren't perfect you can see there's kind of like these abstracts do you like form I guess their own kind of distinct pattern in the embedding space and there is some differentiation between the two unlike the other one that where they were just kind of overlapping so that was kind of my first attempt I came up with these conclusions and next steps so one of the things I've looked at most recently was then fine tuning actually a sentence transformer model on med and Ally which is one of the which is essentially as well it's a natural language inference data set that's not actually mine which is a natural language inference data set but can be used to like gauge how similar sentences are together based on the labels in that so I fine-tune that as full sentence transformer model to prove and this mile is actually nice because it produces full kind of sentence embeddings I haven't done the full clustering analysis on it yet but from what I've seen from the initial results at least qualitatively on a few things like with for instance you know bats a human transmission and camel to human transmission mechanism it rates it like for instance a fairly high similarity score which I think would be good and then for instance if you're looking at like treatment efficiency a cork line on COPD patients and back to human transmission coronavirus it rates it with the fairly lower similarity score which which we want because those are essentially two queries asking very different questions so I've seen qualitatively just on this basic analysis is that it's seems to be performing a lot better I guess um okay I think that covers most of what I was going to go over as I said it is definitely an interesting project and yeah that was kind of a bit informal but I always just asked you I think two days ago or a day ago to prepare this so hopefully it's common sense to people happy to answer any questions this is great thank you so much a few questions coming in already Casey the chat I'll pull it up this is the hardest part to figure out if you need stopping the shame okay okay so um can we use you map for other things than you have used it um yeah I mean I think yeah you can use you to map for any type of clustering so anytime you have in betting's or you want to do dimensionality reduction you can use you map it actually serves as kind of a good dimension I was thinking of also actually using it I guess to maybe reduce the dimensionality of those 768 dimensional vectors to maybe take up a little less memory but it's a good just dimensionality reduction technique in general and yeah I can definitely add some article links to it I think I already linked to it in a couple places in my notebook but yeah that's a really good question about the RAM intensiveness so so yeah these models are kind of hard at scale so that's why I think most people do use some kind of initial search index where you return an initial list of results before doing the kind of similarity scores which is what I was looking at there are ways I guess as I said to maybe try to use you map to reduce the dimensions of the embedded text so that can definitely help to I haven't really studied entirely at this point but yeah it is definitely is a question between how much RAM and resources are available and then how good you want the search results to be so that's one of those like kind of real-world trade-offs you have to weigh if you have any more questions by 16 verbally cinnamon chat yeah yeah I think you tell us a little bit more about some of the stuff that the corona Y group yeah sure so actually um yeah corona why it's kind of create four different main tasks one focused on you can see them all I'm kind of like Corona Y page one focused on kind of geography and how Geographic factors influence the virus and ever focused on transmission specifically another focused on vaccines and of our therapeutic s-- and the fourth is on various risk factors associated with it so so yeah there's those four kind of core tasks which people are doing very specific kind of NLP efforts on so for instance on the G of section they're extracting specifically like named entities from the medical unnamed entities of like locations and sublevel stuff about you know countries and then combine that with like geographic data to look at how you know geography impacts in this unit spread at least from the literature then like on say like the vaccines and therapeutics they're looking at specifically extracting the vaccine and you know therapeutics info so yeah they're kind of multiple efforts going on right now I've been more focused on kind of the common effort which is kind of define general models that could work across all tasks so that's where those kind of sentence embeddings come in but uh well yeah it's just kind of a definitely an interesting group and a lot of cool things going on with it uh do you know what the link to the psych mean if I can drop it in the chat if you give it the link to the slack yeah I can send out yeah I can get your link gotcha and then I also posted a link to our slack community I see two more questions one for money one from Jonathan okay um yes sure so Jonathan have you considered it have you considered visualizing any attention components for your transformers yeah I think that I could definitely be useful I didn't do it too much in that notebook but yeah I think it would be useful to see like kind of which which words are being kind of weighed kind of weighed in the language model when embedding the and when creating the embedding so that definitely would be a good thing did you see like which particularly which tokens and if it's attending to something like coronavirus or COPD 19 more that would be helpful to know but yeah that would be a good next step to so if a question that piggybacks off of that so I'm actually building this into 18 bases right now attention mechanisms a way to visualize them what are you using right now to visualize your attention or like for other projects because you haven't used in this yeah so for attention right now I kind of try to use heat maps and stuff between kind of the input you know whatever the input is and whatever the output sequence is so I think that's the big one right now I guess you could also look at specific context vectors and kind of visualizing those could also definitely be helpful so are you using Class C maps right now - should I say heat maps um I actually haven't heard of them them specifically right now I've kind of done some of my own embedding kind of visualizations of kind of the activations but I might look into them I haven't done too much into the actual kind of visualizations but I think that could definitely help with interpretability so there was another question on dimensionality reduction would you translate it to feature selection is that right oh yeah I mean it's kind of related to that it's basically just a you map pca t sine they all take kind of like a very high dimensional vector and then they try to find you know the the parts of it that really stick out and like define it in the kind of embedding space and simple terms and the map it said that use those to map to the low dimensional embedding space
Original Description
Isaac Godfried is a machine learning engineer at Monster where his main focus is to remove barriers related to the use of deep learning in industry.
As part of our Virtual Deep Learning Salon he shared how he's applying machine learning to the the COVID-19 dataset and how we can do this responsibly.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Weights & Biases · Weights & Biases · 44 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
▶
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
0. What is machine learning?
Weights & Biases
1. Build Your First Machine Learning Model
Weights & Biases
Intro to ML: Course Overview
Weights & Biases
2. Multi-Layer Perceptrons
Weights & Biases
3. Convolutional Neural Networks
Weights & Biases
Weights & Biases at OpenAI
Weights & Biases
Why Experiment Tracking is Crucial to OpenAI
Weights & Biases
4. Autoencoders
Weights & Biases
5. Sentiment Analysis
Weights & Biases
6. Recurrent Neural Networks [RNNs]
Weights & Biases
7. Text Generation using LSTMs and GRUs
Weights & Biases
8. Text Classification Using Convolutional Neural Networks
Weights & Biases
9. Hybrid LSTMs [Long Short-Term Memory]
Weights & Biases
Toyota Research Institute on Experiment Tracking with Weights & Biases
Weights & Biases
Weights and Biases - Developer Tools for Deep Learning
Weights & Biases
Introducing Weights & Biases
Weights & Biases
10. Seq2Seq Models
Weights & Biases
11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
Weights & Biases
12. One-shot learning for teaching neural networks to classify objects never seen before
Weights & Biases
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
Weights & Biases
14. Data Augmentation | Keras
Weights & Biases
15. Batch Size and Learning Rate in CNNs
Weights & Biases
Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Weights & Biases
Grading Rubric for AI Applications with Sergey Karayev (2019)
Weights & Biases
16. Video Frame Prediction using CNNs and LSTMs (2019)
Weights & Biases
Image to LaTeX - Applied Deep Learning Fellowship (2019)
Weights & Biases
17. Build and Deploy an Emotion Classifier (2019)
Weights & Biases
Applied Deep Learning - Data Management with Josh Tobin (2019)
Weights & Biases
Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Weights & Biases
Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Weights & Biases
Troubleshooting and Iterating ML Models with Lee Redden (2019)
Weights & Biases
Designing a Machine Learning Project with Neal Khosla (2019)
Weights & Biases
Lukas Beiwald on ML Tools and Experiment Management (2019)
Weights & Biases
Building Machine Learning Teams with Josh Tobin (2019)
Weights & Biases
Pieter Abeel on Potential Deep Learning Research Directions (2019)
Weights & Biases
Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Weights & Biases
Five Lessons for Team-Oriented Research with Peter Welder (2019)
Weights & Biases
Applied Deep Learning - Rosanne Liu on AI Research (2019)
Weights & Biases
Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Weights & Biases
Organizing ML projects — W&B walkthrough (2020)
Weights & Biases
Brandon Rohrer — Machine Learning in Production for Robots
Weights & Biases
Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Weights & Biases
My experiments with Reinforcement Learning with Jariullah Safi
Weights & Biases
Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Weights & Biases
Testing Machine Learning Models with Eric Schles
Weights & Biases
How Linear Algebra is not like Algebra with Charles Frye
Weights & Biases
Predicting Protein Structures using Deep Learning with Jonathan King
Weights & Biases
Rachael Tatman — Conversational AI and Linguistics
Weights & Biases
Reformer by Han Lee
Weights & Biases
Sequence Models with Pujaa Rajan
Weights & Biases
GitHub Actions & Machine Learning Workflows with Hamel Husain
Weights & Biases
Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Weights & Biases
Jack Clark — Building Trustworthy AI Systems
Weights & Biases
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Weights & Biases
Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Weights & Biases
Antipatterns in open source research code with Jariullah Safi
Weights & Biases
Attention for time series forecasting & COVID predictions - Isaac Godfried
Weights & Biases
Made with ML - Goku Mohandas
Weights & Biases
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Weights & Biases
Deep Learning Salon by Weights & Biases
Weights & Biases
More on: ML Maths Basics
View skill →
🎓
Tutor Explanation
DeepCamp AI