Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere

Cohere · Beginner ·📐 ML Fundamentals ·3y ago

Skills: LLM Foundations85%Fine-tuning LLMs80%LLM Engineering70%ML Maths Basics60%ML Pipelines50%

Key Takeaways

Ed Grefenstette, Head of Machine Learning at Cohere, discusses various topics including machine learning systems, conversational intelligence, NLP use cases, and the limitations of artificial general intelligence, while also sharing insights on the company's approach to large-scale language modeling and innovation sharing.

Full Transcript

foreign number one what do you suggest students to do if they want to practice pushing machine learning models to production okay this was a great and interesting starter question because it's a it's a bit more specific than how do you just get started in machine learning but rather how do we put machine learning models in production a skill that isn't necessarily taught in universities and isn't necessarily something people even with my background in research have a lot of experience with and that we for a large period between when I started working on deep learning around 2012 and really until I got back into startup land the scale was handled by Engineers who were far more capable than I was but with the sort of Cambrian explosion of machine learning libraries and Frameworks that really democratize access of this so too has there been a sort of wave of people going straight out of undergraduate dropping up from PHD programs going into industry starting these startups and I feel like the field craft and ml Ops and how to be able to sort of actually deal with efficiency deal with organizing your data deal with versioning your models so that you could organize a business around ml as a first-class citizen has really sort of grown so quickly um at a point where I had a hard time keeping up with it so this is not an area I have expertise with but unfortunately people who do who have developed this expertise firsthand have a lot of them have put time into publishing you know courses and blog posts and books one that um I'm reading right now and to learn about this aspect since I obvious obviously have to understand this aspect of things also within cohere even though it's not my expertise is chepuians I'm not sure I'm pronouncing her name correctly um who taught um who taught one of the courses at Stanford on mlops has put out a wonderful book a few months ago called designing machine learning systems and iterative process for production-ready applications which covers kind of for me the whole stack in a very production oriented manner so I really would recommend starting with that book I'm not making royalties or commission from it it just happens to be at my bedside um I'm sure there were other resources um out there that are equivalently or perhaps differently useful so I really recommend just using Google or your favorite search engine and finding resources out there but what's sure is that there's a lot more information out there than the last time I had to learn about this sort of stuff where it was figuring out by yourself and no one really knew how to do things awesome recommendation dearly noted thank you so much for this we have the first Live question coming from the Q a section please go ahead hi um I just wanted to ask what NLP use cases you're most interested in and what you find interesting in the realm of NLP so personally the the main thing that's exciting me today a lot of things excite me within language and that it's it's the main you know tool by which we not just communicate but you know make plans transaction record information for ourselves for others uh Aid our memory so all these use cases are are fascinating the thing I find absolutely fascinating in terms of technological progress in just the last few months and certainly in the last week is conversational intelligence so the ability of uh unified models to um not just carry on open-ended conversation as we see with chat chat gbt or Google's Lambda or um our Facebooks or sorry meta blenderbot 3 but to increasingly interact with the world in meaningful ways and show different Fortes across these these three models I mentioned for example Lambda can interface with um apis that can search the web or operate a calculator are demonstrating a really interesting form of groundedness that's also reflected in blenderbot 3 and in contrast what we've seen with openai's GPT chat or chat DBT I never remember which order around it it's clear that they've built on top of the strengths of their their codecs model uh and understanding um code is a modality and and it's a fantastic tool for interacting with knowledge surrounding code with code itself editing code getting recommendations about how to produce code and these all point to a sort of inflection point where this class of technology which previously was I think quite Niche like people had a hard time getting conversational open-ended conversational Bots that were able to do more than just chit chat off the ground it's a turning point where we're finally seeing the potential for really broad applications for these being a central sort of entry point to a number of tools a number of application cases so I'm really excited to see this domain develop over the next few months and obviously we're keen participants in this ourselves hopefully with some exciting stuff to announce in the in the coming months amazing thank you so much uh can I let me just move on because we have lots of questions coming so next one is from Yan yam they want to ask yourself oh yeah hello um uh I was just wondering uh what deep learning framework among by George tensorflow Jax perhaps others go here uses and why as well so like what are the different trade-offs between different Frameworks that you think about the um so I I myself have worked with pytorch for many years um for about three years during my time at fair and before that worked with um Lua torch and tensorflow 1.0 when it was a deep mine at cohere we use Jax internally and the primary reason for that is that our compute is obtained in partnership with Google so we have access to a large number of tpus and Jax is the the best framework for for getting the most out of tpus themselves um I think if we were in a situation where we were operating with a large number of gpus um we would probably consider something like Pi torch instead which has better performance on gpus and especially with the release of high torch 2.0 announced announced last week um has further sort of like uh potential for optimization for large-scale models that would be very appealing so for us it's a pragmatic decision that's primarily directed by what Hardware we've accessed to um if you if Hardware isn't a hardware and efficiency isn't your primary concern then I think it boils down really to sort of uh how much the relative features of these languages matter so if you want to do higher order derivatives or something that's a bit more mathematically fancy up until about a year or so ago Jax was the obvious choice because it just built on top of numpy and made everything differentiable and it was very convenient to use this vmap abstraction to to not have to explicitly reason about batches um but now there's Funk torch I haven't played with it directly um but uh I have the impression that that starts to get feature parity in terms of flexibility with Jack so again if yeah if Hardware isn't your um your priority there's there's people there's more about those features for most businesses focusing on large models is always going to be about efficiency and hardware and there the choice is pretty easy gpus you use for training you use pytorch and for if you have tpus you use Jackson then for inference you use you know you can use C plus plus or you know Onyx or a number of different sort of um different Frameworks very comprehensive thank you so much John for the question thanks and for the answer next up is Eden with a bunch of questions either you're allowed to answer the first question or any of your questions of the group of questions you have okay cool um so uh two questions are or okay so one is really fast what's your favorite ice cream flavor the risk of sounding boring I really I really like vanilla um I like I like chocolate I like um cookie um cookie dough as well cookie dough and chocolate but vanilla is great because I like toppings on top of ice cream and so if it does like a blank canvas and you can just you know yeah thanks and you can just um you can just add stuff to it in fact as an anecdote in terms of what you can add to to vanilla we had a hackathon this this summer where our Brazilian interns and this will segue I guess to your questions about internships uh our Brazilian interns formed a team where they used our large language models uh prompted by a number of recipes to generate new recipes and then they they made them and one of them was a dessert that involved red wine and vanilla ice cream which kind of vanilla ice cream red wine cocktail which you wouldn't imagine works but in a very strange way it kind of did or didn't make us sick so um yeah vanilla really is the canvas upon which you can put any paint even if it sounds unreasonable I think the weirdest ice cream flavor I've had is like balsamic and strawberry which turned out pretty well actually I've had more I've had Marmite and vanilla ice cream it was pretty weird too yeah some things aren't necessarily made to be combined but you know you can try uh yeah so my uh longer question is how do you decide like what technology and what research event is uh in ml you choose to share with the community compared to keeping like a company's competitive Advantage like you know what is considered significant enough more important enough that needs to be shared compared to oh this is something that will make my model slightly better which will give me like the edge over you know Google or open AI or whatever but I think right now the landscape in large scale language modeling is very data directed so like I like to think that we should reach a point where we're comfortable sharing any sort of innovation we're doing on the modeling front or the training front and in fact perhaps even the modeling code with the community because our competitive Advantage lies in the data and how we use it and obviously I think we'll have to realistically keep something you know to ourselves so that we can keep that competitive advantage and that there's our existence as a company is driven by having something to offer for which we can take profit but we don't necessarily benefit from significant secrecy when it comes to any of the techniques we're developing and in fact we can help showcase our potential for Innovation rather than just implementing other people's modeling techniques by doing so so we're very eager to both innovate and share the product of that Innovation with the broader Community this aligns quite well I think with how things worked at Facebook ad research where I worked before and a certain extent a deep mine but which was a bit more secretive um but Facebook I researched the idea was we really want to open source everything we're doing release everything we're doing if the business is going to take any value out of it it's from having the experts in-house to sort of advise on how things can be adapted to to be business facing so I think some of the same reasoning can apply here we want to build the best models possible and then have those be available to our customers users developers to build astounding language Technologies and that's our Competitive Edge not necessarily how we produce the models oh um and yeah I'm sure you already read the third question um I'm sorry I'll answer that nice question very quickly we are taking advantage we can you can apply on the careers web page um I'm sure Sandra can put us the link after this you will be guided thank you so much sorry for taking up so much time thank you thank you for your questions okay next up is Sarah thank you so much I don't mean I'm excited to ask my question but I think maybe Alan's question is a good question to go before mine um I think it'll provide good content let's do Alan's first then okay let's do Alex then yeah so um I guess yeah mine is a little bit more abstract or philosophical uh but I'm very interested in uh what you would Define as a threshold of artificial general intelligence and how long do you think it'll be before we get there um I am a bit skeptical well okay first I'll preface this by saying that AGI means a lot of different things to different people so I can't dismiss AGI completely in the sense that some people will have a very moderate view of what it means that probably is in line with I find but I find reasonable but I don't believe that there's an unboundedly general intelligence that's possible just because I think it's a contradiction so um we're humans aren't general intelligences for example right we are geared towards being good at particular ways of learning and making decisions that are the byproduct of our evolutionary history and therefore a byproduct of the constraints under which we have to operate in the world right we can't communicate we can't memorize unboundedly long sentences we don't communicate with unboundedly long sort of like phrases because we need to make decisions under particular sort of like time frames or we'll starve or a bear will eat us or you know in modern society you know uh will just fail to sort of make any productive advances and so all these sort of constraints and pressures from our environment have formed within us through evolutionary Pathways but also through meta evaluate meta Evolution within Society strong biases towards what we're where we're good at adapting and what we're not good at right for example a calculator is undoubtedly better than us at doing rapid calculation just because we don't have any evolutionary need to do you know 10 digit calculation in our head but we are good at doing very rapid pattern matching and deciding things you know kind of like well doing good enough estimates of small sets of numbers just because that allows us to you know expend very little energy cognitively and otherwise and making these decisions so I think the same sort of thing applies to when you're talking about General Intel like more general intelligence is like there's always some point where you have to bias it towards a particular class of problems because if you take the abstract space of all the possible sort of like problems that you solved somewhere in strong opposition to to each other where solving one then you can come up with very artificial examples here and I I'll won't do that in the interest of time but you can come up with one classic problems where being optimal at solving those makes you you know uh almost inversely sub-optimal at solving and another side another classic problems so it's for me inconceivable that there's a general learner that's you know jointly optimal at doing both of those things and in a very hand wavy sense this has the connection to no free lunch theorems but I'm not I'm not an I'm not a uh expert at optimization Theory so I'm not going to pretend that that's a ground truth I think that's more of like a hunch uh in the recent paper uh that we um called uh general intelligence requires rethinking aspiration that I wrote with my student minchi Jiang uh we talk about increasingly general intelligence which I think is a more plausible sort of notion in that we definitely want to come up with methods that in an open domain in an open learning open-ended learning setting Force our agents and our models to be increasingly um increasingly General over iterative over learning iterations but that's always under the view of there being some sort of like formed cone of bias set by the environment under which we're generalizing that obviously balanced the generality of the system that didn't answer your question directly but the short answer is I don't think AGI exists so that it doesn't make sense to talk about when it arrives or what the threshold is I hope that that seems like an appropriate yeah I think so it sets a pretty high bar higher than I'm usual for AGI so I guess I like on a practical level I was just more interested in do you think uh AI is going to be smarter than the the average human and when's that going to happen uh 28.50 there we go awesome thank you so much Alan and Sarah here we go yeah thank you so much because now that makes I can ask a better question now that I understand how you think of AGI um which is you know two I guess a two-parter which is how much do you think that the current uh deep learning machine learning neural net learning approaches despite being called that like how much do you actually think that parallels natural systems neurobiological systems and do you think that it matters like do you think having system technical Technical Systems that operate more like that sensory based intuitive limited human capacity is something worth exploring or is it better for us to just focus on big you know prediction and calculation tools which are obviously far better than we are at doing such things so on the first on the first point about biological plausibility for some definitely the wrong person to ask us so I grew up in in France and had to do the scientific Baccalaureate which requires you to do amongst other things physics mathematics and biology and my biology my biology paper was my worst in the set so I'm barely confident when it comes to sort of even talking about biological plausibility all I can say is you know we we don't build planes to fly like birds and so it's fine for things to be obtaining the same sort of like abstract functionality without necessarily replicating the same functional mechanisms um so to that end I personally I'm interested in whether or not you know um back prop happens in the brain or not like those are things as a hobbyist as a as an outsider almost to that debate I find it quite fascinating and I you know I was entertained and and stimulated by Jeff hinton's keynote at Europe's uh I'm trying to think about learning methods that are more biologically plausible um but I don't think it's essential for us to seek that um do I think I mean the part of your pro your question was also do I think just deep learning as a framework gets us to the level of human intelligence sorry my am I am I projecting that question no that was my original question but now that I understand how you think about it I guess I'm my other question is more do you think it's worth a technical pursuit of a more kind of human learning with all of our failings about you know I mean I think like rather than focusing on the biological problems I like to think about like what are aspects of our own learning Pathways that we might want to replicate on the functional level like how we memorize things how we try and incorporate information in the short and long term these are you know nice things to sort of try and reason about when we're building uh artificial systems I think a large part of my earlier machine learning career um around the time I I was in deep mine was on like differentiable neural computers neural push down automata where we were trying to you know join the algorithmic world the reasoning about memory and short-term and long-term planning world into like the architectural level of how we build neural networks um I still think that's an interesting area to to push and it just like you know large-scale models tended to produce better extrinsic results in Practical real world tasks so that kind of knee capped that whole line of work but um I still think there's a lot of inspiration to be taken from looking at how humans plan what we're good at and what we're poor at if if what we're looking to do is produce systems that reason at some abstract level in a similar way to us so that we have a better chance of getting them to align with our biases with our strengths and weaknesses and you know supplement those perhaps but it's not necessary that we exclusively focus on that in artificial intelligence in fact it's beneficial to also consider the design of artificial systems that highly complement us and to be frankly mentioned to the calculator which I mentioned earlier is a great example of this it's something that does something very well that we're bad at at least some people most people are bad at like doing uh 10 15 digit multiplication um but uh and therefore it's useful but it's it's also terrible at doing things that we're good at so the calculator hasn't immediately made us all relevant awesome thank you so much can I just ask you to repeat the name of the paper that you referenced the author of the paper about natural learning um you mentioned someone who wrote a paper about how we could make um how we can uh oh Jeff Jeff Hinton that the the this keynote at nerps um okay cool yes thank you I'm sure that'll be online in video form soon thank you Sarah um awesome Ed we have a next next question but before that I want you guys to remember to ask your question questions in the Q a section and upload the questions that you would like to be asked and next up will be eaten eaten early with us hello can you hear me yes yes all right first I want to say thank you so much for having this q a and speaking this day we really appreciate it um my question in particular is I read a lot of work talking about how people have just been increasing the scale of neural networks making them larger more data more model parameters and somebody will see if we can continue doing that and we'll continue to read benefits another people are saying we need to introduce other types of systems some things along the lines like symbolic based artificial intelligence to supplement that and I was curious how much feeding we can continue scaling deep learning models before we need to start doing other things yeah that's a great question and I mean I'll answer I'll answer it and perhaps in a slightly more General sense by starting by declaring which was going to be a very contributive thing to say perhaps that I don't actually believe that our current large language models and that the way we're training large language models and Transformer architectures are going to bring us all the way to human level language understanding which seems like a very strange thing to say working for a company which is at least in present predicated upon doing this and and I'll explain what I mean before everyone thinks I've gone Rogue and cuts me off is um if you start from this intuition there's really three ways you can try and Advance things or you know operate right and the first is to just you know stick to your intuition that okay it's not gonna it can never really work it's just kind of working now but and just be a contrarian and I'm not singling or Sub sub tweeting in person any specific person in our community but um you know who you are uh but the um like being a contrarian is great but it's it's not the most constructive way to sort of necessarily move the die on the field even if you're even if you happen to be right even if you have a healthy skepticism and it's always good to be a bit skeptical the second is to say okay I'm skeptical but I'm going to produce I'm going to try and produce a testable hypothesis which is the basis of a scientific method in the form of a benchmark that seeks to you know illuminate a particular failure mode of the current Paradigm right it says like I don't believe that language models are going to you know in their current Incarnation you know exhibit uh quality X which is uh convincingly and extrinsically you know useful or or an essential part of like human communication for example and the diffic this has been kind of the bread and butter of machine learning for the last decade at least as people develop benchmarks the community rallies around trying to sort of produce training or model changes that improve upon the benchmarks you get soda you get your icml paper hooray um and but that is sort of like slow and gradual process and so it's a it's a healthy way to engage in in scientific progress in an engineering discipline the problem with this is that the complexity of the behavior that language models exhibit and can be in the situations they can be deployed in has reached a degree to which it's very difficult to design benchmarks that appropriately find Reliable robust failure modes that reflect the diversity in which of the the diversity of the mannerism which they can be interacted with or used there's still a place for benchmarks obviously my student Laura Reese has recently put out a fantastic paper on a benchmark that tests whether or not large language models can understand conversational implicature or form of pragmatics and we find that they can so that there's you know progress to be made there but in practice this brings us to the Third Way of like you know being a skeptic which is let's just try and build cool stuff with the technology and see where it fails in the real world it's the most unforgiving Benchmark of all like the diverse way in which users are going to try and mess with your model get it to fail just by virtue of trying to get utility from it and um and it happens to also be the sort of way of addressing this problem where you can also make a lot of money so it's great that you can align you know the scientific method and you know Financial incentives in such a way now obviously if your model fails to sort of land with the public and find utility that's in itself not a sufficient negative case or sufficient case for the negative thesis because you know you could have had poor product Market fit you could have marketed it improperly you could have just not advertised you know the the qualities of your model sufficiently and that's why I didn't connect with users but it once you start having sort of users engage with your model in real world applications the signal you get from that about what it's good at where more progress is needed and what the sort of eventual critical failure mode if it exists is you really have a first hand um opportunity to see that and then obviously hopefully fix it so that you can continue to add value and continue to sort of operate as a as a company and and push the the boundaries of technology and in this sense cohere is really like a fantastic place to work to plug my employer here because if you're working a deep mind or if you're working at a Facebook AI research right some of the smartest people in the world work there it's a great environment to do research it's very difficult to put anything you're building in front of customers right it's like if you have a really great idea and you go to you know Google and you say I want some cross-functional collaboration even if your idea is great in particular and economically valuable like these companies and are are big they're they're incentivized to be conservative to not rock the boat too much with their current Revenue models and to just really slowly and carefully integrate new technology in their user-facing sort of offering conversely you can work at a startup which you know they're then at an earlier stage where of course you can put things in front of users but you don't necessarily have the resources computational and otherwise and then in terms of engineering support to build things at a very large scale and cohere and open Ai and a few other places are are there are very few places where you both have you know this immediacy of being able to like turn research ideas and put them in front of users in order to get that tight feedback loop and the resources engineering and computational to be able to build things at scale so um I'm real this on the basis of my skepticism I think go here is like a fantastic place to be working for me to address this fundamental question specifically you mentioned scale so on one sentence there I too am skeptical about like where scale is going to give us I don't think we can just scale the data and scale the models but it keeps on working so in line with my earlier response we're going to really want to find the failure mode before we start thinking about how to fix it got it thank you awesome thank you so much Ethan for a question thanks Ed I agree go here is an awesome place to work on stuff next question is coming from Dwayne uh good morning everyone thank you Sandra thank you Ed since uh this is an AMA my question totally leans into the anything category um and it's it's pretty Niche question for the industry that I work in which is the news industry uh and I am curious Ed uh if you've had any sort of Daydreams or or passing thoughts as you're consuming some what I'd say sort of good reputable news to start with that um you know ml might have applications for in order to help us improve trust and legitimacy for for the consumers that read our content that's an interesting question I I haven't really I mean I thought about the question about trust in the news and and observe that over the last four to eight years um the the a very human problem has emerged where people have lost trust and are new in news and institutions and misinformation before we worry about like machines producing it is already amply produced by humans um and and for the harms of that trust so I'm not an expert in this particular domain but I do have the confidence that as we have um systems that uh increasingly understand and I use scare quotes because they're they're mimicking understanding or the jury's out as to whether or not this is aligning with the actual mechanism I wish humans understand but we if you allow the anthropomorphization we have systems that increasingly understand and can explain or generate or reference uh linguistic information and linguist and information containing other modalities and pictures Etc and we build products that you know allow you to search to like Converse and you know operate over these modalities um that this will allow us to build tools that will automatically you know extract for example you know given a news article here are other reliable news sources or or sources of information that you know are um uh congruent or aligned with information being produced there and hear our information in here some news sources that you might consider in opposition and at least kind of synthesize you know uh where the the fault lines lie and where this what the what the sources of information are so that people don't feel like they have this tunnel vision view of like their favorite newspaper tells them XYZ and that's all they're going to look at uh but rather can very easily without much sort of cognitive load aggregate information from several sources and make a more educated decision about whether or not they trust you know particular users that said you can give people the best tool in the world in this nature um you're still going to be fighting against the natural human bias to sort of like stick with your team to stick with you know people you trust and to sort of reject evidence even if even if it's like well argued so I don't think it's going to be a Panacea but I'm eager to think about how people are going to integrate the sort of Technology we're building into trying to I guess distill information in a explainable way such that people can at least explore the possibility that their favorite news source isn't necessarily trustworthy or better yet reinforce the idea that they're not just believing their trusted news source because they live in a bubble but because that information is consistent with what you know the other news sources are saying yeah very good thank you very much thank you um next question comes from Ferris do you want to ask your question or shall I do that for you let me do it for you then um so Ferris is asking speaking of biological systems are you designing any systems that have human in the loop sort of hybrid systems yeah so when it comes one popular topic right now in uh large-scale language model training is learning from Human feedback um so open AI I've been working a lot on this aspect um in particular with their move from DaVinci 2 to DaVinci 3. um and uh we are we're definitely looking into this um our core belief right now is that collecting data for example opening I have used this primarily to train excellent um instruction following models so models rather that operate by continuing a sequence of text or a sequence of like inductive reasoning uh rather for them to be good at just taking instructions in a very natural form like write me a poem about this or like tell me about that and that's the basis of a lot of the models we're seeing now including I believe chat GPT um so they've trained that with initially just by collecting a lot of instructions and they're those sort of like completion of those instructions and and and and training against that data and a supervised format um and then sort of augmented that with um human and sort of human in the loop slightly asynchronous human in the loop um RL from Human feedback where they kill they collect feedback about like two proposed continuations that the model gives given a prompt and then ranking that you trying to rank in an alignment with how humans would rank in order to be able to sort of train uh the further train the model when you have a bunch of prompts but you don't have necessarily the gold standard responses from annotators I like that Paradigm a lot I think it's it's it definitely makes a lot of sense our current design philosophy is to think let's focus on like actually just getting core data from initially from annotators purely and then from Anna further annotation of how people are interacting with our models and then just continue supervised learning focusing more on the diversity of our data and focusing more on you know the quality of that data and see how far that gets us before we try uh reinforcement learning uh or a similar method um this might seem as a spread for me to say given I've been working a lot at on reinforcement learning both with my group at UCL and in my prior sort of work at Facebook AI research but I'm a big fan of pragmatics like and and of being sort of uh conservative with the degree of complexity you add to your learning mechanism so we have a pretty reasonable approach which is see how well a simple method scales um and then add complexity on the algorithmic front wonderful thank you thank you Ed for um answering the first question now we have a question coming from Jean-Pierre hi good question so try to keep it short please if possible we have more questions coming I've been studying deep learning uh during my Master's Degree and it looked to me that it's mapping very extensive smart mapping and when you look at human beings or even nature it's all hierarchical like the body is made of organs and eyes made of an iris and so and so on a cat is small legs in the head so it seems that we model reality and information like a hierarchical structure uh objects and then we learned the properties of those objects so it's really easy for us like you've never seen a cat running and then you see a cat running you just update the property um so I was wondering maybe I'm wrong but are there any people who do this today is there research in that area because in Transformers you have like some levels of hierarchy but they are not defined in terms of the objects that people use every day in language or in then in their uh in in their minds yeah I mean so you know deep learning is a very Broad and wide field now and so there certainly are a number of project areas that work on interpretability not just of the output but of like the Intermediate representations there's been a lot of work uh between 2014 and 2017 around that time on trying to um as I said produce uh neural architectures that have um more uh mechanistically if not biologically inspired kind of structures like differentiable neural computers differentiable Stacks or cues neural gpus where we're doing continuous relaxation of like discrete architectures and so you can sort of think about what's happening at the intermediate levels it's manipulating bits of information and and trying to compose them and that that's been a very interesting sort of research program um separately there's also been um work on sort of neurosymbolic methods where there are uh in the somewhere in the system there are things collapse onto interpretable symbols and the the neural network components manipulate those symbols or operate on on them as an intermediate layer and again there's a large body of research on this including work I did with Richard Evans on differentiable prologue interpreters um but the react to go back to your foundational question I don't think it's it's wrong to say that just standard deep learning as we do it today like taking very large scale Transformers aren't also capable of learning um uh hierarchy implicitly by virtue of doing something like mapping I mean so when you're learning at machine learning when you're learning a large when you're learning a large Network against your data you are learning and mapping right you're learning a function and every function is a mapping from input to Output um the idea is that by having enough data and sufficient sufficiently expressive model the best way of explaining the training data is to actually implicitly induce um the uh some degree of compositional substructure within the data and then exploiting that in order to extrapolate well so the only explanation that I could find plausible for why these neural networks exhibit compositionalities and that you can run them and explain data that you haven't seen during training time is that this is exactly what has happened during training the statistic grammortization has induced within the network something like a hierarchy is it interpretable no but is it there it must be in some form if not they would just have completely fit the training data and not be able to explain held out data thank you um thank you Jean-Pierre for a question thanks for your answer um we have a question coming from Adrian Ziegler next uh yeah hi there uh so my mine is a bit more applied so how do you think about prompt engineering uh versus fine tuning or also like some alignment methods such as instruct GPT when it comes to using an LM for a specific use case so that's a that's a great question and it's it's very relevant to our current strategy now I'm happy to to talk about it a bit so prompt engineering is a is a fascinating mechanism by which we can interact with language models whether they're general ones or you're in slightly more specific ones which is to say let's exploit the fact that they take text as a modality and use text to condition kind of what the the rest of the generation is going to look like and you can see that as was trying to sort of you know take something General and get it to act in a specific way people like it because it doesn't require actually understanding how the models were trained or for the training of the model or any sort of programming so it's been a very intuitive way for developers to to interact with large-scale language models and get them to do things like you know act like a summarization engine um you know act like a chatbot um so that's super cool where it starts to fail if you take the most General models even very good ones is you know you might obtain something that acts like a summarization engine or something that acts like a chat bot and it's is going to work sort of like 95 of the time and if that's good enough for your product then just stick to prompt engineering but in practice for a lot of sort of business critical cases what you want is like things to work 99.5 percent of the time and then you know to get that sort of robustness to get that sort of like highly specialized behavior fine-tuning is the solution that we typically engage in we find data that reflects the downstream application um and and we retrain on it but obviously as you go towards specificity using fine tuning you lose a lot of the generality so if you fine-tune too far towards a specific use case and someone wants something in the neighborhood of that use case you've you've kind of like trained yourself out of the generality that would have allowed that rapid adaptation so a lot of what we're doing and I think a lot of what the field is going to start exploring is figuring out what the sweet spot is between specificity and you know General applicability like how can we train something that's domain specific enough that it's like commercially applicable but in a more constrained way where we can still prompt it to have specific behaviors that will you know in for one prompt help business aim or for one another prompt help business be and obviously let the businesses themselves determine what these prompts are with some guidance from us um exploring the space is a difficult task but it's really kind of our our original debt right now so um we're we're pretty thick into it awesome thank you so much actually um so um go ahead yeah I guess for me this is just such an exciting time because it seems like every few weeks uh there's another really large uh language model and they just keep getting better and bigger and and more options so it's it's utterly fantastic but I guess I'm curious on on your perspective on how the whole industry will adapt to this do you see that some of these models will become increasingly specialized in various ways so we'll see very diverse ways to use these uh different uh tools or is it more like a winner take all where the biggest best smartest uh models will dominate and so I'm really interested in your view on that as well as you know how do you see cohere fitting into uh you know that competitive space what's your Niche or role or how do you believe you'll you'll stand out so I mean so with regard to whether there's going to be a winner take salt I mean I hope not because opening I have had a pretty big head start on on everyone else um and if it was that simple that would be great for them and also very a very interesting moment for the field because to really take it all you'd have to have a general model that is so robust that you know everyone just wants to use it rather than try and you know find some sort of like specialized model or something that works better in another domain in practice what I'm seeing from you know what they've put out with gbd3 and I guess gbt 3.5 as as what is unofficially underpinning um chat GPT um these are really powerful and great models uh they're in line with some of the progress we've seen within Google and with within within uh within Facebook all these models have really clear strengths but also very clear failure modes and uh I don't necessarily believe that any of these companies will be able to you know address all the failure modes as fast enough or all the strengths or capitalize on all the strengths fast enough that they will focus on every area once right like clearly taking having a really good strong Foundation model isn't sufficient to necessarily address every every um use case and so companies that have the second best model or the third best sort of like base model but that try and enter areas where the other companies aren't necessarily competing we'll have a better time just by virtue of focusing on the data and the pragmatics of that particular area so I think will be happening over the next few years is there won't necessarily be a Cambrian explosion of like large-scale language modeling companies in terms of building the base models but there will be I think more than two or three that would be servicing different areas that will grow into their own strengths um and if there's a winner's take Soul situation that will be by virtue of them either merging or you know a monopoly forming which you know I think we also would like for regulatory reasons not to not to see um rather than just by virtue of someone having a significant Head Start exciting time to be in this space that's for sure thank you so much uh all of our question thanks Ed and uh the next question comes from Marine you can omit yourself and ask directly or I'll do it for you so Ed you highlighted the difference between production cycle in big companies versus startups Marine is asking what makes a good infrastructure for deploying ml models I'm like woefully underqualified to answer this question I mean I should spend my background is that of of research and and I when I was in research as I said we piggybacked on the excellent engineering work or research Engineers to really help us scale and increasingly researchers have to sort of actually think about like the model architecture and and the and the the data regime to to focus on on leveraging scalability but that's that wasn't something that I necessarily um had to do that much myself at the point when I was doing primarily research and again within the business I came to cohere and we have an excellent sort of like infrastructure team we have an excellent sort of like foundations team that have built the framework for training these and then uh serving these models at scale and so it's not uh necessary for me to have like extremely well defined expertise in this so the very short answer is I I I can compliment what these people bring to the table but fortunately I can rely on the expertise and and not you know in a good position to answer that question favorite is lovely and honest answer um all right so I think this might be a little bit of a stupid question but what do you think about using Transformer architecture to improve Transformer architecture Maybe by engineering prompts to so you can query questions about mathematical proofs where like the knowledge base backing the model is a corpus of mathematical statements on their proofs that's good I mean so first I don't think that's a stupid question at all that sounds very interesting very interesting there aren't they like there are there are simple questions but there there aren't any uh completely stupid questions and that by any measure was not a stupid one I think it's an interesting research topic so um you're asking why not use like several networks for example I have one one model learn to prompt another and so this reminds me a little bit there's been a lot of work on using networks to sort of condition other networks um going back to work by Schmidt who were on slow and fast learning which was probably some of the early meta learning papers I think that was with said hawk writer um has obviously been work um by and colleagues at Facebook executed that at Google on hyper networks where you have networks predicting the parameters or modulating the parameters of other networks and what you're proposing sounds a little bit like that except in Transformer land so let's pretend that you have one network conditioning the activations of another via discrete layer via languages and modality um I'm trying to think if that's actually been sort of that sounds like this like there's been work around that kind of topic um but I guess yes there's work there's work on on uh hierarchical reinforcement learning with language as a sort of um command medium so Chelsea Finn's lab had a what I described as a position paper where they showed that feudal reinforcement learning could be defined in such a way that the meta controller if you're not familiar with feudal reinforcement learning instead of having a single policy acting on an environment you sort of have a meta controller that sort of works at a higher level abstraction and a lower level controller that takes instructions from The Meta controller and acts on the world so to give you an example a robotic arm might need in practice to learn how to sort of like lift a red ball and then a blue ball and then a green ball and if you just let a single policy direct the control of this robotic arm and try and explore what the right combination of balls is it's going to take a while because that's exploring all the different ways of actuating moving the arm and if you factorize that exploration problems that you have you know you have a meta controller that gets to Output instructions like move the red arm move the moves like move the red ball move the blue ball move the green ball and the controller follows his instructions the exploration space is equally factorized in that you now just are exploring the different sort of combinations of balls you move and the controller only needs to get good at following the sort of like low level instructions so using language as an intermediate modality there talk speaks to me a lot like sorry it sounds a lot like what you're proposing and having you know a complex problem and you have some sort of factorization of the network that's expert at posing problem opposing questions the network that's you know an expert at answering them then you could train it through some adversarial process but I don't know any literature that's tried exactly that which is good news because if anyone's listening to this Con is looking for a research project then you've just proposed something that sounds like something to to flesh out a bit more and try oh yes grab a toilet still hot meanwhile uh we're moving to the question that was posted by Benjamin thank you Sandra so in relation to large language models how do you think that these three and other forms of generative AI that should be governed going forward know that there's an ample opportunity to miss you some of these Technologies for example in terms of polluting information ecosystems via chatbots driven by gbt providing misinformation and so on so uh do you think that guided release strategies and other forms of self-governance they are enough or is this current open source movement of these models and stipulating an area that governments they should more actively move into and regulate going forward thank you that's a really tough question and it's been a long day so let me try and think of it something not embarrassing to say I mean I I definitely a

Original Description

Ed previously worked at the University of Oxford's Department of Computer Science, and was a Fulford Junior Research Fellow at Somerville College, while also lecturing at Hertford College to students taking Oxford's new computer science and philosophy course. He is now an Honorary Professor at UCL. His research interests include natural language and generation, machine reasoning, open-ended learning, and meta-learning. He was involved in, and on multiple occasions was the lead of, various projects such as the production of differentiable neural computers, data structures, and program interpreters; teaching artificial agents to play the 80s game NetHack; and examining whether neural networks could reliably solve logical or mathematical problems. His life's goal is to get computers to do the thinking as much as possible, so he can focus on the fun stuff.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Cohere · Cohere · 54 of 60

← Previous Next →

Andreas Madsen on Independent Research and Interpretability

Andreas Madsen on Independent Research and Interpretability

Plex: Towards Reliability using Pretrained Large Model Extensions

Plex: Towards Reliability using Pretrained Large Model Extensions

Independent Research Panel Discussion

Independent Research Panel Discussion

The Future of ML Ops: Open Challenges and Opportunities

The Future of ML Ops: Open Challenges and Opportunities

C4AI Special - Grad School Applications

C4AI Special - Grad School Applications

Cohere For AI Fireside Chat: Samy Bengio

Cohere For AI Fireside Chat: Samy Bengio

Cohere For AI - Scholars Program Information Session

Cohere For AI - Scholars Program Information Session

Modular and Composable Transfer Learning with Jonas Pfeiffer

Modular and Composable Transfer Learning with Jonas Pfeiffer

Jay Alammar Presents Large Language Models for Real World Applications

Jay Alammar Presents Large Language Models for Real World Applications

Catherine Olsson - Mechanistic Interpretability: Getting Started

Catherine Olsson - Mechanistic Interpretability: Getting Started

How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners

How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners

C4AI Sparks: Samy Bengio

C4AI Sparks: Samy Bengio

BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1

BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1

Exploring News Headlines With Text Clustering | Jay Alammar

Exploring News Headlines With Text Clustering | Jay Alammar

Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang

Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang

Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney

Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney

Intro to KeyBERT - BERTopic for Topic Modeling

Intro to KeyBERT - BERTopic for Topic Modeling

Intro to PolyFuzz - BERTopic for Topic Modeling

Intro to PolyFuzz - BERTopic for Topic Modeling

API Design Philosophy - BERTopic for Topic Modeling

API Design Philosophy - BERTopic for Topic Modeling

Code demo of BERTopic - BERTopic for Topic Modeling

Code demo of BERTopic - BERTopic for Topic Modeling

Short texts vs long texts in BERTopic- BERTopic for Topic Modeling

Short texts vs long texts in BERTopic- BERTopic for Topic Modeling

How People can help BERTopic - BERTopic for Topic Modeling

How People can help BERTopic - BERTopic for Topic Modeling

Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan

Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan

Cohere API Community Demos | October 2022

Cohere API Community Demos | October 2022

Perfect Prompt Demo By Arjun Patel

Perfect Prompt Demo By Arjun Patel

Project Idea Generator Demo By Tobechukwu Okamkpa

Project Idea Generator Demo By Tobechukwu Okamkpa

SuperTransformer Demo By Amir Nagri and Team Megatron

SuperTransformer Demo By Amir Nagri and Team Megatron

Cohere For AI Fireside Chat: Pablo Samuel Castro

Cohere For AI Fireside Chat: Pablo Samuel Castro

How Startups Can Use NLP to Build a Competitive Moat

How Startups Can Use NLP to Build a Competitive Moat

Build Chatbots Faster with Large Language Models

Build Chatbots Faster with Large Language Models

Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2

Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2

Utku Evci - Sparsity and Beyond Static Network Architectures

Utku Evci - Sparsity and Beyond Static Network Architectures

Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp

Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp

Iterating on your data with doubtlab - Tools to Improve Training Data

Iterating on your data with doubtlab - Tools to Improve Training Data

Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data

Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data

Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data

Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data

Building Cohere API Demo App With Streamlit | Adrien Morisot

Building Cohere API Demo App With Streamlit | Adrien Morisot

Rosanne Liu - career creation for non-standard candidates

Rosanne Liu - career creation for non-standard candidates

Giving computers many human languages with Cohere's multilingual embeddings

Giving computers many human languages with Cohere's multilingual embeddings

Learning by Distilling Context with Charlie Snell

Learning by Distilling Context with Charlie Snell

Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3

Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3

Reflecting on for.ai...

Reflecting on for.ai...

Create a Custom Language Model with Surge AI and Cohere

Create a Custom Language Model with Surge AI and Cohere

Cohere API Community Demos | November 2022

Cohere API Community Demos | November 2022

Cohere API Community Demos | December 2022

Cohere API Community Demos | December 2022

Cohere For AI Presents: Colin Raffel

Cohere For AI Presents: Colin Raffel

Lucas Beyer - FlexiViT: One Model for All Patch Sizes

Lucas Beyer - FlexiViT: One Model for All Patch Sizes

What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation

What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation

Evaluating Information Retrieval with BEIR

Evaluating Information Retrieval with BEIR

Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers

Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers

High quality text classification with few training examples with SetFit

High quality text classification with few training examples with SetFit

Multilingual and cross lingual embeddings - Nils Reimers

Multilingual and cross lingual embeddings - Nils Reimers

Developing open-source software: lessons, benefits, and challenges - Nils Reimers

Developing open-source software: lessons, benefits, and challenges - Nils Reimers

Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere

Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere

HyperWrite Powers Its Generative AI Service with Cohere

HyperWrite Powers Its Generative AI Service with Cohere

EMNLP 2022 Conference Special Edition - Talking Language AI #4

EMNLP 2022 Conference Special Edition - Talking Language AI #4

Cohere API Community Demos | January 2023

Cohere API Community Demos | January 2023

C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates

C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates

Michael Tschannen - Image-and-Language Understanding from Pixels Only

Michael Tschannen - Image-and-Language Understanding from Pixels Only

How to Add AI to your App

How to Add AI to your App

This video features Ed Grefenstette, Head of Machine Learning at Cohere, discussing various topics in machine learning, including large-scale language modeling, conversational intelligence, and the limitations of artificial general intelligence. Ed shares insights on Cohere's approach to innovation sharing and competitive advantage through data-driven approaches. The video also covers technical topics such as Jax, PyTorch, and Transformer architectures.

Key Takeaways

Build large-scale language models using Jax and PyTorch
Fine-tune language models for specific tasks and robustness
Design and deploy ML models using Transformer architectures
Implement retrieval augmented generation and fine-tuning for natural language understanding
Evaluate and benchmark language models for reliability and performance

💡 The key insight from this video is that large-scale language modeling is a complex task that requires careful consideration of factors such as data quality, model architecture, and fine-tuning techniques. Ed Grefenstette's discussion highlights the importance of innovation sharing, competitive advan

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for AI development

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for advancing AI research

Medium · Data Science

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Explore the geometric assumptions underlying neural networks and their implications on manifold learning and projections

Medium · Deep Learning

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn about the hidden assumptions of neural geometry and how manifolds and projections impact neural network performance

Machine Learning Project for Final Year Students | ML Project Idea @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB