The Library Problem

Data Skeptic · Intermediate ·🏗️ Systems Design & Architecture ·9y ago

Skills: AI Systems Design90%LLM Foundations80%Prompt Craft70%Agent Foundations60%

Key Takeaways

The Library Problem is a systems design challenge that involves predicting whether a book will be returned on time or not, and the video discusses various approaches to solving this problem using machine learning and data analysis, including binary classification, dimensionality reduction, and feature selection.

Full Transcript

[Music] Data Skeptic features interviews with experts on topics related to data science, all through the eye of scientific skepticism. [Music] All right, data skeptic listeners, welcome to the last episode of 2016. Thanks for tuning in to the last special of this year. You know, last year this time I did a little rant called Let's Kill the WordCloud. And I gotta say, I've seen a lot less word clouds. Bunch of people reached out on social media to show me projects where they explicitly chose not to do a word cloud and found a better result. That was awesome. Let's keep that up. I was trying to think what we should try and kill this year, and I actually had a better idea. So, one of the things I do professionally is help companies figure out what tools to use, build prototypes, and then sometimes staff their teams on those prototypes. I'm not really a recruiter in that way, but sometimes I'll build some model for some production system, and when I'm done, I staff it so that whoever I built it for can maintain it and extend that initial work when I'm gone. I finished up a project like this just this past week. So, I was doing a lot of interviewing, trying to find someone to carry the project on. But, a funny thing happened to me and I thought it would make a good episode. So, there's a question I like to ask on interviews when I'm doing an initial phone call with people. It's not a super technical question. and it's more of a chance for me to understand exactly how far along people are in their journey as a data scientist. It's somewhat open-ended, so people take it in different directions that reflect their own experience. It's also something that fairly quickly helps me determine if people have done some applied machine learning on real world problems or if they've just read a lot and done course demos and stuff like that. Real data just is different. There are certain things that you have to experience in the wild. So, I've probably asked maybe a hundred people this in the last two to three years. I guess with those sorts of numbers, I shouldn't be surprised at what happened, but it did take me a little off guard. So, here's what happened. I get on the call, we're asking about background, their thesis, and stuff like that. And then I hit them with this question that I'll hit you with in a minute. And the person says, "Oh, yeah, yeah, my friend said you'd ask this." And then they started going into a pretty prepared spiel. Now, this was not a referral. This wasn't like someone said, "I interviewed someone and they said, "Oh, also interview my friend." I had no idea who this person was talking about. So, I stopped them and I said, "Well, what do you mean that your friend said I'd ask this cuz this is something I want to ask impromptu. I don't want a scripted answer. I want to not catch you off guard, but get a natural reaction, not a prepared answer, cuz it would be pretty easy to ask someone and help get them to help you prepare for it. I want to hear your reaction." So, I'm like, "Well, what friend?" And apparently, this person had a friend who interviewed with me months earlier at a totally different company. So, I guess this is out there in the zeitgeist and I should probably retire this question since I really liked it for its impromptuness. And if I'm going to retire it, why not talk about it here? So, what's this going to be? I'm going to lay out this question for you the way I lay it out for everybody I would interview. And then I'm going to walk you through what I'm looking for, not looking for, just how I would decompose this question. So, if you're a a senior data scientist, this might be a little bit introductory for you, but maybe you could study my technique. You won't agree with everything I say, but you could learn ways in which I interview that you might benefit from. For a more junior person, study this question. This would be something you might be asked or something similar. And if you're sitting across the table from someone like me, this is maybe the way they're going to interpret your answer. And if you're not really a data science professional at all, you're just a casual listener who likes my data stories, stuff like the potholes episode this past year or the Saturday Night Live analysis we did, the more pop side of the show, if you will. I think it's worth staying tuned as well. You'll hear a little bit about how a data science problem, admittedly a fake and contrived one, but a problem nonetheless would get sorted through by someone. And I think just this process can be interesting. So here's the problem as I present it to people. I find the city nearest to them, Los Angeles for me, and I say, "All right, imagine the downtown main branch, Los Angeles library, has hired you to deal with this issue they have of books being returned late or sometimes not at all. They'd like to leverage all the data they have to try and predict at the time of checkout, is there a risk that the person's going to bring the book back late? And to do this, you're given 10 years of transactional data, tons of transactions. Each row is a single instance of one book being checked out from the library. Now, every row in that big database has exactly five columns. has a book ID that we'll talk about in a minute, a patron ID, we'll talk about in a minute, a cart ID, which is just a way to link if people checked out multiple books at the same time, they would all have the same cart ID. Then you have a checkout timestamp, meaning down to the second when the person got the book. And you have a return timestamp, which is down to the second when the book was returned to the library, or that can be null if the book was stolen or recently checked out. Now, going back to those IDs, we've got a patron ID, which links to a patron table. That's where the library is stored, everything they know about their patrons. And that table has everything you'd expect it would have. Name, address, demographics, income, anything you could plausibly argue the library would or should know about a patron, they've got it in there. Now, the book ID links to a books table, which for each book has every bit of information you can find on Amazon.com. So, its price, the year it was published, what color its cover is, anything on Amazon, you've got those data points. All right. All right. Now, for a second, let's step out of what I would tell the person I'm talking to, and I'm going to give a little meta analysis on this problem. Now, I set it up this way because now we're trusting intuition on this data set. If I were to describe for you some, I don't know, medical data set or engine telemetry, I'd probably have to explain to you a lot about those fields. Even if you're not a library patron, the idea of a library existing, what its business is, how it works, these are obvious things. And I don't want to have to walk through features, their cardality, this sort of thing because we'll just let it all be sort of assumed. Everyone knows Amazon.com. Everyone knows what data is there. So I don't know. I guess that's not really called a pneummonic, but it's a nice abbreviation that I find very useful. It's also useful because there's an implication about the mechanism of that system. Asking any non-trivial problem that isn't just do you know this method a sort of how would you tackle this problem. There's always a lot of handwavy stuff of the details of what you do and how you do it because the reality is the process is iterative. You look through the data, you do exploration, you find things and you exploit them and down the road I might give someone a data set and ask them to demonstrate their prowess. But at an early stage, that's too big a commitment. I just want to have an initial talk. So as I was saying, we have an understanding of the system of a library. For example, we would expect that it's not a common case that there are thefts. You know, if you ask me, how often does a convenience store get stolen from? And I don't mean like armed robbed, but someone five finger discount. I don't know. Higher than I'd like to think. I'm sure, but I have no idea. It could be anywhere between 1% and 20% of the inventory. I really don't know. But the library, there's a sense of like that's not really where you go to steal. It must be the minority of books that are checked out that come back late or end up getting stolen or something like that. It's just sort of an intuitive thing that's there and it's this shortorthhand we get to take advantage of. So that right there is the first thing that uh the best candidates tend to point out. They say, "Well, this is a class imbalance problem. We probably have a ton of examples of good behavior at the library and a small percentage of bad behavior." It's the classic situation of, "Well, if only 1% of the books are late, then I can make a 99% accurate classifier immediately by just assuming all books get returned. I'm only wrong 1% of the time." So class imbalance is a tricky thing. It's not in every field, but I feel like most data scientists who have done a couple of good projects have encountered it. So you get a couple of bonus points if you point that out at this stage. Now it's also interesting to see what people say next off their cuff. You know what's intuitive about it. Interesting one is people will sometimes talk about the price of the book. Like oh maybe more expensive books get stolen. And for sure that happens. I guarantee you there's cases where there's an expensive book. Some real lowdown person that should be ashamed of themselves wants to own this book but doesn't want to come by it honestly. They go to the library. They steal it. But to be honest with you, I don't see price as the major driver here. think through this problem and again the data will tell the whole story. So a lot of people say that at this point they say well I need to do exploratory data analysis and of course that's correct but let's put our coin down a little bit here. We can think through this problem in advance of looking at the data. It's all right a little push on people get into some hypotheticals. What would you first look at and the price one sometimes I'll point it out sometimes I won't about you know is that an effective one maybe of course we should look at it but I don't think that's going to be the strongest signal. Similarly, the color of the book. That seems totally irrelevant to me. Now, I know there's that Netflix result of they found that people like things with black covers or something like that. It was pretty cool. That came out last year, the year before. Now, that might affect what you check out, but I don't think people are less inclined to return a book or less inclined to steal a book based on the color of its cover. So, while we might use that as a feature, seems unlikely that that should be helpful. And in fact, when we look at our resulting model that we'll eventually build, if that shows up at all, we should be skeptical of it. Say, well, what magnitude contribution is that ad? If it's a major contributor, you know something's wrong. It's not to say that that's false. Maybe um there's a certain very famous type of book that is in a rare color on its spine and people are known for stealing that book, in which case it's correlated with color. That feature has information in it. There's predictive power there, but you haven't gotten to the root of it. So yeah, let's keep color around, but let's be very careful with our results on it. So a lot of times, you know, I let people drift around, explore the problem, ask questions, get some details, and if I need to put them on track, you know, if they're just sort of going off in too many directions, I might say, "Well, just off the cuff, what would what is your expectation? What do you think would be the most useful feature you could come up with for this data set?" And again, no right or wrong answers because there is no data set, but there's an answer that is deeply preferred and it's well, I would aggregate the user's history and look if they've returned books on time successfully before. That is almost certainly the most predictive value. Yes, past behavior is not indicative of future behavior, but actually it is. Now, that's not there in the data set I described. Remember, I had this patron ID stuff and everything that you'd expect that the library would know about the person is in there, their age, their demographics, whatever. I didn't say how many books they've checked out in the past because number one, that would have been too much spoon feeding. But also, that's not likely to be there in most application databases or most transactional systems. All that's important to the engineer who built that system is scan the book, notate that this person checked it out at this time. And maybe when they register the library card, it's capture all the actual facts we have about that person, their address, their birth date, these sorts of things. coming up with derived data like how many books has a person checked out, what is their average time between checkouts, how many times have they returned late or not at all. These are analytical questions that generally either get created by a business intelligence layer if one exists at the company. So somebody outside of the core you engineering team that builds or maintains or services the checkout process, someone comes along and grabs that database and usually does some data warehousing or things like that. If that exists, that's great. those people will be there to help you. And sometimes people bring that up. More experienced people will know that and say, "Well, do I have a data cube or these sorts of things?" And my answer in this case is no, because it's not necessarily common that you do have that layer there or that it has all the features you need. It might if they've done a great job, but those people might be data warehousing professionals, which are not the same things as data scientists. So, go back to the core data and explore it and realize, hey, I'm going to have to aggregate some stuff here. That feature once I've set it seems rather obvious. That's one of those obvious after you know it things that is for me a good clue about someone's experience and seniority. That doesn't mean you fail if you don't come up with that. I'm going to make an inference about a person's experience from it. Now generally I try and drop a few breadcrumbs along the way and lead the person to that answer. I'd like them to get it but hopefully with as as few prompts as I can give out. The danger here or the naive I I don't want to see is when someone says, "Oh, you just take your data and throw it blindly into your algorithm and hope for the best." Machine learning, which is essentially what we're doing here, it isn't magic. I mean, it it works by processes that make sense. And a project like this, machine learning is a lot like manipulating insects. You know, they do what they're going to do. Uh sometimes you have to go fill a gap with to make sure they don't walk through it or, you know, make some small change to get the insects to react the way you want. That's the process of tuning machine learning in my experience. At least when I have constructed or engineered features different when you have, you know, large scale telemetry stuff. But back up, I jumped into machine learning here. The other thing that, and this is interesting, no points against anyone who fails to say this because I think people who are more experienced will leap over this fact, but in thoroughess, someone should point out, well, this seems to be a classification problem. And you'll notice I forgot to mention that. That is to say that you have a you can treat this as a binary outcome. The book will be returned on time. The book will be returned late or not at all. So every record in the database can be labeled that way. returned on time, not returned on time. And then the process you want to go through here is creating a classifier that makes a prediction of given all the observable data, which class do we think that this particular checkout is about to be a part of? A book that will get returned or won't get returned. Now, in truth, you might want a more probabilistic method. So, here we sometimes get into methodology, and I'll leave that discussion out of this podcast because there's lots of ways you could go. This could be a very basian driven approach. Some people start talking about probabilistic programming at this stage. And really the binary classifier is sort of a sledgehammer approach. Not necessarily the most sophisticated, but a good place to start for sure. So I'm going to talk mostly through that way because after all, if you can't build a decent binary classifier, you're probably not going to be able to build something a little bit better. So I'll see this as iterative. I'll learn the problem trying to predict the binary events. Then I'll figure out, well, how is this going to fit into the library operationally? and would it make sense for me to give them a slightly more complex model and will they be able to interpret its results and things like that. As I said when I set up the problem, they want to know at the transactional level at the checkout counter, the time the book's being checked out, is there a risk, which could be rephrased to what is the risk? But realize then that you have a librarian or a checkout person. I don't know if librarians a title you have to achieve or not, but the person there is going to interpret that result. So we should think carefully regardless of how we arrive at our answer. what should we show them? Because if you say, you know, there's a 80% chance this book will be returned or you say there's a 20% chance this book won't be returned, that's the same information, but it's likely to get interpreted differently. So, let's think of this as a binary classifier just for the time being. And there's a little note to be made here. You definitely get points if you recognize this, but I guess no points off if you miss it. When I described the timestamp of the book being returned, I said it would be down to the second when the book was returned and null if it hasn't been yet or was stolen. So, I actually gave you an interesting point right there in plain text, but I rush over it so not everyone sees it. There's some length of time you're allowed to have the book out. Let's just say it's a month just to keep it simple. Well, that means the most recent months records are not useful to you because if I checked out a book yesterday, my return time is null. I haven't returned it yet, but I still might. This is what we call right censorship. The fact that we haven't yet had a chance to observe whether or not that person will return the book. It's uncertain. It could go either way. And therefore, that data is a little bit different from data that's more than one month old where we now can establish whether or not the transaction turned out to be successfully returned or not. So again, sledgehammer approach, we could just say drop the most recent month's data. That's a viable approach. And actually, I use that in a lot of situations. It's not always doable. If you don't have a lot of records, that can be problematic because you want to squeeze as much out of it as you can. But I said, you had 10 years worth of data. So, seems reasonable that you can throw out the most recent month. And again, we can appeal to our intuition about libraries. As far as I know, there there's no, you know, blitz, check out your library book day, some tent revival or the equivalent of Black Friday where behaviors are totally different. That's not to say that there aren't seasonal patterns or major events at the library, but I see no reason to think that there'd be a sudden shift where the last month of data is exposing some major innovation or dramatic change in behavior. It seems to be that the library process, any trends it has changed slowly. So throwing out the last month to me seems okay. Now, I do have people sometimes chase that rabbit hole of, well, here's how I would handle this. Generally, it's someone who that was exactly their problem. they had the the most recent records were the most important. So they had to deal with the censorship and that's cool. I like to let people go off on that tangent here about that work. So in a way I uncover that aspect of what they bring to the table, but not a required thing to comment on here. Getting back to aggregating patron data, you know, how many books has this person checked out previously? What percentage were successfully returned? Those are great features, but I also suspect again playing on my intuitions about the library that we have a very longtail distribution over usage. maybe 10% of the users account for 90% of the books checked out. You know, for sure there's some book heads that go in and multiple books a week, but that's rare. That's not the common case. I would bet that the average person, the mean average, checks out around 1.9 books. Maybe the harmonic mean would be like 1.2 and the median value, I would guess, would be one. Oh, maybe I should unpack those real quick in case you don't know them. When we usually say mean, it typically refers to the arithmetic mean. That's just sum them up, divide by n. The median of course is the 50th percentile. The person in the, you know, absolute middle if you line them up in order. So when I say I assume it's one, that means that at least half of your patrons have only ever checked out one book. And the harmonic mean is an interesting one. You really should go look it up or I should do a mini episode on it. It's like the ugly stepchild average. It's even less popular than the modal average, which is the most common number. But anyway, the point being it's extremely longtailed and there are 8020 dynamics involved. So then that ever so predictive feature we talked about, did they successfully return books to the library in the past is not going to be relevant for the majority of your patrons. Remember median of one, which means 50% of the people checked out one book. Not much of a history there. I mean, there is like you can look at all the people who've only checked out one book ever and what's the frequency those returned it or didn't. And now you kind of have a prior. I guess you could be useful for you. And that's not a bad approach. But something needs to be done about this long tale of customers. It's not obvious and it's the kind of thing you arrive at just through exploring the data. Is there really a dividing line somewhere where there are super users and normal users? That's an angle a lot of people go to. I tend to find that I don't like those approaches because there's rarely a logical place to split the data. What does it take to be rich? I don't know, a million dollars in assets makes you a rich person perhaps. I don't know, maybe that's not even rich. But does that mean someone who has $1 less is not a rich person? No. It's a gradient. So that isn't to say that binning users isn't bad. Using deciles, actually, you know, the top 10% of people, group them together. 10 to 20th percentile, group them together. That can be useful. And I do that on a lot of projects. Not to be a broken record, the data will speak for itself. So a lot of the best ways to treat that come out of your exploratory analysis. But I like to push people to chase their tail a little bit here and talk to me about some of the things they would try and maybe even stories about what's happened in the past. That's the best part here. When I ask you, well, how would you do this? A lot of times people start saying, well, when I had this problem, I did it this way. Sometimes I get into interesting discussions at this point about the similarity of users. Can we cluster people together based on behaviors and then model those as sort of like personas? That's a possibility. I always find personas are a little bit dangerous because most of the ones I hear coming out of like marketing teams are not datadriven. They're these arbitrary characters that they make up and overlay onto a customer base but don't really have good formal datadriven definitions. If you can't cleanly and inarguably label every person in your database into one of your personas, then what good are they? They don't seem to be a mapping to your actual organization's data. But you never know, there might be clusters of users. I could see where graduate students could emerge from their checkout behavior, the types of books they check out, the frequency, the time of day of visits. That that could be a group that could be emergent from some dimensionality reduction process. I could see families being another. Of course, lots of kids books checked out. I don't know, maybe my mom always took us to the library early in the morning for some reason. Oh yeah, they had like a reading school thing. So maybe every Thursday at 10 checkouts around then or predictive of being in a family. Those could be features you engineer. You could say like appears to be family checkout and that might be useful. But better yet, find some way to cluster that or reduce its dimensionality and let those behavioral profiles emerge from the data itself. And over the last couple years, I've heard some awesome stories from people about how they were working on some problem and did something clever in that area to really eke out the last couple percentage inaccuracy they were after. Cuz that's really what this is. It's an iteration process. We come up with a bunch of features, things that describe those transactions or maybe the users or the books. And we're looking for patterns and things that have information content in them such that we can make a prediction about the future here. On a problem like this, it can be endless in what you try. This is an area when I want to see how clever people are. The family thing I mentioned in fact wasn't my original idea. That was one that somebody gave me on their talk. They said, "Well, maybe I could label families and family behavior is different." And I said, "Well, how would you detect families?" And we got into it. Very cool. Think about this problem. Step away from the numbers and the models. Just getting your algorithm to compile is not the key. Let's think about intuitively how this gets solved. In fact, let's say there was no algorithm. Let's say there was a person's job, the lieutenant supervisor of checkout, checking out forecasting. And that person just stood there. They should probably be on like some second floor with a tinted window so they could gaze down maniacally at the people. And they've got a little uh check box. They're going to say yes or no whether or not they think that person is going to return their book on time. Maybe scold them when they don't. Actually, don't scold them. Now we're getting into Goodart's law and the nudge people and all that. Could a person whose full-time job it is to look at these transactions and make a best guess, could they guess better than random? Almost certainly. I mean, first of all, they're going to get to know the frequent patrons and presumably frequent patrons are successfully returning books on time. So, if they recognize, you know, Mrs. McCaffrey who comes every Tuesday and Mr. Jeffre who comes with his grandson every Wednesday and they always return their books. Well then that person has memorized those particular data points and they'll predict them correctly. Now we'd like a model that wasn't so much about memorization and I think that person would develop more general things too. You know they might observe that I don't know reference books get checked in better than uh romance novels or something like that. They'll develop these patterns. Now the danger here is that they develop superstition. And let me tell you, no matter who you are, if you try and do this on your own, you will develop superstitions. You'll notice something a couple of times, it'll stick out to you and you'll start to think it's a pattern. A lot of times it will be, most of the time, even. But we can always be fooled into seeing patterns that aren't there. Even the machines can, but they're a little better at it because we can mechanically control their procedures a little better and we can inspect how exactly they work. might not always be easy to interpret them, but everything is deterministic and we have all the things we need to measure the units of computation that did our process. We don't have all the tools to measure what goes on in people's brains yet. So, if we trust that a clever person with that awesome job title could have some accuracy doing this process, we can think about, well, what is it they're going to pick up on? What might we be able to represent as features so that our algorithm that builds a model can learn to predict? So, we really should be moving into talking about modeling. But to wrap up on features, I like to push people for a while and give them the opportunity to show me themselves being clever with what they'd come up with. Doesn't have to blow my mind, but say something like, well, I'll look at, you know, in in the past, I'll look at when books generally get returned. Are they returned, you know, always on the last day before they're due, or do some people keep them overnight? Can that lag be informative and predictive? Maybe we can look at the checkout habits of the patron. If they always come at roughly the same time every week, are those people more successful at returning books than people that come randomly? Interesting idea. Great follow-up question. How do you represent that consistency mathematically? It's not like I check out the book every Tuesday morning down to the same second. What's your wiggle room? How do you measure that? How do you represent it in a way that might be useful? I also hope people don't fall for all my red herrings. When I say we have demographic data, I really don't think that that's well actually, you know, I don't know, maybe there is a gender difference. It wouldn't surprise me if someone said women were better at returning books on time than men. Although that's quite prejudice of me to say, even though I am a man. And yeah, sometimes at this stage if someone has some social science background, we get into the ethics and all that stuff. And that's good discussion. But the reality is I doubt if something like race has that much impact on the rate at which you return books. Some people get into the distance from the person's home to the library. That's interesting. We really don't know what will work until we try it. So, this is just a fun time to explore different ideas we might chase after. Almost everybody hits on some way that they want to aggregate the book features. Now, if you took every little data point on Amazon.com against every book, you'd have this really sparse distribution of data. That is to say, like maybe books sell very well when they have a particularly famous illustrator on them. So we should have a dummy variable you know that for each book that it was created by this illustrator or illustrator B or illustrator C. Now most illustrators have only illustrated a small number of books. Not all books are illustrated. There's something useful there, but you have to find the right way to represent it. Maybe it's dimensionality reduction. The statistitians tend to go one way here. The ML people tend to go another. The more pure maths people often get into linear algebra here. Lots of good ideas. Again, no right or wrong answers. A great time to get a sense of the tools people go to to solve problems. All right, moving on to the algorithm when I'm kind of trying to wind people down because we could go on and on about this sort of stuff all day and say, "Well, all right, what algorithm are you going to use?" And as I said earlier, we're just going to assume this is a binary classification problem. We put in all the data we have about the patron, the book they're checking out, and any of those aggregated features like their prior checkout history. out should come a yes or no whether or not the model thinks they're going to be late returning it. I think most of the answers I get here are just what people happen to use on a recent project. Logistic regression, random forest, XG boost. There's no right or wrong answer here. There really isn't. I could probably build an argument for each of them and they'd all sound convincing. And in fact, there is actually provably no right answer. And that proof is something called the no free lunch theorem. Let me get a pen. I got to do a mini episode on that. No free lunch theorem. All right. Not January. Maybe maybe February. Anyway, I'll figure that out later. Lots of people say XG Boost, great algorithm. Why' you pick it? Somebody, this is the best description I ever heard of XG Boost. They said, well, of all the techniques for optimizing a decision process, XG Boost seems like the kitchen sink. And I really agree with that. It has all these clever tricks and optimizations that make it no surprise that that treebased algorithm wins. is I think the majority of Kaggle competitions. Now, we could talk about why that is and the class of problems that end up on Kaggle and yada yada, but there's no doubt XG Boost is a great algorithm, but it is not the best because there is provably no best. No free lunch theorem. But when I hear what people go to, I learn their personal biases and I start to hear about their experiences. A common answer I get when interviewing more junior people is, well, I'd use random forest. Why? Well, because that's what we used on the Titanic project and this other thing for my final and it worked great. So, I'd use it again. A more senior answer would hem and haul a little bit about the nuances of different algorithms and why they would or wouldn't work given some of the situations here. So, this is a point for someone's seniority to come into focus a little bit. The maturity with which they can describe their choice of algorithm means a lot. Not necessarily requirement depending on what the needs are, but this is always a really interesting part in the conversation for me. So, the modeling part usually is pretty short. There's not a lot to say without some data. There's no point in talking about whether or not it needs to be regularized or we need some stacking of models or all these really advanced ML techniques. We don't even know if there's enough information in our data set to make a decent prediction. If your prediction is only slightly better than noise, then chasing after fancier models and things is generally not going to benefit you that much. So, how do we know if we have a good model? Let's get into our diagnostics. How will you measure the accuracy of your model? We cover diagnostics a lot on many episodes, although there's still a bunch I have to get to. So, I'm not going to get into those here, but this is a place in the conversation where I ask people to tell me what diagnostics they'd use, why often can they define it for me. And this is a pet peeve of mine. I might be alone in this. Maybe some people agree with me, but I hate when there's interviewers who are essentially just looking if you've memorized stuff. I just want to know that you intuitively know why you pick it. Do you have to recall the exact formula for computing the F1 score? In my opinion, no. Tell me what it is, why you chose it, and the intuition of how it's going to work. 2 seconds on Google will give you the formula. I don't know if that's an older generation thing or not. I feel like the memorization side of stuff is dying. I hope so, cuz for me, I don't care how people get things solved. I'm going to assume that you have access to an internet connection at all times and can Google anything. And just consider that an extension of the skill set you bring to the table. I don't know that everyone agrees with me, but that's my little soap box. And then as I'm winding this up, I'll often ask,"Well, what sort of accuracy do you think you'd achieve here?" And this throws a lot of people off. I know people hate this part of the question because no one wants to answer it. And I can see why. You don't have any data. You don't have this fictitious model. How can you possibly know how accurate it'll be? Well, I'm looking for more of a philosophical answer. And generally, I'll I'll give some breadcrumbs to get people to arrive at that realization. What I really want to ask, but not use these words, is how solvable of a problem is this? Are you going to get to a 99.98% accurate system? No way. Obviously not. Right? Think about it. Just take your intuition. All the model you're going to build has to go on is the data the library already had. That data does not contain all the factors at play in why someone would be late returning a book. What if someone suddenly takes ill, goes into the hospital, and the book is the last thing of their concern. They return it 6 months late. Nothing in the data set you had could have predicted that illness. Even if somehow you had their health records, you can't necessarily predict with certainty that they'll get sick with some virus they'll encounter 4 days after they get the book. What are the core reasons why people would be late or steal a book? Surely the data set has some predictive power about that, but it doesn't contain everything we'd want to observe to perfectly know the answer. You know, if someone says, "Oh, I work on engine telemetry and we can predict a spark plug failure with 99.995% accuracy an hour before it would happen." Yeah, I believe that because you presumably have highly refined, well-engineered sensors with good precision accuracy. And not just that, but a published description from the manufacturer of just how accurate they are. That, you know, it measures the temperature to this accuracy within this amount of time and how it works on other weird conditions and changing temperatures and things. And that's a physical system that if I were to assume that heat is the only thing that could cause a spark plug to fail, well, no, I don't have to assume that. I can probably figure out all of the physical reasons why it would fail, if I can measure all the things that correlate with those physical reasons, then I'm only bounded by the accuracy of my sensors. In other words, how good can the manufacturers provide that sensor data to me? That's a different sort of a class of machine learning. Here we've got these loose correlative implications. And at some point people building these systems have to ask themselves what's the benefit of further investment. All right we built this model in a month it's 65% accurate which sounds low but that might be useful for the library. Can we get it to 70? I don't know maybe probably. Can we get it to 95? No way. All right. So what's our asmtoic ceiling here? What other things might we need to observe? Sometimes at this point folk wisdom comes out for good or for bad. People will say, "Well, I've noticed that left-handed people are so much more responsible than right-handed people. All the southpaws are very honorable people returning their books on time every time." Well, you don't know if they're left or right-handed in the data set, presumably. It's a little bit of a weird question for the library to ask. All right. So, can we get that data? If we have reason to believe that that might be true, how do we get that data into our data set? And that's our iterative process. So, by this point, the conversation's gone, usually in one of about 20 ways, and we just end up discussing whatever aspect of the problem we've zoned in on. I always enjoyed hearing different people's takes on this problem. And it's been useful for me because I find that, you know, in about 15 minutes, you can have an efficient discussion that gives me a lot of depth and insight into, as I said, where someone is on their professional journey. But yeah, I like asking this off the cuff. So, I guess it's got to be retired. But, I don't know. Hopefully, you listeners got some value out of it. So, as we're wrapping up here and closing out this year of data skeptic, don't unplug your earbuds just yet. Stay with me for a minute and please humor me as I get all NPR for a second. I'm going to hit you hard for the next couple weeks with this stuff and then it'll mellow out. But here's the new deal. I'm at capacity on what I could do for this show. But data Skeptic as a entity has a lot of momentum behind it. We recently launched a blog and we're ramping that up at dataskeepic.com. You'll start to see some guest posters coming on in I think the end of January. We want to do a lot more projects this year, like our openhouse project that you're going to hear about in one of the early episodes in 2017, or things like the SNL impact analysis we did recently. Data Skeptic comes out every Friday morning like clockwork. Always has, always will. So, don't get me wrong, the show isn't going anywhere, but we're getting to the point where we do need your support. If you'd like Data Skeptic to continue to grow, both in the quality of our programming and into new media like videos, please consider supporting us by becoming a member at dataskeepic.com. Our transcripts have lots of links to relevant content, but those transcripts cost us a lot of money. We want to continue to do that, not just so that the hearing impaired community can enjoy the content, but because it also makes the show citable on places like Wikipedia. I mean, we get some deep experts to share original thoughts right here that you can't find anywhere else. If 1% of you contribute to the show, that would make a massive impact. So, let me ask you this. Do you enjoy Data Skeptic every week or some of you every other week? That's fine. Have we given you something? Maybe you've learned a different way of explaining a concept or an introduction to a cool topic you hadn't previously been exposed to. Has Data Skeptic helped your career? Bro, come on now. Break us off a piece of that. We want to go bigger in 2017. We learned about open data and potholes. In March, we scratched the surface of auditing algorithms and data science and policing. We heard from toolmakers like Hadley Wickham and Wes McKini about Feather, John Morrow about Aloha, Marco Trulio Rivierro about Lime, Michael Kuthpard about music 21, and others. bike sharing optimization, unstructured data in finance, stealing machine learning algorithms from the cloud. I called out a company making dubious claims that they could detect terrorists using facial recognition. Oh man, this is a really good year actually. And the docket we have for early next year is on point as well. 2017 can be even better with your support. So, please consider heading over to the brand new data skeptic.com and click on membership. For as little as $1 an episode, you can help Data Skeptic to remain your source for being skeptical both of and with data.

Original Description

We close out 2016 with a discussion of a basic interview question which might get asked when applying for a data science job. Specifically, how a library might build a model to predict if a book will be returned late or not.  

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Skeptic · Data Skeptic · 33 of 60

← Previous Next →

Data Skeptic book giveaway contest winner selection

Data Skeptic book giveaway contest winner selection

OpenHouse - Front end and API overview

OpenHouse - Front end and API overview

OpenHouse Crawling with AWS Lambda

OpenHouse Crawling with AWS Lambda

[MINI] Logistic Regression on Audio Data

[MINI] Logistic Regression on Audio Data

Data Provenance and Reproducibility with Pachyderm

Data Provenance and Reproducibility with Pachyderm

[MINI] Primer on Deep Learning

[MINI] Primer on Deep Learning

Big Data Tools and Trends

Big Data Tools and Trends

[MINI] Automated Feature Engineering

[MINI] Automated Feature Engineering

The Data Refuge Project

The Data Refuge Project

[MINI] The Perceptron

[MINI] The Perceptron

[MINI] Feed Forward Neural Networks

[MINI] Feed Forward Neural Networks

Data Science at Patreon

Data Science at Patreon

[MINI] Backpropagation

[MINI] Backpropagation

[MINI] Generative Adversarial Networks

[MINI] Generative Adversarial Networks

[MINI] AdaBoost

[MINI] AdaBoost

[MINI] The Bootstrap

[MINI] The Bootstrap

[MINI] Gini Coefficients

[MINI] Gini Coefficients

[MINI] Random Forest

[MINI] Random Forest

[MINI] Heteroskedasticity

[MINI] Heteroskedasticity

Urban Congestion

Urban Congestion

[MINI] The CAP Theorem

[MINI] The CAP Theorem

Unstructured Data for Finance

Unstructured Data for Finance

Detecting Terrorists with Facial Recognition?

Detecting Terrorists with Facial Recognition?

Predictive Models on Random Data

Predictive Models on Random Data

[MINI] F1 Score

[MINI] F1 Score

Machine Learning on Images with Noisy Human-centric Labels

Machine Learning on Images with Noisy Human-centric Labels

The Library Problem

The Library Problem

Stealing Models from the Cloud

Stealing Models from the Cloud

Data Science at eHarmony

Data Science at eHarmony

Multiple Comparisons and Conversion Optimization

Multiple Comparisons and Conversion Optimization

Election Predictions

Election Predictions

[MINI] Calculating Feature Importance

[MINI] Calculating Feature Importance

MS Connect Conference

MS Connect Conference

The Police Data and the Data Driven Justice Initiatives

The Police Data and the Data Driven Justice Initiatives

Studying Competition and Gender Through Chess

Studying Competition and Gender Through Chess

[MINI] Goodhart's Law

[MINI] Goodhart's Law

Trusting Machine Learning Models with LIME

Trusting Machine Learning Models with LIME

Predictive Policing

Predictive Policing

Mutli-Agent Diverse Generative Adversarial Networks

Mutli-Agent Diverse Generative Adversarial Networks

[MINI] Convolutional Neural Networks

[MINI] Convolutional Neural Networks

Unsupervised Depth Perception

Unsupervised Depth Perception

[MINI] Max-pooling

[MINI] Max-pooling

Activation Functions

Activation Functions

[MINI] The Vanishing Gradient

[MINI] The Vanishing Gradient

Estimating Sheep Pain with Facial Recognition

Estimating Sheep Pain with Facial Recognition

[MINI] Conditional Independence

[MINI] Conditional Independence

MINI: Bayesian Belief Networks

MINI: Bayesian Belief Networks

Project Common Voice

Project Common Voice

[MINI] Recurrent Neural Networks

[MINI] Recurrent Neural Networks

The Library Problem is a classic systems design challenge that involves predicting whether a book will be returned on time or not, and the video discusses various approaches to solving this problem using machine learning and data analysis.

Key Takeaways

Represent features as data points
Aggregate features to improve prediction accuracy
Use dimensionality reduction to handle sparse data
Choose a binary classification algorithm such as logistic regression, random forest, or XG boost
Ask questions to clarify the problem
Identify the core reasons for a problem
Iterate on the model to refine its accuracy

💡 The no free lunch theorem states that there is no single best algorithm for all problems, and the choice of algorithm depends on the problem and the data.

🔒 Pro feature: Ask AI to explain this lesson →

More on: AI Systems Design

View skill →

Architecting Scalable Cloud AI Infrastructure

Architecting Scalable Cloud AI Infrastructure

I Built an AI That Made $3,500 Betting While I Slept

I Built an AI That Made $3,500 Betting While I Slept

Unreal Engine Character Development & Combat Systems

Unreal Engine Character Development & Combat Systems

Explore NVIDIA Metropolis AI-Powered Multi-Camera Tracking on AWS

Explore NVIDIA Metropolis AI-Powered Multi-Camera Tracking on AWS

NVIDIA Developer

Modernizing your Legacy Applications with Crowdbotics

Modernizing your Legacy Applications with Crowdbotics

Microsoft Developer

Accelerate AI on NVIDIA RTX AI PCs with Windows ML | Microsoft Build 2025

Accelerate AI on NVIDIA RTX AI PCs with Windows ML | Microsoft Build 2025

NVIDIA Developer

Related AI Lessons

How I Structured My Next.js 14 App Router Project — And Why It Scales

Learn how to structure a scalable Next.js 14 App Router project for better organization and maintainability

Dev.to · Mbanefo Emmanuel Ifechukwu

Let’s write a simple Lexer in Go

Learn to build a simple lexer in Go to understand source code tokenization

Medium · Programming

The Hardest Part Of Microservices Is Undoing What Already Succeeded

Learn how to refactor monolithic ERP systems into microservices, focusing on undoing existing successful implementations

Medium · Programming

What OOP Actually Buys You (And Why “Real World Modeling” Is a Lie)

Learn the actual benefits of Object-Oriented Programming (OOP) and why 'real world modeling' is a misconception

Medium · Programming

Retracing It All With My Son