Allstate Purchase Prediction Challenge on Kaggle

Data School · Beginner ·📐 ML Fundamentals ·12y ago

Skills: Supervised Learning90%ML Pipelines80%ML Maths Basics70%

Key Takeaways

The Allstate Purchase Prediction Challenge on Kaggle is a machine learning competition where participants predict which car insurance options a customer will buy, with a focus on prediction accuracy and handling over 2,000 possible combinations of options. The competition utilizes various machine learning techniques, including logistic regression, random forests, and stacked models, to improve prediction accuracy.

Full Transcript

Hello, my name is Kevin Markham and I'm going to be talking about the Allstate purchase prediction challenge. Here's our agenda for the talk. We're going to start by talking about the goal of the All-State competition. Uh why this is a difficult goal to achieve, what data we have access to and what we can learn from that data. Uh can machine learning help us. Uh what approaches worked and what did not. What did I learn? And then of course, did I profit? So what was the goal of the allstick competition? It was a machine learning competition run by Kaggle. And machine learning is basically the process by which computers learn patterns from data. The competition was sponsored by Allstate, the insurance company. And the goal was to predict which car insurance options a customer will buy. So let's get some context around this problem and then we'll move on to an example. There are seven car insurance options each with two to four possible values. Those values are identified only by number and then a quote consists of a single combination of those seven options and then customers can review one or more quotes before actually making their purchase. Here's an example. This is one customer's car insurance quote history. You can see there are seven quotes and for each quote there are seven different options A through G. And you can see the values they chose for those options. And then the question is what did they actually purchase? Well, as it turns out, they actually just purchased the options identical to the last quote they looked at. So that might be a useful pattern to watch out for. Here's another example. In this case, the customer looks at six different quotes except for five of them, they're looking at the same identical options. It's almost obvious what they're going to purchase, except that they purchase something that they never looked at during the quoting process. Perhaps the prediction process isn't going to be so easy after all. Let's talk about the competition and how it actually works. We have access to some training data which is 97,000 customers and their complete quote history including what they purchased. And then we've got the test data which is 55,000 customers. A partial quote history and then our goal is to predict what they purchased. The evaluation metric for the competition is prediction accuracy. So why is this difficult? Well, for one, there's over 2,000 possible combinations of options. And importantly, your prediction is only counted as correct if you get all seven options right. So there's no partial credit and no feedback given on which options you get wrong. And in addition, options are not even identified as to their meaning. All you know is that they are A through G. You don't know what those options signify. So let's first start with a naive approach to predicting. Uh for every customer, let's simply predict that they will purchase the last set of options they were quoted. So the good news is that this works pretty well. You get an accuracy of.53793. The bad news, however, is that everyone figured out this strategy and indeed 46% of competitors have that identical score for that exact reason. But never fear because it's data to the rescue. I hadn't yet mentioned it, but we have access to a lot more data than just the quote history. We have a variety of customer data. We have data about the car they're trying to insure. And we have additional data about each quote, including the day and time of the quote and how much it cost. So, let's start looking through the data and see what we can learn from it. I mentioned before that there are over 2,000 possible option combinations, but perhaps only a small subset of those are ever actually purchased. That would reduce the problem space significantly. Unfortunately, that's not the case. Most of the possible option combinations do either appear in the training set or the test set. Now, I'm going to briefly walk you through some visualizations I made of the data. This plot simply shows that the more quotes you have for a customer, the better that naive strategy will work. The naive strategy being simply predicting that they'll purchase the last thing that they looked at. This plot basically shows that the test set has been significantly truncated in comparison to the training set. In other words, the test set has a lot less quote data for each customer on average, making this a harder problem. This plot shows that customer behavior can vary based upon the time of day that the customers are viewing quotes. And this is an interesting plot that shows that what a customer chooses for one option might affect what they choose for another option. So for example, a customer that chooses either C= 3 or C= 4 will almost always choose D= 3. Let's just try predicting what a customer will purchase based upon these option interactions. We start with that naive approach to make what I call the baseline predictions and then we create a list of rules about pairs of options and then use these rules to fix the baseline predictions. Unfortunately, this approach worked worse than the naive approach. Why didn't this approach work? Well, these rules are based on strong patterns in the data, but patterns are of course not always correct. And more importantly, you you don't actually know how many of the seven options need to be changed from the baseline in order to make them correct. The key insight here is that there's a huge risk when changing any baseline prediction. There's more than a 50% chance you will break a prediction that is already correct. And you need to balance that against the tiny chance that you will take an incorrect prediction and make it the correct one. In other words, it's very important to only change a baseline prediction if you're sure it's wrong. So I decided to use a stacked model approach meaning that first we would predict which customers are likely to change options after their final quote and then second we would fix the baseline predictions only for those customers. Step one then is to predict who will change. I modeled this with logistic regression and random forests which are two common machine learning approaches. And then you evaluate a model using an ROC curve. What you see on the left is a reference ROC curve that explains that the yellow line means a model is excellent. The pink line means that a model is performing well and the blue line means that a model is worthless. And as you can see, my model was basically worthless. No worries though because we can do some feature engineering and perhaps that will help improve the model. So this simply means that we're going to create some new features by transforming or combining the existing features. My theory was that these might be less noisy than the raw features and less likely to overfit the training data. I'm not going to walk through the new features in detail, but I'll just show some examples. I created new features called family and time of day. This is an interesting feature I created called state group in which I clustered states based upon some similar customer behavior in those states. So for example, North Dakota and South Dakota are in the same cluster because customers were doing similar things in those states. Two more features I created were stability and plan frequency. These were basically my way of summarizing uh quote history for a given customer in a single number. So we're ready to try step one again. We're going to redo our model except with my newly engineered features and of course again evaluate using the ROC curve. On the left is my previous ROC curve and on the right is my new ROC curve. Uh, you might not be able to tell, but it's actually a tiny bit better than the one on the left, but it's still relatively worthless. However, all is not lost because I had a new insight. When predicting which customers will change, it's actually much more important to optimize for precision than accuracy. In other words, we need to minimize the false positives by setting a high probability threshold. So, for example, in the test set, about 25,000 customers will change options after their final quote. We don't actually need to try to find all 25,000 of those customers. Instead, we need to find perhaps a hundred of those customers that we are sure will change and then fix their baseline predictions. Now, we're going to optimize for precision. I created a cross validation framework to predict the test set precision of my model. I changed the probability threshold from 0.5 into 85 and I was able to validate that this approach was indeed working. The next step is to predict new plans for the customers who I predict will change. There's two options for how to do this. I could build one model to predict the entire combination of seven options at once or I could build seven different models. Each one would predict an individual option and then I would combine the results at the end. I ended up choosing the second option and I used random forests and a single hidden layer neural network. Unfortunately, I had poor prediction results. In order for this seven model approach to be accurate enough, each model needs to be performing at about 90% accuracy. Instead, my models were performing with 60 to 80% accuracy. And thus, when you combined the predictions from the seven different options, it rarely predicted a completely correct combination of options. But I did have a backup plan. Instead of machine learning, I decided to do some human learning and try to outthink the computer. So, I located nine customers in the test set that had a very, very high probability of changing from their last quote. I went through my list of rules about unlikely option combinations and one by one went through each customer looking for options combinations that seemed unlikely and changing them. I tweaked my combinations by comparing them against my random forest model. This was a very time-intensive method, but I figured I could convert it into a pure machine learning model if it worked. Unfortunately, it was a huge waste of time because it did not work. Time was running out in the competition, but I came up with a new strategy. There was a tip in the Kaggle forums that recommended that you should locate plans that were rarely purchased and then replace those plans with more likely alternatives. So, in my estimation, these are probably combinations of options that simply don't make sense to most people. Note that this approach completely ignores all customer data and focuses purely on the plans. The tasks from here are first to determine which plans are unlikely. So I calculated a view count and a purchase likelihood for every plan. And then I needed to determine the best replacement plan for each unlikely plan. So I was just tallying which plans were actually purchased by those who viewed them. I calculated a metric called replacement plan commonality. And then for all of these values, I simply set thresholds to determine how many plans would be replaced and what plans they would be replaced by. And as it turned out, it worked. It improved upon the baseline approach. And as you can see, I moved up 365 places on the leaderboard just by improving on my best score by 0.0024. So, I went to work tuning my three threshold values and submitted many different combinations to Kaggle. My best submission beat the baseline by 0.06%. And that's not a typo. It's 0.06% not 6%. And even the top competitor was only beating the baseline by 78%. Proving that this is a very challenging problem. Here are the details of my best submission. So all we do is we start with the naive approach to make our baseline predictions and then if the plan on the left is predicted any one of those five plans we simply change it to the plan on the right. Note that this approach completely ignores all work I had done up to this point in that it doesn't use any of my models or any of the features I engineered. I also wanted to see if I could improve this approach. So, I tried stacking this approach with one of my existing models, but that didn't succeed in improving my test set accuracy. I also came up with some other ideas, but ultimately ran out of time to actually try them. In my estimation, the top competitors are likely using an ensemble of models that incorporates this approach somehow. So, what did I learn from this competition? Number one, early on in the competition, try many different approaches because you need to give yourself time to iterate upon a proper approach rather than going down the rabbit hole of an approach that doesn't actually work. Number two, smarter strategies trump more and more data. Just because you know how to build a model and you have access to a lot of data doesn't actually mean you need to use those things. In the end, the strategy that worked best for me had nothing to do with any modeling or most of the data I had access to. Number three, real world data is hard to work with. It's much harder to work with than the kind of data you're usually given in a machine learning textbook. For instance, it's not necessarily obvious what to do with all of this quote data that you're given, how to learn something from each and every quote. It's a very challenging problem. Number four, algorithms and processes that allow for rapid iteration are priceless. So even though there are algorithms like random forests that tend to perform well, they also take a lot longer to run than logistic regression. So I ended up mostly using logistic regression because it allowed me to rapidly iterate through different models. And number five, learn from others around you. There's so many different approaches you can try that it's helpful to see what others around you are doing so that you can use that knowledge to help focus your efforts. Thank you so much for your time. I hope you enjoyed the presentation. I've got here a link to my GitHub repository where I have a paper I wrote about this competition as well as all the code I used. And it also includes a link to the Kaggle forums where many different competitors are discussing their own approaches to how they solved this problem.

Original Description

This is a presentation about my participation in Kaggle's "Allstate Purchase Prediction Challenge." Kaggle is a website that hosts machine learning competitions. RESOURCES: - Project paper, code, and presentation slides: https://github.com/justmarkham/kaggle-allstate - Competition website: http://www.kaggle.com/c/allstate-purchase-prediction-challenge - Blog post: https://www.dataschool.io/kaggle-allstate-purchase-prediction-challenge/ LET'S CONNECT! - Newsletter: http://www.dataschool.io/subscribe/ - Twitter: https://twitter.com/justmarkham - Facebook: https://www.facebook.com/DataScienceSchool/ - LinkedIn: https://www.linkedin.com/in/justmarkham/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data School · Data School · 8 of 60

← Previous Next →

Setting up Git and GitHub

Setting up Git and GitHub

Navigating a GitHub Repository - Part 1

Navigating a GitHub Repository - Part 1

Forking a GitHub Repository

Forking a GitHub Repository

Creating a New GitHub Repository

Creating a New GitHub Repository

Copying a GitHub Repository to Your Local Computer

Copying a GitHub Repository to Your Local Computer

Committing Changes in Git and Pushing to a GitHub Repository

Committing Changes in Git and Pushing to a GitHub Repository

Syncing Your GitHub Fork

Syncing Your GitHub Fork

Allstate Purchase Prediction Challenge on Kaggle

Allstate Purchase Prediction Challenge on Kaggle

Troubleshooting: Updates Rejected When Pushing to GitHub

Troubleshooting: Updates Rejected When Pushing to GitHub

Hands-on dplyr tutorial for faster data manipulation in R

Hands-on dplyr tutorial for faster data manipulation in R

ROC Curves and Area Under the Curve (AUC) Explained

ROC Curves and Area Under the Curve (AUC) Explained

Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)

Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)

What is machine learning, and how does it work?

What is machine learning, and how does it work?

Setting up Python for machine learning: scikit-learn and Jupyter Notebook

Setting up Python for machine learning: scikit-learn and Jupyter Notebook

Getting started in scikit-learn with the famous iris dataset

Getting started in scikit-learn with the famous iris dataset

Training a machine learning model with scikit-learn

Training a machine learning model with scikit-learn

Comparing machine learning models in scikit-learn

Comparing machine learning models in scikit-learn

Data science in Python: pandas, seaborn, scikit-learn

Data science in Python: pandas, seaborn, scikit-learn

Selecting the best model in scikit-learn using cross-validation

Selecting the best model in scikit-learn using cross-validation

How to find the best model parameters in scikit-learn

How to find the best model parameters in scikit-learn

How to evaluate a classifier in scikit-learn

How to evaluate a classifier in scikit-learn

What is pandas? (Introduction to the Q&A series)

What is pandas? (Introduction to the Q&A series)

How do I read a tabular data file into pandas?

How do I read a tabular data file into pandas?

How do I select a pandas Series from a DataFrame?

How do I select a pandas Series from a DataFrame?

Why do some pandas commands end with parentheses (and others don't)?

Why do some pandas commands end with parentheses (and others don't)?

How do I rename columns in a pandas DataFrame?

How do I rename columns in a pandas DataFrame?

How do I remove columns from a pandas DataFrame?

How do I remove columns from a pandas DataFrame?

How do I sort a pandas DataFrame or a Series?

How do I sort a pandas DataFrame or a Series?

How do I filter rows of a pandas DataFrame by column value?

How do I filter rows of a pandas DataFrame by column value?

How do I apply multiple filter criteria to a pandas DataFrame?

How do I apply multiple filter criteria to a pandas DataFrame?

Your pandas questions answered!

Your pandas questions answered!

How do I use the "axis" parameter in pandas?

How do I use the "axis" parameter in pandas?

How do I use string methods in pandas?

How do I use string methods in pandas?

How do I change the data type of a pandas Series?

How do I change the data type of a pandas Series?

When should I use a "groupby" in pandas?

When should I use a "groupby" in pandas?

How do I explore a pandas Series?

How do I explore a pandas Series?

How do I handle missing values in pandas?

How do I handle missing values in pandas?

What do I need to know about the pandas index? (Part 1)

What do I need to know about the pandas index? (Part 1)

What do I need to know about the pandas index? (Part 2)

What do I need to know about the pandas index? (Part 2)

How do I select multiple rows and columns from a pandas DataFrame?

How do I select multiple rows and columns from a pandas DataFrame?

Machine Learning with Text in scikit-learn (PyCon 2016)

Machine Learning with Text in scikit-learn (PyCon 2016)

When should I use the "inplace" parameter in pandas?

When should I use the "inplace" parameter in pandas?

How do I make my pandas DataFrame smaller and faster?

How do I make my pandas DataFrame smaller and faster?

How do I use pandas with scikit-learn to create Kaggle submissions?

How do I use pandas with scikit-learn to create Kaggle submissions?

More of your pandas questions answered!

More of your pandas questions answered!

How do I create dummy variables in pandas?

How do I create dummy variables in pandas?

How do I work with dates and times in pandas?

How do I work with dates and times in pandas?

How do I find and remove duplicate rows in pandas?

How do I find and remove duplicate rows in pandas?

How do I avoid a SettingWithCopyWarning in pandas?

How do I avoid a SettingWithCopyWarning in pandas?

How do I change display options in pandas?

How do I change display options in pandas?

How do I create a pandas DataFrame from another object?

How do I create a pandas DataFrame from another object?

How do I apply a function to a pandas Series or DataFrame?

How do I apply a function to a pandas Series or DataFrame?

Getting started with machine learning in Python (webcast)

Getting started with machine learning in Python (webcast)

Q&A about Machine Learning with Text (online course)

Q&A about Machine Learning with Text (online course)

Your pandas questions answered! (webcast)

Your pandas questions answered! (webcast)

Machine Learning with Text in scikit-learn (PyData DC 2016)

Machine Learning with Text in scikit-learn (PyData DC 2016)

Write Pythonic Code for Better Data Science (webcast)

Write Pythonic Code for Better Data Science (webcast)

Web scraping in Python (Part 1): Getting started

Web scraping in Python (Part 1): Getting started

Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup

Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup

Web scraping in Python (Part 3): Building a dataset

Web scraping in Python (Part 3): Building a dataset

The Allstate Purchase Prediction Challenge on Kaggle requires participants to predict which car insurance options a customer will buy, with a focus on prediction accuracy and handling complex option combinations. The competition utilizes various machine learning techniques to improve prediction accuracy. By participating in this challenge, learners can develop skills in supervised learning, machine learning pipelines, and mathematical basics for machine learning.

Key Takeaways

Explore the competition dataset and understand the problem
Develop a baseline model using a naive approach
Improve the model using logistic regression and random forests
Evaluate model performance using cross-validation and ROC curves
Create new features through feature engineering
Optimize the model for precision instead of accuracy
Stack multiple models to improve prediction accuracy

💡 The competition highlights the importance of handling complex option combinations and optimizing for precision instead of accuracy in machine learning models.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Supervised Learning

View skill →

Auto Machine Learning (AutoML) Using AutoGluon

Auto Machine Learning (AutoML) Using AutoGluon

Coding the SARIMA Model : Time Series Talk

Coding the SARIMA Model : Time Series Talk

Code With Me : Logistic Regression (from scratch) !

Code With Me : Logistic Regression (from scratch) !

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Predicting the Winning Team with Machine Learning

Predicting the Winning Team with Machine Learning

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Related Reads

Build a Simple Calculator

Learn to build a simple calculator using Python and apply basic programming concepts to a real-world project

Medium · Python

Building ML APIs That Don’t Fail During Startup

Learn how to build ML APIs that don't fail during startup by using a production-ready pattern for loading ML models without serving requests too early

Medium · Python

Your Model’s Numbers Just Changed. Git Never Noticed.

Learn how to track changes in your model's data using Data Version Control (DVC) to ensure reproducibility and accuracy

Medium · Machine Learning

Your Model’s Numbers Just Changed. Git Never Noticed.

Learn how to track changes in your machine learning model's data with Data Version Control, a crucial step in ensuring reproducibility and collaboration

Medium · DevOps

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB