Causal Discovery | Inferring causality from observational data
Skills:
Agent Foundations80%Tool Use & Function Calling70%ML Maths Basics60%Unsupervised Learning50%
Key Takeaways
The video discusses causal discovery, aiming to infer causal structure from observational data, using tools like Python, PC algorithm, and Greedy Equivalence Search algorithm. It covers key concepts such as conditional independence testing, inverse problems, and asymmetry in causal discovery.
Full Transcript
hey folks welcome back this is the final video in the three-part series on causality in this video i'll be talking about causal discovery which aims at inferring causal structure from data start by introducing what causal discovery is sketching some big ideas and then finishing with a concrete example with code in python so with that let's get into the video in the previous video i was talking about causal inference which aims at answering questions about cause and effect we talked about a lot of great things we talked about the do operator which stimulates interventions we talked about confounding which talked about estimating causal effects and all these things were great however there was one key assumption that was necessary in order to do causal inference which was a causal mod and obviously a lot of times in the real world we don't have a causal model in hand when we're starting our analysis and it's not always clear which variables cause which causal discovery is one thing that might help with obtaining a causal model and the goal of causal discovery is to find causal structure in data so basically given data inferring the underlying causal model so causal discovery is an example of a so-called inverse problem and inverse problems can be understood in contrast to forward problems for example imagine you have an ice cube sitting on your kitchen counter you know the shape of the ice cube you know the volume and if you were to let that ice cube sit there for a few hours you could probably predict with some reasonable degree of accuracy what the resulting puddle of water would look like the inverse problem is like the opposite of this in other words the inverse problem would be given a puddle of water on the kitchen counter predicting the shape of the ice cube that made that puddle and clearly this is a hard problem because there are several different shapes of ice that could create the same puddle of water connecting this to causal discovery the shape of the ice cube is like our causal model and the puddle of water is like the statistics that we observe in our data so following this analogy there are several causal models that could potentially generate the same statistics we observe in a given data set the approach to solving inverse problems is to make assumptions basically we narrow down the possible number of solutions through assumptions and although assumptions help they often do not fully solve the problem this is where we need to use some tricks to go a little further here i'll talk about three different tricks for causal discovery the first trick is conditional independence testing i start here with a definition of statistical independence which is shown here in other words two variables x and y are said to be independent if their joint distribution is equal to the product of their individual distributions from this we can get a definition of conditional independence which is basically the same thing however now we look at distributions of each variable when conditioned on a particular variable say z we can use this idea of conditional independence testing to do causal discovery and this is actually the main idea behind one of the first causal discovery algorithms called the pc algorithm which is named after its authors clark glymore and peter spears i probably butchered that so i apologize but there's a reference to a review paper by them at the bottom here so i'll just briefly go through the main idea of the pc algorithm more details can be found in the blog linked in the description the first step is to form a fully connected undirected graph so we have a node for each variable in our data set and we connect undirected edges between each of these nodes in step two we do pairwise independence tests so we do an independence test between every possible pair of variables and if two variables are independent we delete the undirected edge between them the third step are conditional independence tests so basically we do the same thing however we pick a variable to condition on then if two variables are found to be conditionally independent we delete the edge between them and we add that conditioned node to the separation set and we continue these conditional independence tests until there are no more candidates for conditional independence testing then in step four we orient colliders so if we have three variables say i j and k we form a collider out of them meaning we make directed edges pointing from i to k and j to k given k is not in the separation set of i and j then in step 5 we add more directed edges to the graph following two constraints namely we do not create any new v structures in our graph nor do we create any directed cycles and hopefully after all that we output a directed acyclic graph which represents the causal connections of our system again more details on the blog and the two references at the bottom here have a great description of the pc algorithm so trick two is a greedy search of the dag space so there are three key concepts here first is a dag which should be familiar since they've been discussed in the previous two videos next is a dag space or in other words the space of all possible dags for example consider the space of dags with two nodes in one edge which is shown here there are only two possibilities x could point to y or y could point to x then finally we have the notion of a greedy search which is a widely used idea in optimization in short a greedy search is an optimization strategy that picks what's best in the short run as opposed to the long run and this is usually done using a heuristic or rule of thumb for example suppose you're trying to get out of a forest you may think i'm trapped in a forest forests have trees so to get out of the forest i should go where there aren't any trees in other words every step you take should be in the direction with the least number of trees so you repeat this strategy and go all the way along this black line until you finally get out of the forest which we can call the greedy path because it is the result of a greedy search however if at the start you were to go in the exact opposite direction of the greedy path you would make it back to civilization much faster so you might say why would we ever want to use a greedy search from the look of it they just seem to give some optimal solutions well the problem is that a lot of the time computing the optimal solution is intractable meaning if you ran an algorithm that tried out all possible solutions all possible paths out of the forest and compared them to each other you would be waiting a long time for it to run and maybe your grandkids would see the solution in their lifetime and this is the problem we face in causal discovery we want to find the optimal dag that best explains a given data set the problem is if we tried an exhaustive search we have to deal with the fact that the number of possible dags is a super exponential and the number of nodes in a graph in other words if we have just three variables the number of possible dags is 25 if we have six nodes we're already over three million possibilities and if we have a measly ten variables or ten nodes in our graph the possible dags is on the order of 10 to the power of 18. so even though greedy searches do not guarantee the optimal solution at least they give us a solution in a reasonable amount of time and so a causal discovery method that uses a greedy search is the so-called greedy equivalence search algorithm so the basic idea of this algorithm is you start with a complete unconnected dag and you iteratively add edges to this unconnected graph such that you maximize a score value so in other words you start with a set of nodes that correspond to each of your variables but no edges between them then you add edges one by one according to some score so the question is what is the score that i'm talking about basically this quantifies how good your dag is or how well the dag explains the data so there are a few options to defining this score one is the so-called bayesian information criterion and source number one reference at the bottom here has a brief discussion for anyone that is interested then you can repeat this process until you reach some stopping criterion which could be some number of edges have been added or the score stops increasing or whatever that may be okay so the final trick is exploiting asymmetry and so as i discussed in the first video asymmetry is a fundamental property of this causality framework so it's natural to think maybe we can leverage asymmetry to help us find good causal models from data and there are three flavors of asymmetry that i've come across and i'll say as a disclaimer that these aren't any kind of standard classification for these things these are just some labels i'm putting on some themes that i have gleaned from looking at this stuff so the first flavor is what i call time asymmetry which is based on the idea that causes precede effects this is what is used in granger causality which is a method to quantifying a asymmetric relationship between two variables based on prediction and more information about grainger causality can be found in this first reference here and there's a lot of stuff out there on grade your causality you can just do a simple google search and you'll probably find a bunch of stuff the second asymmetry is what i call complexity asymmetry which is basically the occam's razor principle that simpler models are better so going back to our ice cube example from earlier following this principle we would say the ice cube that's actually cube is preferred over the more complicated eyes because it is simpler and finally the third flavor is what i call functional asymmetry where better functional fits are better candidates for a causal model so one method that uses this is the non-linear additive noise model the way this works is suppose we start with two statistically dependent variables x and y we then model y in terms of a non-linear function of x then we compute a noise term by taking the difference between y and this non-linear functional fit and then finally we test whether the noise term n is independent of x if it is we accept the model and say x causes y and if not we reject it and then we can do the same thing in the opposite direction where we model x in terms of a non-linear function of y and repeat the same procedure and more details on this method can be found in reference number three and generally all of these are great resources if you're trying to learn more about causal discovery okay so wrapping up these tricks we have a trick based taxonomy and it's important to note that these tricks are not mutually exclusive in all cases there are indeed algorithms that will mix and match different ones for causal discovery as shown in the bottom row of the table here and as a bit of commentary causal discovery seems to me at least to be a relatively young field so there still has not emerged a single or small set of causal discovery algorithms that beat out all others in all situations and i'll also say that this is by no means an exhaustive list of causal discovery techniques however this is probably a good start for anyone trying to get into causal discovery and the references given at the bottom here can get the ball rolling for you okay so i will conclude with a concrete example like in the previous video so we're going to be using the same census data set as before but instead of having just three variables of age education and wealth we're going to include more variables and instead of using the microsoft do y library for causal inference we're going to be using the causal discovery toolbox so again first step is importing libraries loading data then for a lot of these causal discovery algorithms it helps to start with a so-called graph skeleton so this is like step two that we saw with the pc algorithm where we do the pairwise independence testing and we have bi-directed edges or undirected edges between variables that are statistically dependent and then you can visualize the network pretty easily using network x so the first causal discovery algorithm that i use here in this example is the pc algorithm so again we just do that in two lines and it spits out this causal graph the graph is somewhat reasonable it's not perfect but we can see that we have has graduate degree which is like our education variable causes a variable greater than 50k which is our income variable and then we also have age causing our income variable which is what we expected but what was not expected is our education variable has graduate degree is pointing toward age so this is saying whether or not someone has a graduate degree has a causal impact on their age which is not true if you give someone a graduate degree it's not going to have any effect on their age another interesting thing is we have several variables having a causal effect on the number of hours that someone works in a week so whether or not they have a graduate degree has a causal effect on the number of hours they work their age has an effect and whether or not they're female so this is basically their two options male or female in this data set and then the ethnicity information captured by is white is a bi-directional so the pc algorithm wasn't able to break that symmetry but what's interesting is uh hours per week causes a single variable which is in relationship so what this is saying is the number of hours you work per week has a causal effect on whether you're in a relationship or not so we could look at this all day and kind of craft whatever stories we want in our minds but this should definitely be taken with a grain of salt so the next algorithm that we try out is the greedy equivalent search algorithm which uses trick number two greedy search of the dagspace and this gives us a causal graph that is somewhat similar to what the pc algorithm gave us notably that edge between hours per week and is white the symmetry was broken so it's not a bi-directed edge and then finally we use the lingam algorithm and this one doesn't really give us something sensible it's basically everything is causing past graduate degree so whether you make more than 50 000 impacts your graduate degree how many hours per week you work has an impact on your graduate degree and these edges seem backwards this algorithm doesn't seem to do a great job and that's because it's assuming linear relationships between variables and since most of these variables are boolean that's not something that necessarily makes sense code can be found at the github linked at the bottom here put the link in the description feel free to take this data run with it feel free to leave a comment i'd be interested to hear the results of your analysis so that concludes our series on causality we started in the first video introducing this new science of cause and effect the second video we talked about causal inference and finally in this video we concluded with causal discovery if you enjoyed the series please consider liking subscribing sharing commenting your thoughts if you're interested in learning more check out the blog check out the github to get your hands on the example code discussed in this video as always thanks for watching [Music] you
Original Description
🤝 Work with me: https://aibuilder.academy/yt/tufdEUSjmNI
🚀 Ship AI apps in weeks, not months: https://aibuilder.academy/courses/yt/tufdEUSjmNI
This is the final video in a three-part series on causality. In it, I sketch some big ideas from causal discovery, which aims to infer causal structure from data. I finish with a concrete example of doing causal discovery in Python.
Series Playlist: https://www.youtube.com/playlist?list=PLz-ep5RbHosVVTz9HEzpI4d6xpWsc8rOa
📰 Read more: https://medium.com/towards-data-science/causal-discovery-6858f9af6dcb?sk=2134f5b56c1ce943afdfebbf9e1dcb45
💻 Example code: https://github.com/ShawhinT/YouTube-Blog/tree/main/causality/causal_discovery
Resources:
- The Book of Why by Judea Pearl: https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X
- Causal Discovery Review: https://www.frontiersin.org/articles/10.3389/fgene.2019.00524/full
- Causal Discovery Toolbox: https://fentechsolutions.github.io/CausalDiscoveryToolbox/html/index.html
Introduction - 0:00
Causal Discovery - 0:21
Forward/Inverse Problem - 1:09
3 Tricks of Causal Discovery - 2:28
Trick 1: Conditional Independence Testing - 2:32
Trick 2: Greedy Search of DAG Space - 5:01
Trick 3: Exploiting Asymmetries - 8:23
Trick-based Taxonomy - 10:34
Example: Causal Discovery with Census Data - 11:13
Closing remarks - 14:26
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Shaw Talebi · Shaw Talebi · 13 of 60
1
2
3
4
5
6
7
8
9
10
11
12
▶
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
biometricDashboard2 DEMO
Shaw Talebi
biometricDahboard3 DEMO
Shaw Talebi
Time Series, Signals, & the Fourier Transform | Introduction
Shaw Talebi
The Fast Fourier Transform | How does it (actually) work?
Shaw Talebi
The Wavelet Transform | Introduction & Example Code
Shaw Talebi
Principal Component Analysis (PCA) | Introduction & Example (Python) Code
Shaw Talebi
Independent Component Analysis (ICA) | EEG Analysis Example Code
Shaw Talebi
Kmeans-based Blink Detecter DEMO
Shaw Talebi
Shit Happens, Stay Solution Oriented
Shaw Talebi
Why Conflict Is Good & How You Can Use It
Shaw Talebi
Causality: An Introduction | How (naive) statistics can fail us
Shaw Talebi
Causal Inference | Answering causal questions
Shaw Talebi
Causal Discovery | Inferring causality from observational data
Shaw Talebi
How to Be Antifragile | 7 Practical Tips
Shaw Talebi
Multi-kills: How to Do More With Less (no, not by multi-tasking)
Shaw Talebi
Topological Data Analysis (TDA) | An introduction
Shaw Talebi
The Mapper Algorithm | Overview & Python Example Code
Shaw Talebi
Persistent Homology | Introduction & Python Example Code
Shaw Talebi
What Is Data Science & How To Start? | A Beginner's Guide
Shaw Talebi
How to do MORE with LESS - multikills
Shaw Talebi
Causal Effects | An introduction
Shaw Talebi
Causal Effects via Propensity Scores | Introduction & Python Code
Shaw Talebi
Causal Effects via the Do-operator | Overview & Example
Shaw Talebi
Causal Effects via DAGs | How to Handle Unobserved Confounders
Shaw Talebi
Smoothing Crypto Time Series with Wavelets | Real-world Data Project
Shaw Talebi
Causal Effects via Regression w/ Python Code
Shaw Talebi
5 Reasons Why Every Data Scientist Should Consider Freelancing
Shaw Talebi
An Introduction to Decision Trees | Gini Impurity & Python Code
Shaw Talebi
10 Decision Trees are Better Than 1 | Random Forest & AdaBoost
Shaw Talebi
Dimensionality Reduction & Segmentation with Decision Trees | Python Code
Shaw Talebi
How to Make a Data Science Portfolio With GitHub Pages (2025)
Shaw Talebi
My $100,000+ Data Science Resume (what got me hired)
Shaw Talebi
How to Create a Custom Email Signature in Gmail (2025)
Shaw Talebi
I Spent $675.92 Talking to Top Data Scientists on Upwork—Here’s what I learned
Shaw Talebi
Lessons from Spending $675.92 to Talk to Top Data Scientists on Upwork #freelance #datascience
Shaw Talebi
A Practical Introduction to Large Language Models (LLMs)
Shaw Talebi
The OpenAI (Python) API | Introduction & Example Code
Shaw Talebi
The Hugging Face Transformers Library | Example Code + Chatbot UI with Gradio
Shaw Talebi
Why I Quit My $150,000 Data Science Job
Shaw Talebi
Prompt Engineering: How to Trick AI into Solving Your Problems
Shaw Talebi
The REALITY of entrepreneurship. #entrepreneurship #startup #smallbusiness
Shaw Talebi
Fine-tuning Large Language Models (LLMs) | w/ Example Code
Shaw Talebi
How to Build an LLM from Scratch | An Overview
Shaw Talebi
I Have 90 Days to Make $10k/mo—Here's my plan
Shaw Talebi
I Spent $716.46 Talking to Data Scientists on Upwork—Here’s what I learned.
Shaw Talebi
Pareto, Power Laws, and Fat Tails
Shaw Talebi
Do NOT become an entrepreneur #entrepreneurship
Shaw Talebi
Detecting Power Laws in Real-world Data | w/ Python Code
Shaw Talebi
How I’d learn data analytics (if I had to start over in 2024) #dataanalytics
Shaw Talebi
4 Ways to Measure Fat Tails with Python (+ Example Code)
Shaw Talebi
Fine-tuning EXPLAINED in 40 sec #generativeai
Shaw Talebi
How Much YouTube Paid Me in My First 6 Months of Monetization (as a Data Science Creator)
Shaw Talebi
5 Questions Every Data Scientist Should Hardcode into Their Brain
Shaw Talebi
AI for Business: A (non-technical) introduction
Shaw Talebi
LLMs EXPLAINED in 60 seconds #ai
Shaw Talebi
3 Ways to Make a Custom AI Assistant | RAG, Tools, & Fine-tuning
Shaw Talebi
What is #ai? — Simply Explained
Shaw Talebi
QLoRA—How to Fine-tune an LLM on a Single GPU (w/ Python Code)
Shaw Talebi
How to Improve LLMs with RAG (Overview + Python Code)
Shaw Talebi
Text Embeddings, Classification, and Semantic Search (w/ Python Code)
Shaw Talebi
More on: Agent Foundations
View skill →
🎓
Tutor Explanation
DeepCamp AI