Causal Discovery | Inferring causality from observational data

Shaw Talebi · Beginner ·🛠️ AI Tools & Apps ·4y ago

Skills: Agent Foundations80%Tool Use & Function Calling70%ML Maths Basics60%Unsupervised Learning50%

Key Takeaways

The video discusses causal discovery, aiming to infer causal structure from observational data, using tools like Python, PC algorithm, and Greedy Equivalence Search algorithm. It covers key concepts such as conditional independence testing, inverse problems, and asymmetry in causal discovery.

Full Transcript

hey folks welcome back this is the final video in the three-part series on causality in this video i'll be talking about causal discovery which aims at inferring causal structure from data start by introducing what causal discovery is sketching some big ideas and then finishing with a concrete example with code in python so with that let's get into the video in the previous video i was talking about causal inference which aims at answering questions about cause and effect we talked about a lot of great things we talked about the do operator which stimulates interventions we talked about confounding which talked about estimating causal effects and all these things were great however there was one key assumption that was necessary in order to do causal inference which was a causal mod and obviously a lot of times in the real world we don't have a causal model in hand when we're starting our analysis and it's not always clear which variables cause which causal discovery is one thing that might help with obtaining a causal model and the goal of causal discovery is to find causal structure in data so basically given data inferring the underlying causal model so causal discovery is an example of a so-called inverse problem and inverse problems can be understood in contrast to forward problems for example imagine you have an ice cube sitting on your kitchen counter you know the shape of the ice cube you know the volume and if you were to let that ice cube sit there for a few hours you could probably predict with some reasonable degree of accuracy what the resulting puddle of water would look like the inverse problem is like the opposite of this in other words the inverse problem would be given a puddle of water on the kitchen counter predicting the shape of the ice cube that made that puddle and clearly this is a hard problem because there are several different shapes of ice that could create the same puddle of water connecting this to causal discovery the shape of the ice cube is like our causal model and the puddle of water is like the statistics that we observe in our data so following this analogy there are several causal models that could potentially generate the same statistics we observe in a given data set the approach to solving inverse problems is to make assumptions basically we narrow down the possible number of solutions through assumptions and although assumptions help they often do not fully solve the problem this is where we need to use some tricks to go a little further here i'll talk about three different tricks for causal discovery the first trick is conditional independence testing i start here with a definition of statistical independence which is shown here in other words two variables x and y are said to be independent if their joint distribution is equal to the product of their individual distributions from this we can get a definition of conditional independence which is basically the same thing however now we look at distributions of each variable when conditioned on a particular variable say z we can use this idea of conditional independence testing to do causal discovery and this is actually the main idea behind one of the first causal discovery algorithms called the pc algorithm which is named after its authors clark glymore and peter spears i probably butchered that so i apologize but there's a reference to a review paper by them at the bottom here so i'll just briefly go through the main idea of the pc algorithm more details can be found in the blog linked in the description the first step is to form a fully connected undirected graph so we have a node for each variable in our data set and we connect undirected edges between each of these nodes in step two we do pairwise independence tests so we do an independence test between every possible pair of variables and if two variables are independent we delete the undirected edge between them the third step are conditional independence tests so basically we do the same thing however we pick a variable to condition on then if two variables are found to be conditionally independent we delete the edge between them and we add that conditioned node to the separation set and we continue these conditional independence tests until there are no more candidates for conditional independence testing then in step four we orient colliders so if we have three variables say i j and k we form a collider out of them meaning we make directed edges pointing from i to k and j to k given k is not in the separation set of i and j then in step 5 we add more directed edges to the graph following two constraints namely we do not create any new v structures in our graph nor do we create any directed cycles and hopefully after all that we output a directed acyclic graph which represents the causal connections of our system again more details on the blog and the two references at the bottom here have a great description of the pc algorithm so trick two is a greedy search of the dag space so there are three key concepts here first is a dag which should be familiar since they've been discussed in the previous two videos next is a dag space or in other words the space of all possible dags for example consider the space of dags with two nodes in one edge which is shown here there are only two possibilities x could point to y or y could point to x then finally we have the notion of a greedy search which is a widely used idea in optimization in short a greedy search is an optimization strategy that picks what's best in the short run as opposed to the long run and this is usually done using a heuristic or rule of thumb for example suppose you're trying to get out of a forest you may think i'm trapped in a forest forests have trees so to get out of the forest i should go where there aren't any trees in other words every step you take should be in the direction with the least number of trees so you repeat this strategy and go all the way along this black line until you finally get out of the forest which we can call the greedy path because it is the result of a greedy search however if at the start you were to go in the exact opposite direction of the greedy path you would make it back to civilization much faster so you might say why would we ever want to use a greedy search from the look of it they just seem to give some optimal solutions well the problem is that a lot of the time computing the optimal solution is intractable meaning if you ran an algorithm that tried out all possible solutions all possible paths out of the forest and compared them to each other you would be waiting a long time for it to run and maybe your grandkids would see the solution in their lifetime and this is the problem we face in causal discovery we want to find the optimal dag that best explains a given data set the problem is if we tried an exhaustive search we have to deal with the fact that the number of possible dags is a super exponential and the number of nodes in a graph in other words if we have just three variables the number of possible dags is 25 if we have six nodes we're already over three million possibilities and if we have a measly ten variables or ten nodes in our graph the possible dags is on the order of 10 to the power of 18. so even though greedy searches do not guarantee the optimal solution at least they give us a solution in a reasonable amount of time and so a causal discovery method that uses a greedy search is the so-called greedy equivalence search algorithm so the basic idea of this algorithm is you start with a complete unconnected dag and you iteratively add edges to this unconnected graph such that you maximize a score value so in other words you start with a set of nodes that correspond to each of your variables but no edges between them then you add edges one by one according to some score so the question is what is the score that i'm talking about basically this quantifies how good your dag is or how well the dag explains the data so there are a few options to defining this score one is the so-called bayesian information criterion and source number one reference at the bottom here has a brief discussion for anyone that is interested then you can repeat this process until you reach some stopping criterion which could be some number of edges have been added or the score stops increasing or whatever that may be okay so the final trick is exploiting asymmetry and so as i discussed in the first video asymmetry is a fundamental property of this causality framework so it's natural to think maybe we can leverage asymmetry to help us find good causal models from data and there are three flavors of asymmetry that i've come across and i'll say as a disclaimer that these aren't any kind of standard classification for these things these are just some labels i'm putting on some themes that i have gleaned from looking at this stuff so the first flavor is what i call time asymmetry which is based on the idea that causes precede effects this is what is used in granger causality which is a method to quantifying a asymmetric relationship between two variables based on prediction and more information about grainger causality can be found in this first reference here and there's a lot of stuff out there on grade your causality you can just do a simple google search and you'll probably find a bunch of stuff the second asymmetry is what i call complexity asymmetry which is basically the occam's razor principle that simpler models are better so going back to our ice cube example from earlier following this principle we would say the ice cube that's actually cube is preferred over the more complicated eyes because it is simpler and finally the third flavor is what i call functional asymmetry where better functional fits are better candidates for a causal model so one method that uses this is the non-linear additive noise model the way this works is suppose we start with two statistically dependent variables x and y we then model y in terms of a non-linear function of x then we compute a noise term by taking the difference between y and this non-linear functional fit and then finally we test whether the noise term n is independent of x if it is we accept the model and say x causes y and if not we reject it and then we can do the same thing in the opposite direction where we model x in terms of a non-linear function of y and repeat the same procedure and more details on this method can be found in reference number three and generally all of these are great resources if you're trying to learn more about causal discovery okay so wrapping up these tricks we have a trick based taxonomy and it's important to note that these tricks are not mutually exclusive in all cases there are indeed algorithms that will mix and match different ones for causal discovery as shown in the bottom row of the table here and as a bit of commentary causal discovery seems to me at least to be a relatively young field so there still has not emerged a single or small set of causal discovery algorithms that beat out all others in all situations and i'll also say that this is by no means an exhaustive list of causal discovery techniques however this is probably a good start for anyone trying to get into causal discovery and the references given at the bottom here can get the ball rolling for you okay so i will conclude with a concrete example like in the previous video so we're going to be using the same census data set as before but instead of having just three variables of age education and wealth we're going to include more variables and instead of using the microsoft do y library for causal inference we're going to be using the causal discovery toolbox so again first step is importing libraries loading data then for a lot of these causal discovery algorithms it helps to start with a so-called graph skeleton so this is like step two that we saw with the pc algorithm where we do the pairwise independence testing and we have bi-directed edges or undirected edges between variables that are statistically dependent and then you can visualize the network pretty easily using network x so the first causal discovery algorithm that i use here in this example is the pc algorithm so again we just do that in two lines and it spits out this causal graph the graph is somewhat reasonable it's not perfect but we can see that we have has graduate degree which is like our education variable causes a variable greater than 50k which is our income variable and then we also have age causing our income variable which is what we expected but what was not expected is our education variable has graduate degree is pointing toward age so this is saying whether or not someone has a graduate degree has a causal impact on their age which is not true if you give someone a graduate degree it's not going to have any effect on their age another interesting thing is we have several variables having a causal effect on the number of hours that someone works in a week so whether or not they have a graduate degree has a causal effect on the number of hours they work their age has an effect and whether or not they're female so this is basically their two options male or female in this data set and then the ethnicity information captured by is white is a bi-directional so the pc algorithm wasn't able to break that symmetry but what's interesting is uh hours per week causes a single variable which is in relationship so what this is saying is the number of hours you work per week has a causal effect on whether you're in a relationship or not so we could look at this all day and kind of craft whatever stories we want in our minds but this should definitely be taken with a grain of salt so the next algorithm that we try out is the greedy equivalent search algorithm which uses trick number two greedy search of the dagspace and this gives us a causal graph that is somewhat similar to what the pc algorithm gave us notably that edge between hours per week and is white the symmetry was broken so it's not a bi-directed edge and then finally we use the lingam algorithm and this one doesn't really give us something sensible it's basically everything is causing past graduate degree so whether you make more than 50 000 impacts your graduate degree how many hours per week you work has an impact on your graduate degree and these edges seem backwards this algorithm doesn't seem to do a great job and that's because it's assuming linear relationships between variables and since most of these variables are boolean that's not something that necessarily makes sense code can be found at the github linked at the bottom here put the link in the description feel free to take this data run with it feel free to leave a comment i'd be interested to hear the results of your analysis so that concludes our series on causality we started in the first video introducing this new science of cause and effect the second video we talked about causal inference and finally in this video we concluded with causal discovery if you enjoyed the series please consider liking subscribing sharing commenting your thoughts if you're interested in learning more check out the blog check out the github to get your hands on the example code discussed in this video as always thanks for watching [Music] you

Original Description

🤝 Work with me: https://aibuilder.academy/yt/tufdEUSjmNI 🚀 Ship AI apps in weeks, not months: https://aibuilder.academy/courses/yt/tufdEUSjmNI This is the final video in a three-part series on causality. In it, I sketch some big ideas from causal discovery, which aims to infer causal structure from data. I finish with a concrete example of doing causal discovery in Python. Series Playlist: https://www.youtube.com/playlist?list=PLz-ep5RbHosVVTz9HEzpI4d6xpWsc8rOa 📰 Read more: https://medium.com/towards-data-science/causal-discovery-6858f9af6dcb?sk=2134f5b56c1ce943afdfebbf9e1dcb45 💻 Example code: https://github.com/ShawhinT/YouTube-Blog/tree/main/causality/causal_discovery Resources: - The Book of Why by Judea Pearl: https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X - Causal Discovery Review: https://www.frontiersin.org/articles/10.3389/fgene.2019.00524/full - Causal Discovery Toolbox: https://fentechsolutions.github.io/CausalDiscoveryToolbox/html/index.html Introduction - 0:00 Causal Discovery - 0:21 Forward/Inverse Problem - 1:09 3 Tricks of Causal Discovery - 2:28 Trick 1: Conditional Independence Testing - 2:32 Trick 2: Greedy Search of DAG Space - 5:01 Trick 3: Exploiting Asymmetries - 8:23 Trick-based Taxonomy - 10:34 Example: Causal Discovery with Census Data - 11:13 Closing remarks - 14:26

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Shaw Talebi · Shaw Talebi · 13 of 60

← Previous Next →

biometricDashboard2 DEMO

biometricDashboard2 DEMO

biometricDahboard3 DEMO

biometricDahboard3 DEMO

Time Series, Signals, & the Fourier Transform | Introduction

Time Series, Signals, & the Fourier Transform | Introduction

The Fast Fourier Transform | How does it (actually) work?

The Fast Fourier Transform | How does it (actually) work?

The Wavelet Transform | Introduction & Example Code

The Wavelet Transform | Introduction & Example Code

Principal Component Analysis (PCA) | Introduction & Example (Python) Code

Principal Component Analysis (PCA) | Introduction & Example (Python) Code

Independent Component Analysis (ICA) | EEG Analysis Example Code

Independent Component Analysis (ICA) | EEG Analysis Example Code

Kmeans-based Blink Detecter DEMO

Kmeans-based Blink Detecter DEMO

Shit Happens, Stay Solution Oriented

Shit Happens, Stay Solution Oriented

Why Conflict Is Good & How You Can Use It

Why Conflict Is Good & How You Can Use It

Causality: An Introduction | How (naive) statistics can fail us

Causality: An Introduction | How (naive) statistics can fail us

Causal Inference | Answering causal questions

Causal Inference | Answering causal questions

Causal Discovery | Inferring causality from observational data

Causal Discovery | Inferring causality from observational data

How to Be Antifragile | 7 Practical Tips

How to Be Antifragile | 7 Practical Tips

Multi-kills: How to Do More With Less (no, not by multi-tasking)

Multi-kills: How to Do More With Less (no, not by multi-tasking)

Topological Data Analysis (TDA) | An introduction

Topological Data Analysis (TDA) | An introduction

The Mapper Algorithm | Overview & Python Example Code

The Mapper Algorithm | Overview & Python Example Code

Persistent Homology | Introduction & Python Example Code

Persistent Homology | Introduction & Python Example Code

What Is Data Science & How To Start? | A Beginner's Guide

What Is Data Science & How To Start? | A Beginner's Guide

How to do MORE with LESS - multikills

How to do MORE with LESS - multikills

Causal Effects | An introduction

Causal Effects | An introduction

Causal Effects via Propensity Scores | Introduction & Python Code

Causal Effects via Propensity Scores | Introduction & Python Code

Causal Effects via the Do-operator | Overview & Example

Causal Effects via the Do-operator | Overview & Example

Causal Effects via DAGs | How to Handle Unobserved Confounders

Causal Effects via DAGs | How to Handle Unobserved Confounders

Smoothing Crypto Time Series with Wavelets | Real-world Data Project

Smoothing Crypto Time Series with Wavelets | Real-world Data Project

Causal Effects via Regression w/ Python Code

Causal Effects via Regression w/ Python Code

5 Reasons Why Every Data Scientist Should Consider Freelancing

5 Reasons Why Every Data Scientist Should Consider Freelancing

An Introduction to Decision Trees | Gini Impurity & Python Code

An Introduction to Decision Trees | Gini Impurity & Python Code

10 Decision Trees are Better Than 1 | Random Forest & AdaBoost

10 Decision Trees are Better Than 1 | Random Forest & AdaBoost

Dimensionality Reduction & Segmentation with Decision Trees | Python Code

Dimensionality Reduction & Segmentation with Decision Trees | Python Code

How to Make a Data Science Portfolio With GitHub Pages (2025)

How to Make a Data Science Portfolio With GitHub Pages (2025)

My $100,000+ Data Science Resume (what got me hired)

My $100,000+ Data Science Resume (what got me hired)

How to Create a Custom Email Signature in Gmail (2025)

How to Create a Custom Email Signature in Gmail (2025)

I Spent $675.92 Talking to Top Data Scientists on Upwork—Here’s what I learned

I Spent $675.92 Talking to Top Data Scientists on Upwork—Here’s what I learned

Lessons from Spending $675.92 to Talk to Top Data Scientists on Upwork #freelance #datascience

Lessons from Spending $675.92 to Talk to Top Data Scientists on Upwork #freelance #datascience

A Practical Introduction to Large Language Models (LLMs)

A Practical Introduction to Large Language Models (LLMs)

The OpenAI (Python) API | Introduction & Example Code

The OpenAI (Python) API | Introduction & Example Code

The Hugging Face Transformers Library | Example Code + Chatbot UI with Gradio

The Hugging Face Transformers Library | Example Code + Chatbot UI with Gradio

Why I Quit My $150,000 Data Science Job

Why I Quit My $150,000 Data Science Job

Prompt Engineering: How to Trick AI into Solving Your Problems

Prompt Engineering: How to Trick AI into Solving Your Problems

The REALITY of entrepreneurship. #entrepreneurship #startup #smallbusiness

The REALITY of entrepreneurship. #entrepreneurship #startup #smallbusiness

Fine-tuning Large Language Models (LLMs) | w/ Example Code

Fine-tuning Large Language Models (LLMs) | w/ Example Code

How to Build an LLM from Scratch | An Overview

How to Build an LLM from Scratch | An Overview

I Have 90 Days to Make $10k/mo—Here's my plan

I Have 90 Days to Make $10k/mo—Here's my plan

I Spent $716.46 Talking to Data Scientists on Upwork—Here’s what I learned.

I Spent $716.46 Talking to Data Scientists on Upwork—Here’s what I learned.

Pareto, Power Laws, and Fat Tails

Pareto, Power Laws, and Fat Tails

Do NOT become an entrepreneur #entrepreneurship

Do NOT become an entrepreneur #entrepreneurship

Detecting Power Laws in Real-world Data | w/ Python Code

Detecting Power Laws in Real-world Data | w/ Python Code

How I’d learn data analytics (if I had to start over in 2024) #dataanalytics

How I’d learn data analytics (if I had to start over in 2024) #dataanalytics

4 Ways to Measure Fat Tails with Python (+ Example Code)

4 Ways to Measure Fat Tails with Python (+ Example Code)

Fine-tuning EXPLAINED in 40 sec #generativeai

Fine-tuning EXPLAINED in 40 sec #generativeai

How Much YouTube Paid Me in My First 6 Months of Monetization (as a Data Science Creator)

How Much YouTube Paid Me in My First 6 Months of Monetization (as a Data Science Creator)

5 Questions Every Data Scientist Should Hardcode into Their Brain

5 Questions Every Data Scientist Should Hardcode into Their Brain

AI for Business: A (non-technical) introduction

AI for Business: A (non-technical) introduction

LLMs EXPLAINED in 60 seconds #ai

LLMs EXPLAINED in 60 seconds #ai

3 Ways to Make a Custom AI Assistant | RAG, Tools, & Fine-tuning

3 Ways to Make a Custom AI Assistant | RAG, Tools, & Fine-tuning

What is #ai? — Simply Explained

What is #ai? — Simply Explained

QLoRA—How to Fine-tune an LLM on a Single GPU (w/ Python Code)

QLoRA—How to Fine-tune an LLM on a Single GPU (w/ Python Code)

How to Improve LLMs with RAG (Overview + Python Code)

How to Improve LLMs with RAG (Overview + Python Code)

Text Embeddings, Classification, and Semantic Search (w/ Python Code)

Text Embeddings, Classification, and Semantic Search (w/ Python Code)

This video teaches viewers about causal discovery, a technique used to infer causal structure from observational data. It covers key concepts and algorithms, including the PC algorithm and Greedy Equivalence Search. Viewers will learn how to apply these techniques to real-world problems and improve their understanding of causal relationships.

Key Takeaways

Form a fully connected undirected graph
Do pairwise independence tests
Do conditional independence tests
Orienting colliders with directed edges
Adding more directed edges without creating new v structures or cycles
Importing libraries
Loading data
Pairwise independence testing
Determining graph structure
Visualizing network structure

💡 Causal discovery is a relatively young field, and there is no single or small set of causal discovery algorithms that beat out all others in all situations.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Agent Foundations

View skill →

Build and Deploy an Agent with Reasoning Engine in Vertex AI

Adding a Phone Gateway to a Virtual Agent

From Zero to Working AI Agent in 60 Seconds

From Zero to Working AI Agent in 60 Seconds

Create An AI Agent With Replit That Automates Your Sales

Create An AI Agent With Replit That Automates Your Sales

Capstone: Autonomous Runway Detection for IoT

Capstone: Autonomous Runway Detection for IoT

AI Agents with Model Context Protocol & Typescript

AI Agents with Model Context Protocol & Typescript

Related AI Lessons

Best AI Tools and Software Reviews: 2026 Picks

Discover the best AI tools and software for your specific needs in 2026, and learn how to match them to your work for optimal results

Verify real estate listings with Dwell, a platform that checks claims against records before you sign

Reddit r/artificial

X now offers an MCP server to make its platform easier for AI tools to use

X launches a hosted MCP server to simplify AI tool integration with its API

n8n Automation Repurpose Video Content: The 2025 Production Guide

Learn to repurpose video content using n8n automation, replacing manual labor with a self-hosted workflow solution

How to Open HPL Files (HP-GL Plotter)

File Extension Geeks