Sequence Alignment | Needleman Wunsch Algorithm

Aladdin Persson · Intermediate ·⚡ Algorithms & Data Structures ·6y ago

Key Takeaways

The video explains the Needleman Wunsch algorithm for sequence alignment using dynamic programming, covering edit steps, optimal subproblems, and the recursive formula for finding the minimum cost alignment. The algorithm is also known as the Edit Distance problem on Leetcode.

Full Transcript

welcome back for another cool algorithm in this video we want to go to an algorithm for sequence alignment using dynamic programming so the problem is that we're giving two strings X which has a length m and Y which has length N and the goal is to transform X into y with a minimum number of edit steps so an edit step is essentially to insert delete or replace a character and notation X I is the character of position I in X so for example let's say that we have two quite similar words occurrence and y is occurrence how do we make X into y well after some thinking we can see that if we insert some if insert an AC here between the C and the U and change this a into e then the two strings will equal so that's a total number of edit steps of two to be able to find a solution to this problem we need to think about the different cases that can arise so the idea is let's assume that we're only looking at the last two elements and the cases that can arise then and we also assume that all subproblems are optimal up to those just last two elements so let's say that we are in case one so this is the case that can occur right we're looking at when x and y is of length 3 but this this argument with holes in general but so let's say that we're only considering so this x3 and y3 right here let's assume that these these are already solved so we already know this is optimal what do we do if that's the case well we just align them we just make X 3 equal y3 and if if it's the case that that X 3 is not is already equal to Y 3 then we just pay pay a cost of 0 and if they aren't equal then we just set x3 to y3 another case that can arise is that the X is of a is a larger length or greater length than Y so and we know that these are optimal so we're only considering the last two elements right here and what we do here is that well we just remove x3 right because these are optimal and remember also the goal is to transform X into y so we make X into y by just removing this then for case three we have that that the length of X is less than length of Y and what we do here since we are we know we assume this subproblem of these are optimal we were just looking at the last element then we insert an element here so we insert y3 into this gap the case for that can arise is that we have gap on both of the two glass elements but really this isn't the relevant subproblem because then we can just remove those gaps and we have the optimal solution so we ignore case four so we have three cases let's try to generalize the idea that we we saw in the previous slide case one well XM is aligned with yn then we can look at x1 up to xn minus 1 and y1 up to Y n minus 1 by simply aligning XM and yn so we make XM equal the character of Y N and remember here if they are equal then we don't pay anything but if they are in equal if you don't equal each other then we need to pay one to change the character of XM so okay so let's say that X M comes after Y n then it follows that we can look at X 1 up to X M minus 1 and y1 up to Y n by deleting X M case 3yn comes after XM it follows then that we can look at x1 up to XM + y1 up to Y n minus 1 by inserting yn in the last of X so we've seen that the structure depends on solving the subproblems right we assumed and on the previous slide we assumed we've solved sub-problem and we'll just look at the last elements but having solved those subproblems suggests a dynamical programming approach so we can define opt IJ to be the minimum number of edit steps required to transform X I X 1 up to X I into y1 up to Y J and then we remember the three cases we have that opted IJ is the minimum of aligning this specific character this I guess X I and the wide J and then the finding the solution to the sub-problem when we look at X 1 up to X I minus 1 and y1 up to Y J minus 1 or we can delete or we can insert so those are the three options for each we pay a cost of 1 here for deleting right if we delete one that we move one backward in X if we insert then we're still at the end of X but we move one back in y we also pay a cost of 1 of inserting so and we saw that this function previously was 1 if they are not equal otherwise they are 0 right if the characters are already equal then we don't have to pay anything but so this this is a recursive formula we need to consider what the base case is for for opt IJ so the initialization of opt I J then we need to remember first what opt IJ means it's the number of edit steps to transform X 1 up to X I to Y 1 2 up to YJ so if we have opted let's look at the first one if you have opted zero zero this means transform nothing into nothing right that's the cost of zero if opt I comma 0 that means transform I characters into nothing well we just do that by deleting the I characters so that will have a cost of I opt 0 comma J let's just transform nothing into J characters and we do that by inserting inserting the J elements and we get J it cost a total cost of J let's look at an example when we have the case X is some DNA sequence and Y is some other DNA sequence and we want to know how can we you know make this DNA sequence into this one so let's form a a matrix where we have the Y that we want to transform it to and the X and then remember that we have the initialization or the first row and the first column so what this means right here is that we want to transform nothing into nothing right that's cos of 0 and here we have nothing and transform it into for example it's a T and C we're here well that's a cost of 2 the cost of 3 etc here if we're standing here then we want to transform T into nothing we do that by removing and that's a cost of 1 here we want to remove T and G as a cost of 2 etc and then we look at the first one and we see here that we're either going so we're either let's say the first case we align T and T and we pay a price of zero because they are already the same and then we move to the diagonal okay so if we are to align them then we pay that cost and we move with the diagonal if we instead in delete if we delete this is the delete one then we move one up and we move here right if we delete the T then we pay a cost of one and we end up in this position transforming nothing into T for the cost of one that will total cost be two or in the last case we insert T and if we insert it a then we will be in this position where we have to transform T into nothing oh then we look at those that's zero two and two and we choose the one that has the smallest so we choose 0 we just align the t's that T values I mean we just align T with T then for the next one we do the same thing and recognize here that we're also looking from on the 1 up the one to the left and I mean the one up and the one to the diagonal and went to the left right here we can see instantly that the one to the left is of smaller value right so that the optimal solution in this case would be the one to the left which would be this one inserting in this case if we want to transform tea into TC then the best choice is to insert C which is also shown by the recursive formula now we can see here as well that the rest of these this row will be quite easy to fill out because as we see this the smallest value is always the one I mean the one smallest here is also the one to the left so if we wanted to transform T into TCG for example that would mean we just insert C and insert G similarly if we have TCGA and we have T then we want to insert CG and a so this row will be quite easy to form just increment each column value here by one from the pre compared to the previous one and then in this this column right here will also be quite easy to form because if we have T G and we want the target to be just a single T care at the character T then we just remove the G and we pay a price of one for example if we have TGA that's a price of two etc and we continue with our recursive formula we just look well we can go above diagonal or the one to the left okay in this case the one to the diagonal is best so if we go to the diagonal one that means we we match or we align rather G with C right and we cost the price of one for that which is why there's a 1 in this recursive formula case the one is the best and we just earned this G into a sea so what we've seen is that we're always looking at three values either the one to the left or the one to the diagonal or the one above if we look at their subproblem up above us that means we delete the one that we're currently at if we go to the subproblem at the diagonal that means that we align them to the left that means that we um we insert in this case I believe there actually is no the best one is they want to the diagonal if we were to compared to recursive formula since G equals G already we pay a price of zero of going into the diagonal one so the total would just still be one here compared to if we were to insert a G right because these two are the same but inserting a G we would pay a cost of one so the best choice here is to align G with G and then to insert C or go to this sub-problem now quite similar for the rest of these values mean in this case we just we see that the one to the left is always the smallest one so we just keep inserting that's the best choice then let's see here yeah okay sorry so for this one we look at the one above us and the one to the diagonal and we compare either we be in this case actually since a is not equal to C we have two choices that we can take either we remove the a which is the one above us and we go here and we just transform this G into AC for total cost of two or we go to the diagonal one of inserting insert I mean aligning a with with C and then removing G so there are two viable options here and the same for the the next one we look at the one above or the one to the diagonal and they are both viable in this case because a does not equal G we pay a price of the same of going above by deleting a or by aligning G and certain then solving that sub-problem also have a cost of two example when we're here we noticed that a is the same as this one a so we pay a price of zero and we can just go and we go to this one or the subproblem that means that we pay a total price of one and so I think you understand intuition and the idea behind the recursive formula so if we just fill out the rest of these values then at the end we will come to the last at the bottom left I mean bottom right that means we want to transform the entire sequence entire string X into y that will be a total of edit steps of three but how do we actually find a solution so we know the minimum number of edits but we also want to know which of the specific edits to do so the idea here is that we use something called backtracking which is we just check whichever of the three cases was the best and that was the Edit graphically we can denote this as the three the three options and I we have discussed this also if we move to the diagonal that means that aligning was the best choice and remember aligning we can either change it or just and pay a price of one or we can just if they already equal pay a price of zero moving horizontally means inserting was the best choice moving above or yeah moving to the one above us that means deletion was the best choice for example in this case when we compare these three here we see that see here actually we see that inserting was best so in this case we want to insert a and now we look again and we look which which of these three was the best ones and the best choice in this case was lining which is see and then we just do the same thing which it which of these three were best it was deleting and then continuing going through each of the choices until we eventually come to the base case here and then we see that from doing this we've actually formed the specific edits to do the best choice in Athiya in the end is to insert a and then to align the C with C and then to remove G and then to align T with T here align G with G align C with C line a with a align G with G and then to insert here insert C and then at the end insert mean align T with T so we see that the the Edit steps were to insert C remove G and and also insert a and just a quick note on the time complexity finding each specific array value was a which is is it a constant time but we need to fill the entire this matrix so knowing that X is of length M and Y is of length n then we have Big O of M times and time complexity in the next video I will code this algorithm in Python so if you want to know this the implementation details check out the next video

Original Description

Explanation step by step of how the sequence alignment algorithms problem works. Other common names of this algorithm is the Needleman Wunsch algorithm. In the next video we will code this algorithm from scratch using Python. On Leetcode this algorithm can be found under "Edit Distance", it seems this algorithm has many different names! Code Repository: https://github.com/AladdinPerzon/Algorithms-Collection-Python
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aladdin Persson · Aladdin Persson · 24 of 60

1 computeCost.m Linear Regression Cost Function - Machine Learning
computeCost.m Linear Regression Cost Function - Machine Learning
Aladdin Persson
2 gradientDescent.m Gradient Descent Implementation -  Machine Learning
gradientDescent.m Gradient Descent Implementation - Machine Learning
Aladdin Persson
3 Neural Network from scratch - Part 1 (Standard Notation)
Neural Network from scratch - Part 1 (Standard Notation)
Aladdin Persson
4 Neural Network from scratch - Part 2 (Forward Propagation)
Neural Network from scratch - Part 2 (Forward Propagation)
Aladdin Persson
5 Neural Network from scratch - Part 3 (Backward Propagation)
Neural Network from scratch - Part 3 (Backward Propagation)
Aladdin Persson
6 Neural Network from scratch - Part 4 (With Python)
Neural Network from scratch - Part 4 (With Python)
Aladdin Persson
7 sigmoid.m - Programming Assignment 2 Machine Learning
sigmoid.m - Programming Assignment 2 Machine Learning
Aladdin Persson
8 costFunction.m - Programming Assignment 2 Machine Learning
costFunction.m - Programming Assignment 2 Machine Learning
Aladdin Persson
9 predict.m - Programming Assignment 2 Machine Learning
predict.m - Programming Assignment 2 Machine Learning
Aladdin Persson
10 costFunctionReg.m - Programming Assignment 2 Machine Learning
costFunctionReg.m - Programming Assignment 2 Machine Learning
Aladdin Persson
11 lrCostFunction.m - Programming Assignment 3 Machine Learning
lrCostFunction.m - Programming Assignment 3 Machine Learning
Aladdin Persson
12 oneVsAll.m - Programming Assignment 3 Machine Learning
oneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
13 predictOneVsAll.m - Programming Assignment 3 Machine Learning
predictOneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
14 predict.m - Programming Assignment 3 Machine Learning
predict.m - Programming Assignment 3 Machine Learning
Aladdin Persson
15 Caesar Cipher Encryption and Decryption with example
Caesar Cipher Encryption and Decryption with example
Aladdin Persson
16 Cryptography: Caesar Cipher Python
Cryptography: Caesar Cipher Python
Aladdin Persson
17 Vigenere Cipher Explained (with Example)
Vigenere Cipher Explained (with Example)
Aladdin Persson
18 Cryptography: Vigenere Cipher Python
Cryptography: Vigenere Cipher Python
Aladdin Persson
19 Hill Cipher Explained (with Example)
Hill Cipher Explained (with Example)
Aladdin Persson
20 Cryptography: Hill Cipher Python
Cryptography: Hill Cipher Python
Aladdin Persson
21 Interval Scheduling Greedy Algorithm: Python
Interval Scheduling Greedy Algorithm: Python
Aladdin Persson
22 Weighted Interval Scheduling Algorithm Explained
Weighted Interval Scheduling Algorithm Explained
Aladdin Persson
23 Weighted Interval Scheduling Python Code
Weighted Interval Scheduling Python Code
Aladdin Persson
Sequence Alignment | Needleman Wunsch Algorithm
Sequence Alignment | Needleman Wunsch Algorithm
Aladdin Persson
25 Sequence Alignment | Needleman Wunsch in Python
Sequence Alignment | Needleman Wunsch in Python
Aladdin Persson
26 Codility BinaryGap Python
Codility BinaryGap Python
Aladdin Persson
27 Codility CyclicRotation Python
Codility CyclicRotation Python
Aladdin Persson
28 Derivation Linear Regression with Gradient Descent
Derivation Linear Regression with Gradient Descent
Aladdin Persson
29 Linear Regression Gradient Descent From Scratch in Python
Linear Regression Gradient Descent From Scratch in Python
Aladdin Persson
30 Pytorch Neural Network example
Pytorch Neural Network example
Aladdin Persson
31 Pytorch CNN example (Convolutional Neural Network)
Pytorch CNN example (Convolutional Neural Network)
Aladdin Persson
32 Pytorch LeNet implementation from scratch
Pytorch LeNet implementation from scratch
Aladdin Persson
33 Pytorch VGG implementation from scratch
Pytorch VGG implementation from scratch
Aladdin Persson
34 Pytorch GoogLeNet / InceptionNet implementation from scratch
Pytorch GoogLeNet / InceptionNet implementation from scratch
Aladdin Persson
35 How to save and load models in Pytorch
How to save and load models in Pytorch
Aladdin Persson
36 How to build custom Datasets for Images in Pytorch
How to build custom Datasets for Images in Pytorch
Aladdin Persson
37 Pytorch Transfer Learning and Fine Tuning Tutorial
Pytorch Transfer Learning and Fine Tuning Tutorial
Aladdin Persson
38 Pytorch Data Augmentation using Torchvision
Pytorch Data Augmentation using Torchvision
Aladdin Persson
39 Pytorch Quick Tip: Weight Initialization
Pytorch Quick Tip: Weight Initialization
Aladdin Persson
40 Pytorch Quick Tip: Using a Learning Rate Scheduler
Pytorch Quick Tip: Using a Learning Rate Scheduler
Aladdin Persson
41 Pytorch ResNet implementation from Scratch
Pytorch ResNet implementation from Scratch
Aladdin Persson
42 Pytorch TensorBoard Tutorial
Pytorch TensorBoard Tutorial
Aladdin Persson
43 Pytorch DCGAN Tutorial (See description for updated video)
Pytorch DCGAN Tutorial (See description for updated video)
Aladdin Persson
44 Naive Bayes from Scratch - Machine Learning Python
Naive Bayes from Scratch - Machine Learning Python
Aladdin Persson
45 Spam Classifier using Naive Bayes in Python
Spam Classifier using Naive Bayes in Python
Aladdin Persson
46 K-Nearest Neighbor from scratch - Machine Learning Python
K-Nearest Neighbor from scratch - Machine Learning Python
Aladdin Persson
47 Linear Regression Normal Equation Python
Linear Regression Normal Equation Python
Aladdin Persson
48 SVM from Scratch - Machine Learning Python (Support Vector Machine)
SVM from Scratch - Machine Learning Python (Support Vector Machine)
Aladdin Persson
49 Neural Network from Scratch - Machine Learning Python
Neural Network from Scratch - Machine Learning Python
Aladdin Persson
50 Pytorch RNN example (Recurrent Neural Network)
Pytorch RNN example (Recurrent Neural Network)
Aladdin Persson
51 Pytorch Bidirectional LSTM example
Pytorch Bidirectional LSTM example
Aladdin Persson
52 Pytorch Text Generator with character level LSTM
Pytorch Text Generator with character level LSTM
Aladdin Persson
53 Logistic Regression from Scratch - Machine Learning Python
Logistic Regression from Scratch - Machine Learning Python
Aladdin Persson
54 K-Means Clustering from Scratch - Machine Learning Python
K-Means Clustering from Scratch - Machine Learning Python
Aladdin Persson
55 Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Aladdin Persson
56 Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Aladdin Persson
57 Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Aladdin Persson
58 Paper Review: Sequence to Sequence Learning with Neural Networks
Paper Review: Sequence to Sequence Learning with Neural Networks
Aladdin Persson
59 Pytorch Seq2Seq Tutorial for Machine Translation
Pytorch Seq2Seq Tutorial for Machine Translation
Aladdin Persson
60 Pytorch Seq2Seq with Attention for Machine Translation
Pytorch Seq2Seq with Attention for Machine Translation
Aladdin Persson

The Needleman Wunsch algorithm is a dynamic programming approach to sequence alignment, finding the optimal alignment by considering edit steps and their costs. The algorithm uses a recursive formula and backtracking to find the minimum cost alignment. This video explains the algorithm step-by-step, covering its application to sequence alignment problems.

Key Takeaways
  1. Initialize the matrix with base case values
  2. Fill in the matrix by considering the minimum cost of aligning, deleting, or inserting characters
  3. Choose the minimum cost by comparing the costs of the three options
  4. Continue this process until the entire matrix is filled in
  5. Read off the optimal alignment from the matrix
  6. Use backtracking to find the specific edits to perform
💡 The Needleman Wunsch algorithm uses dynamic programming to find the optimal alignment between two sequences, with a time complexity of O(M*N) where M and N are the lengths of the two sequences.

Related AI Lessons

Bloom Filters, Explained Properly
Learn how Bloom filters work and their benefits, including tiny memory and blazing speed, in exchange for potential false positives.
Dev.to · Daksh Gargas
Prefix Sums: The Preprocessing Trick That Makes Range Queries Instant
Learn how prefix sums enable instant range queries in arrays, boosting performance in various applications
Medium · Programming
I Thought I Was Ready for the Interview — Then One Simple Math Question Destroyed Me
A simple math question can destroy a developer's interview, highlighting the importance of being prepared for unexpected questions
Medium · Programming
Week 2(Day 10): LeetCode Two Pointers(slow & fast): Remove Duplicates from Sorted Array (Brute…
Learn to remove duplicates from a sorted array using the two pointers technique, improving from brute force to optimized solutions
Medium · Python
Up next
Stump Grinder Carbide Wheel Grinds Hardwood To Chips
Innoforge Studio
Watch →