Unicode Normalization for NLP in Python

James Briggs · Intermediate ·🧠 Large Language Models ·5y ago

Key Takeaways

This video teaches Unicode normalization techniques for natural language processing in Python, covering how to handle annoying font variants and diacritics in text input

Full Transcript

okay so we're going to take a look at unicode normalization unicode normalization is something that we use when we have those weird font variants that people always use on the internet so if you've ever seen people using those odd characters i think they use it to express some form of individuality or to catch your attention and then we also have another issue where we have weird glyphs in text and this is more reasonable because it's actually a part of language that's like little glyphs so you have the accents above the ease and stuff in italian or spanish and those little glyphs all together they're called diacritics and whenever we come across diacritics or that weird text we can get issues when we're building models the issues with the weird text is obviously if we have someone has got hello world in normal text and we're comparing it to someone's hello world in some weird text with circles around every letter we can't actually compare them like like because our models or code in general is not going to be able to compare those two different unicode character sets and the issue diacritics is that those characters always have this hidden property in that we have one unicode character which is the capital c with cedella but then we have an identical set of characters which is for example the latin capital c immediately followed by something called the combining cedilla character and they together look exactly like the other unicode character and this is quite difficult to deal with so we have these two problems and we use unicode normalization to actually deal with those when we're building nfp models so i kind of said there's there's two forms of equivalent characters that are not really equivalent equivalent the first of those is the compatibility equivalences that's where we have stuff like font variants we have different line break sequences circled variants superscripts subscripts fractions and a few other things as well now we want our model to see both hello world with those we have circles and also just hello world as one because that's how we read it and that's how it's supposed to be interpreted and that is what the compatibility equivalence is for and we'll look at how we actually deal with that pretty soon and then we also have the canonical equivalence which is the thing with the accents and the glyphs i mentioned before so you have a few different reasons for that but two that i think are most relevant is where you have the combined characters so we have that see with cedilla character and then we also have the capital c plus the combining cellular characters merged together and then we also have conjoined the korean characters which i think are pretty common as well canonical equivalence is much more to do with characters that we can't really see that they are different but they are in fact different whereas compatibility equivalence is more to do with they've purposely made them different and in reality their meaning is the same so we have two different directions for how we can transform our text between these two different forms so we have decomposition which is breaking down unicode characters into smaller parts or more normal parts and then we have composition which is taking multiple unicode characters and merging them into a single accepted unicode character so i've got this example here this u 0 0 c 7 if we take a look here this is our c with cedilla and we can see here this is what it looks like it has this c and it's got a little sedum at the bottom and then the other side we have these two characters here and if we just take a look here we can see okay this is the c plus siddhi so these are two separate unicode characters then we see okay they actually look exactly the same again and obviously that's where our problem is so what we can do is we can decompose them into their different parts now these are already separated so when we decompose them we just get the same thing again whereas for our c with cedilla character we decompose that and we basically get these two different parts we should see latin capital c and the combining cedilla character and then we can perform canonical composition to put those both together and merge them back into the capital c with cedilla and that's essentially how decomposition and composition works obviously it's slightly different for the compatibility decomposition but we'll talk about that quite soon when we take the fact that we have these two different directions composition decomposition and we have our two types of transformations which is compatibility and canonical equivalence we get these four forms so where form d which is canonical decomposition which is what i showed you here where we're decomposing those characters into its individual parts and if we just take a look at how to actually do this in python so we'll take this unicode here i'll just place it here um this is our c with cedilla character so if we just print that out we see we have that character now the other one is where it's kind of both together so i'm just going to call it c plus cedar um that is the latin capital c which is zero zero four three which if i just print this out so we can just see it before we put the cedilla on the end we just have a c and then for the cedilla we just put zero three two seven and we get that and obviously these look the same but if we compare them we'll see that they are not the same okay we get faults so to deal with that this is where we need to use that canonical decomposition or nfd that we can see here so to do all of this we're going to need to import the unicode data library and then we use unicode data normalization in this case we're using nfd which is canonical decomposition and then what we want to do is pass in our c with cedilla because we're going to want to break this down into the two different parts so that's the one that we need to transform and on the other side we're going to have our c plus cedilla which is our two characters and we'll see if we just change this to normalize now we have true so now what we've done is converted a single character into the two separate characters here and that is because we've used normal form decompositions we decompose those we broke them apart now on the other side of that we have the canonical composition where we build them back up into one and to do that we use nfc and obviously if we try it with this we're not going to get the right answer because we're not going to find that they match because we're compositioning this back into itself so it's just going to be this again against this which are separate so we actually switch which side we have this function on so if i just remove this and copy this across and we'll see that now we get true because what we've done is converted these into this and that's how we normalize for canonical equivalence which is essentially where we can't actually see the difference on the other side we have where people using the weird text so in our abbreviations we have these two with the k and that k means compatibility where there isn't a k that means we're using the canonical equivalence where there is decay we're using the compatibility equivalence now the first of those is normal form kd which is compatibility decomposition now this breaks down the fancy or alternative characters into their smaller parts if they do have smaller parts so for example fractions if we have the one over two fraction that will get broken down into the numbers one and two and also a fraction character which can actually see down here and we also have our fantasy characters so where we have this fantasy capital h and we decompose it into just a normal latin capital letter h and that's how their compatibility decomposition works and to apply that we want to use nfkd so if we just take what we have here and we're just going to switch what we're actually using so i'm going to switch out the su cedilla for this fancy h sorry fantasy h in fact we can just leave it like that because we can at least see what we're doing now so we're going to put that here and we want to compare that to just a normal letter h obviously it's false doesn't match what we need to do is normalize this and decompose it into the capital h character so let's take this and we're going to use our normalized function again but this time we want to use compatibility equivalence reasons to k and we're decomposing it using d and now we can see that we are getting true so if we just print out the results of this function you can see okay great it's just taking that h and converting it into something normal and then that leads us on to our final normal form which is normal form at kc so normal phone kc consists of two sets we have the compatibility decomposition which is what we've just done and then there's a second set which is the canonical composition so we're building that back up those different parts canonically and this allows us to normalize all variants of a given character into a single shared form so for example with our fancy h we can add the combining cedilla to that in order to just make this some horrible monstrosity of a character and we would write that out as we have h here so we just put that straight in and then we can just come up here and get our cedilla unicode and put that in and if we put those together we get this weird character now if we wanted to compare that to another character which is the h with cedilla which is a single unicode character we're gonna have some issues because this is just one character so if we use nfkd we can give it go so we'll add this in let's try and compare it to this okay we get false and that's because this is breaking this down into two different parts so a h and this combining sid dilla so if i just remove this and print out you see okay they look the same but they're not the same because we have those two characters again so this is where we need canonical composition to bring those together into a single character so that looks like this so we have initially we have our compatibility decomposition if we go across we have this final which is a canonical composition and this is the nfkc normal form so normal form kc and to apply that all we need to do is obviously adjust this to kc and okay we run that we seem to get the same result but then if we add this we can see okay now we're getting what we need and in reality i think for most cases or almost all that i can think of anyway you're going to use this nfkc to normalize your text because this is going to provide you with the cleanest simplest data set that is the most normalized so when going forwards with your language models this is definitely the form that i would go with now of course you can mix up you use different ones but i would definitely recommend if this is you know quite confusing and hard to get a grasp of just taking these unicode characters playing around them a little bit applying these normal form functions to them and just seeing what happens and i think it'll probably click quite quickly so that's it for this video i hope it's been useful and you've enjoyed it so thank you for watching and i will see you again in the next one

Original Description

ℕ𝕠-𝕠𝕟𝕖 𝕚𝕟 𝕥𝕙𝕖𝕚𝕣 𝕣𝕚𝕘𝕙𝕥 𝕞𝕚𝕟𝕕 𝕨𝕠𝕦𝕝𝕕 𝕖𝕧𝕖𝕣 𝕦𝕤𝕖 𝕥𝕙𝕖𝕤𝕖 𝕒𝕟𝕟𝕠𝕪𝕚𝕟𝕘 𝕗𝕠𝕟𝕥 𝕧𝕒𝕣𝕚𝕒𝕟𝕥𝕤. 𝕋𝕙𝕖 𝕨𝕠𝕣𝕤𝕥 𝕥𝕙𝕚𝕟𝕘, 𝕚𝕤 𝕚𝕗 𝕪𝕠𝕦 𝕕𝕠 𝕒𝕟𝕪 𝕗𝕠𝕣𝕞 𝕠𝕗 ℕ𝕃ℙ 𝕒𝕟𝕕 𝕪𝕠𝕦 𝕙𝕒𝕧𝕖 𝕔𝕙𝕒𝕣𝕒𝕔𝕥𝕖𝕣𝕤 𝕝𝕚𝕜𝕖 𝕥𝕙𝕚𝕤 𝕚𝕟 𝕪𝕠𝕦𝕣 𝕚𝕟𝕡𝕦𝕥, 𝕪𝕠𝕦𝕣 𝕥𝕖𝕩𝕥 𝕓𝕖𝕔𝕠𝕞𝕖𝕤 𝕔𝕠𝕞𝕡𝕝𝕖𝕥𝕖𝕝𝕪 𝕦𝕟𝕣𝕖𝕒𝕕𝕒𝕓𝕝𝕖. We also find that text like this is incredibly common - particularly on social media. Another pain-point comes from diacritics (the little glyphs in Ç, é, Å) that you'll find in almost every European language. These characters have a hidden property that can trip up any NLP model - take a look at the Unicode for two versions of Ç: Latin capital letter C with cedilla: \u00C7 Latin capital letter C + combining cedilla: \u0043\u0327 Both are completely different, despite rendering as the same character. To deal with all of these text variants we need to use Unicode normalization - which we will cover in this video. 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 Medium article: https://towardsdatascience.com/what-on-earth-is-unicode-normalization-56c005c55ad0 Friend link (free access): https://towardsdatascience.com/what-on-earth-is-unicode-normalization-56c005c55ad0?sk=0cd19a9ad9f5d948b33179bab3c3b7cd
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 23 of 60

1 Stoic Philosophy Text Generation with TensorFlow
Stoic Philosophy Text Generation with TensorFlow
James Briggs
2 How to Build TensorFlow Pipelines with tf.data.Dataset
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
3 Every New Feature in Python 3.10.0a2
Every New Feature in Python 3.10.0a2
James Briggs
4 How-to Build a Transformer for Language Classification in TensorFlow
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
5 How-to use the Kaggle API in Python
How-to use the Kaggle API in Python
James Briggs
6 Language Generation with OpenAI's GPT-2 in Python
Language Generation with OpenAI's GPT-2 in Python
James Briggs
7 Text Summarization with Google AI's T5 in Python
Text Summarization with Google AI's T5 in Python
James Briggs
8 How-to do Sentiment Analysis with Flair in Python
How-to do Sentiment Analysis with Flair in Python
James Briggs
9 Python Environment Setup for Machine Learning
Python Environment Setup for Machine Learning
James Briggs
10 Sequential Model - TensorFlow Essentials #1
Sequential Model - TensorFlow Essentials #1
James Briggs
11 Functional API - TensorFlow Essentials #2
Functional API - TensorFlow Essentials #2
James Briggs
12 Training Parameters - TensorFlow Essentials #3
Training Parameters - TensorFlow Essentials #3
James Briggs
13 Input Data Pipelines - TensorFlow Essentials #4
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
14 6 of Python's Newest and Best Features (3.7-3.9)
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
15 Novice to Advanced RegEx in Less-than 30 Minutes + Python
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
16 Building a PlotLy $GME Chart in Python
Building a PlotLy $GME Chart in Python
James Briggs
17 How-to Use The Reddit API in Python
How-to Use The Reddit API in Python
James Briggs
18 How to Build Custom Q&A Transformer Models in Python
How to Build Custom Q&A Transformer Models in Python
James Briggs
19 How to Build Q&A Models in Python (Transformers)
How to Build Q&A Models in Python (Transformers)
James Briggs
20 How-to Decode Outputs From NLP Models (Python)
How-to Decode Outputs From NLP Models (Python)
James Briggs
21 Identify Stocks on Reddit with SpaCy (NER in Python)
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
22 Sentiment Analysis on ANY Length of Text With Transformers (Python)
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
Unicode Normalization for NLP in Python
Unicode Normalization for NLP in Python
James Briggs
24 The NEW Match-Case Statement in Python 3.10
The NEW Match-Case Statement in Python 3.10
James Briggs
25 Multi-Class Language Classification With BERT in TensorFlow
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
26 How to Build Python Packages for Pip
How to Build Python Packages for Pip
James Briggs
27 How-to Structure a Q&A ML App
How-to Structure a Q&A ML App
James Briggs
28 How to Index Q&A Data With Haystack and Elasticsearch
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
29 Q&A Document Retrieval With DPR
Q&A Document Retrieval With DPR
James Briggs
30 How to Use Type Annotations in Python
How to Use Type Annotations in Python
James Briggs
31 Extractive Q&A With Haystack and FastAPI in Python
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
32 Sentence Similarity With Sentence-Transformers in Python
Sentence Similarity With Sentence-Transformers in Python
James Briggs
33 Sentence Similarity With Transformers and PyTorch (Python)
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
34 NER With Transformers and spaCy (Python)
NER With Transformers and spaCy (Python)
James Briggs
35 Training BERT #1 - Masked-Language Modeling (MLM)
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
36 Training BERT #2 - Train With Masked-Language Modeling (MLM)
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
37 Training BERT #3 - Next Sentence Prediction (NSP)
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
38 Training BERT #4 - Train With Next Sentence Prediction (NSP)
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
39 FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
40 New Features in Python 3.10
New Features in Python 3.10
James Briggs
41 Training BERT #5 - Training With BertForPretraining
Training BERT #5 - Training With BertForPretraining
James Briggs
42 How-to Use HuggingFace's Datasets - Transformers From Scratch #1
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
43 Build a Custom Transformer Tokenizer - Transformers From Scratch #2
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
44 3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
45 3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
46 Building MLM Training Input Pipeline - Transformers From Scratch #3
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
47 Training and Testing an Italian BERT - Transformers From Scratch #4
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
48 Faiss - Introduction to Similarity Search
Faiss - Introduction to Similarity Search
James Briggs
49 Angular App Setup With Material - Stoic Q&A #5
Angular App Setup With Material - Stoic Q&A #5
James Briggs
50 Why are there so many Tokenization methods in HF Transformers?
Why are there so many Tokenization methods in HF Transformers?
James Briggs
51 Choosing Indexes for Similarity Search (Faiss in Python)
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
52 Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
53 How LSH Random Projection works in search (+Python)
How LSH Random Projection works in search (+Python)
James Briggs
54 IndexLSH for Fast Similarity Search in Faiss
IndexLSH for Fast Similarity Search in Faiss
James Briggs
55 Faiss - Vector Compression with PQ and IVFPQ (in Python)
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
56 Product Quantization for Vector Similarity Search (+ Python)
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
57 How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
58 Metadata Filtering for Vector Search + Latest Filter Tech
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
59 Build NLP Pipelines with HuggingFace Datasets
Build NLP Pipelines with HuggingFace Datasets
James Briggs
60 Composite Indexes and the Faiss Index Factory
Composite Indexes and the Faiss Index Factory
James Briggs

Related AI Lessons

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know
Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology
Dev.to AI
Call GPT, Claude, and Gemini from one API key — a 3-step setup
Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub
Dev.to AI
Your LLM Doesn’t Pick Stocks — It Remembers Them
Discover how LLMs remember stock picks rather than making actual predictions, and why this matters for AI-driven investment strategies
Medium · Machine Learning
Word Representation
Learn how word representation works in NLP and its importance in understanding human language, enabling applications like text classification and language translation
Medium · NLP
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →