Unicode Normalization for NLP in Python
โ๐ -๐ ๐๐ ๐๐ ๐ฅ๐๐๐๐ฃ ๐ฃ๐๐๐๐ฅ ๐๐๐๐ ๐จ๐ ๐ฆ๐๐ ๐๐ง๐๐ฃ ๐ฆ๐ค๐ ๐ฅ๐๐๐ค๐ ๐๐๐๐ ๐ช๐๐๐ ๐๐ ๐๐ฅ ๐ง๐๐ฃ๐๐๐๐ฅ๐ค. ๐๐๐ ๐จ๐ ๐ฃ๐ค๐ฅ ๐ฅ๐๐๐๐, ๐๐ค ๐๐ ๐ช๐ ๐ฆ ๐๐ ๐๐๐ช ๐๐ ๐ฃ๐ ๐ ๐ โ๐โ ๐๐๐ ๐ช๐ ๐ฆ ๐๐๐ง๐ ๐๐๐๐ฃ๐๐๐ฅ๐๐ฃ๐ค ๐๐๐๐ ๐ฅ๐๐๐ค ๐๐ ๐ช๐ ๐ฆ๐ฃ ๐๐๐ก๐ฆ๐ฅ, ๐ช๐ ๐ฆ๐ฃ ๐ฅ๐๐ฉ๐ฅ ๐๐๐๐ ๐๐๐ค ๐๐ ๐๐ก๐๐๐ฅ๐๐๐ช ๐ฆ๐๐ฃ๐๐๐๐๐๐๐.
We also find that text like this is incredibly commonโ-โparticularly on social media.
Another pain-point comes from diacritics (the little glyphs in ร, รฉ, ร
) that you'll find in almost every European language.
These characters have a hidden property that can trip up any NLP modelโ-โtake a look at the Unicode for two versions of ร:
Latin capital letter C with cedilla: \u00C7
Latin capital letter C + combining cedilla: \u0043\u0327
Both are completely different, despite rendering as the same character.
To deal with all of these text variants we need to use Unicode normalization - which we will cover in this video.
๐ค 70% Discount on the NLP With Transformers in Python course:
https://bit.ly/3DFvvY5
Medium article:
https://towardsdatascience.com/what-on-earth-is-unicode-normalization-56c005c55ad0
Friend link (free access):
https://towardsdatascience.com/what-on-earth-is-unicode-normalization-56c005c55ad0?sk=0cd19a9ad9f5d948b33179bab3c3b7cd
Watch on YouTube โ
(saves to browser)
Sign in to unlock AI tutor explanation ยท โก30
Playlist
Uploads from James Briggs ยท James Briggs ยท 23 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
โถ
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Stoic Philosophy Text Generation with TensorFlow
James Briggs
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
Every New Feature in Python 3.10.0a2
James Briggs
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
How-to use the Kaggle API in Python
James Briggs
Language Generation with OpenAI's GPT-2 in Python
James Briggs
Text Summarization with Google AI's T5 in Python
James Briggs
How-to do Sentiment Analysis with Flair in Python
James Briggs
Python Environment Setup for Machine Learning
James Briggs
Sequential Model - TensorFlow Essentials #1
James Briggs
Functional API - TensorFlow Essentials #2
James Briggs
Training Parameters - TensorFlow Essentials #3
James Briggs
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
Building a PlotLy $GME Chart in Python
James Briggs
How-to Use The Reddit API in Python
James Briggs
How to Build Custom Q&A Transformer Models in Python
James Briggs
How to Build Q&A Models in Python (Transformers)
James Briggs
How-to Decode Outputs From NLP Models (Python)
James Briggs
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
Unicode Normalization for NLP in Python
James Briggs
The NEW Match-Case Statement in Python 3.10
James Briggs
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
How to Build Python Packages for Pip
James Briggs
How-to Structure a Q&A ML App
James Briggs
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
Q&A Document Retrieval With DPR
James Briggs
How to Use Type Annotations in Python
James Briggs
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
Sentence Similarity With Sentence-Transformers in Python
James Briggs
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
NER With Transformers and spaCy (Python)
James Briggs
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
New Features in Python 3.10
James Briggs
Training BERT #5 - Training With BertForPretraining
James Briggs
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
Faiss - Introduction to Similarity Search
James Briggs
Angular App Setup With Material - Stoic Q&A #5
James Briggs
Why are there so many Tokenization methods in HF Transformers?
James Briggs
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
How LSH Random Projection works in search (+Python)
James Briggs
IndexLSH for Fast Similarity Search in Faiss
James Briggs
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
Build NLP Pipelines with HuggingFace Datasets
James Briggs
Composite Indexes and the Faiss Index Factory
James Briggs
More on: LLM Foundations
View skill โRelated AI Lessons
โก
โก
โก
โก
The Human-in-the-Loop Trap
Medium ยท Machine Learning
I thought LLM tool calling would kill glue code and then my lights still wouldnโt turn on
Dev.to ยท Lars Winstand
You Donโt Have to Fine-Tune Your LLM to change itโs Behavior. You Can Justโฆ Steer It.
Medium ยท Machine Learning
You Donโt Have to Fine-Tune Your LLM to change itโs Behavior. You Can Justโฆ Steer It.
Medium ยท LLM
๐
Tutor Explanation
DeepCamp AI