Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Key Takeaways
The video demonstrates how to create customized BERT and SBERT models for domain-specific applications, utilizing tools like Hugging Face, PyTorch, and sentence Transformers, and techniques such as fine-tuning and tokenization with BPE and WordPiece.
Full Transcript
hello community today I want to talk about domain specific NLP models Transformer models that have a very specific Focus so here we go when I talk to my clients I hear a lot of I have a specific domain data set I'm working in biomedical in finance in legal I need some very specific systems and the question I get asked how can you train this system for a query system that I want in my company but I'm sure that my words and my product names and my partners and my patterns and my I don't know science definitely is not in the general data set that common NLP models have been trained on so can can you build me something well let's answer this and when I tell them well there's a bird system and it has a vocabulary of 30 000 tokens in the general default case of course we can program higher values they say hey but my specific corporate domain knowledge only already has I don't know 10 000 specific words so this will not this will not work at all so beautiful you have to tell them calm down transformer models like bird do not act on single words only we we humans we do but machine code looks completely different and then there is a rabbit hole if you have to do the explaining and I would like to show you my simple way how to convince clients so first I tell them hey when I prepare your domain specific data of your corporation I have something a tool it's called a tokenizer and he performs four tasks there's a normalizer a pre-tokenizer they tokenizer model itself that I use and number three will be the main point here and then of course I have post processing for the special tokens that I need for the attention masks if I work with bird models Transformer models and so on that's how and we have libraries it's already some predefined structure that helps a lot of let's see that hugging phase now focus on point number three to tokenizer model I have here for you two videos on my YouTube channel that I explained to you in detail how two different models of a tokenizer work the first is byte pair encoding bpe this is the most common and a really powerful tool and the second is for the bird structure it comes in called word peace model now the first one but pairing Coatings it works by starting from the single characters in a word then they analyze this they merged those together the frequency based and they create new tokens from the bottom up and the advantage bpe has it can build words it has never seen by using such multiple sub word tokens you need smaller vocabularies and you have a good chance that maybe you have no unknown tokens this is or this is why it's such a great system and I like to work a lot with bpe now more or less completely the opposite path takes wordpiece wordpiece tries to build long words first then they start to split those words in multiple sub word tokens and it is completely different as you can see if you have to choose I would recommend in general you go first with bpe okay let you give me an example it's always good to have an example there are two words three words Quantum chromodynamic is one word from science Quantum field Theory also of course form theoretical physics so the first tokenizer bpe how does it analyze this and what are the tokens they come up with you can see it splits up Quantum chromodynamics and four different tokens and Quantum field theory in three different tokens great now bird makes it a little bit different you see that here even the second token from BP Chrome is now split in ch and rum and also you see that theory the last word here is also split again in the te in DHA so you see you have a complete different structure of your vocabulary where you have your token and the assigned numerical value to this so Choose Wisely anyway then the clients and if you client demands at the dedicated solution for their corporate maybe secret to main data great I mean they are not interested in general system that has been trained on billions of sentences politics news economy Finance whatever if a client wants a dedicated NLP system fast narrowly focused only on their domain knowledge efficient and performance oriented so what are the steps normally I create an individual tokenizer from scratch I train it on their corporate data with this tokenizer I take a bird model and I train it from scratch with this tokenizer on their corporate data I do the same for the buy encoder in sentence Transformer expert and then I built a neural information retrieval system specific for the needs of the clients then you have the optimization of the cloud infrastructure maybe you hire an additional mni engineering for this but of course you have to make sure that it's both understand that the price for this very specific individual solution is four to five five times higher than if I just code a general solution for a client so keep this in mind now if you want to learn more about sentence Transformer and how to optimize them I have a whole YouTube playlist you see here on my YouTube channel you can see starting on the right side one two three four five six videos just on training and preparing the data set fourth fine-tuning a sentence Transformer expert in Python and if you have the training set I have another YouTube playlist with a lot of videos explaining in detail for you how to fine tune now the model the system how to do domain adaptation or the transfer of domain knowledge for your expert system so you have a lot of videos a lot of solutions where I show you the code the theory and the application in detail let me point out four specific videos for you top you have I show you the code how to code in Python in Python pytorch semantic information retrieval system with sentence Transformers this is really some Advanced neural information system and there's a specific video I have here on my channel and then a little bit easier if you want if you fine-tune less expert sentence Transformer system you built already on a domain one let's say it's mathematics and you want to train it now on a second domain let's say physics or chemistry or whatever you have there's a specific video where I'll show you how to train expert on two knowledge domains now it helps with the client for the client if you have a Graphic visualization there I have a video for you it's called Yuma parametric umap where all our encoding of the sentence embedding vectors are in high dimensional Vector space and to bring this down from 1000 dimensional Vector space a mathematical topological space to a three-dimensional visualization you can show your client where you can see clustered topics for example you need a topological tool and in this top right video I show you how to use the topological tool of umap how to code this how to apply to your sentences so you can have visualizations for your client and the last video on the bottom right is if you want to go one step further if you say I don't just want to have visualizations but I want to work with knowledge graphs and I want to combine here the topic of sentence Transformers sentence embeddings and Vector spaces and want to use this for uh graph based data approach because maybe your client also has some knowledge graph applications this is the video for you if you want to use sentence Transformers with graph structured with heterogeneous graph data structure and how you combine them to gain insight into corporate data you have never seen before oh yeah last point chat GPT is now in December 2022 really trending and a lot of people ask me hey with this we don't need Google search anymore we don't need information retrieval system and the answer is no chat GPT is just an llm a large language model and if you want to know about what is chat GPT what is Galactica what is Bloom what is florante5 what is the purpose of each model how you can code it how you can optimize the code how you can tune the performance of those models I have a specific playlist that my YouTube channel where I show you large language model given by each different company from Oakman AI to Google what they can do how you can use it What is the characteristic what is the theory behind it how it is built but in general do not just follow some Trends because it's trendy but I would like to provide some knowledge to those llms what they are how you can use them what are they designed for and that they have a very specific Niche application so this was the last slide I hope you enjoyed it a little bit and I see you in my next video
Original Description
A lot of viewers asked about how to train Transformer (BERT, SBERT) on domain specific knowledge? Where there are a lot of special terms and complex medical, biochemical names? Can a pre-trained SBERT system learn these semantic content relations, although it has not been pre-trained on them? Is fine-tuning a SBERT system on new data sets enough to integrate this specific information?
Here an answer to all your question.
My YT Playlist on DATA SET for SBERT Fine-tuning:
https://www.youtube.com/watch?v=JxfS5ZjdxGE&list=PLgy71-0-2-F1Yf7waKzaywNKMCbD8FtaA
My YT Playlist on SBERT Fine-tuning:
https://www.youtube.com/watch?v=FidMAm-tj9k&list=PLgy71-0-2-F1GVPahTCcfUNIdPvaWUXJG
My YT Playlist on LLM:
https://www.youtube.com/watch?v=DNy4UhBrOKI&list=PLgy71-0-2-F0byY7llx5kyNrHHyBVguvZ
00:00 Company specific DATA
01:48 It is not WORDS
04:06 A sequence of Tokens
05:15 Client demands
06:47 My Solutions
10:42 Use Large Language Models (ChatGPT)
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Discover AI · Discover AI · 6 of 60
1
2
3
4
5
▶
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Discover AI
Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Discover AI
Create a Smarter Future!
Discover AI
The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
Discover AI
Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Discover AI
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Discover AI
Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)
Discover AI
Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)
Discover AI
SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)
Discover AI
Discover Vision Transformer (ViT) Tech in 2023
Discover AI
Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Discover AI
Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Discover AI
BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM
Discover AI
Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts
Discover AI
From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
Discover AI
How to start with ChatGPT? | Short Introduction to OpenAI API #shorts
Discover AI
The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor
Discover AI
Microsoft and ChatGPU
Discover AI
From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
Discover AI
Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Discover AI
TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
Discover AI
3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
Discover AI
FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Discover AI
ChatGPT - Can it Lie to you?
Discover AI
ChatGPT Alternative: Perplexity by Perplexity.AI
Discover AI
2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
Discover AI
Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Discover AI
BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!
Discover AI
Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A
Discover AI
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Discover AI
After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
Discover AI
Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Discover AI
Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Discover AI
The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
Discover AI
New TECH: Vision Transformer 2023 on Image Classification | AI
Discover AI
PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech
Discover AI
New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
Discover AI
New BING ChatGPT loses its mind
Discover AI
Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Discover AI
Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Discover AI
Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Discover AI
PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
Discover AI
New BING Chat AGGRESSIVE
Discover AI
Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Discover AI
Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Discover AI
Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course
Discover AI
Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Discover AI
Microsoft's CEO in Trouble #shorts
Discover AI
Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Discover AI
OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
Discover AI
ChatGPT polarizes
Discover AI
Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Discover AI
ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial
Discover AI
Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Discover AI
ChatGPT: Multidimensional Prompts
Discover AI
ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
Discover AI
Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Discover AI
Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Discover AI
Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Discover AI
Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Discover AI
More on: LLM Foundations
View skill →Related AI Lessons
Chapters (6)
Company specific DATA
1:48
It is not WORDS
4:06
A sequence of Tokens
5:15
Client demands
6:47
My Solutions
10:42
Use Large Language Models (ChatGPT)
🎓
Tutor Explanation
DeepCamp AI