Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Discover AI · Beginner ·🧠 Large Language Models ·3y ago

Key Takeaways

The video demonstrates how to create customized BERT and SBERT models for domain-specific applications, utilizing tools like Hugging Face, PyTorch, and sentence Transformers, and techniques such as fine-tuning and tokenization with BPE and WordPiece.

Full Transcript

hello community today I want to talk about domain specific NLP models Transformer models that have a very specific Focus so here we go when I talk to my clients I hear a lot of I have a specific domain data set I'm working in biomedical in finance in legal I need some very specific systems and the question I get asked how can you train this system for a query system that I want in my company but I'm sure that my words and my product names and my partners and my patterns and my I don't know science definitely is not in the general data set that common NLP models have been trained on so can can you build me something well let's answer this and when I tell them well there's a bird system and it has a vocabulary of 30 000 tokens in the general default case of course we can program higher values they say hey but my specific corporate domain knowledge only already has I don't know 10 000 specific words so this will not this will not work at all so beautiful you have to tell them calm down transformer models like bird do not act on single words only we we humans we do but machine code looks completely different and then there is a rabbit hole if you have to do the explaining and I would like to show you my simple way how to convince clients so first I tell them hey when I prepare your domain specific data of your corporation I have something a tool it's called a tokenizer and he performs four tasks there's a normalizer a pre-tokenizer they tokenizer model itself that I use and number three will be the main point here and then of course I have post processing for the special tokens that I need for the attention masks if I work with bird models Transformer models and so on that's how and we have libraries it's already some predefined structure that helps a lot of let's see that hugging phase now focus on point number three to tokenizer model I have here for you two videos on my YouTube channel that I explained to you in detail how two different models of a tokenizer work the first is byte pair encoding bpe this is the most common and a really powerful tool and the second is for the bird structure it comes in called word peace model now the first one but pairing Coatings it works by starting from the single characters in a word then they analyze this they merged those together the frequency based and they create new tokens from the bottom up and the advantage bpe has it can build words it has never seen by using such multiple sub word tokens you need smaller vocabularies and you have a good chance that maybe you have no unknown tokens this is or this is why it's such a great system and I like to work a lot with bpe now more or less completely the opposite path takes wordpiece wordpiece tries to build long words first then they start to split those words in multiple sub word tokens and it is completely different as you can see if you have to choose I would recommend in general you go first with bpe okay let you give me an example it's always good to have an example there are two words three words Quantum chromodynamic is one word from science Quantum field Theory also of course form theoretical physics so the first tokenizer bpe how does it analyze this and what are the tokens they come up with you can see it splits up Quantum chromodynamics and four different tokens and Quantum field theory in three different tokens great now bird makes it a little bit different you see that here even the second token from BP Chrome is now split in ch and rum and also you see that theory the last word here is also split again in the te in DHA so you see you have a complete different structure of your vocabulary where you have your token and the assigned numerical value to this so Choose Wisely anyway then the clients and if you client demands at the dedicated solution for their corporate maybe secret to main data great I mean they are not interested in general system that has been trained on billions of sentences politics news economy Finance whatever if a client wants a dedicated NLP system fast narrowly focused only on their domain knowledge efficient and performance oriented so what are the steps normally I create an individual tokenizer from scratch I train it on their corporate data with this tokenizer I take a bird model and I train it from scratch with this tokenizer on their corporate data I do the same for the buy encoder in sentence Transformer expert and then I built a neural information retrieval system specific for the needs of the clients then you have the optimization of the cloud infrastructure maybe you hire an additional mni engineering for this but of course you have to make sure that it's both understand that the price for this very specific individual solution is four to five five times higher than if I just code a general solution for a client so keep this in mind now if you want to learn more about sentence Transformer and how to optimize them I have a whole YouTube playlist you see here on my YouTube channel you can see starting on the right side one two three four five six videos just on training and preparing the data set fourth fine-tuning a sentence Transformer expert in Python and if you have the training set I have another YouTube playlist with a lot of videos explaining in detail for you how to fine tune now the model the system how to do domain adaptation or the transfer of domain knowledge for your expert system so you have a lot of videos a lot of solutions where I show you the code the theory and the application in detail let me point out four specific videos for you top you have I show you the code how to code in Python in Python pytorch semantic information retrieval system with sentence Transformers this is really some Advanced neural information system and there's a specific video I have here on my channel and then a little bit easier if you want if you fine-tune less expert sentence Transformer system you built already on a domain one let's say it's mathematics and you want to train it now on a second domain let's say physics or chemistry or whatever you have there's a specific video where I'll show you how to train expert on two knowledge domains now it helps with the client for the client if you have a Graphic visualization there I have a video for you it's called Yuma parametric umap where all our encoding of the sentence embedding vectors are in high dimensional Vector space and to bring this down from 1000 dimensional Vector space a mathematical topological space to a three-dimensional visualization you can show your client where you can see clustered topics for example you need a topological tool and in this top right video I show you how to use the topological tool of umap how to code this how to apply to your sentences so you can have visualizations for your client and the last video on the bottom right is if you want to go one step further if you say I don't just want to have visualizations but I want to work with knowledge graphs and I want to combine here the topic of sentence Transformers sentence embeddings and Vector spaces and want to use this for uh graph based data approach because maybe your client also has some knowledge graph applications this is the video for you if you want to use sentence Transformers with graph structured with heterogeneous graph data structure and how you combine them to gain insight into corporate data you have never seen before oh yeah last point chat GPT is now in December 2022 really trending and a lot of people ask me hey with this we don't need Google search anymore we don't need information retrieval system and the answer is no chat GPT is just an llm a large language model and if you want to know about what is chat GPT what is Galactica what is Bloom what is florante5 what is the purpose of each model how you can code it how you can optimize the code how you can tune the performance of those models I have a specific playlist that my YouTube channel where I show you large language model given by each different company from Oakman AI to Google what they can do how you can use it What is the characteristic what is the theory behind it how it is built but in general do not just follow some Trends because it's trendy but I would like to provide some knowledge to those llms what they are how you can use them what are they designed for and that they have a very specific Niche application so this was the last slide I hope you enjoyed it a little bit and I see you in my next video

Original Description

A lot of viewers asked about how to train Transformer (BERT, SBERT) on domain specific knowledge? Where there are a lot of special terms and complex medical, biochemical names? Can a pre-trained SBERT system learn these semantic content relations, although it has not been pre-trained on them? Is fine-tuning a SBERT system on new data sets enough to integrate this specific information? Here an answer to all your question. My YT Playlist on DATA SET for SBERT Fine-tuning: https://www.youtube.com/watch?v=JxfS5ZjdxGE&list=PLgy71-0-2-F1Yf7waKzaywNKMCbD8FtaA My YT Playlist on SBERT Fine-tuning: https://www.youtube.com/watch?v=FidMAm-tj9k&list=PLgy71-0-2-F1GVPahTCcfUNIdPvaWUXJG My YT Playlist on LLM: https://www.youtube.com/watch?v=DNy4UhBrOKI&list=PLgy71-0-2-F0byY7llx5kyNrHHyBVguvZ 00:00 Company specific DATA 01:48 It is not WORDS 04:06 A sequence of Tokens 05:15 Client demands 06:47 My Solutions 10:42 Use Large Language Models (ChatGPT)
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Discover AI · Discover AI · 6 of 60

1 Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Discover AI
2 Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Discover AI
3 Create a Smarter Future!
Create a Smarter Future!
Discover AI
4 The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
Discover AI
5 Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Discover AI
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Discover AI
7 Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D   (SBERT 48)
Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)
Discover AI
8 Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey!  (SBERT 49)
Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)
Discover AI
9 SBERT Extreme 3D: Train a BERT Tokenizer  on your (scientific) Domain Knowledge  (SBERT 50)
SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)
Discover AI
10 Discover Vision Transformer (ViT) Tech in 2023
Discover Vision Transformer (ViT) Tech in 2023
Discover AI
11 Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Discover AI
12 Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Discover AI
13 BERT and GPT in Language Models like ChatGPT or BLOOM |  EASY Tutorial on Large Language Models LLM
BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM
Discover AI
14 Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source)  #shorts
Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts
Discover AI
15 From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
Discover AI
16 How to start with ChatGPT?  | Short Introduction to OpenAI API #shorts
How to start with ChatGPT? | Short Introduction to OpenAI API #shorts
Discover AI
17 The Future of Conversational AI? Google's PaLM w/ RLHF  | LLM ChatGPT Competitor
The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor
Discover AI
18 Microsoft and ChatGPU
Microsoft and ChatGPU
Discover AI
19 From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
Discover AI
20 Google's 2nd Answer to "BING ChatGPT":  Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Discover AI
21 TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
Discover AI
22 3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
Discover AI
23 FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Discover AI
24 ChatGPT - Can it Lie to you?
ChatGPT - Can it Lie to you?
Discover AI
25 ChatGPT Alternative: Perplexity by Perplexity.AI
ChatGPT Alternative: Perplexity by Perplexity.AI
Discover AI
26 2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
Discover AI
27 Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Discover AI
28 BLOOM 176B Inference on AWS  | Bigger than GPT-3 for more Power!
BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!
Discover AI
29 Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings?  My own ChatGPT? | Visual Q+A
Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A
Discover AI
30 Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Discover AI
31 After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
Discover AI
32 Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Discover AI
33 Fine-tune ChatGPT w/  in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Discover AI
34 The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
Discover AI
35 New TECH: Vision Transformer 2023 on Image Classification | AI
New TECH: Vision Transformer 2023 on Image Classification | AI
Discover AI
36 PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned  | AI  Tech
PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech
Discover AI
37 New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
Discover AI
38 New BING ChatGPT loses its mind
New BING ChatGPT loses its mind
Discover AI
39 Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Discover AI
40 Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Discover AI
41 Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Discover AI
42 PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
Discover AI
43 New BING Chat AGGRESSIVE
New BING Chat AGGRESSIVE
Discover AI
44 Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Discover AI
45 Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Discover AI
46 Dream Job Alert: AI Prompt Engineer - $335K  |  AI Prompt Design: A Crash Course
Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course
Discover AI
47 Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Discover AI
48 Microsoft's CEO in Trouble   #shorts
Microsoft's CEO in Trouble #shorts
Discover AI
49 Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Discover AI
50 OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
Discover AI
51 ChatGPT polarizes
ChatGPT polarizes
Discover AI
52 Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Discover AI
53 ChatGPT Prompt Engineering w/ in-context learning (ICL)  - 7 Examples | Tutorial
ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial
Discover AI
54 Chat with your Image!  BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Discover AI
55 ChatGPT:  Multidimensional Prompts
ChatGPT: Multidimensional Prompts
Discover AI
56 ChatGPT:  In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
Discover AI
57 Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Discover AI
58 Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Discover AI
59 Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Discover AI
60 Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Discover AI

This video teaches viewers how to create customized BERT and SBERT models for domain-specific applications, covering topics such as tokenization, fine-tuning, and domain adaptation. By following the steps outlined in the video, viewers can develop tailored NLP models for their business needs.

Key Takeaways
  1. Create an individual tokenizer from scratch
  2. Train a BERT model from scratch on corporate data
  3. Optimize cloud infrastructure for the customized solution
  4. Fine-tune a sentence Transformer expert on corporate data
  5. Use a neural information retrieval system specific to the client's needs
💡 Customized BERT and SBERT models can be created by training on domain-specific data and fine-tuning pre-trained models, allowing for more accurate and effective NLP tasks in specialized domains.

Related AI Lessons

Your LLM Doesn’t Pick Stocks — It Remembers Them
Discover how LLMs remember stock picks rather than making actual predictions, and why this matters for AI-driven investment strategies
Medium · Machine Learning
Word Representation
Learn how word representation works in NLP and its importance in understanding human language, enabling applications like text classification and language translation
Medium · NLP
When Cosine Similarity Approaching Singularity in Google Search AI Mode
Learn how cosine similarity approaching singularity affects Google Search AI and unified knowledge graphs, and why it matters for AI engineers and data scientists
Medium · AI
When Cosine Similarity Approaching Singularity in Google Search AI Mode
Learn how cosine similarity approaching singularity affects Google Search AI and unified knowledge graphs, and why it matters for data science and AI development
Medium · Data Science

Chapters (6)

Company specific DATA
1:48 It is not WORDS
4:06 A sequence of Tokens
5:15 Client demands
6:47 My Solutions
10:42 Use Large Language Models (ChatGPT)
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →