Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Discover AI · Beginner ·🧠 Large Language Models ·3y ago

Skills: LLM Foundations90%Fine-tuning LLMs85%Prompt Craft80%

Key Takeaways

The video demonstrates how to create customized BERT and SBERT models for domain-specific applications, utilizing tools like Hugging Face, PyTorch, and sentence Transformers, and techniques such as fine-tuning and tokenization with BPE and WordPiece.

Full Transcript

hello community today I want to talk about domain specific NLP models Transformer models that have a very specific Focus so here we go when I talk to my clients I hear a lot of I have a specific domain data set I'm working in biomedical in finance in legal I need some very specific systems and the question I get asked how can you train this system for a query system that I want in my company but I'm sure that my words and my product names and my partners and my patterns and my I don't know science definitely is not in the general data set that common NLP models have been trained on so can can you build me something well let's answer this and when I tell them well there's a bird system and it has a vocabulary of 30 000 tokens in the general default case of course we can program higher values they say hey but my specific corporate domain knowledge only already has I don't know 10 000 specific words so this will not this will not work at all so beautiful you have to tell them calm down transformer models like bird do not act on single words only we we humans we do but machine code looks completely different and then there is a rabbit hole if you have to do the explaining and I would like to show you my simple way how to convince clients so first I tell them hey when I prepare your domain specific data of your corporation I have something a tool it's called a tokenizer and he performs four tasks there's a normalizer a pre-tokenizer they tokenizer model itself that I use and number three will be the main point here and then of course I have post processing for the special tokens that I need for the attention masks if I work with bird models Transformer models and so on that's how and we have libraries it's already some predefined structure that helps a lot of let's see that hugging phase now focus on point number three to tokenizer model I have here for you two videos on my YouTube channel that I explained to you in detail how two different models of a tokenizer work the first is byte pair encoding bpe this is the most common and a really powerful tool and the second is for the bird structure it comes in called word peace model now the first one but pairing Coatings it works by starting from the single characters in a word then they analyze this they merged those together the frequency based and they create new tokens from the bottom up and the advantage bpe has it can build words it has never seen by using such multiple sub word tokens you need smaller vocabularies and you have a good chance that maybe you have no unknown tokens this is or this is why it's such a great system and I like to work a lot with bpe now more or less completely the opposite path takes wordpiece wordpiece tries to build long words first then they start to split those words in multiple sub word tokens and it is completely different as you can see if you have to choose I would recommend in general you go first with bpe okay let you give me an example it's always good to have an example there are two words three words Quantum chromodynamic is one word from science Quantum field Theory also of course form theoretical physics so the first tokenizer bpe how does it analyze this and what are the tokens they come up with you can see it splits up Quantum chromodynamics and four different tokens and Quantum field theory in three different tokens great now bird makes it a little bit different you see that here even the second token from BP Chrome is now split in ch and rum and also you see that theory the last word here is also split again in the te in DHA so you see you have a complete different structure of your vocabulary where you have your token and the assigned numerical value to this so Choose Wisely anyway then the clients and if you client demands at the dedicated solution for their corporate maybe secret to main data great I mean they are not interested in general system that has been trained on billions of sentences politics news economy Finance whatever if a client wants a dedicated NLP system fast narrowly focused only on their domain knowledge efficient and performance oriented so what are the steps normally I create an individual tokenizer from scratch I train it on their corporate data with this tokenizer I take a bird model and I train it from scratch with this tokenizer on their corporate data I do the same for the buy encoder in sentence Transformer expert and then I built a neural information retrieval system specific for the needs of the clients then you have the optimization of the cloud infrastructure maybe you hire an additional mni engineering for this but of course you have to make sure that it's both understand that the price for this very specific individual solution is four to five five times higher than if I just code a general solution for a client so keep this in mind now if you want to learn more about sentence Transformer and how to optimize them I have a whole YouTube playlist you see here on my YouTube channel you can see starting on the right side one two three four five six videos just on training and preparing the data set fourth fine-tuning a sentence Transformer expert in Python and if you have the training set I have another YouTube playlist with a lot of videos explaining in detail for you how to fine tune now the model the system how to do domain adaptation or the transfer of domain knowledge for your expert system so you have a lot of videos a lot of solutions where I show you the code the theory and the application in detail let me point out four specific videos for you top you have I show you the code how to code in Python in Python pytorch semantic information retrieval system with sentence Transformers this is really some Advanced neural information system and there's a specific video I have here on my channel and then a little bit easier if you want if you fine-tune less expert sentence Transformer system you built already on a domain one let's say it's mathematics and you want to train it now on a second domain let's say physics or chemistry or whatever you have there's a specific video where I'll show you how to train expert on two knowledge domains now it helps with the client for the client if you have a Graphic visualization there I have a video for you it's called Yuma parametric umap where all our encoding of the sentence embedding vectors are in high dimensional Vector space and to bring this down from 1000 dimensional Vector space a mathematical topological space to a three-dimensional visualization you can show your client where you can see clustered topics for example you need a topological tool and in this top right video I show you how to use the topological tool of umap how to code this how to apply to your sentences so you can have visualizations for your client and the last video on the bottom right is if you want to go one step further if you say I don't just want to have visualizations but I want to work with knowledge graphs and I want to combine here the topic of sentence Transformers sentence embeddings and Vector spaces and want to use this for uh graph based data approach because maybe your client also has some knowledge graph applications this is the video for you if you want to use sentence Transformers with graph structured with heterogeneous graph data structure and how you combine them to gain insight into corporate data you have never seen before oh yeah last point chat GPT is now in December 2022 really trending and a lot of people ask me hey with this we don't need Google search anymore we don't need information retrieval system and the answer is no chat GPT is just an llm a large language model and if you want to know about what is chat GPT what is Galactica what is Bloom what is florante5 what is the purpose of each model how you can code it how you can optimize the code how you can tune the performance of those models I have a specific playlist that my YouTube channel where I show you large language model given by each different company from Oakman AI to Google what they can do how you can use it What is the characteristic what is the theory behind it how it is built but in general do not just follow some Trends because it's trendy but I would like to provide some knowledge to those llms what they are how you can use them what are they designed for and that they have a very specific Niche application so this was the last slide I hope you enjoyed it a little bit and I see you in my next video

Original Description

A lot of viewers asked about how to train Transformer (BERT, SBERT) on domain specific knowledge? Where there are a lot of special terms and complex medical, biochemical names? Can a pre-trained SBERT system learn these semantic content relations, although it has not been pre-trained on them? Is fine-tuning a SBERT system on new data sets enough to integrate this specific information? Here an answer to all your question. My YT Playlist on DATA SET for SBERT Fine-tuning: https://www.youtube.com/watch?v=JxfS5ZjdxGE&list=PLgy71-0-2-F1Yf7waKzaywNKMCbD8FtaA My YT Playlist on SBERT Fine-tuning: https://www.youtube.com/watch?v=FidMAm-tj9k&list=PLgy71-0-2-F1GVPahTCcfUNIdPvaWUXJG My YT Playlist on LLM: https://www.youtube.com/watch?v=DNy4UhBrOKI&list=PLgy71-0-2-F0byY7llx5kyNrHHyBVguvZ 00:00 Company specific DATA 01:48 It is not WORDS 04:06 A sequence of Tokens 05:15 Client demands 06:47 My Solutions 10:42 Use Large Language Models (ChatGPT)

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Discover AI · Discover AI · 6 of 60

← Previous Next →

Step Into the Unknown (by YouChat) - May 2023 be your best year yet

Step Into the Unknown (by YouChat) - May 2023 be your best year yet

Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!

Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!

Create a Smarter Future!

Create a Smarter Future!

The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers

The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers

Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models

Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models

Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)

Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)

Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)

Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)

SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)

SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)

Discover Vision Transformer (ViT) Tech in 2023

Discover Vision Transformer (ViT) Tech in 2023

Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)

Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)

Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI

Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI

BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM

BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM

Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts

Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts

From T5 to T5X: A Game-Changing Evolution with JAX & FLAX

From T5 to T5X: A Game-Changing Evolution with JAX & FLAX

How to start with ChatGPT? | Short Introduction to OpenAI API #shorts

How to start with ChatGPT? | Short Introduction to OpenAI API #shorts

The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor

The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor

Microsoft and ChatGPU

Microsoft and ChatGPU

From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch

From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch

Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI

Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI

TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP

TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP

3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer

3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer

FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!

FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!

ChatGPT - Can it Lie to you?

ChatGPT - Can it Lie to you?

ChatGPT Alternative: Perplexity by Perplexity.AI

ChatGPT Alternative: Perplexity by Perplexity.AI

2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2

2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2

Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING

Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING

BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!

BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!

Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A

Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A

Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!

Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!

After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?

After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?

Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT

Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT

Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct

Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct

The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE

The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE

New TECH: Vision Transformer 2023 on Image Classification | AI

New TECH: Vision Transformer 2023 on Image Classification | AI

PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech

PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech

New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!

New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!

New BING ChatGPT loses its mind

New BING ChatGPT loses its mind

Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)

Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)

Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI

Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI

Microsoft strongly restricts access to ChatGPT on new BING - WHY?

Microsoft strongly restricts access to ChatGPT on new BING - WHY?

PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)

PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)

New BING Chat AGGRESSIVE

New BING Chat AGGRESSIVE

Panoptic Image Segmentation: Mask2Former explained | Identify all objects!

Panoptic Image Segmentation: Mask2Former explained | Identify all objects!

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course

Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course

Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide

Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide

Microsoft's CEO in Trouble #shorts

Microsoft's CEO in Trouble #shorts

Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)

Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)

OpenAI's ChatGPT can NOW summarize external Sources on the Internet?

OpenAI's ChatGPT can NOW summarize external Sources on the Internet?

ChatGPT polarizes

ChatGPT polarizes

Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed

Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed

ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial

ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial

Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

ChatGPT: Multidimensional Prompts

ChatGPT: Multidimensional Prompts

ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples

ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples

Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM

Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM

Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?

Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?

Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)

Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)

Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?

Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?

This video teaches viewers how to create customized BERT and SBERT models for domain-specific applications, covering topics such as tokenization, fine-tuning, and domain adaptation. By following the steps outlined in the video, viewers can develop tailored NLP models for their business needs.

Key Takeaways

Create an individual tokenizer from scratch
Train a BERT model from scratch on corporate data
Optimize cloud infrastructure for the customized solution
Fine-tune a sentence Transformer expert on corporate data
Use a neural information retrieval system specific to the client's needs

💡 Customized BERT and SBERT models can be created by training on domain-specific data and fine-tuning pre-trained models, allowing for more accurate and effective NLP tasks in specialized domains.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Your LLM Doesn’t Pick Stocks — It Remembers Them

Discover how LLMs remember stock picks rather than making actual predictions, and why this matters for AI-driven investment strategies

Medium · Machine Learning

Word Representation

Learn how word representation works in NLP and its importance in understanding human language, enabling applications like text classification and language translation

When Cosine Similarity Approaching Singularity in Google Search AI Mode

Learn how cosine similarity approaching singularity affects Google Search AI and unified knowledge graphs, and why it matters for AI engineers and data scientists

When Cosine Similarity Approaching Singularity in Google Search AI Mode

Learn how cosine similarity approaching singularity affects Google Search AI and unified knowledge graphs, and why it matters for data science and AI development

Medium · Data Science

Chapters (6)

Company specific DATA

1:48 It is not WORDS

4:06 A sequence of Tokens

5:15 Client demands

6:47 My Solutions

10:42 Use Large Language Models (ChatGPT)

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)