Code demo of BERTopic - BERTopic for Topic Modeling

Cohere · Advanced ·📄 Research Papers Explained ·3y ago

Key Takeaways

This video demonstrates the use of BERTopic, a popular Python library for topic modeling, to explore large text archives and identify relationships between topics. The code demo shows how to train a model, load a saved model, and customize topic labels.

Full Transcript

arcs of Articles right we've all read the articles here archive is is a really nice database with uh yeah before you go into it uh since you're showing code can you maybe zoom in a little bit just so the code is more it's not that clear right yeah yeah this is much better I think perfect um you know we want some data and uh what's nicer data then a lot of abstracts from Mostly computer vision and some machine learning articles and I randomly sample 10 000 of them just just to make it a little bit easier for myself and that's what we're going to do the topic modeling on this abstract for example is about reinforcement learning and so we have a lot of more of those abstracts that we want to see okay which articles or which topics become you know popular uh which topics can we find can we find some relationships between some of those topics can we do some fine tuning etc etc now we can train our model we can do the exact same thing as we did before we import our package we instantiate it and do fit transform on our abstract now I can't do that but I'm kind of lazy and we have limited time so I'm gonna load in the model instead I've done this save the model and now I'm essentially gonna load it in which I've done before so after this training procedure and this training procedures of course sentence Transformers uh umap hdb can scan CTF IDF we can view the topics that we have created so what we do is we have a topic model we do get topic info and what we get is a data frame that has several columns I have to call them topic that shows you the topic ID it goes from -1 to 147 so we see two things happening here there are a lot of topics that were found as a default but there's also that minus one and the minus one those are outliers I'm still showing the outliers because otherwise you feel like you're missing documents but those are all the documents that couldn't be clustered and we can fine tune it so it becomes larger or smaller but that depends on hdb scan we have the count of the number of documents in a topic but most interestingly we have the topic representation we have the topic ID and then words that are you know best representative of that cluster of the topic and if you pick this one that's reinforcement learning right and we have some object detection and we have some extra real attacks and we can go through all of these topics and see what exactly is happening here we can also say okay we're going to pick the top 10 10 most frequent ones and I'll read through them we see segmentation some molecular topics we can dive into those to see what exactly is being talked about something uh about Transformer models self-attention Visions interesting to see that also being there um we can go through all of these topics and and read to them but we can also say okay I don't like this representation I don't like the four here I don't like those underscores I think there are way too many words here so let's customize our topic labels now we can generate topic labels automatically based on the words that we have here but instead let's say we have three we remove that topic prefix because it's ugly or it's annoying or you know whatever the reason might be and we change the separator because at underscore might make it more difficult to read and one thing that we add on top of that let's define our own topics because I I think I know that this is about Transformer based models and I know that this one is about reinforcement learning so we can also do some topic labeling here now we've run this and we get exactly what we've done before right what we wanted to be doing we want to have this representation a little bit nicer than this one and we might be doing this because we want to visualize certain topics and you know we have some topics that are more interesting than others so we label them but you know you can also say I want 10 more to 20 words so it gives you a little bit more understanding of what is happening here and there are a bunch of more things that we can do with that topic model we can update the topics to you know change the anagram range because I want words concatenated in a way I want to merge a certain topic because I think they are very similar to one another and I don't want to consider them separately I might want to reduce the topics to 150 or 10. I might want to find certain topics that I couldn't find before because I don't want to go through 150 topics to see if my topic of interest is in there are many things that we can do with with something like this um that gives us a little bit more control over how our topic model looks like but after doing all of this we still want a little bit more knowledge about which topics are in there but also the relationship between topics or the relationship between documents and topics or the relationship between words within the topics etc etc and for that we come to the second pillar and that's the the visualizations that are possible with within their topic

Original Description

BERTopic for Topic Modeling - Talking Language AI full episode: https://www.youtube.com/watch?v=uZxQz87lb84 Topic modeling allows us to explore large text archives with software. This is commonly called "topic modeling". Go in-depth into BERTopic (the popular python topic modeling library) with its creator, Maarten Grootendorst. We explore three important pillars of the package, modularity, variations, and visualizations. Each of the pillars demonstrates how BERTopic gives control back to the developer allowing for a one-stop-shop of topic modeling. This video also demonstrates BERTopic's basic capabilities and some advanced tricks that new and advanced users of BERTopic may enjoy. Maarten is Open Source Developer and Maintainer (BERTopic, PolyFuzz, KeyBERT), Data Scientist, Psychologist. === Join the Cohere Discord: https://discord.gg/co-mmunity Discussion thread for this episode (feel free to ask questions): https://discord.com/channels/95442198... Maarten on Twitter: https://twitter.com/MaartenGr BERTopic: https://maartengr.github.io/BERTopic/ BERTopic on Github: https://github.com/MaartenGr/BERTopic BERTopic paper: https://arxiv.org/abs/2203.05794
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Cohere · Cohere · 20 of 60

1 Andreas Madsen on Independent Research and Interpretability
Andreas Madsen on Independent Research and Interpretability
Cohere
2 Plex: Towards Reliability using Pretrained Large Model Extensions
Plex: Towards Reliability using Pretrained Large Model Extensions
Cohere
3 Independent Research Panel Discussion
Independent Research Panel Discussion
Cohere
4 The Future of ML Ops: Open Challenges and Opportunities
The Future of ML Ops: Open Challenges and Opportunities
Cohere
5 C4AI Special - Grad School Applications
C4AI Special - Grad School Applications
Cohere
6 Cohere For AI Fireside Chat: Samy Bengio
Cohere For AI Fireside Chat: Samy Bengio
Cohere
7 Cohere For AI - Scholars Program Information Session
Cohere For AI - Scholars Program Information Session
Cohere
8 Modular and Composable Transfer Learning with Jonas Pfeiffer
Modular and Composable Transfer Learning with Jonas Pfeiffer
Cohere
9 Jay Alammar Presents Large Language Models for Real World Applications
Jay Alammar Presents Large Language Models for Real World Applications
Cohere
10 Catherine Olsson - Mechanistic Interpretability: Getting Started
Catherine Olsson - Mechanistic Interpretability: Getting Started
Cohere
11 How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners
How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners
Cohere
12 C4AI Sparks: Samy Bengio
C4AI Sparks: Samy Bengio
Cohere
13 BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1
BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1
Cohere
14 Exploring News Headlines With Text Clustering | Jay Alammar
Exploring News Headlines With Text Clustering | Jay Alammar
Cohere
15 Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang
Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang
Cohere
16 Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney
Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney
Cohere
17 Intro to KeyBERT - BERTopic for Topic Modeling
Intro to KeyBERT - BERTopic for Topic Modeling
Cohere
18 Intro to PolyFuzz - BERTopic for Topic Modeling
Intro to PolyFuzz - BERTopic for Topic Modeling
Cohere
19 API Design Philosophy - BERTopic for Topic Modeling
API Design Philosophy - BERTopic for Topic Modeling
Cohere
Code demo of BERTopic - BERTopic for Topic Modeling
Code demo of BERTopic - BERTopic for Topic Modeling
Cohere
21 Short texts vs long texts in BERTopic- BERTopic for Topic Modeling
Short texts vs long texts in BERTopic- BERTopic for Topic Modeling
Cohere
22 How People can help BERTopic - BERTopic for Topic Modeling
How People can help BERTopic - BERTopic for Topic Modeling
Cohere
23 Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan
Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan
Cohere
24 Cohere API Community Demos | October 2022
Cohere API Community Demos | October 2022
Cohere
25 Perfect Prompt Demo By Arjun Patel
Perfect Prompt Demo By Arjun Patel
Cohere
26 Project Idea Generator Demo By Tobechukwu Okamkpa
Project Idea Generator Demo By Tobechukwu Okamkpa
Cohere
27 SuperTransformer Demo By Amir Nagri and Team Megatron
SuperTransformer Demo By Amir Nagri and Team Megatron
Cohere
28 Cohere For AI Fireside Chat: Pablo Samuel Castro
Cohere For AI Fireside Chat: Pablo Samuel Castro
Cohere
29 How Startups Can Use NLP to Build a Competitive Moat
How Startups Can Use NLP to Build a Competitive Moat
Cohere
30 Build Chatbots Faster with Large Language Models
Build Chatbots Faster with Large Language Models
Cohere
31 Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2
Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2
Cohere
32 Utku Evci - Sparsity and Beyond Static Network Architectures
Utku Evci - Sparsity and Beyond Static Network Architectures
Cohere
33 Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp
Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp
Cohere
34 Iterating on your data with doubtlab - Tools to Improve Training Data
Iterating on your data with doubtlab - Tools to Improve Training Data
Cohere
35 Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data
Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data
Cohere
36 Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data
Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data
Cohere
37 Building Cohere API Demo App With Streamlit | Adrien Morisot
Building Cohere API Demo App With Streamlit | Adrien Morisot
Cohere
38 Rosanne Liu - career creation for non-standard candidates
Rosanne Liu - career creation for non-standard candidates
Cohere
39 Giving computers many human languages with Cohere's multilingual embeddings
Giving computers many human languages with Cohere's multilingual embeddings
Cohere
40 Learning by Distilling Context with Charlie Snell
Learning by Distilling Context with Charlie Snell
Cohere
41 Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3
Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3
Cohere
42 Reflecting on for.ai...
Reflecting on for.ai...
Cohere
43 Create a Custom Language Model with Surge AI and Cohere
Create a Custom Language Model with Surge AI and Cohere
Cohere
44 Cohere API Community Demos | November 2022
Cohere API Community Demos | November 2022
Cohere
45 Cohere API Community Demos | December 2022
Cohere API Community Demos | December 2022
Cohere
46 Cohere For AI Presents: Colin Raffel
Cohere For AI Presents: Colin Raffel
Cohere
47 Lucas Beyer - FlexiViT: One Model for All Patch Sizes
Lucas Beyer - FlexiViT: One Model for All Patch Sizes
Cohere
48 What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation
What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation
Cohere
49 Evaluating Information Retrieval with BEIR
Evaluating Information Retrieval with BEIR
Cohere
50 Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers
Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers
Cohere
51 High quality text classification with few training examples with SetFit
High quality text classification with few training examples with SetFit
Cohere
52 Multilingual and cross lingual embeddings - Nils Reimers
Multilingual and cross lingual embeddings - Nils Reimers
Cohere
53 Developing open-source software: lessons, benefits, and challenges - Nils Reimers
Developing open-source software: lessons, benefits, and challenges - Nils Reimers
Cohere
54 Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere
Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere
Cohere
55 HyperWrite Powers Its Generative AI Service with Cohere
HyperWrite Powers Its Generative AI Service with Cohere
Cohere
56 EMNLP 2022 Conference Special Edition - Talking Language AI #4
EMNLP 2022 Conference Special Edition - Talking Language AI #4
Cohere
57 Cohere API Community Demos | January 2023
Cohere API Community Demos | January 2023
Cohere
58 C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates
C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates
Cohere
59 Michael Tschannen -  Image-and-Language Understanding from Pixels Only
Michael Tschannen - Image-and-Language Understanding from Pixels Only
Cohere
60 How to Add AI to your App
How to Add AI to your App
Cohere

This video teaches how to use BERTopic for topic modeling and customize topic labels. It demonstrates how to train a model, load a saved model, and visualize topic relationships.

Key Takeaways
  1. Import the BERTopic library
  2. Load a dataset of text abstracts
  3. Train a BERTopic model
  4. Load a saved BERTopic model
  5. Customize topic labels
  6. Visualize topic relationships
💡 BERTopic allows for customizable topic labeling and visualization of topic relationships, enabling more effective exploration of large text archives.

Related Reads

📰
On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]
arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia
Reddit r/MachineLearning
📰
CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available
Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development
Medium · Data Science
📰
Found a potential mistake in an ICLR 2026 blogpost [D]
Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications
Reddit r/MachineLearning
📰
Building a Research Pipeline: From Google Scholar Search to Citation Network Analysis
Learn to build a research pipeline to efficiently manage and analyze academic papers and citations, staying current in a fast-moving research field
Dev.to · NexGenData
Up next
Indians Under House Arrest in America? 😱 Immigration Crisis Explained | SumanTV Classroom
SumanTV Classroom
Watch →