Coding an AI Voice Bot from Scratch: Real-Time Conversation with Python

AssemblyAI · Intermediate ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%LLM Engineering85%Prompt Craft80%Fine-tuning LLMs70%

Key Takeaways

This video demonstrates how to build a real-time AI voice assistant using Python, leveraging AssemblyAI for speech-to-text transcription, OpenAI for text response generation, and 11 Labs for audio generation. The project utilizes various tools such as Port Audio, Brew, and pip, and requires API keys for AssemblyAI, OpenAI, and 11 Labs.

Full Transcript

in this video I'll show you how to build an AI voice bot in Python it will be able to understand realtime audio input and at the same time generate real-time audio responses here's the scenario where our AI voice bot is working at a dental clinic thank you for calling Vancouver dental clinic my name is Sandy how may I assist you hi Sandy my name is Smitha and I would like to book an appointment with the denst tomorrow hello Smitha I can definitely help you with that let me check our schedule for availability could you please tell me your preferred time for the appointment tomorrow of course uh I would like to meet the dentist tomorrow at 12:00 noon great choice Smitha I have an opening at noon tomorrow with the dentist shall I go ahead and book that appointment for you that would be perfect thanks Sandy you're welcome Smitha I have successfully booked your appointment with the dentist for tomorrow at noon please make sure to bring your insurance information there are four steps involved in building our AI voice bot first off is installing all of the necessary python libraries so that is assembly AI open Ai and 11 Labs assembly AI is going to be used for accurate realtime speech to text transcription that means that whatever you're saying in real time is going to be transcribed and once that is transcribed that transcript is passed to open AI where we'll be generating text responses of how a dental assistant would respond once we get that text response from openi we'll pass that text response to 11 Labs where audio is going to be generated and that is exactly how our AI voice bot will work first off let's start by installing all of the necessary python libraries I've went ahead and created a voice virtual environment in my project folder once I've activated that now I'm going to install all the necessary libraries so I'm going to start off with Port audio so Brew install Port audio and also we are going to do pip install assembly AI extras and then we're going to do pip install 11 labs followed by Brew install MP and lastly we're also going to install open aai all of these commands and the code for this project will be in the description box below so do check out the GitHub link first let's start off by importing the python libraries that we've just downloaded so let's start off with assembly AI uh um after that we're going to import 11 Labs specifically we're importing the generate function and stream function and then also let's import open AI once we've done that let's create a class called AI assistant next we're going to initialize this class most importantly we need the API keys for all three of these services that we're using to get an assembly AI API key click on the link in the description box below once you've created API keys for all three of these Services you can declare them here let's do assembly AI do settings dot API key equals to API key and this is where you can enter your assembly AI API key once you've done that let's also Define the open AI API key once you have all three API Keys defined let's go ahead and create an empty transcriber object after we've done this let's also create a list containing the full transcripts of everything that we're saying and also what the AI assistant is saying as well as well so let's do self. full transcript before the conversation starts we want full transcript to only include a single thing which is the prompt that we want to give to openi so let's start writing that prompt we'll first have to define the role as system and and once we've done that we also have to Define content and this is where we write our prompt so let's write you are a receptionist at a dental clinic be resourceful and efficient so that's all our prompt will contain and that is what our full transcript list will contain this full transcript list is actually really important because every time that we communicate to open AI API we will be sending a full transcript of whatever has been said by you and also by the voicebot so it's really important that you follow this specific format next we can move on to step number two which is real-time transcription with assembly AI the first thing you want to do is create a method called start transcription in start transcription we will now create a transcriber object and store it into the transcriber uh variable that we've just created so self. transcriber object equals to assembly I do realtime transcriber we'll also set the sample rate to 16,000 and we want to define a few different methods lastly we want to define something called and aeran Silent threshold and you want to set this to a th000 this defines the time in which the program will actually wait before determining that you have ended a sentence when you're talking in real time so what this code does is it connects your microphone and streams data to assembly AI API next up we'll Define a method called stop transcription what this method does is it closes the transcriber and it sets It To None again next we need to Define these four methods on data on error on open and on close these four methods Define how the realtime transcriber works so let's head on over to assem assembly a documentation in order to do so so inside of assembly a documentation for realtime streaming we're want to look at this first code example what we want to do is copy this four functions right here which we need for our code so go on over and copy this once you've done that let's head on back to vs code and paste this there when you're pasting the code ensure that it's aligned once you've done that we have to make a few changes to the code that we've just pasted first off let's change the parameters to import self in each of these methods once you've done that what I want to do is actually comment out this code in on open because I don't want to actually print out anything in terminal besides the transcripts and instead I'm just going to write return I'm also going to be doing the same thing for the methods called on error and on close we also want to make some changes to the on data method so the on data method is really important because we actually get to Define what we want to do with the real-time transcript which is coming in from assembly ai's API so in the second if statement right here is where we actually receive the final real-time transcript that means that whenever you finish saying a sentence that entire sentence is actually being printed out or sent to you right here instead instead of printing it out what I want to do is send that over to a new method called generate AI response which we will be defining and the parameter for this will be the transcript we are now at step three where we're going to write code to pass the realtime transcript to open ai's API we'll start off by writing a function called generate AI response here the parameters will be self and transcript the very first thing that we're going to do in this method is called the stop transcription method the reason why we're doing that is because we want to pause the realtime transcription stream while we are passing and communicating with openi API so let's do self. stop transcription after which we want to now add our real-time transcript to our full transcript list next we also want to print out our real-time transcript which the user has just said now we're ready to pass this transcript directly to openi API for the model we're going to be making use of GPT 3.5 Turbo and for messages we are going to be passing the full transcript after which let's define a parameter called AI response AI response is equals to response do choices what this line of code does is it retrieves the response from open eyes API and stores it into AI response and at this point what we can do is we can go ahead and generate audio so that's exactly what we'll do we'll do self do generate audio and this is a method that we now have to go ahead and create and we're going to pass AI response as a parameter at this point once we have generated audio we can go ahead and restart the real-time transcription so you can continue having that conversation so what we want to do is now call the start transcription function at this point we're at the last and final step where we'll be generating audio with 11 laps so we're going to create a method called generate audio and the parameters will be self and text this text right here is actually the response from open ai's API and the first thing that we want to do is add that into full transcript next we also want to print out this text saying that it is actually from the AI assistant next we have to write the code to send a request to 11 Labs API and we'll be making use of the generate function that we imported at the beginning of this for voice I'm going to be selecting Rachel but there's a bunch of different voices available on 11 Labs which you can feel free to browse and select the ones that you want and I'm also setting the stream parameter to through and I'm going to call the stream function and pass this audio stream so this is the end of the generate audio method next we're actually going to define the start and end of our project we'll start off by defining the initial greeting that our AI voicebot has to say so we'll say thank you for calling Vancouver Dental Clinic my name is Sandy how may I assist you so this is the initial greeting which our AI voice bot will read out to us before starting our real-time transcription passing it to open Ai and then generating more audio now let us initialize the class AI assistant and the first thing that we want to do is call the generate audio method and pass greeting inside after which we also want to call the start transcription function at this point you can hit save and start running this project now I can go into terminal and run our python file thank you for calling Vancouver dental clinic my name is Sandy how may I assist you hi Sandy I'm Smitha and I'd like to book an appointment with Dr Lee tomorrow hello Smitha I'm happy to help you with that let me check Dr Lee's availability for tomorrow could you please tell me your preferred time for the appointment uh I would like to book it at 3 p.m. tomorrow I'm sorry but Dr Lee is fully booked tomorrow afternoon however we do have availability at 10: a.m. or 1 p.m. would any of these times work for you Smitha yes uh 1 p.m. actually works for me Sandy great I've successfully scheduled your appointment with Dr Lee for tomorrow at 1 p.m. can I have your phone number to confirm the appointment Smitha yes my phone number is 1 2 3 4 5 6 7 8 9 thank you Smitha I have your phone number as 1 123456789 you will receive a confirmation call or text shortly if you have any other questions or need further assistance feel free to let me know thank you for choosing Vancouver dental clinic check out this next video to learn how to transcribe a live phone call in Python using assembly Ai and twio

Original Description

🔑 Get your AssemblyAI API key here: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_smit_17 Learn how to build a real-time AI voice assistant using Python that can handle incoming calls, transcribe speech, generate intelligent responses, and provide a human-like conversational experience. Perfect for call centers, customer support, and virtual receptionist applications. In this coding tutorial, you'll integrate multiple cutting-edge technologies, including: 1. Assemblyai Speech-to-Text API for accurate real-time transcription. 2. OpenAI's powerful language models for natural language processing (NLP) and response generation. 3. ElevenLabs' AI voice synthesis to convert text responses into natural-sounding audio. Step-by-step, you'll create a Python application that seamlessly combines these APIs, enabling your AI assistant to listen to incoming audio, comprehend the speech, formulate contextual responses, and communicate back with synthesized voice in real-time. Github code: https://github.com/smithakolan/AssemblyAI-AI-Voice-Bot/ Timestamps: 00:00 - Intro & Demo of application 01:10 - Outline of application 01:58 - Step 1: download python libraries 06:21 - Step 1: Streaming Speech-to-Text with AssemblyAI 12:11 - Step 3: OpenAI Chat completion 15:32 - Step 4: Generate Human-like audio with Elevenlabs 18:48 - Running our AI Call Assistant #AIVoiceAssistant #RealTimeSpeechRecognition #NaturalLanguageProcessing #AIVoiceSynthesis #PythonTutorial #CallCenterAutomation #VoiceBot #StreamingSpeechtoText ▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬ 🖥️ Website: https://www.assemblyai.com 🐦 Twitter: https://twitter.com/AssemblyAI 🦾 Discord: https://discord.gg/Cd8MyVJAXd ▶️ Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1 🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #MachineLearning #DeepLearning

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AssemblyAI · AssemblyAI · 0 of 60

← Previous Next →

Python Speech Recognition in 5 Minutes

Python Speech Recognition in 5 Minutes

Python Click Part 1 of 4

Python Click Part 1 of 4

Python Click Part 2 of 4

Python Click Part 2 of 4

Python Click Part 3 of 4

Python Click Part 3 of 4

Python Click Part 4 of 4

Python Click Part 4 of 4

Deep learning in 5 minutes | What is deep learning?

Deep learning in 5 minutes | What is deep learning?

How to make a web app that transcribes YouTube videos with Streamlit | Part 1

How to make a web app that transcribes YouTube videos with Streamlit | Part 1

How to make a web app that transcribes YouTube videos with Streamlit | Part 2

How to make a web app that transcribes YouTube videos with Streamlit | Part 2

Batch normalization | What it is and how to implement it

Batch normalization | What it is and how to implement it

Real-time Speech Recognition in 15 minutes with AssemblyAI

Real-time Speech Recognition in 15 minutes with AssemblyAI

Regularization in a Neural Network | Dealing with overfitting

Regularization in a Neural Network | Dealing with overfitting

Add speech recognition to your Streamlit apps in 5 minutes

Add speech recognition to your Streamlit apps in 5 minutes

Transformers for beginners | What are they and how do they work

Transformers for beginners | What are they and how do they work

Automatic Chapter Detection With AssemblyAI | Python Tutorial

Automatic Chapter Detection With AssemblyAI | Python Tutorial

Deep Learning Series Part 1 - What is Deep Learning?

Deep Learning Series Part 1 - What is Deep Learning?

Deep Learning Series part 2 - Why is it called “Deep Learning”?

Deep Learning Series part 2 - Why is it called “Deep Learning”?

Activation Functions In Neural Networks Explained | Deep Learning Tutorial

Activation Functions In Neural Networks Explained | Deep Learning Tutorial

Deep Learning Series part 3 - Deep Learning vs. Machine Learning

Deep Learning Series part 3 - Deep Learning vs. Machine Learning

Deep Learning Series part 4 - Why is Deep Learning better for NLP?

Deep Learning Series part 4 - Why is Deep Learning better for NLP?

Intro to Batch Normalization Part 1

Intro to Batch Normalization Part 1

Intro to Batch Normalization Part 2

Intro to Batch Normalization Part 2

Intro to Batch Normalization Part 3 - What is Normalization?

Intro to Batch Normalization Part 3 - What is Normalization?

Intro to Batch Normalization Part 4

Intro to Batch Normalization Part 4

Intro to Batch Normalization Part 5

Intro to Batch Normalization Part 5

Sentiment Analysis for Earnings Calls with AssemblyAI

Sentiment Analysis for Earnings Calls with AssemblyAI

Summarizing my favorite podcasts with Python

Summarizing my favorite podcasts with Python

Introduction to Regularization

Introduction to Regularization

How/Why Regularization in Neural Networks?

How/Why Regularization in Neural Networks?

Getting Started With Torchaudio | PyTorch Tutorial

Getting Started With Torchaudio | PyTorch Tutorial

Types of Regularization

Types of Regularization

Tuning Alpha in L1 and L2 Regularization

Tuning Alpha in L1 and L2 Regularization

Dropout Regularization

Dropout Regularization

What is GPT-3 and how does it work? | A Quick Review

What is GPT-3 and how does it work? | A Quick Review

Backpropagation For Neural Networks Explained | Deep Learning Tutorial

Backpropagation For Neural Networks Explained | Deep Learning Tutorial

Jupyter Notebooks Tutorial | How to use them & tips and tricks!

Jupyter Notebooks Tutorial | How to use them & tips and tricks!

Best Free Speech-To-Text APIs and Open Source Libraries

Best Free Speech-To-Text APIs and Open Source Libraries

Regularization - Early stopping

Regularization - Early stopping

Regularization - Data Augmentation

Regularization - Data Augmentation

Bias and Variance for Machine Learning | Deep Learning

Bias and Variance for Machine Learning | Deep Learning

Recurrent Neural Networks (RNNs) Explained - Deep Learning

Recurrent Neural Networks (RNNs) Explained - Deep Learning

What is BERT and how does it work? | A Quick Review

What is BERT and how does it work? | A Quick Review

Introduction to Transformers

Introduction to Transformers

Transformers | What is attention?

Transformers | What is attention?

Transformers | how attention relates to Transformers

Transformers | how attention relates to Transformers

Transformers | Basics of Transformers

Transformers | Basics of Transformers

Supervised Machine Learning Explained For Beginners

Supervised Machine Learning Explained For Beginners

Transformers | Basics of Transformers Encoders

Transformers | Basics of Transformers Encoders

Transformers | Basics of Transformers I/O

Transformers | Basics of Transformers I/O

How to evaluate ML models | Evaluation metrics for machine learning

How to evaluate ML models | Evaluation metrics for machine learning

Unsupervised Machine Learning Explained For Beginners

Unsupervised Machine Learning Explained For Beginners

Weight Initialization for Deep Feedforward Neural Networks

Weight Initialization for Deep Feedforward Neural Networks

Q-Learning Explained - Reinforcement Learning Tutorial

Q-Learning Explained - Reinforcement Learning Tutorial

Should You Use PyTorch or TensorFlow in 2022?

Should You Use PyTorch or TensorFlow in 2022?

What is Layer Normalization? | Deep Learning Fundamentals

What is Layer Normalization? | Deep Learning Fundamentals

I created a Python App to study FASTER

I created a Python App to study FASTER

How to create your FIRST NEURAL NETWORK with TensorFlow!

How to create your FIRST NEURAL NETWORK with TensorFlow!

Neural Networks Summary: All hyperparameters

Neural Networks Summary: All hyperparameters

Getting Started with OpenAI API and GPT-3 | Beginner Python Tutorial

Getting Started with OpenAI API and GPT-3 | Beginner Python Tutorial

Convert Speech-To-Text In Python in 60 seconds!

Convert Speech-To-Text In Python in 60 seconds!

Gradient Clipping for Neural Networks | Deep Learning Fundamentals

Gradient Clipping for Neural Networks | Deep Learning Fundamentals

This video teaches you how to build a real-time AI voice assistant using Python, covering topics such as speech-to-text transcription, text response generation, and audio generation. By following the steps outlined in the video, you can create a conversational AI system that can handle incoming calls and provide human-like responses.

Key Takeaways

Install necessary Python libraries
Create a voice virtual environment
Activate the virtual environment
Install Port Audio
Install AssemblyAI with extras
Create a method called start transcription
Define a silent threshold
Copy and modify four functions from AssemblyAI documentation
Stop transcription
Define methods for on data, on error, on open, and on close

💡 The key to building a successful real-time AI voice assistant is to integrate multiple AI APIs and fine-tune the models for specific tasks, such as speech-to-text transcription and text response generation.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Sonnet 5 vs GLM-5.2 vs everyone: how to pick the cheapest LLM API in 2026

Learn how to choose the cheapest LLM API in 2026 by comparing pricing models of Sonnet 5 and GLM-5.2

JSON-Schema masks can block needed tool calls

Learn how JSON-Schema masks can block needed tool calls in LLM agents and how to sidestep the problem with a two-pass inference hack

The Invisible Cage: What the Evolution from Claude Sonnet 4.6

Explore the evolution of AI models like Claude Sonnet 4.6 and its implications on AI development

The Best Vector Database in 2026: Qdrant vs Pinecone vs Weaviate vs Milvus vs pgvector

Learn how to choose the best vector database for your RAG system in 2026, comparing Qdrant, Pinecone, Weaviate, Milvus, and pgvector

Dev.to · Darshit Radadiya

Chapters (7)

Intro & Demo of application

1:10 Outline of application

1:58 Step 1: download python libraries

6:21 Step 1: Streaming Speech-to-Text with AssemblyAI

12:11 Step 3: OpenAI Chat completion

15:32 Step 4: Generate Human-like audio with Elevenlabs

18:48 Running our AI Call Assistant

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)