An OpenAI ChatGPT-4o Replica: Voice-based Chatbot

5 min readJul 2, 2024

GPT-4o impressed us with those amazing demos from OpenAI, and we’d all love to try it out. However, you cannot get that same experience added to your application because there is no API support. So, how does one get that experience? By building it themselves.
Today I’m presenting a voice-based chatbot that uses all of OpenAI’s services and tools to give us that speech-to-speech experience. This is a foundational application model. It can expand to adapt to various use cases such as RAG, Phone Calling or Virtual Assistants. The possibilities are endless. Let’s dive in.

Architecture

So the application works in the following steps:
1. Input Audio: We take the input using the sounddevice library
2. Transcription Model: OpenAI Whisper Model converts our audio to text
3. Prompt + Transcription: Both components are stitched and forwared to the LLM
4. LLM: OpenAI gpt-3.5 is currently used to generate responses
5. LLM Response: Take the response and throw it to the TTS model
6. Text-to-Speech: Convert text-to-speech using OpenAI TTS Model
7. Output Audio: Stream audio using the pygame library

from flask import Flask, render_template, request, jsonify
import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
import threading
import pygame
import requests
from openai import OpenAI

First, we’re making all of our necessary imports. We’re using Flask to build our application and we’re using render_template to render our index.html file. We’re also using jsonify to structure our request data and requests to send the those requests. We’re also using numpy and scipy for audio conversion. We’re also using threading to efficiently run our application. Finally, we’re using pygame and openai as mentioned above.

app = Flask(__name__)

Initiating our Flask app!

# Initialize OpenAI client
client = OpenAI(api_key="")
# Initialize pygame for sound playback
pygame.mixer.init()
# Initialize variables
audio_filename = "output.wav"
transcription_text = ""
response_text = ""
# Threading event for stopping recording
stop_event = threading.Event()

Setting up the OpenAI client, initialising our pygame mixer, adding in the filename and setting up the stop_event event.

def record_audio(filename='output.wav', samplerate=44100):
    global transcription_text
    print("Recording started...")
    try:
        recording = []
        with sd.InputStream(samplerate=samplerate, channels=2, dtype='int16') as stream:
            while not stop_event.is_set():
                frame, overflowed = stream.read(1024)
                recording.append(frame)
        recording = np.concatenate(recording, axis=0)
        write(filename, samplerate, recording)
        print("Recording finished, audio saved to", filename)
        transcription_text = transcribe_audio(filename)
        if transcription_text:
            global response_text
            response_text = transcribe_and_chat(transcription_text)
    except Exception as e:
        print(f"Error during recording: {e}")

The record_audio function will take in the filename and the samplerate as parameters and then use the sd.InputStream function to record your audio through your microphone. It keeps running until the stop_event is set. The recording is then concatenated and the file is created. This file is then sent to the transcibe_audio function. This is where we will use the OpenAI Whisper model. Once we have the transcription_text from the model, we send that text to the transcribe_and_chat function which returns to us the response_text.

def transcribe_audio(filename):
    try:
        with open(filename, 'rb') as audio_file:
            transcription = client.audio.transcriptions.create(
                model="whisper-1", 
                file=audio_file,
                response_format="text"
            )
            transcribed_text = transcription.strip()
            print(f">> You said: {transcribed_text}")
            return transcribed_text
    except requests.exceptions.RequestException as e:
        print(f"Error in transcription: {e}")
    except Exception as e:
        print(f"Error in transcribe_audio: {e}")

The transcibe_audio function will open our file and use the client.audio.transcriptions.create function to get our transcription. This function takes in three parameters, the model name, the file name and the response format. We then strip our transcription and return that text after logging it on the console.

def transcribe_and_chat(input_text):
    try:
        messages = [
            {"role": "system", "content": "You are a helpful assistant providing concise responses in at most two sentences."},
            {"role": "user", "content": input_text}
        ]
        chat_response = client.chat.completions.create(model="gpt-3.5-turbo", messages=messages)
        response_text = chat_response.choices[0].message.content.strip()
        print(f">> The assistant said: {response_text}")
        streamed_audio(response_text)
        return response_text
    except Exception as e:
        print(f"Error in transcribe_and_chat: {e}")
        return ""

The transcribe_and_chat function will take in our prompt and the input_text which is the transcription and then use the client.chat.completions.create function to send our data to the OpenAI gpt-3.5 model and get a response. You can change the model as a parameter in the function. Once we get the response_text, we then send that text to the streamed_audio function so we can hear it.

def streamed_audio(input_text):
    url = "https://api.openai.com/v1/audio/speech"
    headers = {
        "Authorization": f"Bearer _",
        "Content-Type": "application/json"
    }
    data = {
        "model": "tts-1",
        "input": input_text,
        "voice": "echo",
        "response_format": "mp3"
    }
    try:
        response = requests.post(url, headers=headers, json=data, stream=True)
        response.raise_for_status()
        with open("response.mp3", "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        play_audio("response.mp3")
    except requests.exceptions.RequestException as e:
        print(f"Error in streamed_audio: {e}")

The streamed_audio function will use the OpenAI TTS model to stream our text using one of its many beautiful voices. We first need to set up our URL and add in our OpenAI API Key to the Authorization header in the request. The data dictionary must include the model, the input, the voice and the response format. We then hit the post request and get our response. We then write that response to a file called response.mp3 and then we use the play_audio function to hear the text.

def play_audio(file_path):
    pygame.mixer.music.load(file_path)
    pygame.mixer.music.play()
    while pygame.mixer.music.get_busy():
        continue

This is the play_audio function that used the pygame library’s mixer to load and play out response.mp3 file.

@app.route('/')
def index():
    return render_template('index.html')

Here we are setting up the root page of our server where we will render index.html, which is our frontend interface.

@app.route('/start_recording', methods=['POST'])
def start_recording():
    try:
        global stop_event
        stop_event.clear()  # Clear any previous event set
        threading.Thread(target=record_audio, args=(audio_filename, 44100)).start()
        return jsonify({'message': 'Recording started.'})
    except Exception as e:
        print(f"Error in start_recording: {e}")
        return jsonify({'message': 'Error in starting recording.'})

The /start_recording post endpoint will first setup and clear the stop_event event. It will then run the thread to call record_audio function. It runs until we stop it ourselves. We return a basic jsonified message.

@app.route('/stop_recording', methods=['POST'])
def stop_recording():
    try:
        global stop_event
        stop_event.set()  # Set stop event to stop recording
        return jsonify({
            'message': 'Recording stopped.',
            'input_text': transcription_text,
            'response_text': response_text
        })
    except Exception as e:
        print(f"Error in stop_recording: {e}")
        return jsonify({'message': 'Error in stopping recording.'})

The /stop_recording post endpoint will first set the stop_event. It will return the message ‘Recording stopped.’. It will also return the input_text and the response_text.

if __name__ == '__main__':
    app.run(debug=True)

Finally, we run our server and we’re good to go!

This was a walkthrough of the back-end only and the actual front-end interface has not been added here but I can share it on demand. It is not that fancy nor is it hard to implement, that is why I left it out.
The application shows how easy it is to build your voice assistant using all of OpenAI’s services and tools. This is a game-changer and it is an amazing solution. The best part is that you can talk in any language and it will also respond in the same language. The world of AI is at a precedent and will go upward from here. The best we can do is take advantage of it and ride the wave of discovery.

An OpenAI ChatGPT-4o Replica: Voice-based Chatbot

Architecture

Written by Fateh Ali Aamir