Building Voice Applications with OpenAI’s Realtime API

4 min readFeb 2, 2025

Introduction

The way we interact with AI is evolving rapidly. With the introduction of OpenAI’s Realtime API, developers can now integrate low-latency, multimodal speech-to-speech experiences directly into their applications. This breakthrough enables seamless real-time conversations with AI, eliminating the delays and complexities of traditional speech processing pipelines.

In this article, we’ll explore how the Realtime API works and walk through a practical implementation using WebRTC to establish a peer-to-peer connection for real-time voice interaction. We’ll cover:

How the Realtime API improves voice AI applications
Setting up a WebRTC connection for live audio exchange
Sending and receiving messages with a WebRTC data channel
Building a UI to control sessions and visualize interactions

By the end, you’ll understand how to build a real-time AI voice assistant capable of engaging in natural, human-like conversations.

The Evolution of AI Voice Assistants

Previously, voice AI applications required multiple components:

Automatic Speech Recognition (ASR) — Converting speech to text using models like Whisper.
Text Processing — Sending text to an AI model for reasoning and response generation.
Text-to-Speech (TTS) — Converting the AI’s response back into speech using a separate model.

This approach introduced latency and loss of expressiveness, making interactions feel unnatural. The Realtime API streamlines this process by handling speech-to-speech interactions natively in a single API call — similar to ChatGPT’s Advanced Voice Mode.

Under the hood, the Realtime API:

Uses WebSockets for persistent, real-time communication with OpenAI’s GPT-4o Realtime model.
Supports function calling, allowing AI assistants to trigger actions dynamically based on user inputs.
Streams audio input and output directly, ensuring natural turn-taking and interruptions, just like human conversation.

With these capabilities, developers can create voice-enabled applications for language learning, customer support, accessibility tools, and more.

Implementing Real-Time AI Conversations with WebRTC

To demonstrate how to integrate OpenAI’s Realtime API into a React application, let’s break down the key components of a WebRTC-based voice assistant. You can find more about this application at the official OpenAI Realtime Console Github repository.

Setting Up WebRTC for Live Audio Streaming

WebRTC (Web Real-Time Communication) enables peer-to-peer communication for low-latency media streaming. In our application, we:

Capture microphone input using navigator.mediaDevices.getUserMedia()
Establish a WebRTC RTCPeerConnection
Create a DataChannel for sending and receiving text-based interactions.
Attach an <audio> element to play the AI’s voice response.

Here’s how we initialize the WebRTC connection:

const pc = new RTCPeerConnection();
const audioElement = document.createElement("audio");
audioElement.autoplay = true;
pc.ontrack = (e) => (audioElement.srcObject = e.streams[0]);
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(mediaStream.getTracks()[0]);
const dataChannel = pc.createDataChannel("oai-events");

This sets up the foundation for real-time voice streaming between the user and OpenAI’s model. If you need to learn more about WebRTC, you can read my article here.

Starting a Session with the Realtime API

To connect with OpenAI’s Realtime API, we need to:

Obtain an ephemeral authentication key from our server.
Generate a WebRTC SDP (Session Description Protocol) offer.
Send the offer to OpenAI’s API to establish a connection.
Receive and set the API’s response as the remote description.

async function startSession() {
  const tokenResponse = await fetch("/token");
  const data = await tokenResponse.json();
  const EPHEMERAL_KEY = data.client_secret.value;  
  const offer = await pc.createOffer();
  await pc.setLocalDescription(offer);
  const sdpResponse = await fetch("https://api.openai.com/v1/realtime", {
    method: "POST",
    body: offer.sdp,
    headers: {
      Authorization: `Bearer ${EPHEMERAL_KEY}`,
      "Content-Type": "application/sdp",
    },
  });
  const answer = { type: "answer", sdp: await sdpResponse.text() };
  await pc.setRemoteDescription(answer);
}

Once this function executes, our application can stream voice data in real time to OpenAI’s Realtime API.

Sending and Receiving Messages Over the Data Channel

To communicate with the AI, we need to send structured messages over WebRTC’s data channel:

function sendTextMessage(message) {
  const event = {
    type: "conversation.item.create",
    item: {
      type: "message",
      role: "user",
      content: [{ type: "input_text", text: message }],
    },
  };
  dataChannel.send(JSON.stringify(event));
  dataChannel.send(JSON.stringify({ type: "response.create" }));
}

Similarly, we listen for responses from the AI:

dataChannel.addEventListener("message", (e) => {
  const response = JSON.parse(e.data);
  console.log("AI Response:", response);
});

Now, our AI assistant can send and receive messages dynamically, enabling natural conversations.

Building a User Interface for Real-Time AI Interaction

To make our assistant more interactive, we create a simple UI with:

A start/stop session button
A message log to display interactions
A tool panel for additional controls

<SessionControls
  startSession={startSession}
  stopSession={stopSession}
  sendClientEvent={sendClientEvent}
  sendTextMessage={sendTextMessage}
  isSessionActive={isSessionActive}
/>

This React component allows users to start a conversation, send messages, and receive real-time responses.

Conclusion

With OpenAI’s Realtime API, developers can build voice applications that feel natural — low latency, expressive, and interactive. By leveraging WebRTC for real-time communication, we’ve demonstrated how to:

Stream live audio to and from OpenAI’s GPT-4o model
Use WebRTC for peer-to-peer connectivity
Send structured messages via a data channel
Build a simple React UI for user interaction

This marks a major step in making AI conversations more human-like and responsive. Whether for education, customer support, or accessibility, real-time AI voice interactions are now within reach.

Ready to build your own real-time AI assistant? Start experimenting with OpenAI’s Realtime API today!

Building Voice Applications with OpenAI’s Realtime API

Introduction

The Evolution of AI Voice Assistants

Implementing Real-Time AI Conversations with WebRTC

Setting Up WebRTC for Live Audio Streaming

Starting a Session with the Realtime API

Sending and Receiving Messages Over the Data Channel

Building a User Interface for Real-Time AI Interaction

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Fateh Ali Aamir

No responses yet

More from Fateh Ali Aamir

Build a LinkedIn Scraper Using Selenium and OpenAI’s GPT 4o-Mini

I wanted to create a LinkedIn scraper from scratch. I tried looking at various publicly available solutions but none of them worked for me…

A Multi-Layer RAG Chatbot with LangChain’s SQL Agent and MySQL

Chatbots are all the craze these days and RAG is a popular mechanism that is being thrown everywhere. In a more traditional sense, RAG is…

Using JWT in FastAPI with PostgreSQL Integration

JSON Web Token (JWT) is a secure and efficient way to manage user authentication and authorization. When paired with a PostgreSQL database…

A Beginner’s Guide to s3cmd

The s3cmd tool is a versatile command-line interface for managing Amazon S3 objects and buckets. It allows users to create and delete…

Recommended from Medium

Building an AI-Powered E-commerce Chatbot with LangChain and Gemini

In this blog post, we’ll build an AI-powered chatbot for e-commerce customer support using LangChain and Google’s Gemini AI. This chatbot…

Basic Python voice bot using Realtime OpenAI API

As I found myself with some downtime, it felt natural to use this opportunity to refine my interviewing skills. To streamline the process…

Lists

Generative AI Recommended Reading

What is ChatGPT?

The New Chatbots: ChatGPT, Bard, and Beyond

Natural Language Processing

Comparative Analysis of AI Agent Frameworks with DSPy: LangGraph, AutoGen and CrewAI

Introduction

AI Tools Update: Trust But Verify

The release of updated automation tools is surprising, and with the recent 9-figure raise by the likes of newcomer, Harvey, the automation…

Integrating Web Search with Large Language Models using LangChain: A Complete Developer Guide

Enhance Your AI Applications by Leveraging Real-time Data and Powerful LLMs

I just discovered Perplexity Sonar Reasoning. Why is nobody talking about this??

The Achilles heel of large language models is the fact that they don’t have real-time access to information. Or so I thought…