Building Voice Applications with OpenAI’s Realtime API
Introduction
The way we interact with AI is evolving rapidly. With the introduction of OpenAI’s Realtime API, developers can now integrate low-latency, multimodal speech-to-speech experiences directly into their applications. This breakthrough enables seamless real-time conversations with AI, eliminating the delays and complexities of traditional speech processing pipelines.
In this article, we’ll explore how the Realtime API works and walk through a practical implementation using WebRTC to establish a peer-to-peer connection for real-time voice interaction. We’ll cover:
- How the Realtime API improves voice AI applications
- Setting up a WebRTC connection for live audio exchange
- Sending and receiving messages with a WebRTC data channel
- Building a UI to control sessions and visualize interactions
By the end, you’ll understand how to build a real-time AI voice assistant capable of engaging in natural, human-like conversations.
The Evolution of AI Voice Assistants
Previously, voice AI applications required multiple components:
- Automatic Speech Recognition (ASR) — Converting speech to text using models like Whisper.
- Text Processing — Sending text to an AI model for reasoning and response generation.
- Text-to-Speech (TTS) — Converting the AI’s response back into speech using a separate model.
This approach introduced latency and loss of expressiveness, making interactions feel unnatural. The Realtime API streamlines this process by handling speech-to-speech interactions natively in a single API call — similar to ChatGPT’s Advanced Voice Mode.
Under the hood, the Realtime API:
- Uses WebSockets for persistent, real-time communication with OpenAI’s GPT-4o Realtime model.
- Supports function calling, allowing AI assistants to trigger actions dynamically based on user inputs.
- Streams audio input and output directly, ensuring natural turn-taking and interruptions, just like human conversation.
With these capabilities, developers can create voice-enabled applications for language learning, customer support, accessibility tools, and more.
Implementing Real-Time AI Conversations with WebRTC
To demonstrate how to integrate OpenAI’s Realtime API into a React application, let’s break down the key components of a WebRTC-based voice assistant. You can find more about this application at the official OpenAI Realtime Console Github repository.
Setting Up WebRTC for Live Audio Streaming
WebRTC (Web Real-Time Communication) enables peer-to-peer communication for low-latency media streaming. In our application, we:
- Capture microphone input using
navigator.mediaDevices.getUserMedia()
- Establish a WebRTC
RTCPeerConnection
- Create a
DataChannel
for sending and receiving text-based interactions. - Attach an
<audio>
element to play the AI’s voice response.
Here’s how we initialize the WebRTC connection:
const pc = new RTCPeerConnection();
const audioElement = document.createElement("audio");
audioElement.autoplay = true;
pc.ontrack = (e) => (audioElement.srcObject = e.streams[0]);
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(mediaStream.getTracks()[0]);
const dataChannel = pc.createDataChannel("oai-events");
This sets up the foundation for real-time voice streaming between the user and OpenAI’s model. If you need to learn more about WebRTC, you can read my article here.
Starting a Session with the Realtime API
To connect with OpenAI’s Realtime API, we need to:
- Obtain an ephemeral authentication key from our server.
- Generate a WebRTC SDP (Session Description Protocol) offer.
- Send the offer to OpenAI’s API to establish a connection.
- Receive and set the API’s response as the remote description.
async function startSession() {
const tokenResponse = await fetch("/token");
const data = await tokenResponse.json();
const EPHEMERAL_KEY = data.client_secret.value;
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const sdpResponse = await fetch("https://api.openai.com/v1/realtime", {
method: "POST",
body: offer.sdp,
headers: {
Authorization: `Bearer ${EPHEMERAL_KEY}`,
"Content-Type": "application/sdp",
},
});
const answer = { type: "answer", sdp: await sdpResponse.text() };
await pc.setRemoteDescription(answer);
}
Once this function executes, our application can stream voice data in real time to OpenAI’s Realtime API.
Sending and Receiving Messages Over the Data Channel
To communicate with the AI, we need to send structured messages over WebRTC’s data channel:
function sendTextMessage(message) {
const event = {
type: "conversation.item.create",
item: {
type: "message",
role: "user",
content: [{ type: "input_text", text: message }],
},
};
dataChannel.send(JSON.stringify(event));
dataChannel.send(JSON.stringify({ type: "response.create" }));
}
Similarly, we listen for responses from the AI:
dataChannel.addEventListener("message", (e) => {
const response = JSON.parse(e.data);
console.log("AI Response:", response);
});
Now, our AI assistant can send and receive messages dynamically, enabling natural conversations.
Building a User Interface for Real-Time AI Interaction
To make our assistant more interactive, we create a simple UI with:
- A start/stop session button
- A message log to display interactions
- A tool panel for additional controls
<SessionControls
startSession={startSession}
stopSession={stopSession}
sendClientEvent={sendClientEvent}
sendTextMessage={sendTextMessage}
isSessionActive={isSessionActive}
/>
This React component allows users to start a conversation, send messages, and receive real-time responses.
Conclusion
With OpenAI’s Realtime API, developers can build voice applications that feel natural — low latency, expressive, and interactive. By leveraging WebRTC for real-time communication, we’ve demonstrated how to:
- Stream live audio to and from OpenAI’s GPT-4o model
- Use WebRTC for peer-to-peer connectivity
- Send structured messages via a data channel
- Build a simple React UI for user interaction
This marks a major step in making AI conversations more human-like and responsive. Whether for education, customer support, or accessibility, real-time AI voice interactions are now within reach.
Ready to build your own real-time AI assistant? Start experimenting with OpenAI’s Realtime API today!