Building Voice Applications with OpenAI’s Realtime API

Fateh Ali Aamir
4 min readFeb 2, 2025

Introduction

The way we interact with AI is evolving rapidly. With the introduction of OpenAI’s Realtime API, developers can now integrate low-latency, multimodal speech-to-speech experiences directly into their applications. This breakthrough enables seamless real-time conversations with AI, eliminating the delays and complexities of traditional speech processing pipelines.

Photo by Quino Al on Unsplash

In this article, we’ll explore how the Realtime API works and walk through a practical implementation using WebRTC to establish a peer-to-peer connection for real-time voice interaction. We’ll cover:

  • How the Realtime API improves voice AI applications
  • Setting up a WebRTC connection for live audio exchange
  • Sending and receiving messages with a WebRTC data channel
  • Building a UI to control sessions and visualize interactions

By the end, you’ll understand how to build a real-time AI voice assistant capable of engaging in natural, human-like conversations.

The Evolution of AI Voice Assistants

Previously, voice AI applications required multiple components:

  1. Automatic Speech Recognition (ASR) — Converting speech to text using models like Whisper.
  2. Text Processing — Sending text to an AI model for reasoning and response generation.
  3. Text-to-Speech (TTS) — Converting the AI’s response back into speech using a separate model.

This approach introduced latency and loss of expressiveness, making interactions feel unnatural. The Realtime API streamlines this process by handling speech-to-speech interactions natively in a single API call — similar to ChatGPT’s Advanced Voice Mode.

Under the hood, the Realtime API:

  • Uses WebSockets for persistent, real-time communication with OpenAI’s GPT-4o Realtime model.
  • Supports function calling, allowing AI assistants to trigger actions dynamically based on user inputs.
  • Streams audio input and output directly, ensuring natural turn-taking and interruptions, just like human conversation.

With these capabilities, developers can create voice-enabled applications for language learning, customer support, accessibility tools, and more.

Implementing Real-Time AI Conversations with WebRTC

To demonstrate how to integrate OpenAI’s Realtime API into a React application, let’s break down the key components of a WebRTC-based voice assistant. You can find more about this application at the official OpenAI Realtime Console Github repository.

Setting Up WebRTC for Live Audio Streaming

WebRTC (Web Real-Time Communication) enables peer-to-peer communication for low-latency media streaming. In our application, we:

  • Capture microphone input using navigator.mediaDevices.getUserMedia()
  • Establish a WebRTC RTCPeerConnection
  • Create a DataChannel for sending and receiving text-based interactions.
  • Attach an <audio> element to play the AI’s voice response.

Here’s how we initialize the WebRTC connection:

const pc = new RTCPeerConnection();
const audioElement = document.createElement("audio");
audioElement.autoplay = true;
pc.ontrack = (e) => (audioElement.srcObject = e.streams[0]);
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(mediaStream.getTracks()[0]);
const dataChannel = pc.createDataChannel("oai-events");

This sets up the foundation for real-time voice streaming between the user and OpenAI’s model. If you need to learn more about WebRTC, you can read my article here.

Starting a Session with the Realtime API

To connect with OpenAI’s Realtime API, we need to:

  1. Obtain an ephemeral authentication key from our server.
  2. Generate a WebRTC SDP (Session Description Protocol) offer.
  3. Send the offer to OpenAI’s API to establish a connection.
  4. Receive and set the API’s response as the remote description.
async function startSession() {
const tokenResponse = await fetch("/token");
const data = await tokenResponse.json();
const EPHEMERAL_KEY = data.client_secret.value;
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const sdpResponse = await fetch("https://api.openai.com/v1/realtime", {
method: "POST",
body: offer.sdp,
headers: {
Authorization: `Bearer ${EPHEMERAL_KEY}`,
"Content-Type": "application/sdp",
},
});
const answer = { type: "answer", sdp: await sdpResponse.text() };
await pc.setRemoteDescription(answer);
}

Once this function executes, our application can stream voice data in real time to OpenAI’s Realtime API.

Sending and Receiving Messages Over the Data Channel

To communicate with the AI, we need to send structured messages over WebRTC’s data channel:

function sendTextMessage(message) {
const event = {
type: "conversation.item.create",
item: {
type: "message",
role: "user",
content: [{ type: "input_text", text: message }],
},
};
dataChannel.send(JSON.stringify(event));
dataChannel.send(JSON.stringify({ type: "response.create" }));
}

Similarly, we listen for responses from the AI:

dataChannel.addEventListener("message", (e) => {
const response = JSON.parse(e.data);
console.log("AI Response:", response);
});

Now, our AI assistant can send and receive messages dynamically, enabling natural conversations.

Building a User Interface for Real-Time AI Interaction

To make our assistant more interactive, we create a simple UI with:

  • A start/stop session button
  • A message log to display interactions
  • A tool panel for additional controls
<SessionControls
startSession={startSession}
stopSession={stopSession}
sendClientEvent={sendClientEvent}
sendTextMessage={sendTextMessage}
isSessionActive={isSessionActive}
/>

This React component allows users to start a conversation, send messages, and receive real-time responses.

Conclusion

With OpenAI’s Realtime API, developers can build voice applications that feel natural — low latency, expressive, and interactive. By leveraging WebRTC for real-time communication, we’ve demonstrated how to:

  • Stream live audio to and from OpenAI’s GPT-4o model
  • Use WebRTC for peer-to-peer connectivity
  • Send structured messages via a data channel
  • Build a simple React UI for user interaction

This marks a major step in making AI conversations more human-like and responsive. Whether for education, customer support, or accessibility, real-time AI voice interactions are now within reach.

Ready to build your own real-time AI assistant? Start experimenting with OpenAI’s Realtime API today!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Fateh Ali Aamir
Fateh Ali Aamir

Written by Fateh Ali Aamir

23. A programmer by profession. A writer by passion.

No responses yet

Write a response