Creating a Multi-Model Chatbot Using Flask, OpenAI and Replicate

Fateh Ali Aamir
6 min readJun 12, 2024

--

I was looking for a chatbot that would let me handle multiple LLM models at once but I couldn’t find any so I decided to create my own. What I present here is a POC-level multi-model chatbot where you have the option to choose between any model you want and run inference. I decided to start with a range of different models. You’ll learn about them in a bit. To give you an overview of the project, we’ve used Python, LangChain and Flask on the backend and the frontend is HTML and vanilla Javascript for now.

Models

OpenAI
For OpenAI, we decided to go with gpt-3.5-turbo and gpt-4. You will need the OpenAI API Key to access these models. You can create an account here and get your API Key. You will need to add in your credits to use the keys.

Replicate
We decided to use falcon-40b and llama2–70b-chat hosted at Replicate. Replicate offers you a limited number of free inferences as well so you can save some bucks there. You will also need to create an account here and retrieve your API token.

flask.py

from flask import Flask, render_template, request, jsonify
from utils import generate_response
import json

app = Flask(__name__)

# Define the available LLM models
llm_models = ["gpt-3.5-turbo", "gpt-4", "Llama-2-70b-chat", "Falcon-40b-instruct"]

@app.route('/')
def index():
return render_template('index.html', llm_models=llm_models)

@app.route('/generate_response', methods=['POST'])
def get_response():
data = request.json
query = data['query']
chat_history = []
model_choice = data['model_choice']

# Call the generate_response function from utils
response, updated_chat_history = generate_response(query, chat_history, model_choice)

return {"response": response, "chat_history": updated_chat_history}

if __name__ == '__main__':
app.run(debug=True)

The flask.py file sets but two endpoints. The first one is the default home page that renders the index.html and also loads the list of LLM models. The second endpoint is the generate_response endpoint that will send the input query to the generate_response() function that we have defined in the utils.py file. The function will process our query and return our response.

utils.py

import os
import random
import time
import openai
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore
from langchain.chains.question_answering import load_qa_chain
from pinecone import Pinecone
from langchain_community.llms import Replicate

def chat_inference(query: str, model_choice: str) -> str:
try:
os.environ["OPENAI_API_KEY"] = ""
os.environ["PINECONE_API_KEY"] = ""
os.environ["REPLICATE_API_TOKEN"] = ""

pinecone = Pinecone(api_key="")

#Prompt
prompt = PromptTemplate(
template="You are a helpful assistant {context}",
input_variables=['context']
)

# Set up OpenAI Embeddings model
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=1536)

# Load Pinecone index and create vector store
vector_store = PineconeVectorStore(index_name="chatbot", embedding=embedding_model)

#Similarity Search
input_documents = vector_store.similarity_search(query, k=2)

# Determine the chat model based on the user's choice
if model_choice == "gpt-3.5-turbo":
chat_llm = ChatOpenAI(
openai_api_key=os.environ["OPENAI_API_KEY"],
model="gpt-3.5-turbo",
temperature=0,
verbose=True,
)

#Chain
chain = load_qa_chain(
llm=chat_llm,
chain_type="stuff",
prompt=prompt,
verbose=True
)

#Response
response = chain.run(
input_documents=input_documents,
question=query,
)

return response
elif model_choice == "gpt-4":
chat_llm = ChatOpenAI(
openai_api_key=os.environ["OPENAI_API_KEY"],
model="gpt-4",
temperature=0,
verbose=True,
)

#Chain
chain = load_qa_chain(
llm=chat_llm,
chain_type="stuff",
prompt=prompt,
verbose=True
)

#Response
response = chain.run(
input_documents=input_documents,
question=query,
)

return response
elif model_choice == "Llama-2-70b-chat":
chat_llm = Replicate(
model="meta/llama-2-70b-chat",
model_kwargs={"temperature": 0.75, "max_length": 500, "top_p": 1},
)
context = "\n".join(doc.page_content for doc in input_documents)

# Update the prompt with the context
formatted_prompt = prompt.format(context=context)
return chat_llm(formatted_prompt)
elif model_choice == "Falcon-40b-instruct":
chat_llm = Replicate(
model="joehoover/falcon-40b-instruct:7d58d6bddc53c23fa451c403b2b5373b1e0fa094e4e0d1b98c3d02931aa07173",
model_kwargs={"temperature": 0.75, "max_length": 500, "top_p": 1},
)
context = "\n".join(doc.page_content for doc in input_documents)

# Update the prompt with the context
formatted_prompt = prompt.format(context=context)
return chat_llm(formatted_prompt)
else:
return "Unknown model selected"

except Exception as exception:
error = str(exception)
print("error: ", error)
return error

def generate_response(query: str, chat_history: list, model_choice: str) -> tuple:
response = chat_inference(query, model_choice)

print(response)

chat_history.append((query, response))
time.sleep(random.randint(0, 5))
return response, chat_history

The generate_response() retrieves the response from the chat_inference() function that takes in the query and the model_choice. After that, it will append the latest message to the chat_history and then return the data.

The chat_inference() is a standard inference function that does the following things (in order):
1. Set up the prompt.
2. Initializes the Pinecone Vector Store and the OpenAI Embedding Model.
3. Run a similarity_search() function to retrieve similar documents from Pinecone.
4. Takes our choice of LLM and initializes it. And then runs inference on our query.
5. Finally, it will return the response from the LLM.

index.html



<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Chatbot</title>
<style>
body {
width: 100vw;
height: 100vh;
overflow-y: scroll;
overflow-x: hidden;
background-size: cover;
background-repeat: no-repeat;
background-image: url("data:image/svg+xml;utf8,%3Csvg xmlns=%22http:%2F%2Fwww.w3.org%2F2000%2Fsvg%22 width=%222000%22 height=%221125%22%3E%3Cg filter=%22url(%23a)%22%3E%3Cpath fill=%22%23018f81%22 d=%22M-1000-562.5h4000v2250h-4000z%22%2F%3E%3Cpath d=%22m-77-225-364 631 1008 902 452-274%22 fill=%22%23018f81%22%2F%3E%3Cpath d=%22M693-120 343 137l355 1133L1756 251%22 fill=%22%23003055%22%2F%3E%3Cpath d=%22m1983.085 163.118-1087 1105 231 80 920-990M-320.479-505.019l-339 455 19 872 1165-950M2166.453-724.48l-867 340 1154 754v-616%22 fill=%22%23632491%22%2F%3E%3Cpath d=%22M2098 968 958 2150l853 241 649-421%22 fill=%22%23003055%22%2F%3E%3C%2Fg%3E%3Cdefs%3E%3Cfilter id=%22a%22 x=%22-260%22 y=%22-260%22 width=%222520%22 height=%221645%22 filterUnits=%22userSpaceOnUse%22 color-interpolation-filters=%22sRGB%22%3E%3CfeFlood flood-opacity=%220%22 result=%22BackgroundImageFix%22%2F%3E%3CfeBlend in=%22SourceGraphic%22 in2=%22BackgroundImageFix%22 result=%22shape%22%2F%3E%3CfeGaussianBlur stdDeviation=%22260%22 result=%22effect1_foregroundBlur_1_2%22%2F%3E%3C%2Ffilter%3E%3C%2Fdefs%3E%3C%2Fsvg%3E");
padding-bottom: 10rem;
margin: 0;
padding: 0;
}
.container{
width: 65%;
height: 85%;
padding: 20px;
border-radius: 8px;
margin: 0 auto;
background-color: #f5f5f5;
box-shadow: 0 0 30px hsla(0, 1%, 82%, 0.137);
color: #000;
border: 1px solid #dddddd00;
}
h1 {
color: #333;
text-align: center;
margin-bottom: 20px;
}
#chatHistory {
border: 1px solid #ccc;
padding: 10px;
width: 100%;
height: 400px;
overflow-y: scroll;
margin-bottom: 10px;
}
#chatHistory p {
margin: 5px 0;
}
.user-message {
color: blue;
}
.bot-message {
color: green;
}
.error-message {
color: red;
}
.spinner {
border: 4px solid rgba(0, 0, 0, 0.1);
width: 36px;
height: 36px;
border-radius: 50%;
border-left-color: #09f;
animation: spin 1s ease infinite;
display: none;
margin: 10px auto;
}
@keyframes spin {
0% { transform: rotate(0deg); }
100% { transform: rotate(360deg); }
}
</style>
</head>
<body>
<div class="container">
<h1>Chatbot</h1>
<select id="modelDropdown">
{% for model in llm_models %}
<option value="{{ model }}">{{ model }}</option>
{% endfor %}
</select>
<form id="message-form" style="display: flex;">
<input type="text" id="userInput" placeholder="Type your message..." style="flex: 1;">
<button type="button" onclick="sendMessage()" style="margin-left: 10px;">Send</button>
</form>
<div id="chatHistory"></div>
<div id="errorMessage" class="error-message" style="margin-top: 10px;"></div>
<div class="spinner" id="loadingSpinner"></div>
</div>

<script>
function sendMessage() {
var query = document.getElementById('userInput').value;
var modelChoice = document.getElementById('modelDropdown').value;
var errorMessage = document.getElementById('errorMessage');
var loadingSpinner = document.getElementById('loadingSpinner');

if (!query.trim()) {
errorMessage.textContent = "Please enter a message.";
return;
}

errorMessage.textContent = '';
loadingSpinner.style.display = 'block';

fetch('/generate_response', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
query: query,
model_choice: modelChoice
}),
})
.then(response => response.json())
.then(data => {
document.getElementById('chatHistory').innerHTML += "<p class='user-message'><strong>User:</strong> " + query + "</p>";
document.getElementById('chatHistory').innerHTML += "<p class='bot-message'><strong>Chatbot:</strong> " + data.response + "</p>";
document.getElementById('userInput').value = '';
document.getElementById('chatHistory').scrollTop = document.getElementById('chatHistory').scrollHeight;
loadingSpinner.style.display = 'none';
})
.catch((error) => {
console.error('Error:', error);
errorMessage.textContent = "An error occurred while processing your request.";
loadingSpinner.style.display = 'none';
});
}
</script>
</body>
</html>

The index.html file includes a dropdown to select your LLM model. It also includes a textbox to send in your query. Once you hit the function, we get a nice spinner until we’ve retrieved the data. Once the data is there, you can see a list of messages in the scrollable window in the centre of the screen. Take a look at the screenshot below. It’s very minimal but it gets the job done.

The possibility of a multi-model chatbot can help you with several use cases. As we have come to understand, different LLMs have different pros and cons. This can help you work with multiple models at a time and for each query, you could just use an LLM that is much more suited to your needs. I hope to grow this into something much bigger and much more optimized so that it can be a valuable application for all LLM lovers out there!

Thanks for ready! 🚀

--

--