Asking OpenAI about ICC T20 World Cup
Recently I followed some Udemy Courses related to Generative AI which taught how to write our own ChatBot which uses OpenAI.
Also ICC Men’s T20 was concluded recently where India emerged as champions. I wanted to ask ChatGPT about this and it didn’t know the answer directly instead it searched the web for the answer and provided me.
But if I close the chat history and come back tomorrow and ask the same question again it will still search the web for the answer. The reason for this is being the ChatGPT’s foundational model was not trained with the information about ICC T20 World Cup 2024 which concluded just couple of days ago.
The information is updated in Wikipedia within hours by the strong list of contributors.
So how can we use this information or any information that is either publicly (internet) or privately (within a company, home etc.) available to get answers using ChatGPT?
We can achieve it with the following RAG architecture.
- Upload and Processing information — I copied the above Wikipedia’s section about T20 World Cup and saved it as a PDF. Then uploaded that PDF which was processed by chunking into lines.
- Store Embeddings in Vector Store — The chunked lines are fed into an OpenAI Embeddings algorithm which converts the data into a set of format which can be identified by OpenAI. Eventually it is stored in a Vector store for future querying. This Vector store is non persistent but there are persistent ones as well.
- Question from Actor — Now the Actor can question about T20 World Cup
- Similarity search — The question will be used as input to perform a similarity search in the Vector store find any matching
- Send matched information — The matched information is send back to an OpenAI Large Language Model (LLM) which is trained using Peta-bytes of data and thousands of hours worth of computing.
- Generate Natural Language Answer — OpenAI LLM generates the answer in Natural Language.
- Answer — Answer is sent back to the Actor
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain_community.chat_models import ChatOpenAI
OPENAI_API_KEY = "YOUR_OPEN_AI_KEY" #Replace with OpenAI Key
#Upload PDF Files
st.header("My First Chatbot")
with st.sidebar:
st.title("Your Documents")
file = st.file_uploader("Upload a PDF file and start asking questions", type="pdf")
#Extract the text
if file is not None:
pdf_reader = PdfReader(file)
text=""
for page in pdf_reader.pages:
text += page.extract_text()
#st.write(text)
#Break it into chunks
text_splitter = RecursiveCharacterTextSplitter(
separators="\n",
chunk_size=500,
chunk_overlap=150,
length_function=len
)
chunks = text_splitter.split_text(text)
#st.write(chunks)
#Generating embeddings
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
#Creating vector store - FAISS - FB AI Semantic Store
# Open AI Embeddings
# Initializing FAISS
vector_store = FAISS.from_texts(chunks, embeddings)
#Get user question
user_question = st.text_input("Type your question here")
#so similarity search
if user_question:
match = vector_store.similarity_search(user_question)
#st.write(match)
#Define LLM
llm = ChatOpenAI(
openai_api_key = OPENAI_API_KEY,
temperature = 0,
max_tokens = 1000,
model_name = "gpt-3.5-turbo"
)
#output results
chain = load_qa_chain(llm, chain_type="stuff")
response = chain.run(input_documents = match, question = user_question)
st.write(response)
We can now see this system in action. When we ask the same questions from our ChatBot instead of OpenAI ChatBot it is giving the answer from the information it has locally and not referring the internet.
Conclusion
There are many different ways to make OpenAI learn latest information. Most widely used ones are;
- Retrieval Augmented Generation (RAG)
- Finetuning
- Prompt Engineering
Out of these three, RAG is at the most moment is the one which is less costly to implement and generates highly accurate answers. This is ideal for businesses which have data repositories which are private but need the power of OpenAI and ChatGPT to support employee, customer queries.
References
Recommended Udemy Course — https://www.udemy.com/course/generative-ai-for-beginners-b