While people are generally interested in chatting with ChatGPT, as a developer I have been thinking of how to use OpenAI APIs for building business applications. Large Language Models (LLMs) have a data freshness problem. For example, as ChatGPT's knowledge cutoff date is September 2021, it won’t be able to answer questions which require the latest information such as the latest offers / promotions. Also, LLMs don’t have access to proprietary / confidential information. For example, you may have internal company documents you’d like to interact with via an LLM.
The first challenge is adding those documents to the LLM, we could try training the LLM on these documents, but this is time-consuming and expensive. And what happens when a new document is added? Training for every new document added is beyond inefficient — it is simply impossible.
So, how do we handle this problem? We can use retrieval-augmented generation. This technique allows us to retrieve relevant information from an external knowledge base and give that information to our LLM. The external knowledge base is our “window” into the world beyond the LLM’s training data. Over the last couple of months, I have been learning about implementing retrieval-augmented generation for LLMs by developing a tech POC using LangChain.js & OpenAI Embeddings API to allow you to chat & query with your own files.
After doing some research, I came up with the following plan and executed it:
Prepare the documents - I downloaded HTML files from Mastercard Priceless website with following command (I interrupted its execution and got a total of 171 files which are stored in data/docs folder of the repo):
wget --user-agent="Mozilla" --no-parent\ --adjust-extension\ -e robots=off -r -m\ --reject='*\?*,*.css,*.jpg,*.gif,*.png,*.js'\ -P data/docs https://www.priceless.com/
Create embeddings of these documents and store them to a vectorstore:
- Build a web app to allow you to chat & query with your documents:
When a question is input by the user, the frontend will send both the new question and chat history, if any, to backend.
If the chat history is empty, the backend will call OpenAI Embeddings API to generate embeddings for the new question and then use the embeddings to do a similarity search from the vectorstore which will return the top 4 related document chunks;
If the chat history is not empty, the backend will call OpenAI Completion API to generate a standalone question based on the new question and the chat history. The backend will call OpenAI Embeddings API to generate embeddings for the standalone question and then use the embeddings to do a similarity search from the vectorstore which will return the top 4 related document chunks.
The document chunks returned from the vectorstore will be used as the context for OpenAI Completion API to generate the final answer which will be streamed to the frontend. Once text generation is completed, the related document chunks, as well as their metadata like URLs, will be sent to fontend to render the list of sources.
This project supports 4 vectorstores: HNSWLib, Chroma, Milvus and Pinecone. Here are some basic facts collected from the internet.
The below table, screen-captured from the Python notebook here, shows what happened when the same set of questions were answered by using the different vector stores.
As you can see from the above table, though the standalone questions generated by LLM are almost the same, the similarity search results from vectorstores differ, which causes the differences in the final answers. In some cases, the search results don’t contain any relevant document chunks, resulting LLM replying “I’m sorry …” as it is unable to provide a useful answer from the context provided.
After completion the comparison of vectorstores, I also tried to compare embeddings functions.
I tried to test the embeddings functions of some open source LLMs on my Mac (without GPU). It turns out that, though the speed is too slow to deliver any reasonable user experiences for end users, the answers are not bad.
The results below were captured after I used Vicuna (q4_1), GPT4ALL and OpenAI to generate embeddings for 3 HTML files which are returned by this chatbot when someone asks "Entertainment in New York". These files are split into 60 chunks with a chunk size of 500 tokens. Then I ask all models to use the generated embeddings vectors to run RetrievalQA on the same question. For comparison, I also tried to use OpenAI + pre-generated embeddings vectors which are stored at Pinecone to run the same query. At last, I appended the answer from this chatbot. Refer to two Python notebooks (part 1 & part 2) under scripts/ folder for more details.
Here are some findings through the development of this project:
The capabilities of vectorstores vary. You need to carefully evaluate them and select the one most suitable to your business needs.
OpenAI Embeddings API might be costly for a large set of documents. As a reference, $22 USD was charged by OpenAI to load the 171 HTML files stored in folder
data/docs, with a total size of around 180M.
To save costs, I used HNSWLib to store all embeddings generated by OpenAI in the local file system and then load them into other vectorstores: Pinecone, Chroma and Milvus.
It might be worthwhile to use other embeddings methods, such as Sentence Transformers, which would not only save costs, but also eliminate the risks of leaking sensitive / confidential information.
In order to reduce source data sizes and save costs, I tried to use html2text to convert all HTML files to pure text files. The result was not ideal - in most cases the chatbot couldn’t answer the user question or give an incomplete answer. Eventually I had to switch back to use HTML files as is - you can tell it when you tap any item of the list of the sources in frontend.
The capabilities of vectorstores vary. Enterprises need to carefully evaluate them and select the one most suitable to their business needs.
As of now, to build LLM based applications for most enterprises, our best bet would be using OpenAI APIs. To save costs, other embeddings methods can be considered if they can work with the selected vectorstore to deliver similarity search results with good performance.
PS. You can play with the chatbot deployed at Netlify. The sources, including NextJS code, Python notebooks, HTML files and HNSWLib data files, are all hosted at GitHub. Feel free to create an issue or submit a pull request. Enjoy!