… and why you might just want to stick with the Chat Completion API
Not unlike the rest of the world, Revelry has been diving head first into the land of generative AI within the past year or so. So far, we’ve primarily leveraged OpenAI’s APIs to incorporate generative AI into various custom software applications.
Obviously, OpenAI is not the only option out there. There are a number of alternative LLMs, such as Anthropic‘s Claude 2, Meta‘s Llama 2, as well as countless other open source models. We have played around with some of these other LLMs, but so far we’ve gotten the most mileage out of OpenAI’s GPT 3.5 and now GPT 4.
Until very recently, the primary API people have been using to communicate with these OpenAI LLMs has been the chat completions API . However, at the OpenAI DevDay on November 6, they announced an open beta of the new Assistants API, as well as Custom GPTs.
For this blog post, I wanted to lay out some of the pros and cons of the various options that OpenAI offers when it comes to utilizing their LLMs. The most relevant options are:
- Chat Completion API
- Assistants API
- Custom GPTs
I will write a followup article that goes more into depth on our specific experiences, but for now I want share some of my thoughts comparing the offerings that OpenAI has available, and to give some advice to teams that are trying to figure out their best options when it comes to incorporating generative AI into their systems.
If you haven’t already, you might want to check out the official Assistants API docs before reading on.
This is an appealing option for teams that want to integrate their own custom software with OpenAI’s models via API, but don’t want to build their own systems to manage prompt templates, build out their own RAG interface, or use models that live outside of OpenAI’s ecosystem.
The biggest downside of integrating with the Assistants API is that it further locks you into OpenAI. If you integrate your application deeply with the assistants API, it is inevitably going to be harder to make your LLMs pluggable, which would be necessary if you wanted to switch to open source or proprietary models down the line. This is because you will not have implemented all the other pieces that are necessary to communicate effectively with an LLM if you are letting OpenAI do all the hard stuff outside of just talking to the LLM in plain text.
There are other downsides which I’ll elaborate on below, but first:
A Tangent About RAG
It’s awesome that OpenAI has built tools that incorporate this functionality into assistants and custom GPTs with minimal effort, but there are a lot of reasons why it might make more sense to just build this yourself.
RAG, aka Retrieval Augmented Generation, is very easy to implement with tools like LangChain. Its also not hard to build a RAG flow without LangChain; there are tons of options out there. What you choose to build with depends on how low level you want to go, how much control you want, and how much vendor lock-in you want to avoid.
How to Build a RAG Flow
- Set up a Vector Database
- Enable uploading of documents to your system that need to go into the vectorDB (probably via some web interface)
- Extract plain text from the files (can be more involved depending on the file type)
- For each uploaded document, chunk the text based on content type
- there are a lot of decisions to be made here in terms of how large the chunks are, what to separate the chunks on, how much overlap there should be, etc
- Convert those chunks into vector embeddings
- You can use openAI’s embedding models via API, but you can also use any embedding model of your choice (open source or proprietary).
- Store those vector embeddings in your vector DB
- Query against the vector DB using semantic search to pull relevant pieces of information out, and inject that info into a prompt before it’s sent to the LLM
- Send the “retrieval augmented” prompt to the LLM to generate the stuff. Hence “Retrieval Augmented Generation”.
Having control over the way the text is chunked, what embedding models are used, how the vector db is queried, and how the results are injected into prompts is a really good reason to implement this yourself as opposed to keeping it all hidden in a black box that is an OpenAI assistant (or customGPT).
Revelry has gone down this path for the AI powered product delivery software we have built (ProdOps.AI), which happens to be built with Phoenix/Elixir, using Pinecone for the VectorDB, and OpenAI for creating the embeddings as well as the LLMs (no LangChain).
We could definitely dive deeper into building a RAG flow from scratch, but the gist of it is that it’s not that hard, and you probably don’t want to let OpenAI do it all for you. Otherwise, you won’t be able to use any other LLMs without rebuilding a lot of your stack.
Ok, RAG rant over. Below are some more concise pros and cons:
- Enables (simple) RAG via the “Knowledge Retrieval” feature, which allows for knowledge base-specific assistants.
- Allows for customized system prompts on a per-assistant basis
- Allows for more transparency and ability to drill down into whats happening under the hood.
- Below is a breakdown of the the distinct resources exposed by the assistant API (ripped directly from the docs):
- Assistant: Purpose-built AI that uses OpenAI’s models and calls tools
- Thread: A conversation session between an Assistant and a user. Threads store Messages and automatically handle truncation to fit content into a model’s context.
- Message: A message created by an Assistant or a user. Messages can include text, images, and other files. Messages stored as a list on the Thread.
- Run: An invocation of an Assistant on a Thread. The Assistant uses it’s configuration and the Thread’s Messages to perform tasks by calling models and tools. As part of a Run, the Assistant appends Messages to the Thread.
- Step: A detailed list of steps the Assistant took as part of a Run. An Assistant can call tools or create Messages during it’s run. Examining Run Steps allows you to introspect how the Assistant is getting to it’s final results.
- Locks you further into OpenAI
- its a lot different than simply interacting with an LLM, which is what you would be dealing with if you didn’t use OpenAI (also more like what the chat completions API is)
- If you build a complex app on top of the chat completions API, it won’t be hard to swap OpenAI’s models out for some open source models (once you have them set up to respond to you appropriately that is).
- In the end, if you use the chat completions API, you are just sending prompts to an LLM. This is the same as any other LLM. If you want to move from assistants API to some non-OpenAI model, you will have a lot more re-writing to do and will have to implement all the RAG/prompt management
- Doesn’t support streaming (chat completions API does)
- Doesn’t support web browsing or image generation (custom GPTs & chatGPT do)
- More complicated to implement; requires a bunch of API requests for what used to be just one request per message. The steps are essentially:
- Create a Thread
- Add Message(s) to thread
- Create and start a Run
- Poll for Run status until its complete
- Poll for Steps to see what additional steps were determined to be necessary (e.g. knowledge retrieval)
- Once Run is complete, fetch the thread messages again to get the latest response
- Every time you want to send a subsequent message, you again need to add the message to the thread, create a run, poll for its status, etc.
- It feels much slower than the completions API, clearly due to the additional API requests and lack of streaming, but likely also due to additional complexity on OpenAI’s side that is completely obscured from us.
- The Knowledge Retrieval RAG functionality is pretty limited at this point
- You can only upload a limited number of files, and its not the easiest thing to do via their UI.
- You have no control over how/what knowledge gets fetched and/or how it gets sent to the LLM
- There is some visibility via the /steps into when knowledge retrieval is used, but its not that much compared to what you’d get with a custom RAG stack.
Chat Completions API
This is probably the best option if you don’t want to be locked into OpenAI but you want to benefit from the undeniable effectiveness and (probably temporary) superiority of the latest GPT models. It’s faster, and it’s closer to what you would be doing if you wanted to interact with a non OpenAI LLM down the line.
- Simple to implement: one API request per message, one response back
- Supports streaming
- Seems much faster, obviously due to less API requests but perhaps there are other reasons that the Assistant API is slower if we could see under the hood.
- Doesn’t support RAG or prompt management out of the box
- Neither of these are are that hard to implement, especially with tools like LangChain
This is a great option for casual ChatGPT users who want to systemize their prompts and have GPTs set up for specific use cases. They are super easy to set up and publish, share with friends, etc. OpenAI has said they will be creating a GPT Store (similar to App Stores today) and that there will be revenue sharing… but its hard to see how this will be very profitable since it seems like OpenAI just owns whatever you’ve put in there. You also have very little visibility/control over how it works, and you are limited to a text chat interface. (You can get a lot more creative than this if you instruct LLMs to respond with structured data via API).
- Supports web browsing and image generation out of the box, which is not available via Assistants or Chat Completions API
- Generally seems slightly faster than the API; less latency (same as using chatGPT)
- Super easy to build, can do it via chat or manually configure the instructions
- Can be published so that anyone with chatGPT Plus can use it
- Comes with the ability to provide suggestion CTAs out of the box (these are hardcoded, but do provide a nice UX with minimal effort)
- Potential for being an early mover in the GPT Store that OpenAI has teased
- They say there will be revenue share here but its very unclear how that would work
- OpenAI will train on the data that they get from your custom GPT
- You can’t collect the data that is going through your custom GPT
- Runs inside of ChatGPT so you don’t have much control over the UX
- Only people with Chat GPT Plus subscriptions can use it; a bit limiting if you want to share with the masses.
If you made it this far, I’m impressed! We invite you to keep coming back to our blog to learn more about our experiences with AI and custom software development. Also, if you are building something cool with generative AI, let us know in the comments! And even better, if you need help integrating AI into your custom software systems, we can help! Get it touch.
We're building an AI-powered Product Operations Cloud, leveraging AI in almost every aspect of the software delivery lifecycle. Want to test drive it with us? Join the ProdOps party at ProdOps.ai.