Some Early Thoughts on “Strawberry”
OpenAI just released two new large language models (LLMs) named o1-preview and o1-mini, and announced a third: o1. Though not yet available, they’ve also released benchmarks for o1.
If you’re interested in integrating against these models via the API, there are a few things you should know.
Chain of Thought Overview
The o1 models were all trained to use chain of thought (CoT) prompting. This technique takes advantage of the fact that models use not just the initial input, but also their own output so far, as they are generating each token: in industry jargon they are “auto-regressive”. This allows them to provide themselves with relevant information before generating an answer for you.
This is not a new technique, and if you have ever used the prompt engineering trick of asking a model to “think step by step”, you’ve done it. However, the way OpenAI has incorporated it in an automated way, by specifically training the models to use it and doing it in the background, does seem novel.
You Need a Tier 5 Account for API Access
Once you’ve spent $1,000 over the lifetime of the account and made your first payment 30+ days ago, you’re automatically moved into Tier 5, which grants you access to o1-preview and o1-mini via the API. Otherwise, the easiest way to test the models is through a ChatGPT Plus/Team/Enterprise/Edu subscription.
Rate Limits
The rate limits as of release are very low: 20 requests per minute (RPM) for either of the o1 models, and 300M and 150M tokens per minute (TPM). This is going to be a major limiting factor in early use cases, though these limits will increase. This initial release is only intended for prototyping.
It’s Really Good… at Certain Things
When OpenAI released gpt-4o, it was a direct upgrade over gpt-4: better, faster, and cheaper. With the o1 family, there are different tradeoffs involved.
They are good at complex tasks, where multi-step reasoning is superior, including difficult questions about coding, sciences, mathematics, and logic. For “creative” tasks like personal writing or editing text, per OpenAI’s benchmarks, humans generally preferred gpt-4o.
Mini is Sometimes Better Than Preview
The o1-mini model is not as good at factual information, but, on the STEM tasks it was specifically trained on, especially coding and mathematics, it often outperforms o1-preview.
It’s Much Slower
The o1 models’ enhanced capabilities on certain tasks is due to their background use of Chain of Thought prompting. In our testing – mostly on moderate to challenging coding questions – the chain of thought process using o1-preview usually took from 4 to 30 seconds, and can take several minutes. However, using o1-mini seems to be about 3-5x faster, and often has similar quality results. The gpt-4o model is still several times faster than even o1-mini.
No Streaming or Function Calling and Other Limitations
The API doesn’t offer the ability to stream responses. Because you have to wait for the CoT process as well before the model starts generating the final output tokens, users may have to wait quite a long time to get their response, even if they do add in streaming in the future.
Also not yet supported: function calling, image inputs (text only), system messages, and setting model parameters like temperature. OpenAI says many of these features will be available in future o1-series models.
“Intermediate” Tokens May Cost A Lot
OpenAI does not let you see the actual intermediate tokens generated during the CoT process, but it will charge you for them. For a complex task with many steps, these tokens may be many multiples of the final output tokens.
While o1-mini is 1/5th the cost of o1-preview, it’s only slightly cheaper per output token than gpt-4o ($12/million vs. $15/million), but you can expect it will use far more output tokens. For comparison, gpt-4o-mini is 1/25th the price of gpt-4o, and 1/100th the cost of o1-preview. For full details, check their pricing page.
Conclusions
Results from using o1-preview and o1-mini feel well worthwhile for certain tasks. It will be interesting to test out o1 when it’s released, since it performs significantly better on a number of benchmarks. There are more tradeoffs to think about with these models though, and there’s no substitution for experimentation across different models with your specific tasks.
It’s also worth considering that every time a new closed model has been released, many of the competitive advantages they initially offer evaporate rapidly as open source models close the gap. The LLM landscape is continually changing, and as you build applications using them, it’s best to make swapping out models easy to take advantage of quality, latency, and cost tradeoffs.
Want to chat on this topic? Reach out. We love this stuff.
We're building an AI-powered Product Operations Cloud, leveraging AI in almost every aspect of the software delivery lifecycle. Want to test drive it with us? Join the ProdOps party at ProdOps.ai.