Revelry blog post on LLMs and context windows. Man in suit looking out window with code concept

Memory Consumption and Limitations in LLMs with Large Context Windows

Part I: Introduction to Large Language Models, Context, and Tokens

This post is the first in a series in which we will explore the limits of large language models (LLMs) with respect to memory overhead and context windows. The goal is to impart a high-level sense of understanding of what an LLM is and the limitations of such a system as of late 2023. 

Introduction

Language models are probabilistic models that take words as inputs and generate words as outputs. This definition is purposefully broad, as language models can take many forms. They can be used to accomplish tasks like text translation, predictive text, sentiment analysis, among others. 

Let’s take sentiment analysis as an example: 

  • A user inputs a string of text into a text box on a computer.
  • The computer turns the string into machine code and inputs this into the language model.
  • The model encodes this input into a form that it can operate on.
  • The model then processes this input and decodes it, producing an output of “positive” or “negative” sentiment based upon the model’s past training.
  • The output of “positive” or “negative” is returned as an output string from the model.

In this instance, the “word” output isn’t a sentence, but an abstract concept of positivity or negativity, expressed as an output string of either “positive” or “negative” (or possibly “neither”). All language models function in a similar fashion: words in, the model operates on a representation of those words, words out.

Large Language Models (LLMs)

A large language model is a specific kind of model (usually, a neural network) that exists as a subset of language models. The line between what makes a language model “large” is not clearly defined, but generally speaking, LLMs are trained on extremely large datasets. The current state-of-the-art LLMs are trained on large swathes of data scraped from the open internet over many years[1], a large number of books[2], and other data sources. This makes them formidable tools for interfacing with the vast amounts of training data they contain, as they can effectively accept written English text prompts and output written English text results as it relates to almost any topic a user can fathom.

The user input into an LLM prompt is called the context. For example, if a user is writing React code, you might send in the code for a component along with a written request asking the model to refactor the code in some way. The combination of your request (“Refactor the code below:”) and the component code is what makes up the context of the prompt. This then gets tokenized, embedded, and input into the LLM. 

Embedding and tokenization are the processes that convert the input natural language text into mathematical objects (vectors, to be specific) that can be operated upon by the neural network that make up the LLM. We’ll dive deeper into these concepts in the next post.

The context window is the maximum amount of tokenized prompt text that can be input into the model in a single request. The nature and the limitations of the context window are a significant focal point of this series of blog posts. The context window for many of the most popular LLMs isn’t exceptionally large – it’s usually on the order of 1,000-10,000 tokens. We’ll dive deeper into how words get converted into tokens in the next post, but for now you can assume an average of about 0.75:1 as a ratio of words to tokens[3]. That is, if you input 750 words into an LLM prompt, you would expect that prompt to be around 1000 tokens in size.

The Big Picture

The mass adoption of LLM-enabled technology demonstrates that these systems have the potential to be truly paradigm-shifting in regards to the way we interact with and use vast amounts of information. These systems have a remarkable ability to index and reference seemling boundless amounts of data – with the catch that this information must be included in its training dataset, or in the small prompt input window.

For example, if you wanted to leverage this power to ask questions about a novel not included in the model’s training data, you’d have to include the full text of said novel in your prompt. This causes an immediate problem – for many models, the average novel won’t come close to fitting inside the context window. As another example, if you had the entire codebase of an established product (hundreds of thousands to millions of lines of code) and you wanted to include that in the context of a prompt to an LLM, there are no models to use to accomplish the task.

Why is this the case? What are the physical limitations that prevent context windows from allowing millions or billions of tokens? The answer, which will be discussed in more detail in the follow-up posts, is related to constraints on memory. These memory constraints are a consequence of the architecture of the LLMs – specifically, they are a result of certain fundamental mechanisms that are present in most of the popular LLMs currently available. 

In the next post, we will discuss the embedding and tokenization process that turns written text into embedded vectors.We’ll also dive further into understanding memory constraints and attempt to address the nature of the token limit that is present in LLMs. We will conclude by exploring how memory scales with tokens.

Want to chat more on this topic? Connect with one of our software strategy, design and development experts. We love this stuff!

References

[1] Common Crawl – https://commoncrawl.org/

[2] “Languages Are Few-Shot Learners”, Sec 2.2 – https://arxiv.org/abs/2005.14165

[3] OpenAI Tokenizer – ​​https://platform.openai.com/tokenizer

We're building an AI-powered Product Operations Cloud, leveraging AI in almost every aspect of the software delivery lifecycle. Want to test drive it with us? Join the ProdOps party at ProdOps.ai.