GPT-4 Has the Memory of a Goldfish

Large language models know a lot but can’t remember much at all.

Animation of speech bubbles moving in a circle
Illustration by The Atlantic

By this point, the many defects of AI-based language models have been analyzed to death—their incorrigible dishonesty, their capacity for bias and bigotry, their lack of common sense. GPT-4, the newest and most advanced such model yet, is already being subjected to the same scrutiny, and it still seems to misfire in pretty much all the ways earlier models did. But large language models have another shortcoming that has so far gotten relatively little attention: their shoddy recall. These multibillion-dollar programs, which require several city blocks’ worth of energy to run, may now be able to code websites, plan vacations, and draft company-wide emails in the style of William Faulkner. But they have the memory of a goldfish.

Ask ChatGPT “What color is the sky on a sunny, cloudless day?” and it will formulate a response by inferring a sequence of words that are likely to come next. So it answers, “On a sunny, cloudless day, the color of the sky is typically a deep shade of blue.” If you then reply, “How about on an overcast day?,” it understands that you really mean to ask, in continuation of your prior question, “What color is the sky on an overcast day?” This ability to remember and contextualize inputs is what gives ChatGPT the ability to carry on some semblance of an actual human conversation rather than simply providing one-off answers like a souped-up Magic 8 ball.

The trouble is that ChatGPT’s memory—and the memory of large language models more generally—is terrible. Each time a model generates a response, it can take into account only a limited amount of text, known as the model’s context window. ChatGPT has a context window of roughly 4,000 words—long enough that the average person messing around with it might never notice but short enough to render all sorts of complex tasks impossible. For instance, it wouldn’t be able to summarize a book, review a major coding project, or search your Google Drive. (Technically, context windows are measured not in words but in tokens, a distinction that becomes more important when you’re dealing with both visual and linguistic inputs.)

For a vivid illustration of how this works, tell ChatGPT your name, paste 5,000 or so words of nonsense into the text box, and then ask what your name is. You can even say explicitly, “I’m going to give you 5,000 words of nonsense, then ask you my name. Ignore the nonsense; all that matters is remembering my name.” It won’t make a difference. ChatGPT won’t remember.

With GPT-4, the context window has been increased to roughly 8,000 words—as many as would be spoken in about an hour of face-to-face conversation. A heavy-duty version of the software that OpenAI has not yet released to the public can handle 32,000 words. That’s the most impressive memory yet achieved by a transformer, the type of neural net on which all the most impressive large language models are now based, says Raphaël Millière, a Columbia University philosopher whose work focuses on AI and cognitive science. Evidently, OpenAI made expanding the context window a priority, given that the company devoted a whole team to the issue. But how exactly that team pulled off the feat is a mystery; OpenAI has divulged pretty much zero about GPT-4’s inner workings. In the technical report released alongside the new model, the company justified its secrecy with appeals to the “competitive landscape” and “safety implications” of AI. When I asked for an interview with members of the context-window team, OpenAI did not answer my email.

For all the improvement to its short-term memory, GPT-4 still can’t retain information from one session to the next. Engineers could make the context window two times or three times or 100 times bigger, and this would still be the case: Each time you started a new conversation with GPT-4, you’d be starting from scratch. When booted up, it is born anew. (Doesn’t sound like a very good therapist.)

But even without solving this deeper problem of long-term memory, just lengthening the context window is no easy thing. As the engineers extend it, Millière told me, the computation power required to run the language model—and thus its cost of operation—increases exponentially. A machine’s total memory capacity is also a constraint, according to Alex Dimakis, a computer scientist at the University of Texas at Austin and a co-director of the Institute for Foundations of Machine Learning. No single computer that exists today, he told me, could support, say, a million-word context window.

Some AI developers have extended language models’ context windows through the use of work-arounds. In one approach, the model is programmed to maintain a working summary of each conversation. Say the model has a 4,000-word context window, and your conversation runs to 5,000 words. The model responds by saving a 100-word summary of the first 1,100 words for its own reference, and then remembers that summary plus the most recent 3,900 words. As the conversation gets longer and longer, the model continually updates its summary—a clever fix, but more a Band-Aid than a solution. By the time your conversation hits 10,000 words, the 100-word summary would be responsible for capturing the first 6,100 of them. Necessarily, it will omit a lot.

Other engineers have proposed more complex fixes for the short-term-memory issue, but none of them solves the rebooting problem. That, Dimakis told me, will likely require a more radical shift in design, perhaps even a wholesale abandonment of the transformer architecture on which every GPT model has been built. Simply expanding the context window will not do the trick.

The problem, at its core, is not really a problem of memory but one of discernment. The human mind is able to sort experience into categories: We (mostly) remember the important stuff and (mostly) forget the oceans of irrelevant information that wash over us each day. Large language models do not distinguish. They have no capacity for triage, no ability to distinguish garbage from gold. “A transformer keeps everything,” Dimakis told me. “It treats everything as important.” In that sense, the trouble isn’t that large language models can’t remember; it’s that they can’t figure out what to forget.